All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v17.1 00/14] xfs-4.19: online repair support
@ 2018-07-30  5:47 Darrick J. Wong
  2018-07-30  5:47 ` [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap Darrick J. Wong
                   ` (13 more replies)
  0 siblings, 14 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:47 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

Hi all,

This is the seventeenth revision of a patchset that adds to XFS kernel
support for online metadata scrubbing and repair.  There aren't any
on-disk format changes.

New for v17.1 are a few fixes suggested by Brian Foster in v17 and a
rebase of the series atop for-next, which adapts the repair code to the
new way of hadling deferred log operations.  New for v17 of the patch
series are fixes for numerous review comments that came from Dave and
Allison.  The long prefixes of the previous versions have been
drastically shortened.  Comments about the strategies used to repair
broken parts of the filesystem have been expanded where reviewers
thought it confusing.  A few data structures have been renamed to
reflect more accurately what they do.

Note, this series does not include any of the controversial repair
functionality that requires fs freezing; that has been deferred to a
later posting.

The first patch renames the 'extent list' functionality into a separate
file and rename it xfs_bitmap, since that's what the data structure
actually represents.

Patches 2-12 implement reconstruction of the AGF/AGI/AGFL headers, the
free space btrees, the inode btrees, the inodes, the inode forks, the
inode block maps, symbolic links, and extended attributes.

Patch 13 augments scrub to rebuild extended attributes when any of the
attr blocks are fragmented.

Patch 14 implements reconstruction of quota blocks.

If you're going to start using this mess, you probably ought to just
pull from my git trees.  The kernel patches[1] should apply against
4.18-rc7.  xfsprogs[2] and xfstests[3] can be found in their usual
places.  The git trees contain all four series' worth of changes.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

[1] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=djwong-devel
[2] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=djwong-devel
[3] https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=djwong-devel

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
@ 2018-07-30  5:47 ` Darrick J. Wong
  2018-07-30 16:21   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 02/14] xfs: repair the AGF Darrick J. Wong
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:47 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

As mentioned previously, the xrep_extent_list basically implements a
bitmap with two functions: set and disjoint union.  Rename all these
functions to xfs_bitmap to shorten the name and make it more obvious
what we're doing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/bitmap.c |  183 +++++++++++++++++++++++++------------------------
 fs/xfs/scrub/bitmap.h |   35 ++++-----
 fs/xfs/scrub/repair.c |   85 ++++++++++-------------
 fs/xfs/scrub/repair.h |    8 +-
 fs/xfs/scrub/trace.h  |    1 
 5 files changed, 149 insertions(+), 163 deletions(-)


diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index a7c2f4773f98..c770e2d0b6aa 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -16,183 +16,186 @@
 #include "scrub/repair.h"
 #include "scrub/bitmap.h"
 
-/* Collect a dead btree extent for later disposal. */
+/*
+ * Set a range of this bitmap.  Caller must ensure the range is not set.
+ *
+ * This is the logical equivalent of bitmap |= mask(start, len).
+ */
 int
-xrep_collect_btree_extent(
-	struct xfs_scrub	*sc,
-	struct xrep_extent_list	*exlist,
-	xfs_fsblock_t		fsbno,
-	xfs_extlen_t		len)
+xfs_bitmap_set(
+	struct xfs_bitmap	*bitmap,
+	uint64_t		start,
+	uint64_t		len)
 {
-	struct xrep_extent	*rex;
+	struct xfs_bitmap_range	*bmr;
 
-	trace_xrep_collect_btree_extent(sc->mp,
-			XFS_FSB_TO_AGNO(sc->mp, fsbno),
-			XFS_FSB_TO_AGBNO(sc->mp, fsbno), len);
-
-	rex = kmem_alloc(sizeof(struct xrep_extent), KM_MAYFAIL);
-	if (!rex)
+	bmr = kmem_alloc(sizeof(struct xfs_bitmap_range), KM_MAYFAIL);
+	if (!bmr)
 		return -ENOMEM;
 
-	INIT_LIST_HEAD(&rex->list);
-	rex->fsbno = fsbno;
-	rex->len = len;
-	list_add_tail(&rex->list, &exlist->list);
+	INIT_LIST_HEAD(&bmr->list);
+	bmr->start = start;
+	bmr->len = len;
+	list_add_tail(&bmr->list, &bitmap->list);
 
 	return 0;
 }
 
-/*
- * An error happened during the rebuild so the transaction will be cancelled.
- * The fs will shut down, and the administrator has to unmount and run repair.
- * Therefore, free all the memory associated with the list so we can die.
- */
+/* Free everything related to this bitmap. */
 void
-xrep_cancel_btree_extents(
-	struct xfs_scrub	*sc,
-	struct xrep_extent_list	*exlist)
+xfs_bitmap_destroy(
+	struct xfs_bitmap	*bitmap)
 {
-	struct xrep_extent	*rex;
-	struct xrep_extent	*n;
+	struct xfs_bitmap_range	*bmr;
+	struct xfs_bitmap_range	*n;
 
-	for_each_xrep_extent_safe(rex, n, exlist) {
-		list_del(&rex->list);
-		kmem_free(rex);
+	for_each_xfs_bitmap_extent(bmr, n, bitmap) {
+		list_del(&bmr->list);
+		kmem_free(bmr);
 	}
 }
 
+/* Set up a per-AG block bitmap. */
+void
+xfs_bitmap_init(
+	struct xfs_bitmap	*bitmap)
+{
+	INIT_LIST_HEAD(&bitmap->list);
+}
+
 /* Compare two btree extents. */
 static int
-xrep_btree_extent_cmp(
+xfs_bitmap_range_cmp(
 	void			*priv,
 	struct list_head	*a,
 	struct list_head	*b)
 {
-	struct xrep_extent	*ap;
-	struct xrep_extent	*bp;
+	struct xfs_bitmap_range	*ap;
+	struct xfs_bitmap_range	*bp;
 
-	ap = container_of(a, struct xrep_extent, list);
-	bp = container_of(b, struct xrep_extent, list);
+	ap = container_of(a, struct xfs_bitmap_range, list);
+	bp = container_of(b, struct xfs_bitmap_range, list);
 
-	if (ap->fsbno > bp->fsbno)
+	if (ap->start > bp->start)
 		return 1;
-	if (ap->fsbno < bp->fsbno)
+	if (ap->start < bp->start)
 		return -1;
 	return 0;
 }
 
 /*
- * Remove all the blocks mentioned in @sublist from the extents in @exlist.
+ * Remove all the blocks mentioned in @sub from the extents in @bitmap.
  *
  * The intent is that callers will iterate the rmapbt for all of its records
- * for a given owner to generate @exlist; and iterate all the blocks of the
+ * for a given owner to generate @bitmap; and iterate all the blocks of the
  * metadata structures that are not being rebuilt and have the same rmapbt
- * owner to generate @sublist.  This routine subtracts all the extents
- * mentioned in sublist from all the extents linked in @exlist, which leaves
- * @exlist as the list of blocks that are not accounted for, which we assume
+ * owner to generate @sub.  This routine subtracts all the extents
+ * mentioned in sub from all the extents linked in @bitmap, which leaves
+ * @bitmap as the list of blocks that are not accounted for, which we assume
  * are the dead blocks of the old metadata structure.  The blocks mentioned in
- * @exlist can be reaped.
+ * @bitmap can be reaped.
+ *
+ * This is the logical equivalent of bitmap &= ~sub.
  */
 #define LEFT_ALIGNED	(1 << 0)
 #define RIGHT_ALIGNED	(1 << 1)
 int
-xrep_subtract_extents(
-	struct xfs_scrub	*sc,
-	struct xrep_extent_list	*exlist,
-	struct xrep_extent_list	*sublist)
+xfs_bitmap_disunion(
+	struct xfs_bitmap	*bitmap,
+	struct xfs_bitmap	*sub)
 {
 	struct list_head	*lp;
-	struct xrep_extent	*ex;
-	struct xrep_extent	*newex;
-	struct xrep_extent	*subex;
-	xfs_fsblock_t		sub_fsb;
-	xfs_extlen_t		sub_len;
+	struct xfs_bitmap_range	*br;
+	struct xfs_bitmap_range	*new_br;
+	struct xfs_bitmap_range	*sub_br;
+	uint64_t		sub_start;
+	uint64_t		sub_len;
 	int			state;
 	int			error = 0;
 
-	if (list_empty(&exlist->list) || list_empty(&sublist->list))
+	if (list_empty(&bitmap->list) || list_empty(&sub->list))
 		return 0;
-	ASSERT(!list_empty(&sublist->list));
+	ASSERT(!list_empty(&sub->list));
 
-	list_sort(NULL, &exlist->list, xrep_btree_extent_cmp);
-	list_sort(NULL, &sublist->list, xrep_btree_extent_cmp);
+	list_sort(NULL, &bitmap->list, xfs_bitmap_range_cmp);
+	list_sort(NULL, &sub->list, xfs_bitmap_range_cmp);
 
 	/*
-	 * Now that we've sorted both lists, we iterate exlist once, rolling
-	 * forward through sublist and/or exlist as necessary until we find an
+	 * Now that we've sorted both lists, we iterate bitmap once, rolling
+	 * forward through sub and/or bitmap as necessary until we find an
 	 * overlap or reach the end of either list.  We do not reset lp to the
-	 * head of exlist nor do we reset subex to the head of sublist.  The
+	 * head of bitmap nor do we reset sub_br to the head of sub.  The
 	 * list traversal is similar to merge sort, but we're deleting
 	 * instead.  In this manner we avoid O(n^2) operations.
 	 */
-	subex = list_first_entry(&sublist->list, struct xrep_extent,
+	sub_br = list_first_entry(&sub->list, struct xfs_bitmap_range,
 			list);
-	lp = exlist->list.next;
-	while (lp != &exlist->list) {
-		ex = list_entry(lp, struct xrep_extent, list);
+	lp = bitmap->list.next;
+	while (lp != &bitmap->list) {
+		br = list_entry(lp, struct xfs_bitmap_range, list);
 
 		/*
-		 * Advance subex and/or ex until we find a pair that
+		 * Advance sub_br and/or br until we find a pair that
 		 * intersect or we run out of extents.
 		 */
-		while (subex->fsbno + subex->len <= ex->fsbno) {
-			if (list_is_last(&subex->list, &sublist->list))
+		while (sub_br->start + sub_br->len <= br->start) {
+			if (list_is_last(&sub_br->list, &sub->list))
 				goto out;
-			subex = list_next_entry(subex, list);
+			sub_br = list_next_entry(sub_br, list);
 		}
-		if (subex->fsbno >= ex->fsbno + ex->len) {
+		if (sub_br->start >= br->start + br->len) {
 			lp = lp->next;
 			continue;
 		}
 
-		/* trim subex to fit the extent we have */
-		sub_fsb = subex->fsbno;
-		sub_len = subex->len;
-		if (subex->fsbno < ex->fsbno) {
-			sub_len -= ex->fsbno - subex->fsbno;
-			sub_fsb = ex->fsbno;
+		/* trim sub_br to fit the extent we have */
+		sub_start = sub_br->start;
+		sub_len = sub_br->len;
+		if (sub_br->start < br->start) {
+			sub_len -= br->start - sub_br->start;
+			sub_start = br->start;
 		}
-		if (sub_len > ex->len)
-			sub_len = ex->len;
+		if (sub_len > br->len)
+			sub_len = br->len;
 
 		state = 0;
-		if (sub_fsb == ex->fsbno)
+		if (sub_start == br->start)
 			state |= LEFT_ALIGNED;
-		if (sub_fsb + sub_len == ex->fsbno + ex->len)
+		if (sub_start + sub_len == br->start + br->len)
 			state |= RIGHT_ALIGNED;
 		switch (state) {
 		case LEFT_ALIGNED:
 			/* Coincides with only the left. */
-			ex->fsbno += sub_len;
-			ex->len -= sub_len;
+			br->start += sub_len;
+			br->len -= sub_len;
 			break;
 		case RIGHT_ALIGNED:
 			/* Coincides with only the right. */
-			ex->len -= sub_len;
+			br->len -= sub_len;
 			lp = lp->next;
 			break;
 		case LEFT_ALIGNED | RIGHT_ALIGNED:
 			/* Total overlap, just delete ex. */
 			lp = lp->next;
-			list_del(&ex->list);
-			kmem_free(ex);
+			list_del(&br->list);
+			kmem_free(br);
 			break;
 		case 0:
 			/*
 			 * Deleting from the middle: add the new right extent
 			 * and then shrink the left extent.
 			 */
-			newex = kmem_alloc(sizeof(struct xrep_extent),
+			new_br = kmem_alloc(sizeof(struct xfs_bitmap_range),
 					KM_MAYFAIL);
-			if (!newex) {
+			if (!new_br) {
 				error = -ENOMEM;
 				goto out;
 			}
-			INIT_LIST_HEAD(&newex->list);
-			newex->fsbno = sub_fsb + sub_len;
-			newex->len = ex->fsbno + ex->len - newex->fsbno;
-			list_add(&newex->list, &ex->list);
-			ex->len = sub_fsb - ex->fsbno;
+			INIT_LIST_HEAD(&new_br->list);
+			new_br->start = sub_start + sub_len;
+			new_br->len = br->start + br->len - new_br->start;
+			list_add(&new_br->list, &br->list);
+			br->len = sub_start - br->start;
 			lp = lp->next;
 			break;
 		default:
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index 1038157695a8..dad652ee9177 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -6,32 +6,27 @@
 #ifndef __XFS_SCRUB_BITMAP_H__
 #define __XFS_SCRUB_BITMAP_H__
 
-struct xrep_extent {
+struct xfs_bitmap_range {
 	struct list_head	list;
-	xfs_fsblock_t		fsbno;
-	xfs_extlen_t		len;
+	uint64_t		start;
+	uint64_t		len;
 };
 
-struct xrep_extent_list {
+struct xfs_bitmap {
 	struct list_head	list;
 };
 
-static inline void
-xrep_init_extent_list(
-	struct xrep_extent_list		*exlist)
-{
-	INIT_LIST_HEAD(&exlist->list);
-}
+void xfs_bitmap_init(struct xfs_bitmap *bitmap);
+void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
 
-#define for_each_xrep_extent_safe(rbe, n, exlist) \
-	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
-int xrep_collect_btree_extent(struct xfs_scrub *sc,
-		struct xrep_extent_list *btlist, xfs_fsblock_t fsbno,
-		xfs_extlen_t len);
-void xrep_cancel_btree_extents(struct xfs_scrub *sc,
-		struct xrep_extent_list *btlist);
-int xrep_subtract_extents(struct xfs_scrub *sc,
-		struct xrep_extent_list *exlist,
-		struct xrep_extent_list *sublist);
+#define for_each_xfs_bitmap_extent(bex, n, bitmap) \
+	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
+
+#define for_each_xfs_bitmap_block(b, bex, n, bitmap) \
+	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) \
+		for ((b) = bex->start; (b) < bex->start + bex->len; (b)++)
+
+int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
+int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
 
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 27a904ef6189..85b048b341a0 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -368,17 +368,17 @@ xrep_init_btblock(
  *
  * However, that leaves the matter of removing all the metadata describing the
  * old broken structure.  For primary metadata we use the rmap data to collect
- * every extent with a matching rmap owner (exlist); we then iterate all other
+ * every extent with a matching rmap owner (bitmap); we then iterate all other
  * metadata structures with the same rmap owner to collect the extents that
- * cannot be removed (sublist).  We then subtract sublist from exlist to
+ * cannot be removed (sublist).  We then subtract sublist from bitmap to
  * derive the blocks that were used by the old btree.  These blocks can be
  * reaped.
  *
  * For rmapbt reconstructions we must use different tactics for extent
  * collection.  First we iterate all primary metadata (this excludes the old
  * rmapbt, obviously) to generate new rmap records.  The gaps in the rmap
- * records are collected as exlist.  The bnobt records are collected as
- * sublist.  As with the other btrees we subtract sublist from exlist, and the
+ * records are collected as bitmap.  The bnobt records are collected as
+ * sublist.  As with the other btrees we subtract sublist from bitmap, and the
  * result (since the rmapbt lives in the free space) are the blocks from the
  * old rmapbt.
  *
@@ -386,11 +386,11 @@ xrep_init_btblock(
  *
  * Now that we've constructed a new btree to replace the damaged one, we want
  * to dispose of the blocks that (we think) the old btree was using.
- * Previously, we used the rmapbt to collect the extents (exlist) with the
+ * Previously, we used the rmapbt to collect the extents (bitmap) with the
  * rmap owner corresponding to the tree we rebuilt, collected extents for any
  * blocks with the same rmap owner that are owned by another data structure
- * (sublist), and subtracted sublist from exlist.  In theory the extents
- * remaining in exlist are the old btree's blocks.
+ * (sublist), and subtracted sublist from bitmap.  In theory the extents
+ * remaining in bitmap are the old btree's blocks.
  *
  * Unfortunately, it's possible that the btree was crosslinked with other
  * blocks on disk.  The rmap data can tell us if there are multiple owners, so
@@ -406,7 +406,7 @@ xrep_init_btblock(
  * If there are no rmap records at all, we also free the block.  If the btree
  * being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then there isn't
  * supposed to be a rmap record and everything is ok.  For other btrees there
- * had to have been an rmap entry for the block to have ended up on @exlist,
+ * had to have been an rmap entry for the block to have ended up on @bitmap,
  * so if it's gone now there's something wrong and the fs will shut down.
  *
  * Note: If there are multiple rmap records with only the same rmap owner as
@@ -419,7 +419,7 @@ xrep_init_btblock(
  * The caller is responsible for locking the AG headers for the entire rebuild
  * operation so that nothing else can sneak in and change the AG state while
  * we're not looking.  We also assume that the caller already invalidated any
- * buffers associated with @exlist.
+ * buffers associated with @bitmap.
  */
 
 /*
@@ -429,13 +429,12 @@ xrep_init_btblock(
 int
 xrep_invalidate_blocks(
 	struct xfs_scrub	*sc,
-	struct xrep_extent_list	*exlist)
+	struct xfs_bitmap	*bitmap)
 {
-	struct xrep_extent	*rex;
-	struct xrep_extent	*n;
+	struct xfs_bitmap_range	*bmr;
+	struct xfs_bitmap_range	*n;
 	struct xfs_buf		*bp;
 	xfs_fsblock_t		fsbno;
-	xfs_agblock_t		i;
 
 	/*
 	 * For each block in each extent, see if there's an incore buffer for
@@ -445,18 +444,16 @@ xrep_invalidate_blocks(
 	 * because we never own those; and if we can't TRYLOCK the buffer we
 	 * assume it's owned by someone else.
 	 */
-	for_each_xrep_extent_safe(rex, n, exlist) {
-		for (fsbno = rex->fsbno, i = rex->len; i > 0; fsbno++, i--) {
-			/* Skip AG headers and post-EOFS blocks */
-			if (!xfs_verify_fsbno(sc->mp, fsbno))
-				continue;
-			bp = xfs_buf_incore(sc->mp->m_ddev_targp,
-					XFS_FSB_TO_DADDR(sc->mp, fsbno),
-					XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK);
-			if (bp) {
-				xfs_trans_bjoin(sc->tp, bp);
-				xfs_trans_binval(sc->tp, bp);
-			}
+	for_each_xfs_bitmap_block(fsbno, bmr, n, bitmap) {
+		/* Skip AG headers and post-EOFS blocks */
+		if (!xfs_verify_fsbno(sc->mp, fsbno))
+			continue;
+		bp = xfs_buf_incore(sc->mp->m_ddev_targp,
+				XFS_FSB_TO_DADDR(sc->mp, fsbno),
+				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK);
+		if (bp) {
+			xfs_trans_bjoin(sc->tp, bp);
+			xfs_trans_binval(sc->tp, bp);
 		}
 	}
 
@@ -519,9 +516,9 @@ xrep_put_freelist(
 	return 0;
 }
 
-/* Dispose of a single metadata block. */
+/* Dispose of a single block. */
 STATIC int
-xrep_dispose_btree_block(
+xrep_reap_block(
 	struct xfs_scrub	*sc,
 	xfs_fsblock_t		fsbno,
 	struct xfs_owner_info	*oinfo,
@@ -593,41 +590,35 @@ xrep_dispose_btree_block(
 	return error;
 }
 
-/* Dispose of btree blocks from an old per-AG btree. */
+/* Dispose of every block of every extent in the bitmap. */
 int
-xrep_reap_btree_extents(
+xrep_reap_extents(
 	struct xfs_scrub	*sc,
-	struct xrep_extent_list	*exlist,
+	struct xfs_bitmap	*bitmap,
 	struct xfs_owner_info	*oinfo,
 	enum xfs_ag_resv_type	type)
 {
-	struct xrep_extent	*rex;
-	struct xrep_extent	*n;
+	struct xfs_bitmap_range	*bmr;
+	struct xfs_bitmap_range	*n;
+	xfs_fsblock_t		fsbno;
 	int			error = 0;
 
 	ASSERT(xfs_sb_version_hasrmapbt(&sc->mp->m_sb));
 
-	/* Dispose of every block from the old btree. */
-	for_each_xrep_extent_safe(rex, n, exlist) {
+	for_each_xfs_bitmap_block(fsbno, bmr, n, bitmap) {
 		ASSERT(sc->ip != NULL ||
-		       XFS_FSB_TO_AGNO(sc->mp, rex->fsbno) == sc->sa.agno);
-
+		       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.agno);
 		trace_xrep_dispose_btree_extent(sc->mp,
-				XFS_FSB_TO_AGNO(sc->mp, rex->fsbno),
-				XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno), rex->len);
+				XFS_FSB_TO_AGNO(sc->mp, fsbno),
+				XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
 
-		for (; rex->len > 0; rex->len--, rex->fsbno++) {
-			error = xrep_dispose_btree_block(sc, rex->fsbno,
-					oinfo, type);
-			if (error)
-				goto out;
-		}
-		list_del(&rex->list);
-		kmem_free(rex);
+		error = xrep_reap_block(sc, fsbno, oinfo, type);
+		if (error)
+			goto out;
 	}
 
 out:
-	xrep_cancel_btree_extents(sc, exlist);
+	xfs_bitmap_destroy(bitmap);
 	return error;
 }
 
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index a3d491a438f4..5a4e92221916 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -27,13 +27,11 @@ int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb,
 		struct xfs_buf **bpp, xfs_btnum_t btnum,
 		const struct xfs_buf_ops *ops);
 
-struct xrep_extent_list;
+struct xfs_bitmap;
 
 int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
-int xrep_invalidate_blocks(struct xfs_scrub *sc,
-		struct xrep_extent_list *btlist);
-int xrep_reap_btree_extents(struct xfs_scrub *sc,
-		struct xrep_extent_list *exlist,
+int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xfs_bitmap *btlist);
+int xrep_reap_extents(struct xfs_scrub *sc, struct xfs_bitmap *exlist,
 		struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
 
 struct xrep_find_ag_btree {
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 93db22c39b51..4e20f0e48232 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -511,7 +511,6 @@ DEFINE_EVENT(xrep_extent_class, name, \
 		 xfs_agblock_t agbno, xfs_extlen_t len), \
 	TP_ARGS(mp, agno, agbno, len))
 DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_btree_extent);
-DEFINE_REPAIR_EXTENT_EVENT(xrep_collect_btree_extent);
 DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert);
 
 DECLARE_EVENT_CLASS(xrep_rmap_class,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 02/14] xfs: repair the AGF
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
  2018-07-30  5:47 ` [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30 16:22   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 03/14] xfs: repair the AGFL Darrick J. Wong
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Regenerate the AGF from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c          |   27 ++-
 fs/xfs/scrub/repair.h          |    4 
 fs/xfs/scrub/scrub.c           |    2 
 4 files changed, 393 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 1e96621ece3a..4842fc598c9b 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -17,12 +17,19 @@
 #include "xfs_sb.h"
 #include "xfs_inode.h"
 #include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
 #include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
 #include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
 
 /* Superblock */
 
@@ -54,3 +61,366 @@ xrep_superblock(
 	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
 	return error;
 }
+
+/* AGF */
+
+struct xrep_agf_allocbt {
+	struct xfs_scrub	*sc;
+	xfs_agblock_t		freeblks;
+	xfs_agblock_t		longest;
+};
+
+/* Record free space shape information. */
+STATIC int
+xrep_agf_walk_allocbt(
+	struct xfs_btree_cur		*cur,
+	struct xfs_alloc_rec_incore	*rec,
+	void				*priv)
+{
+	struct xrep_agf_allocbt		*raa = priv;
+	int				error = 0;
+
+	if (xchk_should_terminate(raa->sc, &error))
+		return error;
+
+	raa->freeblks += rec->ar_blockcount;
+	if (rec->ar_blockcount > raa->longest)
+		raa->longest = rec->ar_blockcount;
+	return error;
+}
+
+/* Does this AGFL block look sane? */
+STATIC int
+xrep_agf_check_agfl_block(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		agbno,
+	void			*priv)
+{
+	struct xfs_scrub	*sc = priv;
+
+	if (!xfs_verify_agbno(mp, sc->sa.agno, agbno))
+		return -EFSCORRUPTED;
+	return 0;
+}
+
+/*
+ * Offset within the xrep_find_ag_btree array for each btree type.  Avoid the
+ * XFS_BTNUM_ names here to avoid creating a sparse array.
+ */
+enum {
+	XREP_AGF_BNOBT = 0,
+	XREP_AGF_CNTBT,
+	XREP_AGF_RMAPBT,
+	XREP_AGF_REFCOUNTBT,
+	XREP_AGF_END,
+	XREP_AGF_MAX
+};
+
+/*
+ * Given the btree roots described by *fab, find the roots, check them for
+ * sanity, and pass the root data back out via *fab.
+ *
+ * This is /also/ a chicken and egg problem because we have to use the rmapbt
+ * (rooted in the AGF) to find the btrees rooted in the AGF.  We also have no
+ * idea if the btrees make any sense.  If we hit obvious corruptions in those
+ * btrees we'll bail out.
+ */
+STATIC int
+xrep_agf_find_btrees(
+	struct xfs_scrub		*sc,
+	struct xfs_buf			*agf_bp,
+	struct xrep_find_ag_btree	*fab,
+	struct xfs_buf			*agfl_bp)
+{
+	struct xfs_agf			*old_agf = XFS_BUF_TO_AGF(agf_bp);
+	int				error;
+
+	/* Go find the root data. */
+	error = xrep_find_ag_btree_roots(sc, agf_bp, fab, agfl_bp);
+	if (error)
+		return error;
+
+	/* We must find the bnobt, cntbt, and rmapbt roots. */
+	if (fab[XREP_AGF_BNOBT].root == NULLAGBLOCK ||
+	    fab[XREP_AGF_BNOBT].height > XFS_BTREE_MAXLEVELS ||
+	    fab[XREP_AGF_CNTBT].root == NULLAGBLOCK ||
+	    fab[XREP_AGF_CNTBT].height > XFS_BTREE_MAXLEVELS ||
+	    fab[XREP_AGF_RMAPBT].root == NULLAGBLOCK ||
+	    fab[XREP_AGF_RMAPBT].height > XFS_BTREE_MAXLEVELS)
+		return -EFSCORRUPTED;
+
+	/*
+	 * We relied on the rmapbt to reconstruct the AGF.  If we get a
+	 * different root then something's seriously wrong.
+	 */
+	if (fab[XREP_AGF_RMAPBT].root !=
+	    be32_to_cpu(old_agf->agf_roots[XFS_BTNUM_RMAPi]))
+		return -EFSCORRUPTED;
+
+	/* We must find the refcountbt root if that feature is enabled. */
+	if (xfs_sb_version_hasreflink(&sc->mp->m_sb) &&
+	    (fab[XREP_AGF_REFCOUNTBT].root == NULLAGBLOCK ||
+	     fab[XREP_AGF_REFCOUNTBT].height > XFS_BTREE_MAXLEVELS))
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/*
+ * Reinitialize the AGF header, making an in-core copy of the old contents so
+ * that we know which in-core state needs to be reinitialized.
+ */
+STATIC void
+xrep_agf_init_header(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agf_bp,
+	struct xfs_agf		*old_agf)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agf_bp);
+
+	memcpy(old_agf, agf, sizeof(*old_agf));
+	memset(agf, 0, BBTOB(agf_bp->b_length));
+	agf->agf_magicnum = cpu_to_be32(XFS_AGF_MAGIC);
+	agf->agf_versionnum = cpu_to_be32(XFS_AGF_VERSION);
+	agf->agf_seqno = cpu_to_be32(sc->sa.agno);
+	agf->agf_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
+	agf->agf_flfirst = old_agf->agf_flfirst;
+	agf->agf_fllast = old_agf->agf_fllast;
+	agf->agf_flcount = old_agf->agf_flcount;
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agf->agf_uuid, &mp->m_sb.sb_meta_uuid);
+
+	/* Mark the incore AGF data stale until we're done fixing things. */
+	ASSERT(sc->sa.pag->pagf_init);
+	sc->sa.pag->pagf_init = 0;
+}
+
+/* Set btree root information in an AGF. */
+STATIC void
+xrep_agf_set_roots(
+	struct xfs_scrub		*sc,
+	struct xfs_agf			*agf,
+	struct xrep_find_ag_btree	*fab)
+{
+	agf->agf_roots[XFS_BTNUM_BNOi] =
+			cpu_to_be32(fab[XREP_AGF_BNOBT].root);
+	agf->agf_levels[XFS_BTNUM_BNOi] =
+			cpu_to_be32(fab[XREP_AGF_BNOBT].height);
+
+	agf->agf_roots[XFS_BTNUM_CNTi] =
+			cpu_to_be32(fab[XREP_AGF_CNTBT].root);
+	agf->agf_levels[XFS_BTNUM_CNTi] =
+			cpu_to_be32(fab[XREP_AGF_CNTBT].height);
+
+	agf->agf_roots[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(fab[XREP_AGF_RMAPBT].root);
+	agf->agf_levels[XFS_BTNUM_RMAPi] =
+			cpu_to_be32(fab[XREP_AGF_RMAPBT].height);
+
+	if (xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
+		agf->agf_refcount_root =
+				cpu_to_be32(fab[XREP_AGF_REFCOUNTBT].root);
+		agf->agf_refcount_level =
+				cpu_to_be32(fab[XREP_AGF_REFCOUNTBT].height);
+	}
+}
+
+/* Update all AGF fields which derive from btree contents. */
+STATIC int
+xrep_agf_calc_from_btrees(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agf_bp)
+{
+	struct xrep_agf_allocbt	raa = { .sc = sc };
+	struct xfs_btree_cur	*cur = NULL;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agf_bp);
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agblock_t		btreeblks;
+	xfs_agblock_t		blocks;
+	int			error;
+
+	/* Update the AGF counters from the bnobt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_BNO);
+	error = xfs_alloc_query_all(cur, xrep_agf_walk_allocbt, &raa);
+	if (error)
+		goto err;
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+	btreeblks = blocks - 1;
+	agf->agf_freeblks = cpu_to_be32(raa.freeblks);
+	agf->agf_longest = cpu_to_be32(raa.longest);
+
+	/* Update the AGF counters from the cntbt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_CNT);
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+	btreeblks += blocks - 1;
+
+	/* Update the AGF counters from the rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
+	error = xfs_btree_count_blocks(cur, &blocks);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+	agf->agf_rmap_blocks = cpu_to_be32(blocks);
+	btreeblks += blocks - 1;
+
+	agf->agf_btreeblks = cpu_to_be32(btreeblks);
+
+	/* Update the AGF counters from the refcountbt. */
+	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
+		cur = xfs_refcountbt_init_cursor(mp, sc->tp, agf_bp,
+				sc->sa.agno);
+		error = xfs_btree_count_blocks(cur, &blocks);
+		if (error)
+			goto err;
+		xfs_btree_del_cursor(cur, error);
+		agf->agf_refcount_blocks = cpu_to_be32(blocks);
+	}
+
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Commit the new AGF and reinitialize the incore state. */
+STATIC int
+xrep_agf_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agf_bp)
+{
+	struct xfs_perag	*pag;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agf_bp);
+
+	/* Trigger fdblocks recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/* Write this to disk. */
+	xfs_trans_buf_set_type(sc->tp, agf_bp, XFS_BLFT_AGF_BUF);
+	xfs_trans_log_buf(sc->tp, agf_bp, 0, BBTOB(agf_bp->b_length) - 1);
+
+	/* Now reinitialize the in-core counters we changed. */
+	pag = sc->sa.pag;
+	pag->pagf_btreeblks = be32_to_cpu(agf->agf_btreeblks);
+	pag->pagf_freeblks = be32_to_cpu(agf->agf_freeblks);
+	pag->pagf_longest = be32_to_cpu(agf->agf_longest);
+	pag->pagf_levels[XFS_BTNUM_BNOi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNOi]);
+	pag->pagf_levels[XFS_BTNUM_CNTi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNTi]);
+	pag->pagf_levels[XFS_BTNUM_RMAPi] =
+			be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAPi]);
+	pag->pagf_refcount_level = be32_to_cpu(agf->agf_refcount_level);
+	pag->pagf_init = 1;
+
+	return 0;
+}
+
+/* Repair the AGF. v5 filesystems only. */
+int
+xrep_agf(
+	struct xfs_scrub		*sc)
+{
+	struct xrep_find_ag_btree	fab[XREP_AGF_MAX] = {
+		[XREP_AGF_BNOBT] = {
+			.rmap_owner = XFS_RMAP_OWN_AG,
+			.buf_ops = &xfs_allocbt_buf_ops,
+			.magic = XFS_ABTB_CRC_MAGIC,
+		},
+		[XREP_AGF_CNTBT] = {
+			.rmap_owner = XFS_RMAP_OWN_AG,
+			.buf_ops = &xfs_allocbt_buf_ops,
+			.magic = XFS_ABTC_CRC_MAGIC,
+		},
+		[XREP_AGF_RMAPBT] = {
+			.rmap_owner = XFS_RMAP_OWN_AG,
+			.buf_ops = &xfs_rmapbt_buf_ops,
+			.magic = XFS_RMAP_CRC_MAGIC,
+		},
+		[XREP_AGF_REFCOUNTBT] = {
+			.rmap_owner = XFS_RMAP_OWN_REFC,
+			.buf_ops = &xfs_refcountbt_buf_ops,
+			.magic = XFS_REFC_CRC_MAGIC,
+		},
+		[XREP_AGF_END] = {
+			.buf_ops = NULL,
+		},
+	};
+	struct xfs_agf			old_agf;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agf_bp;
+	struct xfs_buf			*agfl_bp;
+	struct xfs_agf			*agf;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+	/*
+	 * Make sure we have the AGF buffer, as scrub might have decided it
+	 * was corrupt after xfs_alloc_read_agf failed with -EFSCORRUPTED.
+	 */
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGF_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agf_bp, NULL);
+	if (error)
+		return error;
+	agf_bp->b_ops = &xfs_agf_buf_ops;
+	agf = XFS_BUF_TO_AGF(agf_bp);
+
+	/*
+	 * Load the AGFL so that we can screen out OWN_AG blocks that are on
+	 * the AGFL now; these blocks might have once been part of the
+	 * bno/cnt/rmap btrees but are not now.  This is a chicken and egg
+	 * problem: the AGF is corrupt, so we have to trust the AGFL contents
+	 * because we can't do any serious cross-referencing with any of the
+	 * btrees rooted in the AGF.  If the AGFL contents are obviously bad
+	 * then we'll bail out.
+	 */
+	error = xfs_alloc_read_agfl(mp, sc->tp, sc->sa.agno, &agfl_bp);
+	if (error)
+		return error;
+
+	/*
+	 * Spot-check the AGFL blocks; if they're obviously corrupt then
+	 * there's nothing we can do but bail out.
+	 */
+	error = xfs_agfl_walk(sc->mp, XFS_BUF_TO_AGF(agf_bp), agfl_bp,
+			xrep_agf_check_agfl_block, sc);
+	if (error)
+		return error;
+
+	/*
+	 * Find the AGF btree roots.  This is also a chicken-and-egg situation;
+	 * see the function for more details.
+	 */
+	error = xrep_agf_find_btrees(sc, agf_bp, fab, agfl_bp);
+	if (error)
+		return error;
+
+	/* Start rewriting the header and implant the btrees we found. */
+	xrep_agf_init_header(sc, agf_bp, &old_agf);
+	xrep_agf_set_roots(sc, agf, fab);
+	error = xrep_agf_calc_from_btrees(sc, agf_bp);
+	if (error)
+		goto out_revert;
+
+	/* Commit the changes and reinitialize incore state. */
+	return xrep_agf_commit_new(sc, agf_bp);
+
+out_revert:
+	/* Mark the incore AGF state stale and revert the AGF. */
+	sc->sa.pag->pagf_init = 0;
+	memcpy(agf, &old_agf, sizeof(old_agf));
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 85b048b341a0..17cf48564390 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -128,9 +128,12 @@ xrep_roll_ag_trans(
 	int			error;
 
 	/* Keep the AG header buffers locked so we can keep going. */
-	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
 
 	/* Roll the transaction. */
 	error = xfs_trans_roll(&sc->tp);
@@ -138,9 +141,12 @@ xrep_roll_ag_trans(
 		goto out_release;
 
 	/* Join AG headers to the new transaction. */
-	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
 
 	return 0;
 
@@ -150,9 +156,12 @@ xrep_roll_ag_trans(
 	 * buffers will be released during teardown on our way out
 	 * of the kernel.
 	 */
-	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
-	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
-	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
+	if (sc->sa.agi_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
+	if (sc->sa.agf_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
+	if (sc->sa.agfl_bp)
+		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
 
 	return error;
 }
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 5a4e92221916..1d283360b5ab 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
 
 int xrep_probe(struct xfs_scrub *sc);
 int xrep_superblock(struct xfs_scrub *sc);
+int xrep_agf(struct xfs_scrub *sc);
+int xrep_agfl(struct xfs_scrub *sc);
 
 #else
 
@@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
 
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
+#define xrep_agf			xrep_notsupported
+#define xrep_agfl			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 6efb926f3cf8..1e8a17c8e2b9 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_fs,
 		.scrub	= xchk_agf,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_agf,
 	},
 	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 03/14] xfs: repair the AGFL
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
  2018-07-30  5:47 ` [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap Darrick J. Wong
  2018-07-30  5:48 ` [PATCH 02/14] xfs: repair the AGF Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30 16:25   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 04/14] xfs: repair the AGI Darrick J. Wong
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Repair the AGFL from the rmap data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
 fs/xfs/scrub/bitmap.h          |    4 +
 fs/xfs/scrub/scrub.c           |    2 
 4 files changed, 373 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 4842fc598c9b..bfef066c87c3 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -424,3 +424,279 @@ xrep_agf(
 	memcpy(agf, &old_agf, sizeof(old_agf));
 	return error;
 }
+
+/* AGFL */
+
+struct xrep_agfl {
+	/* Bitmap of other OWN_AG metadata blocks. */
+	struct xfs_bitmap	agmetablocks;
+
+	/* Bitmap of free space. */
+	struct xfs_bitmap	*freesp;
+
+	struct xfs_scrub	*sc;
+};
+
+/* Record all OWN_AG (free space btree) information from the rmap data. */
+STATIC int
+xrep_agfl_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_agfl	*ra = priv;
+	xfs_fsblock_t		fsb;
+	int			error = 0;
+
+	if (xchk_should_terminate(ra->sc, &error))
+		return error;
+
+	/* Record all the OWN_AG blocks. */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		error = xfs_bitmap_set(ra->freesp, fsb, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	return xfs_bitmap_set_btcur_path(&ra->agmetablocks, cur);
+}
+
+/*
+ * Map out all the non-AGFL OWN_AG space in this AG so that we can deduce
+ * which blocks belong to the AGFL.
+ *
+ * Compute the set of old AGFL blocks by subtracting from the list of OWN_AG
+ * blocks the list of blocks owned by all other OWN_AG metadata (bnobt, cntbt,
+ * rmapbt).  These are the old AGFL blocks, so return that list and the number
+ * of blocks we're actually going to put back on the AGFL.
+ */
+STATIC int
+xrep_agfl_collect_blocks(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agf_bp,
+	struct xfs_bitmap	*agfl_extents,
+	xfs_agblock_t		*flcount)
+{
+	struct xrep_agfl	ra;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_btree_cur	*cur;
+	struct xfs_bitmap_range	*br;
+	struct xfs_bitmap_range	*n;
+	int			error;
+
+	ra.sc = sc;
+	ra.freesp = agfl_extents;
+	xfs_bitmap_init(&ra.agmetablocks);
+
+	/* Find all space used by the free space btrees & rmapbt. */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xrep_agfl_walk_rmap, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+
+	/* Find all blocks currently being used by the bnobt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_BNO);
+	error = xfs_bitmap_set_btblocks(&ra.agmetablocks, cur);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+
+	/* Find all blocks currently being used by the cntbt. */
+	cur = xfs_allocbt_init_cursor(mp, sc->tp, agf_bp, sc->sa.agno,
+			XFS_BTNUM_CNT);
+	error = xfs_bitmap_set_btblocks(&ra.agmetablocks, cur);
+	if (error)
+		goto err;
+
+	xfs_btree_del_cursor(cur, error);
+
+	/*
+	 * Drop the freesp meta blocks that are in use by btrees.
+	 * The remaining blocks /should/ be AGFL blocks.
+	 */
+	error = xfs_bitmap_disunion(agfl_extents, &ra.agmetablocks);
+	xfs_bitmap_destroy(&ra.agmetablocks);
+	if (error)
+		return error;
+
+	/*
+	 * Calculate the new AGFL size.  If we found more blocks than fit in
+	 * the AGFL we'll free them later.
+	 */
+	*flcount = 0;
+	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
+		*flcount += br->len;
+		if (*flcount > xfs_agfl_size(mp))
+			break;
+	}
+	if (*flcount > xfs_agfl_size(mp))
+		*flcount = xfs_agfl_size(mp);
+	return 0;
+
+err:
+	xfs_bitmap_destroy(&ra.agmetablocks);
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Update the AGF and reset the in-core state. */
+STATIC int
+xrep_agfl_update_agf(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agf_bp,
+	xfs_agblock_t		flcount)
+{
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(agf_bp);
+
+	ASSERT(flcount <= xfs_agfl_size(sc->mp));
+
+	/* Trigger fdblocks recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/* Update the AGF counters. */
+	if (sc->sa.pag->pagf_init)
+		sc->sa.pag->pagf_flcount = flcount;
+	agf->agf_flfirst = cpu_to_be32(0);
+	agf->agf_flcount = cpu_to_be32(flcount);
+	agf->agf_fllast = cpu_to_be32(flcount - 1);
+
+	xfs_alloc_log_agf(sc->tp, agf_bp,
+			XFS_AGF_FLFIRST | XFS_AGF_FLLAST | XFS_AGF_FLCOUNT);
+	return 0;
+}
+
+/* Write out a totally new AGFL. */
+STATIC void
+xrep_agfl_init_header(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agfl_bp,
+	struct xfs_bitmap	*agfl_extents,
+	xfs_agblock_t		flcount)
+{
+	struct xfs_mount	*mp = sc->mp;
+	__be32			*agfl_bno;
+	struct xfs_bitmap_range	*br;
+	struct xfs_bitmap_range	*n;
+	struct xfs_agfl		*agfl;
+	xfs_agblock_t		agbno;
+	unsigned int		fl_off;
+
+	ASSERT(flcount <= xfs_agfl_size(mp));
+
+	/* Start rewriting the header. */
+	agfl = XFS_BUF_TO_AGFL(agfl_bp);
+	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
+	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
+	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
+	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
+
+	/*
+	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
+	 * blocks than fit in the AGFL, they will be freed in a subsequent
+	 * step.
+	 */
+	fl_off = 0;
+	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
+	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
+		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
+
+		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
+
+		while (br->len > 0 && fl_off < flcount) {
+			agfl_bno[fl_off] = cpu_to_be32(agbno);
+			fl_off++;
+			agbno++;
+			br->start++;
+			br->len--;
+		}
+
+		if (br->len)
+			break;
+		list_del(&br->list);
+		kmem_free(br);
+	}
+
+	/* Write new AGFL to disk. */
+	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
+	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
+}
+
+/* Repair the AGFL. */
+int
+xrep_agfl(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_owner_info	oinfo;
+	struct xfs_bitmap	agfl_extents;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agf_bp;
+	struct xfs_buf		*agfl_bp;
+	xfs_agblock_t		flcount;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+	xfs_bitmap_init(&agfl_extents);
+
+	/*
+	 * Read the AGF so that we can query the rmapbt.  We hope that there's
+	 * nothing wrong with the AGF, but all the AG header repair functions
+	 * have this chicken-and-egg problem.
+	 */
+	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+
+	/*
+	 * Make sure we have the AGFL buffer, as scrub might have decided it
+	 * was corrupt after xfs_alloc_read_agfl failed with -EFSCORRUPTED.
+	 */
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGFL_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agfl_bp, NULL);
+	if (error)
+		return error;
+	agfl_bp->b_ops = &xfs_agfl_buf_ops;
+
+	/* Gather all the extents we're going to put on the new AGFL. */
+	error = xrep_agfl_collect_blocks(sc, agf_bp, &agfl_extents, &flcount);
+	if (error)
+		goto err;
+
+	/*
+	 * Update AGF and AGFL.  We reset the global free block counter when
+	 * we adjust the AGF flcount (which can fail) so avoid updating any
+	 * buffers until we know that part works.
+	 */
+	error = xrep_agfl_update_agf(sc, agf_bp, flcount);
+	if (error)
+		goto err;
+	xrep_agfl_init_header(sc, agfl_bp, &agfl_extents, flcount);
+
+	/*
+	 * Ok, the AGFL should be ready to go now.  Roll the transaction to
+	 * make the new AGFL permanent before we start using it to return
+	 * freespace overflow to the freespace btrees.
+	 */
+	sc->sa.agf_bp = agf_bp;
+	sc->sa.agfl_bp = agfl_bp;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		goto err;
+
+	/* Dump any AGFL overflow. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	return xrep_reap_extents(sc, &agfl_extents, &oinfo, XFS_AG_RESV_AGFL);
+err:
+	xfs_bitmap_destroy(&agfl_extents);
+	return error;
+}
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index c770e2d0b6aa..fdadc9e1dc49 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -9,6 +9,7 @@
 #include "xfs_format.h"
 #include "xfs_trans_resv.h"
 #include "xfs_mount.h"
+#include "xfs_btree.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -209,3 +210,94 @@ xfs_bitmap_disunion(
 }
 #undef LEFT_ALIGNED
 #undef RIGHT_ALIGNED
+
+/*
+ * Record all btree blocks seen while iterating all records of a btree.
+ *
+ * We know that the btree query_all function starts at the left edge and walks
+ * towards the right edge of the tree.  Therefore, we know that we can walk up
+ * the btree cursor towards the root; if the pointer for a given level points
+ * to the first record/key in that block, we haven't seen this block before;
+ * and therefore we need to remember that we saw this block in the btree.
+ *
+ * So if our btree is:
+ *
+ *    4
+ *  / | \
+ * 1  2  3
+ *
+ * Pretend for this example that each leaf block has 100 btree records.  For
+ * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
+ * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
+ * block 4.  The list is [1, 4].
+ *
+ * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
+ * loop.  The list remains [1, 4].
+ *
+ * For the 101st btree record, we've moved onto leaf block 2.  Now
+ * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
+ * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
+ *
+ * For the 102nd record, bc_ptrs[0] == 2, so we continue.
+ *
+ * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
+ * we add 3 to the list.  Now it is [1, 4, 2, 3].
+ *
+ * For the 300th record we just exit, with the list being [1, 4, 2, 3].
+ */
+
+/*
+ * Record all the buffers pointed to by the btree cursor.  Callers already
+ * engaged in a btree walk should call this function to capture the list of
+ * blocks going from the leaf towards the root.
+ */
+int
+xfs_bitmap_set_btcur_path(
+	struct xfs_bitmap	*bitmap,
+	struct xfs_btree_cur	*cur)
+{
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		fsb;
+	int			i;
+	int			error;
+
+	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
+		xfs_btree_get_block(cur, i, &bp);
+		if (!bp)
+			continue;
+		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+		error = xfs_bitmap_set(bitmap, fsb, 1);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Collect a btree's block in the bitmap. */
+STATIC int
+xfs_bitmap_collect_btblock(
+	struct xfs_btree_cur	*cur,
+	int			level,
+	void			*priv)
+{
+	struct xfs_bitmap	*bitmap = priv;
+	struct xfs_buf		*bp;
+	xfs_fsblock_t		fsbno;
+
+	xfs_btree_get_block(cur, level, &bp);
+	if (!bp)
+		return 0;
+
+	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
+	return xfs_bitmap_set(bitmap, fsbno, 1);
+}
+
+/* Walk the btree and mark the bitmap wherever a btree block is found. */
+int
+xfs_bitmap_set_btblocks(
+	struct xfs_bitmap	*bitmap,
+	struct xfs_btree_cur	*cur)
+{
+	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
+}
diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
index dad652ee9177..ae8ecbce6fa6 100644
--- a/fs/xfs/scrub/bitmap.h
+++ b/fs/xfs/scrub/bitmap.h
@@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
 
 int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
 int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
+int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
+		struct xfs_btree_cur *cur);
+int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
+		struct xfs_btree_cur *cur);
 
 #endif	/* __XFS_SCRUB_BITMAP_H__ */
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 1e8a17c8e2b9..2670f4cf62f4 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_fs,
 		.scrub	= xchk_agfl,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_agfl,
 	},
 	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 04/14] xfs: repair the AGI
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (2 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 03/14] xfs: repair the AGFL Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30 18:20   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 05/14] xfs: repair free space btrees Darrick J. Wong
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the AGI header items with some help from the rmapbt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/scrub/agheader_repair.c |  220 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/scrub.c           |    2 
 3 files changed, 223 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index bfef066c87c3..921e7d42a2ef 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -700,3 +700,223 @@ xrep_agfl(
 	xfs_bitmap_destroy(&agfl_extents);
 	return error;
 }
+
+/* AGI */
+
+/*
+ * Offset within the xrep_find_ag_btree array for each btree type.  Avoid the
+ * XFS_BTNUM_ names here to avoid creating a sparse array.
+ */
+enum {
+	XREP_AGI_INOBT = 0,
+	XREP_AGI_FINOBT,
+	XREP_AGI_END,
+	XREP_AGI_MAX
+};
+
+/*
+ * Given the inode btree roots described by *fab, find the roots, check them
+ * for sanity, and pass the root data back out via *fab.
+ */
+STATIC int
+xrep_agi_find_btrees(
+	struct xfs_scrub		*sc,
+	struct xrep_find_ag_btree	*fab)
+{
+	struct xfs_buf			*agf_bp;
+	struct xfs_mount		*mp = sc->mp;
+	int				error;
+
+	/* Read the AGF. */
+	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+
+	/* Find the btree roots. */
+	error = xrep_find_ag_btree_roots(sc, agf_bp, fab, NULL);
+	if (error)
+		return error;
+
+	/* We must find the inobt root. */
+	if (fab[XREP_AGI_INOBT].root == NULLAGBLOCK ||
+	    fab[XREP_AGI_INOBT].height > XFS_BTREE_MAXLEVELS)
+		return -EFSCORRUPTED;
+
+	/* We must find the finobt root if that feature is enabled. */
+	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
+	    (fab[XREP_AGI_FINOBT].root == NULLAGBLOCK ||
+	     fab[XREP_AGI_FINOBT].height > XFS_BTREE_MAXLEVELS))
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/*
+ * Reinitialize the AGI header, making an in-core copy of the old contents so
+ * that we know which in-core state needs to be reinitialized.
+ */
+STATIC void
+xrep_agi_init_header(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agi_bp,
+	struct xfs_agi		*old_agi)
+{
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agi_bp);
+	struct xfs_mount	*mp = sc->mp;
+
+	memcpy(old_agi, agi, sizeof(*old_agi));
+	memset(agi, 0, BBTOB(agi_bp->b_length));
+	agi->agi_magicnum = cpu_to_be32(XFS_AGI_MAGIC);
+	agi->agi_versionnum = cpu_to_be32(XFS_AGI_VERSION);
+	agi->agi_seqno = cpu_to_be32(sc->sa.agno);
+	agi->agi_length = cpu_to_be32(xfs_ag_block_count(mp, sc->sa.agno));
+	agi->agi_newino = cpu_to_be32(NULLAGINO);
+	agi->agi_dirino = cpu_to_be32(NULLAGINO);
+	if (xfs_sb_version_hascrc(&mp->m_sb))
+		uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid);
+
+	/* We don't know how to fix the unlinked list yet. */
+	memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked,
+			sizeof(agi->agi_unlinked));
+
+	/* Mark the incore AGF data stale until we're done fixing things. */
+	ASSERT(sc->sa.pag->pagi_init);
+	sc->sa.pag->pagi_init = 0;
+}
+
+/* Set btree root information in an AGI. */
+STATIC void
+xrep_agi_set_roots(
+	struct xfs_scrub		*sc,
+	struct xfs_agi			*agi,
+	struct xrep_find_ag_btree	*fab)
+{
+	agi->agi_root = cpu_to_be32(fab[XREP_AGI_INOBT].root);
+	agi->agi_level = cpu_to_be32(fab[XREP_AGI_INOBT].height);
+
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb)) {
+		agi->agi_free_root = cpu_to_be32(fab[XREP_AGI_FINOBT].root);
+		agi->agi_free_level = cpu_to_be32(fab[XREP_AGI_FINOBT].height);
+	}
+}
+
+/* Update the AGI counters. */
+STATIC int
+xrep_agi_calc_from_btrees(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agi_bp)
+{
+	struct xfs_btree_cur	*cur;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agi_bp);
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agino_t		count;
+	xfs_agino_t		freecount;
+	int			error;
+
+	cur = xfs_inobt_init_cursor(mp, sc->tp, agi_bp, sc->sa.agno,
+			XFS_BTNUM_INO);
+	error = xfs_ialloc_count_inodes(cur, &count, &freecount);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+
+	agi->agi_count = cpu_to_be32(count);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	return 0;
+err:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Trigger reinitialization of the in-core data. */
+STATIC int
+xrep_agi_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*agi_bp,
+	const struct xfs_agi	*old_agi)
+{
+	struct xfs_perag	*pag;
+	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agi_bp);
+
+	/* Trigger inode count recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/* Write this to disk. */
+	xfs_trans_buf_set_type(sc->tp, agi_bp, XFS_BLFT_AGI_BUF);
+	xfs_trans_log_buf(sc->tp, agi_bp, 0, BBTOB(agi_bp->b_length) - 1);
+
+	/* Now reinitialize the in-core counters if necessary. */
+	pag = sc->sa.pag;
+	sc->sa.pag->pagi_init = 1;
+	pag->pagi_count = be32_to_cpu(agi->agi_count);
+	pag->pagi_freecount = be32_to_cpu(agi->agi_freecount);
+
+	return 0;
+}
+
+/* Repair the AGI. */
+int
+xrep_agi(
+	struct xfs_scrub		*sc)
+{
+	struct xrep_find_ag_btree	fab[XREP_AGI_MAX] = {
+		[XREP_AGI_INOBT] = {
+			.rmap_owner = XFS_RMAP_OWN_INOBT,
+			.buf_ops = &xfs_inobt_buf_ops,
+			.magic = XFS_IBT_CRC_MAGIC,
+		},
+		[XREP_AGI_FINOBT] = {
+			.rmap_owner = XFS_RMAP_OWN_INOBT,
+			.buf_ops = &xfs_inobt_buf_ops,
+			.magic = XFS_FIBT_CRC_MAGIC,
+		},
+		[XREP_AGI_END] = {
+			.buf_ops = NULL
+		},
+	};
+	struct xfs_agi			old_agi;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_buf			*agi_bp;
+	struct xfs_agi			*agi;
+	int				error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+	/*
+	 * Make sure we have the AGI buffer, as scrub might have decided it
+	 * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED.
+	 */
+	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
+			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGI_DADDR(mp)),
+			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
+	if (error)
+		return error;
+	agi_bp->b_ops = &xfs_agi_buf_ops;
+	agi = XFS_BUF_TO_AGI(agi_bp);
+
+	/* Find the AGI btree roots. */
+	error = xrep_agi_find_btrees(sc, fab);
+	if (error)
+		return error;
+
+	/* Start rewriting the header and implant the btrees we found. */
+	xrep_agi_init_header(sc, agi_bp, &old_agi);
+	xrep_agi_set_roots(sc, agi, fab);
+	error = xrep_agi_calc_from_btrees(sc, agi_bp);
+	if (error)
+		goto out_revert;
+
+	/* Reinitialize in-core state. */
+	return xrep_agi_commit_new(sc, agi_bp, &old_agi);
+
+out_revert:
+	/* Mark the incore AGI state stale and revert the AGI. */
+	sc->sa.pag->pagi_init = 0;
+	memcpy(agi, &old_agi, sizeof(old_agi));
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 1d283360b5ab..9de321eee4ab 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -60,6 +60,7 @@ int xrep_probe(struct xfs_scrub *sc);
 int xrep_superblock(struct xfs_scrub *sc);
 int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
+int xrep_agi(struct xfs_scrub *sc);
 
 #else
 
@@ -85,6 +86,7 @@ xrep_calc_ag_resblks(
 #define xrep_superblock			xrep_notsupported
 #define xrep_agf			xrep_notsupported
 #define xrep_agfl			xrep_notsupported
+#define xrep_agi			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 2670f4cf62f4..4bfae1e61d30 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -226,7 +226,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_fs,
 		.scrub	= xchk_agi,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_agi,
 	},
 	[XFS_SCRUB_TYPE_BNOBT] = {	/* bnobt */
 		.type	= ST_PERAG,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 05/14] xfs: repair free space btrees
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (3 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 04/14] xfs: repair the AGI Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-31 17:47   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 06/14] xfs: repair inode btrees Darrick J. Wong
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Rebuild the free space btrees from the gaps in the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/alloc.c        |    1 
 fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/common.c       |    8 +
 fs/xfs/scrub/repair.h       |    2 
 fs/xfs/scrub/scrub.c        |    4 
 fs/xfs/scrub/trace.h        |    2 
 fs/xfs/xfs_extent_busy.c    |   14 +
 fs/xfs/xfs_extent_busy.h    |    2 
 9 files changed, 610 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/scrub/alloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 57ec46951ede..44ddd112acd2 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   alloc_repair.o \
 				   bitmap.o \
 				   repair.o \
 				   )
diff --git a/fs/xfs/scrub/alloc.c b/fs/xfs/scrub/alloc.c
index 036b5c7021eb..c9b34ba312ab 100644
--- a/fs/xfs/scrub/alloc.c
+++ b/fs/xfs/scrub/alloc.c
@@ -15,7 +15,6 @@
 #include "xfs_log_format.h"
 #include "xfs_trans.h"
 #include "xfs_sb.h"
-#include "xfs_alloc.h"
 #include "xfs_rmap.h"
 #include "xfs_alloc.h"
 #include "scrub/xfs_scrub.h"
diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
new file mode 100644
index 000000000000..b228c2906de2
--- /dev/null
+++ b/fs/xfs/scrub/alloc_repair.c
@@ -0,0 +1,581 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_alloc.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_inode.h"
+#include "xfs_refcount.h"
+#include "xfs_extent_busy.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+
+/*
+ * Free Space Btree Repair
+ * =======================
+ *
+ * The reverse mappings are supposed to record all space usage for the entire
+ * AG.  Therefore, we can recalculate the free extents in an AG by looking for
+ * gaps in the physical extents recorded in the rmapbt.  On a reflink
+ * filesystem this is a little more tricky in that we have to be aware that
+ * the rmap records are allowed to overlap.
+ *
+ * We derive which blocks belonged to the old bnobt/cntbt by recording all the
+ * OWN_AG extents and subtracting out the blocks owned by all other OWN_AG
+ * metadata: the rmapbt blocks visited while iterating the reverse mappings
+ * and the AGFL blocks.
+ *
+ * Once we have both of those pieces, we can reconstruct the bnobt and cntbt
+ * by blowing out the free block state and freeing all the extents that we
+ * found.  This adds the requirement that we can't have any busy extents in
+ * the AG because the busy code cannot handle duplicate records.
+ *
+ * Note that we can only rebuild both free space btrees at the same time
+ * because the regular extent freeing infrastructure loads both btrees at the
+ * same time.
+ *
+ * We use the prefix 'xrep_abt' here because we regenerate both free space
+ * allocation btrees at the same time.
+ */
+
+struct xrep_abt_extent {
+	struct list_head	list;
+	xfs_agblock_t		bno;
+	xfs_extlen_t		len;
+};
+
+struct xrep_abt {
+	/* Blocks owned by the rmapbt or the agfl. */
+	struct xfs_bitmap	nobtlist;
+
+	/* All OWN_AG blocks. */
+	struct xfs_bitmap	*btlist;
+
+	/* Free space extents. */
+	struct list_head	*extlist;
+
+	struct xfs_scrub	*sc;
+
+	/* Length of extlist. */
+	uint64_t		nr_records;
+
+	/*
+	 * Next block we anticipate seeing in the rmap records.  If the next
+	 * rmap record is greater than next_bno, we have found unused space.
+	 */
+	xfs_agblock_t		next_bno;
+
+	/* Number of free blocks in this AG. */
+	xfs_agblock_t		nr_blocks;
+};
+
+/* Record extents that aren't in use from gaps in the rmap records. */
+STATIC int
+xrep_abt_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+	struct xrep_abt_extent	*rae;
+	xfs_fsblock_t		fsb;
+	int			error;
+
+	/* Record all the OWN_AG blocks... */
+	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
+		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
+		if (error)
+			return error;
+	}
+
+	/* ...and all the rmapbt blocks... */
+	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
+	if (error)
+		return error;
+
+	/* ...and all the free space. */
+	if (rec->rm_startblock > ra->next_bno) {
+		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
+				ra->next_bno, rec->rm_startblock - ra->next_bno,
+				XFS_RMAP_OWN_NULL, 0, 0);
+
+		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
+		if (!rae)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&rae->list);
+		rae->bno = ra->next_bno;
+		rae->len = rec->rm_startblock - ra->next_bno;
+		list_add_tail(&rae->list, ra->extlist);
+		ra->nr_records++;
+		ra->nr_blocks += rae->len;
+	}
+	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
+			rec->rm_startblock + rec->rm_blockcount);
+	return 0;
+}
+
+/* Collect an AGFL block for the not-to-release list. */
+static int
+xrep_abt_walk_agfl(
+	struct xfs_mount	*mp,
+	xfs_agblock_t		bno,
+	void			*priv)
+{
+	struct xrep_abt		*ra = priv;
+	xfs_fsblock_t		fsb;
+
+	fsb = XFS_AGB_TO_FSB(mp, ra->sc->sa.agno, bno);
+	return xfs_bitmap_set(&ra->nobtlist, fsb, 1);
+}
+
+/* Compare two free space extents. */
+static int
+xrep_abt_extent_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xrep_abt_extent	*ap;
+	struct xrep_abt_extent	*bp;
+
+	ap = container_of(a, struct xrep_abt_extent, list);
+	bp = container_of(b, struct xrep_abt_extent, list);
+
+	if (ap->bno > bp->bno)
+		return 1;
+	else if (ap->bno < bp->bno)
+		return -1;
+	return 0;
+}
+
+/* Free an extent, which creates a record in the bnobt/cntbt. */
+STATIC int
+xrep_abt_free_extent(
+	struct xfs_scrub	*sc,
+	xfs_fsblock_t		fsbno,
+	xfs_extlen_t		len,
+	struct xfs_owner_info	*oinfo)
+{
+	int			error;
+
+	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
+	if (error)
+		return error;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		return error;
+	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
+}
+
+/* Find the longest free extent in the list. */
+static struct xrep_abt_extent *
+xrep_abt_get_longest(
+	struct list_head	*free_extents)
+{
+	struct xrep_abt_extent	*rae;
+	struct xrep_abt_extent	*res = NULL;
+
+	list_for_each_entry(rae, free_extents, list) {
+		if (!res || rae->len > res->len)
+			res = rae;
+	}
+	return res;
+}
+
+/*
+ * Allocate a block from the (cached) first extent in the AG.  In theory
+ * this should never fail, since we already checked that there was enough
+ * space to handle the new btrees.
+ */
+STATIC xfs_fsblock_t
+xrep_abt_alloc_block(
+	struct xfs_scrub	*sc,
+	struct list_head	*free_extents)
+{
+	struct xrep_abt_extent	*ext;
+
+	/* Pull the first free space extent off the list, and... */
+	ext = list_first_entry(free_extents, struct xrep_abt_extent, list);
+
+	/* ...take its first block. */
+	ext->bno++;
+	ext->len--;
+	if (ext->len == 0) {
+		list_del(&ext->list);
+		kmem_free(ext);
+	}
+
+	return XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, ext->bno - 1);
+}
+
+/* Free every record in the extent list. */
+STATIC void
+xrep_abt_cancel_freelist(
+	struct list_head	*extlist)
+{
+	struct xrep_abt_extent	*rae;
+	struct xrep_abt_extent	*n;
+
+	list_for_each_entry_safe(rae, n, extlist, list) {
+		list_del(&rae->list);
+		kmem_free(rae);
+	}
+}
+
+/*
+ * Iterate all reverse mappings to find (1) the free extents, (2) the OWN_AG
+ * extents, (3) the rmapbt blocks, and (4) the AGFL blocks.  The free space is
+ * (1) + (2) - (3) - (4).  Figure out if we have enough free space to
+ * reconstruct the free space btrees.  Caller must clean up the input lists
+ * if something goes wrong.
+ */
+STATIC int
+xrep_abt_find_freespace(
+	struct xfs_scrub	*sc,
+	struct list_head	*free_extents,
+	struct xfs_bitmap	*old_allocbt_blocks)
+{
+	struct xrep_abt		ra;
+	struct xrep_abt_extent	*rae;
+	struct xfs_btree_cur	*cur;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_agblock_t		agend;
+	xfs_agblock_t		nr_blocks;
+	int			error;
+
+	ra.extlist = free_extents;
+	ra.btlist = old_allocbt_blocks;
+	xfs_bitmap_init(&ra.nobtlist);
+	ra.next_bno = 0;
+	ra.nr_records = 0;
+	ra.nr_blocks = 0;
+	ra.sc = sc;
+
+	/*
+	 * Iterate all the reverse mappings to find gaps in the physical
+	 * mappings, all the OWN_AG blocks, and all the rmapbt extents.
+	 */
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xrep_abt_walk_rmap, &ra);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+	cur = NULL;
+
+	/* Insert a record for space between the last rmap and EOAG. */
+	agend = be32_to_cpu(XFS_BUF_TO_AGF(sc->sa.agf_bp)->agf_length);
+	if (ra.next_bno < agend) {
+		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
+		if (!rae) {
+			error = -ENOMEM;
+			goto err;
+		}
+		INIT_LIST_HEAD(&rae->list);
+		rae->bno = ra.next_bno;
+		rae->len = agend - ra.next_bno;
+		list_add_tail(&rae->list, free_extents);
+		ra.nr_records++;
+		ra.nr_blocks += rae->len;
+	}
+
+	/* Collect all the AGFL blocks. */
+	error = xfs_agfl_walk(mp, XFS_BUF_TO_AGF(sc->sa.agf_bp),
+			sc->sa.agfl_bp, xrep_abt_walk_agfl, &ra);
+	if (error)
+		goto err;
+
+	/* Do we have enough space to rebuild both freespace btrees? */
+	nr_blocks = 2 * xfs_allocbt_calc_size(mp, ra.nr_records);
+	if (!xrep_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE) ||
+	    ra.nr_blocks < nr_blocks) {
+		error = -ENOSPC;
+		goto err;
+	}
+
+	/* Compute the old bnobt/cntbt blocks. */
+	error = xfs_bitmap_disunion(old_allocbt_blocks, &ra.nobtlist);
+err:
+	xfs_bitmap_destroy(&ra.nobtlist);
+	if (cur)
+		xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/*
+ * Reset the global free block counter and the per-AG counters to make it look
+ * like this AG has no free space.
+ */
+STATIC int
+xrep_abt_reset_counters(
+	struct xfs_scrub	*sc,
+	int			*log_flags)
+{
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_agf		*agf;
+	xfs_agblock_t		new_btblks;
+	xfs_agblock_t		to_free;
+	int			error;
+
+	/*
+	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
+	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
+	 * non-root blocks of the free space and rmap btrees.  Do this before
+	 * resetting the AGF counters.
+	 */
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* rmap_blocks accounts root block, btreeblks doesn't */
+	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
+
+	/* btreeblks doesn't account bno/cnt root blocks */
+	to_free = pag->pagf_btreeblks + 2;
+
+	/* and don't account for the blocks we aren't freeing */
+	to_free -= new_btblks;
+
+	error = xfs_mod_fdblocks(sc->mp, -(int64_t)to_free, false);
+	if (error)
+		return error;
+
+	/*
+	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
+	 * state stale in case we fail out of here.
+	 */
+	ASSERT(pag->pagf_init);
+	pag->pagf_init = 0;
+	pag->pagf_btreeblks = new_btblks;
+	pag->pagf_freeblks = 0;
+	pag->pagf_longest = 0;
+
+	agf->agf_btreeblks = cpu_to_be32(new_btblks);
+	agf->agf_freeblks = 0;
+	agf->agf_longest = 0;
+	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
+
+	return 0;
+}
+
+/* Initialize a new free space btree root and implant into AGF. */
+STATIC int
+xrep_abt_reset_btree(
+	struct xfs_scrub	*sc,
+	xfs_btnum_t		btnum,
+	struct list_head	*free_extents)
+{
+	struct xfs_owner_info	oinfo;
+	struct xfs_buf		*bp;
+	struct xfs_perag	*pag = sc->sa.pag;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_agf		*agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	/* Allocate new root block. */
+	fsbno = xrep_abt_alloc_block(sc, free_extents);
+	if (fsbno == NULLFSBLOCK)
+		return -ENOSPC;
+
+	/* Initialize new tree root. */
+	error = xrep_init_btblock(sc, fsbno, &bp, btnum, &xfs_allocbt_buf_ops);
+	if (error)
+		return error;
+
+	/* Implant into AGF. */
+	agf->agf_roots[btnum] = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
+	agf->agf_levels[btnum] = cpu_to_be32(1);
+
+	/* Add rmap records for the btree roots */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
+			XFS_FSB_TO_AGBNO(mp, fsbno), 1, &oinfo);
+	if (error)
+		return error;
+
+	/* Reset the incore state. */
+	pag->pagf_levels[btnum] = 1;
+
+	return 0;
+}
+
+/* Initialize new bnobt/cntbt roots and implant them into the AGF. */
+STATIC int
+xrep_abt_reset_btrees(
+	struct xfs_scrub	*sc,
+	struct list_head	*free_extents,
+	int			*log_flags)
+{
+	int			error;
+
+	error = xrep_abt_reset_btree(sc, XFS_BTNUM_BNOi, free_extents);
+	if (error)
+		return error;
+	error = xrep_abt_reset_btree(sc, XFS_BTNUM_CNTi, free_extents);
+	if (error)
+		return error;
+
+	*log_flags |= XFS_AGF_ROOTS | XFS_AGF_LEVELS;
+	return 0;
+}
+
+/*
+ * Make our new freespace btree roots permanent so that we can start freeing
+ * unused space back into the AG.
+ */
+STATIC int
+xrep_abt_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_bitmap	*old_allocbt_blocks,
+	int			log_flags)
+{
+	int			error;
+
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate the old freespace btree blocks and commit. */
+	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
+	if (error)
+		return error;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		return error;
+
+	/* Now that we've succeeded, mark the incore state valid again. */
+	sc->sa.pag->pagf_init = 1;
+	return 0;
+}
+
+/* Build new free space btrees and dispose of the old one. */
+STATIC int
+xrep_abt_rebuild_trees(
+	struct xfs_scrub	*sc,
+	struct list_head	*free_extents,
+	struct xfs_bitmap	*old_allocbt_blocks)
+{
+	struct xfs_owner_info	oinfo;
+	struct xrep_abt_extent	*rae;
+	struct xrep_abt_extent	*n;
+	struct xrep_abt_extent	*longest;
+	int			error;
+
+	xfs_rmap_skip_owner_update(&oinfo);
+
+	/*
+	 * Insert the longest free extent in case it's necessary to
+	 * refresh the AGFL with multiple blocks.  If there is no longest
+	 * extent, we had exactly the free space we needed; we're done.
+	 */
+	longest = xrep_abt_get_longest(free_extents);
+	if (!longest)
+		goto done;
+	error = xrep_abt_free_extent(sc,
+			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
+			longest->len, &oinfo);
+	list_del(&longest->list);
+	kmem_free(longest);
+	if (error)
+		return error;
+
+	/* Insert records into the new btrees. */
+	list_for_each_entry_safe(rae, n, free_extents, list) {
+		error = xrep_abt_free_extent(sc,
+				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
+				rae->len, &oinfo);
+		if (error)
+			return error;
+		list_del(&rae->list);
+		kmem_free(rae);
+	}
+
+done:
+	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
+	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/* Repair the freespace btrees for some AG. */
+int
+xrep_allocbt(
+	struct xfs_scrub	*sc)
+{
+	struct list_head	free_extents;
+	struct xfs_bitmap	old_allocbt_blocks;
+	struct xfs_mount	*mp = sc->mp;
+	int			log_flags = 0;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/*
+	 * Make sure the busy extent list is clear because we can't put
+	 * extents on there twice.
+	 */
+	if (!xfs_extent_busy_list_empty(sc->sa.pag))
+		return -EDEADLOCK;
+
+	/* Collect the free space data and find the old btree blocks. */
+	INIT_LIST_HEAD(&free_extents);
+	xfs_bitmap_init(&old_allocbt_blocks);
+	error = xrep_abt_find_freespace(sc, &free_extents, &old_allocbt_blocks);
+	if (error)
+		goto out;
+
+	/* Make sure we got some free space. */
+	if (list_empty(&free_extents)) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+	/*
+	 * Sort the free extents by block number to avoid bnobt splits when we
+	 * rebuild the free space btrees.
+	 */
+	list_sort(NULL, &free_extents, xrep_abt_extent_cmp);
+
+	/*
+	 * Blow out the old free space btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_abt_reset_counters(sc, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_abt_reset_btrees(sc, &free_extents, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_abt_commit_new(sc, &old_allocbt_blocks, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the freespace information. */
+	error = xrep_abt_rebuild_trees(sc, &free_extents, &old_allocbt_blocks);
+out:
+	xrep_abt_cancel_freelist(&free_extents);
+	xfs_bitmap_destroy(&old_allocbt_blocks);
+	return error;
+}
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 346b02abccf7..0fb949afaca9 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -623,8 +623,14 @@ xchk_setup_ag_btree(
 	 * expensive operation should be performed infrequently and only
 	 * as a last resort.  Any caller that sets force_log should
 	 * document why they need to do so.
+	 *
+	 * Force everything in memory out to disk if we're repairing.
+	 * This ensures we won't get tripped up by btree blocks sitting
+	 * in memory waiting to have LSNs stamped in.  The AGF/AGI repair
+	 * routines use any available rmap data to try to find a btree
+	 * root that also passes the read verifiers.
 	 */
-	if (force_log) {
+	if (force_log || (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR)) {
 		error = xchk_checkpoint_log(mp);
 		if (error)
 			return error;
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 9de321eee4ab..bc1a5f1cbcdc 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -61,6 +61,7 @@ int xrep_superblock(struct xfs_scrub *sc);
 int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
+int xrep_allocbt(struct xfs_scrub *sc);
 
 #else
 
@@ -87,6 +88,7 @@ xrep_calc_ag_resblks(
 #define xrep_agf			xrep_notsupported
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
+#define xrep_allocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 4bfae1e61d30..2133a3199372 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -232,13 +232,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_bnobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
 	},
 	[XFS_SCRUB_TYPE_CNTBT] = {	/* cntbt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_allocbt,
 		.scrub	= xchk_cntbt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_allocbt,
 	},
 	[XFS_SCRUB_TYPE_INOBT] = {	/* inobt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 4e20f0e48232..26bd5dc68efe 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -551,7 +551,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 xfs_agblock_t agbno, xfs_extlen_t len, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
-DEFINE_REPAIR_RMAP_EVENT(xrep_alloc_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
index 0ed68379e551..82f99633a597 100644
--- a/fs/xfs/xfs_extent_busy.c
+++ b/fs/xfs/xfs_extent_busy.c
@@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
 		diff = b1->bno - b2->bno;
 	return diff;
 }
+
+/* Are there any busy extents in this AG? */
+bool
+xfs_extent_busy_list_empty(
+	struct xfs_perag	*pag)
+{
+	spin_lock(&pag->pagb_lock);
+	if (pag->pagb_tree.rb_node) {
+		spin_unlock(&pag->pagb_lock);
+		return false;
+	}
+	spin_unlock(&pag->pagb_lock);
+	return true;
+}
diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
index 990ab3891971..2f8c73c712c6 100644
--- a/fs/xfs/xfs_extent_busy.h
+++ b/fs/xfs/xfs_extent_busy.h
@@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
 	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
 }
 
+bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
+
 #endif /* __XFS_EXTENT_BUSY_H__ */


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 06/14] xfs: repair inode btrees
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (4 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 05/14] xfs: repair free space btrees Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-08-02 14:54   ` Brian Foster
  2018-07-30  5:48 ` [PATCH 07/14] xfs: repair refcount btrees Darrick J. Wong
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile              |    1 
 fs/xfs/scrub/common.c        |    2 
 fs/xfs/scrub/ialloc_repair.c |  673 ++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c        |   20 +
 fs/xfs/scrub/repair.h        |   11 +
 fs/xfs/scrub/scrub.c         |    4 
 fs/xfs/scrub/scrub.h         |    1 
 fs/xfs/scrub/trace.h         |    4 
 8 files changed, 712 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/scrub/ialloc_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 44ddd112acd2..af1dc9aeb1a7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -166,6 +166,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   bitmap.o \
+				   ialloc_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c
index 0fb949afaca9..67df7ea8798d 100644
--- a/fs/xfs/scrub/common.c
+++ b/fs/xfs/scrub/common.c
@@ -516,6 +516,8 @@ xchk_ag_free(
 	struct xchk_ag		*sa)
 {
 	xchk_ag_btcur_free(sa);
+	if (sa->pag != NULL && sc->reset_perag_resv)
+		xrep_reset_perag_resv(sc);
 	if (sa->agfl_bp) {
 		xfs_trans_brelse(sc->tp, sa->agfl_bp);
 		sa->agfl_bp = NULL;
diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
new file mode 100644
index 000000000000..126135c1a147
--- /dev/null
+++ b/fs/xfs/scrub/ialloc_repair.c
@@ -0,0 +1,673 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_icache.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_log.h"
+#include "xfs_trans_priv.h"
+#include "xfs_error.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+
+/*
+ * Inode Btree Repair
+ * ==================
+ *
+ * A quick refresher of inode btrees on a v5 filesystem:
+ *
+ * - Each inode btree record can describe a single 'inode chunk'.  The chunk
+ *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
+ *   inobt record must be aligned to the chunk size.  A chunk can be smaller
+ *   than a fs block.  One must be careful with 64k-block filesystems whose
+ *   inodes are smaller than 1k.
+ *
+ * - Inode buffers are read into memory in units of 'inode clusters'.  However
+ *   many inodes fit in a cluster buffer is the smallest number of inodes that
+ *   can be allocated or freed.  Clusters are never larger than a chunk and
+ *   never smaller than a fs block.  If sparse inodes are not enabled, then
+ *   records can be aligned to a cluster.
+ *
+ * - If sparse inodes are enabled, the holemask field will be active.  Each
+ *   bit of the holemask represents 4 potential inodes; if set, the
+ *   corresponding space does *not* contain inodes and must be left alone.
+ *
+ * So what's the rebuild algorithm?
+ *
+ * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
+ * records.  The OWN_INOBT records are the old inode btree blocks and will be
+ * cleared out after we've rebuilt the tree.  Each possible inode chunk within
+ * an OWN_INODES record will be read in and the freemask calculated from the
+ * i_mode data in the inode chunk.  For sparse inodes the holemask will be
+ * calculated by creating the properly aligned inobt record and punching out
+ * any chunk that's missing.  Inode allocations and frees grab the AGI first,
+ * so repair protects itself from concurrent access by locking the AGI.
+ *
+ * Once we've reconstructed all the inode records, we can create new inode
+ * btree roots and reload the btrees.  We rebuild both inode trees at the same
+ * time because they have the same rmap owner and it would be more complex to
+ * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
+ * blocks it owns.  We have all the data we need to build both, so dump
+ * everything and start over.
+ *
+ * We use the prefix 'xrep_ibt' because we rebuild both inode btrees.
+ */
+
+struct xrep_ibt_extent {
+	struct list_head	list;
+	xfs_inofree_t		freemask;
+	xfs_agino_t		startino;
+	unsigned int		count;
+	unsigned int		usedcount;
+	uint16_t		holemask;
+};
+
+struct xrep_ibt {
+	/* Reconstructed inode records. */
+	struct list_head	*extlist;
+
+	/* Old inode btree blocks we found in the rmap. */
+	struct xfs_bitmap	*btlist;
+
+	struct xfs_scrub	*sc;
+
+	/* Number of inode btree block records. */
+	uint64_t		nr_records;
+};
+
+/*
+ * Is this inode in use?  If the inode is in memory we can tell from i_mode,
+ * otherwise we have to check di_mode in the on-disk buffer.  We only care
+ * that the high (i.e. non-permission) bits of _mode are zero.  This should be
+ * safe because repair keeps all AG headers locked until the end, and process
+ * trying to perform an inode allocation/free must lock the AGI.
+ */
+STATIC int
+xrep_ibt_check_free(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp,
+	xfs_ino_t		fsino,
+	xfs_agino_t		bpino,
+	bool			*inuse)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_dinode	*dip;
+	int			error;
+
+	/* Will the in-core inode tell us if it's in use? */
+	error = xfs_icache_inode_is_allocated(mp, sc->tp, fsino, inuse);
+	if (!error)
+		return 0;
+
+	/* Inode uncached or half assembled, read disk buffer */
+	dip = xfs_buf_offset(bp, bpino * mp->m_sb.sb_inodesize);
+	if (be16_to_cpu(dip->di_magic) != XFS_DINODE_MAGIC)
+		return -EFSCORRUPTED;
+
+	if (dip->di_version >= 3 && be64_to_cpu(dip->di_ino) != fsino)
+		return -EFSCORRUPTED;
+
+	*inuse = dip->di_mode != 0;
+	return 0;
+}
+
+/*
+ * For each inode cluster covering the physical extent recorded by the rmapbt,
+ * we must calculate the properly aligned startino of that cluster, then
+ * iterate each cluster to fill in used and filled masks appropriately.  We
+ * then use the (startino, used, filled) information to construct the
+ * appropriate inode records.
+ */
+STATIC int
+xrep_ibt_process_cluster(
+	struct xrep_ibt		*ri,
+	xfs_agblock_t		agbno,
+	int			blks_per_cluster,
+	xfs_agino_t		rec_agino)
+{
+	struct xfs_imap		imap;
+	struct xrep_ibt_extent	*rie;
+	struct xfs_dinode	*dip;
+	struct xfs_buf		*bp;
+	struct xfs_scrub	*sc = ri->sc;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_ino_t		fsino;
+	xfs_inofree_t		usedmask;
+	xfs_agino_t		nr_inodes;
+	xfs_agino_t		startino;
+	xfs_agino_t		clusterino;
+	xfs_agino_t		clusteroff;
+	xfs_agino_t		agino;
+	uint16_t		fillmask;
+	bool			inuse;
+	int			usedcount;
+	int			error;
+
+	/* The per-AG inum of this inode cluster. */
+	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
+
+	/* The per-AG inum of the inobt record. */
+	startino = rec_agino + rounddown(agino - rec_agino,
+			XFS_INODES_PER_CHUNK);
+
+	/* The per-AG inum of the cluster within the inobt record. */
+	clusteroff = agino - startino;
+
+	/* Every inode in this holemask slot is filled. */
+	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
+	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
+			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
+
+	/*
+	 * Grab the inode cluster buffer.  This is safe to do with a broken
+	 * inobt because imap_to_bp directly maps the buffer without touching
+	 * either inode btree.
+	 */
+	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
+	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
+	imap.im_boffset = 0;
+	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
+			XFS_IGET_UNTRUSTED);
+	if (error)
+		return error;
+
+	usedmask = 0;
+	usedcount = 0;
+	/* Which inodes within this cluster are free? */
+	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
+		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
+		error = xrep_ibt_check_free(sc, bp, fsino,
+				clusterino, &inuse);
+		if (error) {
+			xfs_trans_brelse(sc->tp, bp);
+			return error;
+		}
+		if (inuse) {
+			usedcount++;
+			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
+		}
+	}
+	xfs_trans_brelse(sc->tp, bp);
+
+	/*
+	 * If the last item in the list is our chunk record,
+	 * update that.
+	 */
+	if (!list_empty(ri->extlist)) {
+		rie = list_last_entry(ri->extlist, struct xrep_ibt_extent,
+				list);
+		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
+			rie->freemask &= ~usedmask;
+			rie->holemask &= ~fillmask;
+			rie->count += nr_inodes;
+			rie->usedcount += usedcount;
+			return 0;
+		}
+	}
+
+	/* New inode chunk; add to the list. */
+	rie = kmem_alloc(sizeof(struct xrep_ibt_extent), KM_MAYFAIL);
+	if (!rie)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&rie->list);
+	rie->startino = startino;
+	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
+	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
+	rie->count = nr_inodes;
+	rie->usedcount = usedcount;
+	list_add_tail(&rie->list, ri->extlist);
+	ri->nr_records++;
+
+	return 0;
+}
+
+/* Record extents that belong to inode btrees. */
+STATIC int
+xrep_ibt_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_ibt		*ri = priv;
+	struct xfs_mount	*mp = cur->bc_mp;
+	xfs_fsblock_t		fsbno;
+	xfs_agblock_t		agbno = rec->rm_startblock;
+	xfs_agino_t		inoalign;
+	xfs_agino_t		agino;
+	xfs_agino_t		rec_agino;
+	int			blks_per_cluster;
+	int			error = 0;
+
+	if (xchk_should_terminate(ri->sc, &error))
+		return error;
+
+	/* Fragment of the old btrees; dispose of them later. */
+	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
+		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
+		return xfs_bitmap_set(ri->btlist, fsbno, rec->rm_blockcount);
+	}
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
+		return 0;
+
+	blks_per_cluster = xfs_icluster_size_fsb(mp);
+
+	if (agbno % blks_per_cluster != 0)
+		return -EFSCORRUPTED;
+
+	trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.agno, rec->rm_startblock,
+			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
+			rec->rm_flags);
+
+	/*
+	 * Determine the inode block alignment, and where the block
+	 * ought to start if it's aligned properly.  On a sparse inode
+	 * system the rmap doesn't have to start on an alignment boundary,
+	 * but the record does.  On pre-sparse filesystems, we /must/
+	 * start both rmap and inobt on an alignment boundary.
+	 */
+	inoalign = xfs_ialloc_cluster_alignment(mp);
+	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
+	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
+	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
+		return -EFSCORRUPTED;
+
+	/*
+	 * Set up the free/hole masks for each inode cluster that could be
+	 * mapped by this rmap record.
+	 */
+	for (;
+	     agbno < rec->rm_startblock + rec->rm_blockcount;
+	     agbno += blks_per_cluster) {
+		error = xrep_ibt_process_cluster(ri, agbno, blks_per_cluster,
+				rec_agino);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/* Compare two ialloc extents. */
+static int
+xrep_ibt_extent_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xrep_ibt_extent	*ap;
+	struct xrep_ibt_extent	*bp;
+
+	ap = container_of(a, struct xrep_ibt_extent, list);
+	bp = container_of(b, struct xrep_ibt_extent, list);
+
+	if (ap->startino > bp->startino)
+		return 1;
+	else if (ap->startino < bp->startino)
+		return -1;
+	return 0;
+}
+
+/* Insert an inode chunk record into a given btree. */
+static int
+xrep_ibt_insert_btrec(
+	struct xfs_btree_cur	*cur,
+	struct xrep_ibt_extent	*rie)
+{
+	int			stat;
+	int			error;
+
+	error = xfs_inobt_lookup(cur, rie->startino, XFS_LOOKUP_EQ, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 0);
+	error = xfs_inobt_insert_rec(cur, rie->holemask, rie->count,
+			rie->count - rie->usedcount, rie->freemask, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
+	return error;
+}
+
+/* Insert an inode chunk record into both inode btrees. */
+static int
+xrep_ibt_insert_rec(
+	struct xfs_scrub	*sc,
+	struct xrep_ibt_extent	*rie)
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	trace_xrep_ibt_insert(sc->mp, sc->sa.agno, rie->startino,
+			rie->holemask, rie->count, rie->count - rie->usedcount,
+			rie->freemask);
+
+	/* Insert into the inobt. */
+	cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp, sc->sa.agno,
+			XFS_BTNUM_INO);
+	error = xrep_ibt_insert_btrec(cur, rie);
+	if (error)
+		goto out_cur;
+	xfs_btree_del_cursor(cur, error);
+
+	/* Insert into the finobt if chunk has free inodes. */
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb) &&
+	    rie->count != rie->usedcount) {
+		cur = xfs_inobt_init_cursor(sc->mp, sc->tp, sc->sa.agi_bp,
+				sc->sa.agno, XFS_BTNUM_FINO);
+		error = xrep_ibt_insert_btrec(cur, rie);
+		if (error)
+			goto out_cur;
+		xfs_btree_del_cursor(cur, error);
+	}
+
+	return xrep_roll_ag_trans(sc);
+out_cur:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Free every record in the inode list. */
+STATIC void
+xrep_ibt_cancel_inorecs(
+	struct list_head	*reclist)
+{
+	struct xrep_ibt_extent	*rie;
+	struct xrep_ibt_extent	*n;
+
+	list_for_each_entry_safe(rie, n, reclist, list) {
+		list_del(&rie->list);
+		kmem_free(rie);
+	}
+}
+
+/*
+ * Iterate all reverse mappings to find the inodes (OWN_INODES) and the inode
+ * btrees (OWN_INOBT).  Figure out if we have enough free space to reconstruct
+ * the inode btrees.  The caller must clean up the lists if anything goes
+ * wrong.
+ */
+STATIC int
+xrep_ibt_find_inodes(
+	struct xfs_scrub	*sc,
+	struct list_head	*inode_records,
+	struct xfs_bitmap	*old_iallocbt_blocks)
+{
+	struct xrep_ibt		ri;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_btree_cur	*cur;
+	xfs_agblock_t		nr_blocks;
+	int			error;
+
+	/* Collect all reverse mappings for inode blocks. */
+	ri.extlist = inode_records;
+	ri.btlist = old_iallocbt_blocks;
+	ri.nr_records = 0;
+	ri.sc = sc;
+
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, sc->sa.agf_bp, sc->sa.agno);
+	error = xfs_rmap_query_all(cur, xrep_ibt_walk_rmap, &ri);
+	if (error)
+		goto err;
+	xfs_btree_del_cursor(cur, error);
+
+	/* Do we have enough space to rebuild all inode trees? */
+	nr_blocks = xfs_iallocbt_calc_size(mp, ri.nr_records);
+	if (xfs_sb_version_hasfinobt(&mp->m_sb))
+		nr_blocks *= 2;
+	if (!xrep_ag_has_space(sc->sa.pag, nr_blocks, XFS_AG_RESV_NONE))
+		return -ENOSPC;
+
+	return 0;
+
+err:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+
+/* Update the AGI counters. */
+STATIC int
+xrep_ibt_reset_counters(
+	struct xfs_scrub	*sc,
+	struct list_head	*inode_records,
+	int			*log_flags)
+{
+	struct xfs_agi		*agi;
+	struct xrep_ibt_extent	*rie;
+	struct xfs_perag	*pag = sc->sa.pag;
+	unsigned int		count = 0;
+	unsigned int		usedcount = 0;
+	unsigned int		freecount;
+
+	/* Figure out the new counters. */
+	list_for_each_entry(rie, inode_records, list) {
+		count += rie->count;
+		usedcount += rie->usedcount;
+	}
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+	freecount = count - usedcount;
+
+	/* Trigger inode count recalculation */
+	xfs_force_summary_recalc(sc->mp);
+
+	/*
+	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
+	 * state stale in case we fail out of here.
+	 */
+	ASSERT(pag->pagi_init);
+	pag->pagi_init = 0;
+	pag->pagi_count = count;
+	pag->pagi_freecount = freecount;
+
+	agi->agi_count = cpu_to_be32(count);
+	agi->agi_freecount = cpu_to_be32(freecount);
+	*log_flags |= XFS_AGI_COUNT | XFS_AGI_FREECOUNT;
+
+	return 0;
+}
+
+/* Initialize a new inode btree roots and implant it into the AGI. */
+STATIC int
+xrep_ibt_reset_btree(
+	struct xfs_scrub	*sc,
+	xfs_btnum_t		btnum,
+	struct xfs_owner_info	*oinfo,
+	enum xfs_ag_resv_type	resv,
+	int			*log_flags)
+{
+	struct xfs_agi		*agi;
+	struct xfs_buf		*bp;
+	struct xfs_mount	*mp = sc->mp;
+	xfs_fsblock_t		fsbno;
+	int			error;
+
+	agi = XFS_BUF_TO_AGI(sc->sa.agi_bp);
+
+	/* Initialize new btree root. */
+	error = xrep_alloc_ag_block(sc, oinfo, &fsbno, resv);
+	if (error)
+		return error;
+	error = xrep_init_btblock(sc, fsbno, &bp, btnum, &xfs_inobt_buf_ops);
+	if (error)
+		return error;
+
+	switch (btnum) {
+	case XFS_BTNUM_INOi:
+		agi->agi_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
+		agi->agi_level = cpu_to_be32(1);
+		*log_flags |= XFS_AGI_ROOT | XFS_AGI_LEVEL;
+		break;
+	case XFS_BTNUM_FINOi:
+		agi->agi_free_root = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
+		agi->agi_free_level = cpu_to_be32(1);
+		*log_flags |= XFS_AGI_FREE_ROOT | XFS_AGI_FREE_LEVEL;
+		break;
+	default:
+		ASSERT(0);
+	}
+
+	return 0;
+}
+
+/* Initialize new inobt/finobt roots and implant them into the AGI. */
+STATIC int
+xrep_ibt_reset_btrees(
+	struct xfs_scrub	*sc,
+	struct xfs_owner_info	*oinfo,
+	int			*log_flags)
+{
+	enum xfs_ag_resv_type	resv;
+	int			error;
+
+	resv = XFS_AG_RESV_NONE;
+	error = xrep_ibt_reset_btree(sc, XFS_BTNUM_INO, oinfo, XFS_AG_RESV_NONE,
+			log_flags);
+	if (error || !xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		return error;
+
+	/*
+	 * If we made a per-AG reservation for the finobt then we must account
+	 * the new block correctly.
+	 */
+	if (!sc->mp->m_inotbt_nores)
+		resv = XFS_AG_RESV_METADATA;
+	return xrep_ibt_reset_btree(sc, XFS_BTNUM_FINO, oinfo, resv, log_flags);
+}
+
+/* Build new inode btrees and dispose of the old one. */
+STATIC int
+xrep_ibt_rebuild_trees(
+	struct xfs_scrub	*sc,
+	struct list_head	*inode_records,
+	struct xfs_owner_info	*oinfo,
+	struct xfs_bitmap	*old_iallocbt_blocks)
+{
+	struct xrep_ibt_extent	*rie;
+	struct xrep_ibt_extent	*n;
+	int			error;
+
+	/* Add all records. */
+	list_sort(NULL, inode_records, xrep_ibt_extent_cmp);
+	list_for_each_entry_safe(rie, n, inode_records, list) {
+		error = xrep_ibt_insert_rec(sc, rie);
+		if (error)
+			return error;
+
+		list_del(&rie->list);
+		kmem_free(rie);
+	}
+
+	/* Free the old inode btree blocks if they're not in use. */
+	return xrep_reap_extents(sc, old_iallocbt_blocks, oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/*
+ * Make our new inode btree roots permanent so that we can start re-adding
+ * inode records back into the AG.
+ */
+STATIC int
+xrep_ibt_commit_new(
+	struct xfs_scrub	*sc,
+	struct xfs_bitmap	*old_iallocbt_blocks,
+	int			log_flags)
+{
+	int			error;
+
+	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xrep_invalidate_blocks(sc, old_iallocbt_blocks);
+	if (error)
+		return error;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		return error;
+
+	/*
+	 * Now that we've succeeded, mark the incore state valid again.  If the
+	 * finobt is enabled, make sure we reinitialize the per-AG reservations
+	 * when we're done.
+	 */
+	sc->sa.pag->pagi_init = 1;
+	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb))
+		sc->reset_perag_resv = true;
+	return 0;
+}
+
+/* Repair both inode btrees. */
+int
+xrep_iallocbt(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_owner_info	oinfo;
+	struct list_head	inode_records;
+	struct xfs_bitmap	old_iallocbt_blocks;
+	struct xfs_mount	*mp = sc->mp;
+	int			log_flags = 0;
+	int			error = 0;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/* Collect the free space data and find the old btree blocks. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
+	INIT_LIST_HEAD(&inode_records);
+	xfs_bitmap_init(&old_iallocbt_blocks);
+	error = xrep_ibt_find_inodes(sc, &inode_records, &old_iallocbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old inode btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_ibt_reset_counters(sc, &inode_records, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_ibt_reset_btrees(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	error = xrep_ibt_commit_new(sc, &old_iallocbt_blocks, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the inode information. */
+	error = xrep_ibt_rebuild_trees(sc, &inode_records, &oinfo,
+			&old_iallocbt_blocks);
+	if (error)
+		goto out;
+out:
+	xrep_ibt_cancel_inorecs(&inode_records);
+	xfs_bitmap_destroy(&old_iallocbt_blocks);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 17cf48564390..a44deb6f06ab 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -880,3 +880,23 @@ xrep_ino_dqattach(
 
 	return error;
 }
+
+/*
+ * Reinitialize the per-AG block reservation for the AG we just fixed.
+ */
+int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	ASSERT(sc->ops->type == ST_PERAG);
+	ASSERT(sc->tp);
+
+	error = xfs_ag_resv_free(sc->sa.pag);
+	if (error)
+		goto out;
+	error = xfs_ag_resv_init(sc->sa.pag, sc->tp);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index bc1a5f1cbcdc..0cc53dee3228 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -53,6 +53,7 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
 void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
+int xrep_reset_perag_resv(struct xfs_scrub *sc);
 
 /* Metadata repairers */
 
@@ -62,6 +63,7 @@ int xrep_agf(struct xfs_scrub *sc);
 int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
+int xrep_iallocbt(struct xfs_scrub *sc);
 
 #else
 
@@ -83,12 +85,21 @@ xrep_calc_ag_resblks(
 	return 0;
 }
 
+static inline int
+xrep_reset_perag_resv(
+	struct xfs_scrub	*sc)
+{
+	ASSERT(0);
+	return -EOPNOTSUPP;
+}
+
 #define xrep_probe			xrep_notsupported
 #define xrep_superblock			xrep_notsupported
 #define xrep_agf			xrep_notsupported
 #define xrep_agfl			xrep_notsupported
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
+#define xrep_iallocbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 2133a3199372..631b0b06db99 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -244,14 +244,14 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_inobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
 		.type	= ST_PERAG,
 		.setup	= xchk_setup_ag_iallocbt,
 		.scrub	= xchk_finobt,
 		.has	= xfs_sb_version_hasfinobt,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_iallocbt,
 	},
 	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
 		.type	= ST_PERAG,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index af323b229c4b..762db46fd696 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -64,6 +64,7 @@ struct xfs_scrub {
 	uint				ilock_flags;
 	bool				try_harder;
 	bool				has_quotaofflock;
+	bool				reset_perag_resv;
 
 	/* State tracking for single-AG operations. */
 	struct xchk_ag			sa;
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 26bd5dc68efe..9126dc66f726 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -552,7 +552,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 		 uint64_t owner, uint64_t offset, unsigned int flags), \
 	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
 DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
-DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
 DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
 
@@ -700,7 +700,7 @@ TRACE_EVENT(xrep_reset_counters,
 		  MAJOR(__entry->dev), MINOR(__entry->dev))
 )
 
-TRACE_EVENT(xrep_ialloc_insert,
+TRACE_EVENT(xrep_ibt_insert,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
 		 xfs_agino_t startino, uint16_t holemask, uint8_t count,
 		 uint8_t freecount, uint64_t freemask),


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 07/14] xfs: repair refcount btrees
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (5 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 06/14] xfs: repair inode btrees Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30  5:48 ` [PATCH 08/14] xfs: repair inode records Darrick J. Wong
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Reconstruct the refcount data from the rmap btree.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                |    1 
 fs/xfs/scrub/refcount_repair.c |  586 ++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h          |    2 
 fs/xfs/scrub/scrub.c           |    2 
 4 files changed, 590 insertions(+), 1 deletion(-)
 create mode 100644 fs/xfs/scrub/refcount_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index af1dc9aeb1a7..4ca97e026f94 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -167,6 +167,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   bitmap.o \
 				   ialloc_repair.o \
+				   refcount_repair.o \
 				   repair.o \
 				   )
 endif
diff --git a/fs/xfs/scrub/refcount_repair.c b/fs/xfs/scrub/refcount_repair.c
new file mode 100644
index 000000000000..f6076083dd94
--- /dev/null
+++ b/fs/xfs/scrub/refcount_repair.c
@@ -0,0 +1,586 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_itable.h"
+#include "xfs_alloc.h"
+#include "xfs_ialloc.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_refcount_btree.h"
+#include "xfs_error.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+
+/*
+ * Rebuilding the Reference Count Btree
+ * ====================================
+ *
+ * This algorithm is "borrowed" from xfs_repair.  Imagine the rmap
+ * entries as rectangles representing extents of physical blocks, and
+ * that the rectangles can be laid down to allow them to overlap each
+ * other; then we know that we must emit a refcnt btree entry wherever
+ * the amount of overlap changes, i.e. the emission stimulus is
+ * level-triggered:
+ *
+ *                 -    ---
+ *       --      ----- ----   ---        ------
+ * --   ----     ----------- ----     ---------
+ * -------------------------------- -----------
+ * ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
+ * 2 1  23 21    3 43 234  2123  1 01 2  3     0
+ *
+ * For our purposes, a rmap is a tuple (startblock, len, fileoff, owner).
+ *
+ * Note that in the actual refcnt btree we don't store the refcount < 2
+ * cases because the bnobt tells us which blocks are free; single-use
+ * blocks aren't recorded in the bnobt or the refcntbt.  If the rmapbt
+ * supports storing multiple entries covering a given block we could
+ * theoretically dispense with the refcntbt and simply count rmaps, but
+ * that's inefficient in the (hot) write path, so we'll take the cost of
+ * the extra tree to save time.  Also there's no guarantee that rmap
+ * will be enabled.
+ *
+ * Given an array of rmaps sorted by physical block number, a starting
+ * physical block (sp), a bag to hold rmaps that cover sp, and the next
+ * physical block where the level changes (np), we can reconstruct the
+ * refcount btree as follows:
+ *
+ * While there are still unprocessed rmaps in the array,
+ *  - Set sp to the physical block (pblk) of the next unprocessed rmap.
+ *  - Add to the bag all rmaps in the array where startblock == sp.
+ *  - Set np to the physical block where the bag size will change.  This
+ *    is the minimum of (the pblk of the next unprocessed rmap) and
+ *    (startblock + len of each rmap in the bag).
+ *  - Record the bag size as old_bag_size.
+ *
+ *  - While the bag isn't empty,
+ *     - Remove from the bag all rmaps where startblock + len == np.
+ *     - Add to the bag all rmaps in the array where startblock == np.
+ *     - If the bag size isn't old_bag_size, store the refcount entry
+ *       (sp, np - sp, bag_size) in the refcnt btree.
+ *     - If the bag is empty, break out of the inner loop.
+ *     - Set old_bag_size to the bag size
+ *     - Set sp = np.
+ *     - Set np to the physical block where the bag size will change.
+ *       This is the minimum of (the pblk of the next unprocessed rmap)
+ *       and (startblock + len of each rmap in the bag).
+ *
+ * Like all the other repairers, we make a list of all the refcount
+ * records we need, then reinitialize the refcount btree root and
+ * insert all the records.
+ */
+
+struct xrep_refc_rmap {
+	struct list_head	list;
+	struct xfs_rmap_irec	rmap;
+};
+
+struct xrep_refc_extent {
+	struct list_head		list;
+	struct xfs_refcount_irec	refc;
+};
+
+struct xrep_refc {
+	struct list_head	rmap_bag;  /* rmaps we're tracking */
+	struct list_head	rmap_idle; /* idle rmaps */
+	struct list_head	*extlist;  /* refcount extents */
+	struct xfs_bitmap	*btlist;   /* old refcountbt blocks */
+	struct xfs_scrub	*sc;
+	unsigned long		nr_records;/* nr refcount extents */
+	xfs_extlen_t		btblocks;  /* # of refcountbt blocks */
+};
+
+/* Grab the next record from the rmapbt. */
+STATIC int
+xrep_refc_next_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xrep_refc	*rr,
+	struct xfs_rmap_irec	*rec,
+	bool			*have_rec)
+{
+	struct xfs_rmap_irec	rmap;
+	struct xfs_mount	*mp = cur->bc_mp;
+	struct xrep_refc_extent	*rre;
+	xfs_fsblock_t		fsbno;
+	int			have_gt;
+	int			error = 0;
+
+	*have_rec = false;
+	/*
+	 * Loop through the remaining rmaps.  Remember CoW staging
+	 * extents and the refcountbt blocks from the old tree for later
+	 * disposal.  We can only share written data fork extents, so
+	 * keep looping until we find an rmap for one.
+	 */
+	do {
+		if (xchk_should_terminate(rr->sc, &error))
+			goto out_error;
+
+		error = xfs_btree_increment(cur, 0, &have_gt);
+		if (error)
+			goto out_error;
+		if (!have_gt)
+			return 0;
+
+		error = xfs_rmap_get_rec(cur, &rmap, &have_gt);
+		if (error)
+			goto out_error;
+		XFS_WANT_CORRUPTED_GOTO(mp, have_gt == 1, out_error);
+
+		if (rmap.rm_owner == XFS_RMAP_OWN_COW) {
+			/* Pass CoW staging extents right through. */
+			rre = kmem_alloc(sizeof(struct xrep_refc_extent),
+					KM_MAYFAIL);
+			if (!rre)
+				goto out_error;
+
+			INIT_LIST_HEAD(&rre->list);
+			rre->refc.rc_startblock = rmap.rm_startblock +
+					XFS_REFC_COW_START;
+			rre->refc.rc_blockcount = rmap.rm_blockcount;
+			rre->refc.rc_refcount = 1;
+			list_add_tail(&rre->list, rr->extlist);
+		} else if (rmap.rm_owner == XFS_RMAP_OWN_REFC) {
+			/* refcountbt block, dump it when we're done. */
+			rr->btblocks += rmap.rm_blockcount;
+			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
+					cur->bc_private.a.agno,
+					rmap.rm_startblock);
+			error = xfs_bitmap_set(rr->btlist, fsbno,
+					rmap.rm_blockcount);
+			if (error)
+				goto out_error;
+		}
+	} while (XFS_RMAP_NON_INODE_OWNER(rmap.rm_owner) ||
+		 xfs_internal_inum(mp, rmap.rm_owner) ||
+		 (rmap.rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK |
+				   XFS_RMAP_UNWRITTEN)));
+
+	*rec = rmap;
+	*have_rec = true;
+	return 0;
+
+out_error:
+	return error;
+}
+
+/* Recycle an idle rmap or allocate a new one. */
+static struct xrep_refc_rmap *
+xrep_refc_get_rmap(
+	struct xrep_refc	*rr)
+{
+	struct xrep_refc_rmap	*rrm;
+
+	if (list_empty(&rr->rmap_idle)) {
+		rrm = kmem_alloc(sizeof(struct xrep_refc_rmap), KM_MAYFAIL);
+		if (!rrm)
+			return NULL;
+		INIT_LIST_HEAD(&rrm->list);
+		return rrm;
+	}
+
+	rrm = list_first_entry(&rr->rmap_idle, struct xrep_refc_rmap, list);
+	list_del_init(&rrm->list);
+	return rrm;
+}
+
+/* Compare two btree extents. */
+static int
+xrep_refcount_extent_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xrep_refc_extent	*ap;
+	struct xrep_refc_extent	*bp;
+
+	ap = container_of(a, struct xrep_refc_extent, list);
+	bp = container_of(b, struct xrep_refc_extent, list);
+
+	if (ap->refc.rc_startblock > bp->refc.rc_startblock)
+		return 1;
+	else if (ap->refc.rc_startblock < bp->refc.rc_startblock)
+		return -1;
+	return 0;
+}
+
+/* Record a reference count extent. */
+STATIC int
+xrep_refc_new_refc(
+	struct xfs_scrub		*sc,
+	struct xrep_refc		*rr,
+	xfs_agblock_t			agbno,
+	xfs_extlen_t			len,
+	xfs_nlink_t			refcount)
+{
+	struct xrep_refc_extent		*rre;
+	struct xfs_refcount_irec	irec;
+
+	irec.rc_startblock = agbno;
+	irec.rc_blockcount = len;
+	irec.rc_refcount = refcount;
+
+	trace_xrep_refcount_extent_fn(sc->mp, sc->sa.agno, &irec);
+
+	rre = kmem_alloc(sizeof(struct xrep_refc_extent), KM_MAYFAIL);
+	if (!rre)
+		return -ENOMEM;
+	INIT_LIST_HEAD(&rre->list);
+	rre->refc = irec;
+	list_add_tail(&rre->list, rr->extlist);
+
+	return 0;
+}
+
+/* Iterate all the rmap records to generate reference count data. */
+#define RMAP_NEXT(r)	((r).rm_startblock + (r).rm_blockcount)
+STATIC int
+xrep_refc_generate_refcounts(
+	struct xfs_scrub	*sc,
+	struct xrep_refc	*rr)
+{
+	struct xfs_rmap_irec	rmap;
+	struct xfs_btree_cur	*cur;
+	struct xrep_refc_rmap	*rrm;
+	struct xrep_refc_rmap	*n;
+	xfs_agblock_t		sbno;
+	xfs_agblock_t		cbno;
+	xfs_agblock_t		nbno;
+	size_t			old_stack_sz;
+	size_t			stack_sz = 0;
+	bool			have;
+	int			have_gt;
+	int			error;
+
+	/* Start the rmapbt cursor to the left of all records. */
+	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp,
+			sc->sa.agno);
+	error = xfs_rmap_lookup_le(cur, 0, 0, 0, 0, 0, &have_gt);
+	if (error)
+		goto out;
+	ASSERT(have_gt == 0);
+
+	/* Process reverse mappings into refcount data. */
+	while (xfs_btree_has_more_records(cur)) {
+		/* Push all rmaps with pblk == sbno onto the stack */
+		error = xrep_refc_next_rmap(cur, rr, &rmap, &have);
+		if (error)
+			goto out;
+		if (!have)
+			break;
+		sbno = cbno = rmap.rm_startblock;
+		while (have && rmap.rm_startblock == sbno) {
+			rrm = xrep_refc_get_rmap(rr);
+			if (!rrm)
+				goto out;
+			rrm->rmap = rmap;
+			list_add_tail(&rrm->list, &rr->rmap_bag);
+			stack_sz++;
+			error = xrep_refc_next_rmap(cur, rr, &rmap, &have);
+			if (error)
+				goto out;
+		}
+		error = xfs_btree_decrement(cur, 0, &have_gt);
+		if (error)
+			goto out;
+		XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+		/* Set nbno to the bno of the next refcount change */
+		nbno = have ? rmap.rm_startblock : NULLAGBLOCK;
+		list_for_each_entry(rrm, &rr->rmap_bag, list)
+			nbno = min_t(xfs_agblock_t, nbno, RMAP_NEXT(rrm->rmap));
+
+		ASSERT(nbno > sbno);
+		old_stack_sz = stack_sz;
+
+		/* While stack isn't empty... */
+		while (stack_sz) {
+			/* Pop all rmaps that end at nbno */
+			list_for_each_entry_safe(rrm, n, &rr->rmap_bag, list) {
+				if (RMAP_NEXT(rrm->rmap) != nbno)
+					continue;
+				stack_sz--;
+				list_move(&rrm->list, &rr->rmap_idle);
+			}
+
+			/* Push array items that start at nbno */
+			error = xrep_refc_next_rmap(cur, rr, &rmap, &have);
+			if (error)
+				goto out;
+			while (have && rmap.rm_startblock == nbno) {
+				rrm = xrep_refc_get_rmap(rr);
+				if (!rrm)
+					goto out;
+				rrm->rmap = rmap;
+				list_add_tail(&rrm->list, &rr->rmap_bag);
+				stack_sz++;
+				error = xrep_refc_next_rmap(cur, rr, &rmap,
+						&have);
+				if (error)
+					goto out;
+			}
+			error = xfs_btree_decrement(cur, 0, &have_gt);
+			if (error)
+				goto out;
+			XFS_WANT_CORRUPTED_GOTO(sc->mp, have_gt, out);
+
+			/* Emit refcount if necessary */
+			ASSERT(nbno > cbno);
+			if (stack_sz != old_stack_sz) {
+				if (old_stack_sz > 1) {
+					error = xrep_refc_new_refc(sc, rr, cbno,
+							nbno - cbno,
+							old_stack_sz);
+					if (error)
+						goto out;
+					rr->nr_records++;
+				}
+				cbno = nbno;
+			}
+
+			/* Stack empty, go find the next rmap */
+			if (stack_sz == 0)
+				break;
+			old_stack_sz = stack_sz;
+			sbno = nbno;
+
+			/* Set nbno to the bno of the next refcount change */
+			nbno = have ? rmap.rm_startblock : NULLAGBLOCK;
+			list_for_each_entry(rrm, &rr->rmap_bag, list)
+				nbno = min_t(xfs_agblock_t, nbno,
+						RMAP_NEXT(rrm->rmap));
+
+			ASSERT(nbno > sbno);
+		}
+	}
+
+	/* Free all the leftover rmap records. */
+	list_for_each_entry_safe(rrm, n, &rr->rmap_idle, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+
+	ASSERT(list_empty(&rr->rmap_bag));
+out:
+	xfs_btree_del_cursor(cur, error);
+	return error;
+}
+#undef RMAP_NEXT
+
+/*
+ * Generate all the reference counts for this AG and a list of the old
+ * refcount btree blocks.  Figure out if we have enough free space to
+ * reconstruct the inode btrees.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_refc_find_refcounts(
+	struct xfs_scrub	*sc,
+	struct list_head	*refcount_records,
+	struct xfs_bitmap	*old_refcountbt_blocks)
+{
+	struct xrep_refc	rr;
+	struct xrep_refc_rmap	*rrm;
+	struct xrep_refc_rmap	*n;
+	struct xfs_mount	*mp = sc->mp;
+	int			error;
+
+	INIT_LIST_HEAD(&rr.rmap_bag);
+	INIT_LIST_HEAD(&rr.rmap_idle);
+	rr.extlist = refcount_records;
+	rr.btlist = old_refcountbt_blocks;
+	rr.btblocks = 0;
+	rr.sc = sc;
+	rr.nr_records = 0;
+
+	/* Generate all the refcount records. */
+	error = xrep_refc_generate_refcounts(sc, &rr);
+	if (error)
+		goto out;
+
+	/* Do we actually have enough space to do this? */
+	if (!xrep_ag_has_space(sc->sa.pag,
+			xfs_refcountbt_calc_size(mp, rr.nr_records),
+			XFS_AG_RESV_METADATA)) {
+		error = -ENOSPC;
+		goto out;
+	}
+
+out:
+	list_for_each_entry_safe(rrm, n, &rr.rmap_idle, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+	list_for_each_entry_safe(rrm, n, &rr.rmap_bag, list) {
+		list_del(&rrm->list);
+		kmem_free(rrm);
+	}
+	return error;
+}
+
+/* Initialize new refcountbt root and implant it into the AGF. */
+STATIC int
+xrep_refc_reset_btree(
+	struct xfs_scrub	*sc,
+	struct xfs_owner_info	*oinfo,
+	int			*log_flags)
+{
+	struct xfs_buf		*bp;
+	struct xfs_agf		*agf;
+	xfs_fsblock_t		btfsb;
+	int			error;
+
+	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
+
+	/* Initialize a new refcountbt root. */
+	error = xrep_alloc_ag_block(sc, oinfo, &btfsb, XFS_AG_RESV_METADATA);
+	if (error)
+		return error;
+	error = xrep_init_btblock(sc, btfsb, &bp, XFS_BTNUM_REFC,
+			&xfs_refcountbt_buf_ops);
+	if (error)
+		return error;
+	agf->agf_refcount_root = cpu_to_be32(XFS_FSB_TO_AGBNO(sc->mp, btfsb));
+	agf->agf_refcount_level = cpu_to_be32(1);
+	agf->agf_refcount_blocks = cpu_to_be32(1);
+	*log_flags |= XFS_AGF_REFCOUNT_BLOCKS | XFS_AGF_REFCOUNT_ROOT |
+		      XFS_AGF_REFCOUNT_LEVEL;
+
+	return 0;
+}
+
+/* Build new refcount btree and dispose of the old one. */
+STATIC int
+xrep_refc_rebuild_tree(
+	struct xfs_scrub	*sc,
+	struct list_head	*refcount_records,
+	struct xfs_owner_info	*oinfo,
+	struct xfs_bitmap	*old_refcountbt_blocks)
+{
+	struct xrep_refc_extent	*rre;
+	struct xrep_refc_extent	*n;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_btree_cur	*cur;
+	int			have_gt;
+	int			error;
+
+	/* Add all records. */
+	list_sort(NULL, refcount_records, xrep_refcount_extent_cmp);
+	list_for_each_entry_safe(rre, n, refcount_records, list) {
+		/* Insert into the refcountbt. */
+		cur = xfs_refcountbt_init_cursor(mp, sc->tp, sc->sa.agf_bp,
+				sc->sa.agno);
+		error = xfs_refcount_lookup_eq(cur, rre->refc.rc_startblock,
+				&have_gt);
+		if (error)
+			return error;
+		XFS_WANT_CORRUPTED_RETURN(mp, have_gt == 0);
+		error = xfs_refcount_insert(cur, &rre->refc, &have_gt);
+		if (error)
+			return error;
+		XFS_WANT_CORRUPTED_RETURN(mp, have_gt == 1);
+		xfs_btree_del_cursor(cur, error);
+		cur = NULL;
+
+		error = xrep_roll_ag_trans(sc);
+		if (error)
+			return error;
+
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+
+	/* Free the old refcountbt blocks if they're not in use. */
+	return xrep_reap_extents(sc, old_refcountbt_blocks, oinfo,
+			XFS_AG_RESV_METADATA);
+}
+
+/* Free every record in the refcount list. */
+STATIC void
+xrep_refc_cancel_recs(
+	struct list_head	*recs)
+{
+	struct xrep_refc_extent	*rre;
+	struct xrep_refc_extent	*n;
+
+	list_for_each_entry_safe(rre, n, recs, list) {
+		list_del(&rre->list);
+		kmem_free(rre);
+	}
+}
+
+/* Rebuild the refcount btree. */
+int
+xrep_refcountbt(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_owner_info	oinfo;
+	struct list_head	refcount_records;
+	struct xfs_bitmap	old_refcountbt_blocks;
+	struct xfs_mount	*mp = sc->mp;
+	int			log_flags = 0;
+	int			error;
+
+	/* We require the rmapbt to rebuild anything. */
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	xchk_perag_get(sc->mp, &sc->sa);
+
+	/* Collect all reference counts. */
+	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_REFC);
+	INIT_LIST_HEAD(&refcount_records);
+	xfs_bitmap_init(&old_refcountbt_blocks);
+	error = xrep_refc_find_refcounts(sc, &refcount_records,
+			&old_refcountbt_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the old refcount btrees.  This is the point at which
+	 * we are no longer able to bail out gracefully.
+	 */
+	error = xrep_refc_reset_btree(sc, &oinfo, &log_flags);
+	if (error)
+		goto out;
+	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
+
+	/* Invalidate all the inobt/finobt blocks in btlist. */
+	error = xrep_invalidate_blocks(sc, &old_refcountbt_blocks);
+	if (error)
+		goto out;
+	error = xrep_roll_ag_trans(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the refcount information. */
+	error = xrep_refc_rebuild_tree(sc, &refcount_records, &oinfo,
+			&old_refcountbt_blocks);
+	if (error)
+		goto out;
+	sc->reset_perag_resv = true;
+out:
+	xfs_bitmap_destroy(&old_refcountbt_blocks);
+	xrep_refc_cancel_recs(&refcount_records);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 0cc53dee3228..da12c20376ae 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -64,6 +64,7 @@ int xrep_agfl(struct xfs_scrub *sc);
 int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
+int xrep_refcountbt(struct xfs_scrub *sc);
 
 #else
 
@@ -100,6 +101,7 @@ xrep_reset_perag_resv(
 #define xrep_agi			xrep_notsupported
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
+#define xrep_refcountbt			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 631b0b06db99..843eafe0acef 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -265,7 +265,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.setup	= xchk_setup_ag_refcountbt,
 		.scrub	= xchk_refcountbt,
 		.has	= xfs_sb_version_hasreflink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_refcountbt,
 	},
 	[XFS_SCRUB_TYPE_INODE] = {	/* inode record */
 		.type	= ST_INODE,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 08/14] xfs: repair inode records
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (6 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 07/14] xfs: repair refcount btrees Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30  5:48 ` [PATCH 09/14] xfs: zap broken inode forks Darrick J. Wong
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Try to reinitialize corrupt inodes, or clear the reflink flag
if it's not needed.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_format.h  |    3 
 fs/xfs/scrub/inode_repair.c |  659 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h       |    2 
 fs/xfs/scrub/scrub.c        |    2 
 5 files changed, 665 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/inode_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 4ca97e026f94..e01b5003d543 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -167,6 +167,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   alloc_repair.o \
 				   bitmap.o \
 				   ialloc_repair.o \
+				   inode_repair.o \
 				   refcount_repair.o \
 				   repair.o \
 				   )
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 059bc44c27e8..d4ebf1a4f3e8 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -973,7 +973,8 @@ typedef enum xfs_dinode_fmt {
 #define XFS_DFORK_APTR(dip)	\
 	(XFS_DFORK_DPTR(dip) + XFS_DFORK_BOFF(dip))
 #define XFS_DFORK_PTR(dip,w)	\
-	((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : XFS_DFORK_APTR(dip))
+	((void *)((w) == XFS_DATA_FORK ? XFS_DFORK_DPTR(dip) : \
+					 XFS_DFORK_APTR(dip)))
 
 #define XFS_DFORK_FORMAT(dip,w) \
 	((w) == XFS_DATA_FORK ? \
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
new file mode 100644
index 000000000000..ec9d94d1e5d8
--- /dev/null
+++ b/fs/xfs/scrub/inode_repair.c
@@ -0,0 +1,659 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_icache.h"
+#include "xfs_inode_buf.h"
+#include "xfs_inode_fork.h"
+#include "xfs_ialloc.h"
+#include "xfs_da_format.h"
+#include "xfs_reflink.h"
+#include "xfs_rmap.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_dir2.h"
+#include "xfs_quota_defs.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Inode Repair
+ *
+ * Roughly speaking, inode problems can be classified based on whether or not
+ * they trip the dinode verifiers.  If those trip, then we won't be able to
+ * _iget ourselves the inode.
+ *
+ * Therefore, the xrep_dinode_* functions fix anything that will cause the
+ * inode buffer verifier or the dinode verifier.  The xrep_inode_* functions
+ * fix things on live incore inodes.
+ */
+
+/* Make sure this buffer can pass the inode buffer verifier. */
+STATIC void
+xrep_dinode_buf(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp)
+{
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_trans	*tp = sc->tp;
+	struct xfs_dinode	*dip;
+	xfs_agnumber_t		agno;
+	xfs_agino_t		agino;
+	int			ioff;
+	int			i;
+	int			ni;
+	bool			crc_ok;
+	bool			magic_ok;
+	bool			unlinked_ok;
+
+	ni = XFS_BB_TO_FSB(mp, bp->b_length) * mp->m_sb.sb_inopblock;
+	agno = xfs_daddr_to_agno(mp, XFS_BUF_ADDR(bp));
+	for (i = 0; i < ni; i++) {
+		ioff = i << mp->m_sb.sb_inodelog;
+		dip = xfs_buf_offset(bp, ioff);
+		agino = be32_to_cpu(dip->di_next_unlinked);
+
+		unlinked_ok = magic_ok = crc_ok = false;
+
+		if (agino == NULLAGINO || xfs_verify_agino(sc->mp, agno, agino))
+			unlinked_ok = true;
+
+		if (dip->di_magic == cpu_to_be16(XFS_DINODE_MAGIC) &&
+		    xfs_dinode_good_version(mp, dip->di_version))
+			magic_ok = true;
+
+		if (xfs_verify_cksum((char *)dip, mp->m_sb.sb_inodesize,
+				XFS_DINODE_CRC_OFF))
+			crc_ok = true;
+
+		if (magic_ok && unlinked_ok && crc_ok)
+			continue;
+
+		if (!magic_ok) {
+			dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+			dip->di_version = 3;
+		}
+		if (!unlinked_ok)
+			dip->di_next_unlinked = cpu_to_be32(NULLAGINO);
+		xfs_dinode_calc_crc(mp, dip);
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DINO_BUF);
+		xfs_trans_log_buf(tp, bp, ioff, ioff + sizeof(*dip) - 1);
+	}
+}
+
+/* Reinitialize things that never change in an inode. */
+STATIC void
+xrep_dinode_header(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	dip->di_magic = cpu_to_be16(XFS_DINODE_MAGIC);
+	if (!xfs_dinode_good_version(sc->mp, dip->di_version))
+		dip->di_version = 3;
+	dip->di_ino = cpu_to_be64(sc->sm->sm_ino);
+	uuid_copy(&dip->di_uuid, &sc->mp->m_sb.sb_meta_uuid);
+	dip->di_gen = cpu_to_be32(sc->sm->sm_gen);
+}
+
+/*
+ * Turn di_mode into /something/ recognizable.
+ *
+ * XXX: Ideally we'd try to read data block 0 to see if it's a directory.
+ */
+STATIC void
+xrep_dinode_mode(
+	struct xfs_dinode	*dip)
+{
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	if (mode == 0 || xfs_mode_to_ftype(mode) != XFS_DIR3_FT_UNKNOWN)
+		return;
+
+	/* bad mode, so we set it to a file that only root can read */
+	mode = S_IFREG;
+	dip->di_mode = cpu_to_be16(mode);
+	dip->di_uid = 0;
+	dip->di_gid = 0;
+}
+
+/* Fix any conflicting flags that the verifiers complain about. */
+STATIC void
+xrep_dinode_flags(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		mode;
+	uint16_t		flags;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
+		flags2 |= XFS_DIFLAG2_REFLINK;
+	else
+		flags2 &= ~(XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE);
+	if (flags & XFS_DIFLAG_REALTIME)
+		flags2 &= ~XFS_DIFLAG2_REFLINK;
+	if (flags2 & XFS_DIFLAG2_REFLINK)
+		flags2 &= ~XFS_DIFLAG2_DAX;
+	dip->di_flags = cpu_to_be16(flags);
+	dip->di_flags2 = cpu_to_be64(flags2);
+}
+
+/*
+ * Blow out symlink; now it points to the current dir.  We don't have to worry
+ * about incore state because this inode is failing the verifiers.
+ */
+STATIC void
+xrep_dinode_zap_symlink(
+	struct xfs_dinode	*dip)
+{
+	char			*p;
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	dip->di_size = cpu_to_be64(1);
+	p = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	*p = '.';
+}
+
+/*
+ * Blow out dir, make it point to the root.  In the future repair will
+ * reconstruct this directory for us.  Note that there's no in-core directory
+ * inode because the sf verifier tripped, so we don't have to worry about the
+ * dentry cache.
+ */
+STATIC void
+xrep_dinode_zap_dir(
+	struct xfs_mount		*mp,
+	struct xfs_dinode		*dip)
+{
+	const struct xfs_dir_ops	*ops;
+	struct xfs_dir2_sf_hdr		*sfp;
+	int				i8count;
+
+	dip->di_format = XFS_DINODE_FMT_LOCAL;
+	i8count = mp->m_sb.sb_rootino > XFS_DIR2_MAX_SHORT_INUM;
+	ops = xfs_dir_get_ops(mp, NULL);
+	sfp = XFS_DFORK_PTR(dip, XFS_DATA_FORK);
+	sfp->count = 0;
+	sfp->i8count = i8count;
+	ops->sf_put_parent_ino(sfp, mp->m_sb.sb_rootino);
+	dip->di_size = cpu_to_be64(xfs_dir2_sf_hdr_size(i8count));
+}
+
+/* Make sure we don't have a garbage file size. */
+STATIC void
+xrep_dinode_size(
+	struct xfs_mount	*mp,
+	struct xfs_dinode	*dip)
+{
+	uint64_t		size;
+	uint16_t		mode;
+
+	mode = be16_to_cpu(dip->di_mode);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		/* di_size can't be nonzero for special files */
+		dip->di_size = 0;
+		break;
+	case S_IFREG:
+		/* Regular files can't be larger than 2^63-1 bytes. */
+		dip->di_size = cpu_to_be64(size & ~(1ULL << 63));
+		break;
+	case S_IFLNK:
+		/*
+		 * Truncate ridiculously oversized symlinks.  If the size is
+		 * zero, reset it to point to the current directory.  Both of
+		 * these conditions trigger dinode verifier errors, so there
+		 * is no in-core state to reset.
+		 */
+		if (size > XFS_SYMLINK_MAXLEN)
+			dip->di_size = cpu_to_be64(XFS_SYMLINK_MAXLEN);
+		else if (size == 0)
+			xrep_dinode_zap_symlink(dip);
+		break;
+	case S_IFDIR:
+		/*
+		 * Directories can't have a size larger than 32G.  If the size
+		 * is zero, reset it to an empty directory.  Both of these
+		 * conditions trigger dinode verifier errors, so there is no
+		 * in-core state to reset.
+		 */
+		if (size > XFS_DIR2_SPACE_SIZE)
+			dip->di_size = cpu_to_be64(XFS_DIR2_SPACE_SIZE);
+		else if (size == 0)
+			xrep_dinode_zap_dir(mp, dip);
+		break;
+	}
+}
+
+/* Fix extent size hints. */
+STATIC void
+xrep_dinode_extsize_hints(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip)
+{
+	struct xfs_mount	*mp = sc->mp;
+	uint64_t		flags2;
+	uint16_t		flags;
+	uint16_t		mode;
+	xfs_failaddr_t		fa;
+
+	mode = be16_to_cpu(dip->di_mode);
+	flags = be16_to_cpu(dip->di_flags);
+	flags2 = be64_to_cpu(dip->di_flags2);
+
+	fa = xfs_inode_validate_extsize(mp, be32_to_cpu(dip->di_extsize),
+			mode, flags);
+	if (fa) {
+		dip->di_extsize = 0;
+		dip->di_flags &= ~cpu_to_be16(XFS_DIFLAG_EXTSIZE |
+					      XFS_DIFLAG_EXTSZINHERIT);
+	}
+
+	if (dip->di_version < 3)
+		return;
+
+	fa = xfs_inode_validate_cowextsize(mp, be32_to_cpu(dip->di_cowextsize),
+			mode, flags, flags2);
+	if (fa) {
+		dip->di_cowextsize = 0;
+		dip->di_flags2 &= ~cpu_to_be64(XFS_DIFLAG2_COWEXTSIZE);
+	}
+}
+
+/* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
+STATIC int
+xrep_dinode_core(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_imap		imap;
+	struct xfs_buf		*bp;
+	struct xfs_dinode	*dip;
+	xfs_ino_t		ino;
+	bool			inuse;
+	int			error;
+
+	/* Map & read inode. */
+	ino = sc->sm->sm_ino;
+	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
+	if (error)
+		return error;
+
+	error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+			imap.im_blkno, imap.im_len, XBF_UNMAPPED, &bp, NULL);
+	if (error)
+		return error;
+
+	/* Make absolutely sure this inode isn't in core. */
+	error = xfs_icache_inode_is_allocated(sc->mp, sc->tp, ino, &inuse);
+	if (error == 0) {
+		ASSERT(0);
+		return -EFSCORRUPTED;
+	}
+
+	/* Make sure we can pass the inode buffer verifier. */
+	xrep_dinode_buf(sc, bp);
+	bp->b_ops = &xfs_inode_buf_ops;
+
+	/* Fix everything the verifier will complain about. */
+	dip = xfs_buf_offset(bp, imap.im_boffset);
+	xrep_dinode_header(sc, dip);
+	xrep_dinode_mode(dip);
+	xrep_dinode_flags(sc, dip);
+	xrep_dinode_size(sc->mp, dip);
+	xrep_dinode_extsize_hints(sc, dip);
+
+	/* Write out the inode... */
+	xfs_dinode_calc_crc(sc->mp, dip);
+	xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_DINO_BUF);
+	xfs_trans_log_buf(sc->tp, bp, imap.im_boffset,
+			imap.im_boffset + sc->mp->m_sb.sb_inodesize - 1);
+	error = xfs_trans_commit(sc->tp);
+	if (error)
+		return error;
+	sc->tp = NULL;
+
+	/* ...and reload it? */
+	error = xfs_iget(sc->mp, sc->tp, ino,
+			XFS_IGET_UNTRUSTED | XFS_IGET_DONTCACHE, 0, &sc->ip);
+	if (error)
+		return error;
+	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+	xfs_ilock(sc->ip, sc->ilock_flags);
+	error = xchk_trans_alloc(sc, 0);
+	if (error)
+		return error;
+	sc->ilock_flags |= XFS_ILOCK_EXCL;
+	xfs_ilock(sc->ip, XFS_ILOCK_EXCL);
+
+	return 0;
+}
+
+/* Fix everything xfs_dinode_verify cares about. */
+STATIC int
+xrep_dinode_problems(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_dinode_core(sc);
+	if (error)
+		return error;
+
+	/* We had to fix a totally busted inode, schedule quotacheck. */
+	if (XFS_IS_UQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_USER);
+	if (XFS_IS_GQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_GROUP);
+	if (XFS_IS_PQUOTA_ON(sc->mp))
+		xrep_force_quotacheck(sc, XFS_DQ_PROJ);
+
+	return 0;
+}
+
+/*
+ * Fix problems that the verifiers don't care about.  In general these are
+ * errors that don't cause problems elsewhere in the kernel that we can easily
+ * detect, so we don't check them all that rigorously.
+ */
+
+/* Make sure block and extent counts are ok. */
+STATIC int
+xrep_inode_blockcounts(
+	struct xfs_scrub	*sc)
+{
+	xfs_filblks_t		count;
+	xfs_filblks_t		acount;
+	xfs_extnum_t		nextents;
+	int			error;
+
+	/* Set data fork counters from the data fork mappings. */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_DATA_FORK,
+			&nextents, &count);
+	if (error)
+		return error;
+	if (XFS_IS_REALTIME_INODE(sc->ip)) {
+		if (count >= sc->mp->m_sb.sb_rblocks)
+			return -EFSCORRUPTED;
+	} else if (!xfs_sb_version_hasreflink(&sc->mp->m_sb)) {
+		if (count >= sc->mp->m_sb.sb_dblocks)
+			return -EFSCORRUPTED;
+	}
+	sc->ip->i_d.di_nextents = nextents;
+
+	/* Set attr fork counters from the attr fork mappings. */
+	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
+			&nextents, &acount);
+	if (error)
+		return error;
+	if (count >= sc->mp->m_sb.sb_dblocks)
+		return -EFSCORRUPTED;
+	if (nextents >= (uint16_t)-1U)
+		return -EFSCORRUPTED;
+	sc->ip->i_d.di_anextents = nextents;
+
+	sc->ip->i_d.di_nblocks = count + acount;
+
+	/*
+	 * If we found attr fork extents but no attr fork root, zero the
+	 * attr fork extent count so that the attr fork repair will run.
+	 */
+	if (sc->ip->i_d.di_anextents != 0 && sc->ip->i_d.di_forkoff == 0)
+		sc->ip->i_d.di_anextents = 0;
+
+	return 0;
+}
+
+/* Check for invalid uid/gid.  Note that a -1U projid is allowed. */
+STATIC void
+xrep_inode_ids(
+	struct xfs_scrub	*sc)
+{
+	if (sc->ip->i_d.di_uid == -1U) {
+		sc->ip->i_d.di_uid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_UQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQ_USER);
+	}
+
+	if (sc->ip->i_d.di_gid == -1U) {
+		sc->ip->i_d.di_gid = 0;
+		VFS_I(sc->ip)->i_mode &= ~(S_ISUID | S_ISGID);
+		if (XFS_IS_GQUOTA_ON(sc->mp))
+			xrep_force_quotacheck(sc, XFS_DQ_GROUP);
+	}
+}
+
+/* Nanosecond counters can't have more than 1 billion. */
+STATIC void
+xrep_inode_timestamps(
+	struct xfs_inode	*ip)
+{
+	if ((unsigned long)VFS_I(ip)->i_atime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_atime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_mtime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_mtime.tv_nsec = 0;
+	if ((unsigned long)VFS_I(ip)->i_ctime.tv_nsec >= NSEC_PER_SEC)
+		VFS_I(ip)->i_ctime.tv_nsec = 0;
+	if (ip->i_d.di_version > 2 &&
+	    (unsigned long)ip->i_d.di_crtime.t_nsec >= NSEC_PER_SEC)
+		ip->i_d.di_crtime.t_nsec = 0;
+}
+
+/* Fix inode flags that don't make sense together. */
+STATIC void
+xrep_inode_flags(
+	struct xfs_scrub	*sc)
+{
+	uint16_t		mode;
+
+	mode = VFS_I(sc->ip)->i_mode;
+
+	/* Clear junk flags */
+	if (sc->ip->i_d.di_flags & ~XFS_DIFLAG_ANY)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_ANY;
+
+	/* NEWRTBM only applies to realtime bitmaps */
+	if (sc->ip->i_ino == sc->mp->m_sb.sb_rbmino)
+		sc->ip->i_d.di_flags |= XFS_DIFLAG_NEWRTBM;
+	else
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_NEWRTBM;
+
+	/* These only make sense for directories. */
+	if (!S_ISDIR(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_RTINHERIT |
+					  XFS_DIFLAG_EXTSZINHERIT |
+					  XFS_DIFLAG_PROJINHERIT |
+					  XFS_DIFLAG_NOSYMLINKS);
+
+	/* These only make sense for files. */
+	if (!S_ISREG(mode))
+		sc->ip->i_d.di_flags &= ~(XFS_DIFLAG_REALTIME |
+					  XFS_DIFLAG_EXTSIZE);
+
+	/* These only make sense for non-rt files. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_FILESTREAM;
+
+	/* Immutable and append only?  Drop the append. */
+	if ((sc->ip->i_d.di_flags & XFS_DIFLAG_IMMUTABLE) &&
+	    (sc->ip->i_d.di_flags & XFS_DIFLAG_APPEND))
+		sc->ip->i_d.di_flags &= ~XFS_DIFLAG_APPEND;
+
+	if (sc->ip->i_d.di_version < 3)
+		return;
+
+	/* Clear junk flags. */
+	if (sc->ip->i_d.di_flags2 & ~XFS_DIFLAG2_ANY)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_ANY;
+
+	/* No reflink flag unless we support it and it's a file. */
+	if (!xfs_sb_version_hasreflink(&sc->mp->m_sb) ||
+	    !S_ISREG(mode))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* DAX only applies to files and dirs. */
+	if (!(S_ISREG(mode) || S_ISDIR(mode)))
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+
+	/* No reflink files on the realtime device. */
+	if (sc->ip->i_d.di_flags & XFS_DIFLAG_REALTIME)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+
+	/* No mixing reflink and DAX yet. */
+	if (sc->ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
+		sc->ip->i_d.di_flags2 &= ~XFS_DIFLAG2_DAX;
+}
+
+/*
+ * Fix size problems with block/node format directories.  If we fail to find
+ * the extent list, just bail out and let the bmapbtd repair functions clean
+ * up that mess.
+ */
+STATIC void
+xrep_inode_blockdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+	xfs_fileoff_t		off;
+	int			error;
+
+	/* Find the last block before 32G; this is the dir size. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	if (!(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, sc->ip, XFS_DATA_FORK);
+		if (error)
+			return;
+	}
+
+	off = XFS_B_TO_FSB(sc->mp, XFS_DIR2_SPACE_SIZE);
+	if (!xfs_iext_lookup_extent_before(sc->ip, ifp, &off, &icur, &got)) {
+		/* zero-extents directory? */
+		return;
+	}
+
+	off = got.br_startoff + got.br_blockcount;
+	sc->ip->i_d.di_size = min_t(loff_t, XFS_DIR2_SPACE_SIZE,
+			XFS_FSB_TO_B(sc->mp, off));
+}
+
+/* Fix size problems with short format directories. */
+STATIC void
+xrep_inode_sfdir_size(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	sc->ip->i_d.di_size = ifp->if_bytes;
+}
+
+/*
+ * Fix any irregularities in an inode's size now that we can iterate extent
+ * maps and access other regular inode data.
+ */
+STATIC void
+xrep_inode_size(
+	struct xfs_scrub	*sc)
+{
+	/*
+	 * Currently we only support fixing size on extents or btree format
+	 * directories.  Files can be any size and sizes for the other inode
+	 * special types are fixed by xrep_dinode_size.
+	 */
+	if (!S_ISDIR(VFS_I(sc->ip)->i_mode))
+		return;
+	switch (XFS_IFORK_FORMAT(sc->ip, XFS_DATA_FORK)) {
+	case XFS_DINODE_FMT_EXTENTS:
+	case XFS_DINODE_FMT_BTREE:
+		xrep_inode_blockdir_size(sc);
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		xrep_inode_sfdir_size(sc);
+		break;
+	}
+}
+
+/* Fix any irregularities in an inode that the verifiers don't catch. */
+STATIC int
+xrep_inode_problems(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	error = xrep_inode_blockcounts(sc);
+	if (error)
+		return error;
+	xrep_inode_timestamps(sc->ip);
+	xrep_inode_flags(sc);
+	xrep_inode_ids(sc);
+	xrep_inode_size(sc);
+	xfs_trans_log_inode(sc->tp, sc->ip, XFS_ILOG_CORE);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/* Repair an inode's fields. */
+int
+xrep_inode(
+	struct xfs_scrub	*sc)
+{
+	int			error = 0;
+
+	/*
+	 * No inode?  That means we failed the _iget verifiers.  Repair all
+	 * the things that the inode verifiers care about, then retry _iget.
+	 */
+	if (!sc->ip) {
+		error = xrep_dinode_problems(sc);
+		if (error)
+			goto out;
+	}
+
+	/* By this point we had better have a working incore inode. */
+	ASSERT(sc->ip);
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* If we found corruption of any kind, try to fix it. */
+	if ((sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) ||
+	    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_XCORRUPT)) {
+		error = xrep_inode_problems(sc);
+		if (error)
+			goto out;
+	}
+
+	/* See if we can clear the reflink flag. */
+	if (xfs_is_reflink_inode(sc->ip))
+		return xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index da12c20376ae..20e449c7a0df 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -65,6 +65,7 @@ int xrep_agi(struct xfs_scrub *sc);
 int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
+int xrep_inode(struct xfs_scrub *sc);
 
 #else
 
@@ -102,6 +103,7 @@ xrep_reset_perag_resv(
 #define xrep_allocbt			xrep_notsupported
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
+#define xrep_inode			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 843eafe0acef..ae922801808d 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -271,7 +271,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode,
 		.scrub	= xchk_inode,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_inode,
 	},
 	[XFS_SCRUB_TYPE_BMBTD] = {	/* inode data fork */
 		.type	= ST_INODE,


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 09/14] xfs: zap broken inode forks
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (7 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 08/14] xfs: repair inode records Darrick J. Wong
@ 2018-07-30  5:48 ` Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 10/14] xfs: repair inode block maps Darrick J. Wong
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:48 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true.  Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c |   32 ++-
 fs/xfs/libxfs/xfs_attr_leaf.h |    2 
 fs/xfs/libxfs/xfs_bmap.c      |   21 ++
 fs/xfs/libxfs/xfs_bmap.h      |    2 
 fs/xfs/scrub/inode_repair.c   |  401 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 439 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index 088ffcd22fa2..51c62dfe3059 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -896,23 +896,16 @@ xfs_attr_shortform_allfit(
 	return xfs_attr_shortform_bytesfit(dp, bytes);
 }
 
-/* Verify the consistency of an inline attribute fork. */
+/* Verify the consistency of a raw inline attribute fork. */
 xfs_failaddr_t
-xfs_attr_shortform_verify(
-	struct xfs_inode		*ip)
+xfs_attr_shortform_verify_struct(
+	struct xfs_attr_shortform	*sfp,
+	size_t				size)
 {
-	struct xfs_attr_shortform	*sfp;
 	struct xfs_attr_sf_entry	*sfep;
 	struct xfs_attr_sf_entry	*next_sfep;
 	char				*endp;
-	struct xfs_ifork		*ifp;
 	int				i;
-	int				size;
-
-	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
-	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
-	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
-	size = ifp->if_bytes;
 
 	/*
 	 * Give up if the attribute is way too short.
@@ -970,6 +963,23 @@ xfs_attr_shortform_verify(
 	return NULL;
 }
 
+/* Verify the consistency of an inline attribute fork. */
+xfs_failaddr_t
+xfs_attr_shortform_verify(
+	struct xfs_inode		*ip)
+{
+	struct xfs_attr_shortform	*sfp;
+	struct xfs_ifork		*ifp;
+	int				size;
+
+	ASSERT(ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL);
+	ifp = XFS_IFORK_PTR(ip, XFS_ATTR_FORK);
+	sfp = (struct xfs_attr_shortform *)ifp->if_u1.if_data;
+	size = ifp->if_bytes;
+
+	return xfs_attr_shortform_verify_struct(sfp, size);
+}
+
 /*
  * Convert a leaf attribute list to shortform attribute list
  */
diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h
index 7b74e18becff..728af25a1738 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.h
+++ b/fs/xfs/libxfs/xfs_attr_leaf.h
@@ -41,6 +41,8 @@ int	xfs_attr_shortform_to_leaf(struct xfs_da_args *args,
 int	xfs_attr_shortform_remove(struct xfs_da_args *args);
 int	xfs_attr_shortform_allfit(struct xfs_buf *bp, struct xfs_inode *dp);
 int	xfs_attr_shortform_bytesfit(struct xfs_inode *dp, int bytes);
+xfs_failaddr_t xfs_attr_shortform_verify_struct(struct xfs_attr_shortform *sfp,
+		size_t size);
 xfs_failaddr_t xfs_attr_shortform_verify(struct xfs_inode *ip);
 void	xfs_attr_fork_remove(struct xfs_inode *ip, struct xfs_trans *tp);
 
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 92cd064a2589..649ce0a407dc 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6096,18 +6096,16 @@ xfs_bmap_finish_one(
 	return error;
 }
 
-/* Check that an inode's extent does not have invalid flags or bad ranges. */
+/* Check that an extent does not have invalid flags or bad ranges. */
 xfs_failaddr_t
-xfs_bmap_validate_extent(
-	struct xfs_inode	*ip,
+xfs_bmap_validate_extent_raw(
+	struct xfs_mount	*mp,
+	bool			isrt,
 	int			whichfork,
 	struct xfs_bmbt_irec	*irec)
 {
-	struct xfs_mount	*mp = ip->i_mount;
 	xfs_fsblock_t		endfsb;
-	bool			isrt;
 
-	isrt = XFS_IS_REALTIME_INODE(ip);
 	endfsb = irec->br_startblock + irec->br_blockcount - 1;
 	if (isrt) {
 		if (!xfs_verify_rtbno(mp, irec->br_startblock))
@@ -6131,3 +6129,14 @@ xfs_bmap_validate_extent(
 	}
 	return NULL;
 }
+
+/* Check that an inode's extent does not have invalid flags or bad ranges. */
+xfs_failaddr_t
+xfs_bmap_validate_extent(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	struct xfs_bmbt_irec	*irec)
+{
+	return xfs_bmap_validate_extent_raw(ip->i_mount,
+			XFS_IS_REALTIME_INODE(ip), whichfork, irec);
+}
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 2e8555c1229a..a1bf6d1fec2d 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -273,6 +273,8 @@ static inline int xfs_bmap_fork_to_state(int whichfork)
 	}
 }
 
+xfs_failaddr_t xfs_bmap_validate_extent_raw(struct xfs_mount *mp, bool isrt,
+		int whichfork, struct xfs_bmbt_irec *irec);
 xfs_failaddr_t xfs_bmap_validate_extent(struct xfs_inode *ip, int whichfork,
 		struct xfs_bmbt_irec *irec);
 
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index ec9d94d1e5d8..3c9ac9e046fd 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -22,11 +22,15 @@
 #include "xfs_ialloc.h"
 #include "xfs_da_format.h"
 #include "xfs_reflink.h"
+#include "xfs_alloc.h"
 #include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
 #include "xfs_bmap.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_bmap_util.h"
 #include "xfs_dir2.h"
 #include "xfs_quota_defs.h"
+#include "xfs_attr_leaf.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -139,7 +143,8 @@ xrep_dinode_mode(
 STATIC void
 xrep_dinode_flags(
 	struct xfs_scrub	*sc,
-	struct xfs_dinode	*dip)
+	struct xfs_dinode	*dip,
+	bool			is_rt_file)
 {
 	struct xfs_mount	*mp = sc->mp;
 	uint64_t		flags2;
@@ -150,6 +155,11 @@ xrep_dinode_flags(
 	flags = be16_to_cpu(dip->di_flags);
 	flags2 = be64_to_cpu(dip->di_flags2);
 
+	if (is_rt_file)
+		flags |= XFS_DIFLAG_REALTIME;
+	else
+		flags &= ~XFS_DIFLAG_REALTIME;
+
 	if (xfs_sb_version_hasreflink(&mp->m_sb) && S_ISREG(mode))
 		flags2 |= XFS_DIFLAG2_REFLINK;
 	else
@@ -288,11 +298,392 @@ xrep_dinode_extsize_hints(
 	}
 }
 
+/* Blocks and extents associated with an inode, according to rmap records. */
+struct xrep_dinode_stats {
+	struct xfs_scrub	*sc;
+
+	/* Blocks in use on the data device by data extents or bmbt blocks. */
+	xfs_rfsblock_t		data_blocks;
+
+	/* Blocks in use on the rt device. */
+	xfs_rfsblock_t		rt_blocks;
+
+	/* Blocks in use by the attr fork. */
+	xfs_rfsblock_t		attr_blocks;
+
+	/* Number of data device extents for the data fork. */
+	xfs_extnum_t		data_extents;
+
+	/*
+	 * Number of realtime device extents for the data fork.  If
+	 * data_extents and rt_extents indicate that the data fork has extents
+	 * on both devices, we'll just back away slowly.
+	 */
+	xfs_extnum_t		rt_extents;
+
+	/* Number of (data device) extents for the attr fork. */
+	xfs_aextnum_t		attr_extents;
+};
+
+/* Count extents and blocks for an inode given an rmap. */
+STATIC int
+xrep_dinode_walk_rmap(
+	struct xfs_btree_cur		*cur,
+	struct xfs_rmap_irec		*rec,
+	void				*priv)
+{
+	struct xrep_dinode_stats	*dis = priv;
+
+	/* Is this even the right fork? */
+	if (rec->rm_owner != dis->sc->sm->sm_ino)
+		return 0;
+	if (rec->rm_flags & XFS_RMAP_ATTR_FORK) {
+		dis->attr_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			dis->attr_extents++;
+	} else {
+		dis->data_blocks += rec->rm_blockcount;
+		if (!(rec->rm_flags & XFS_RMAP_BMBT_BLOCK))
+			dis->data_extents++;
+	}
+	return 0;
+}
+
+/* Count extents and blocks for an inode from all AG rmap data. */
+STATIC int
+xrep_dinode_count_ag_rmaps(
+	struct xrep_dinode_stats	*dis,
+	xfs_agnumber_t			agno)
+{
+	struct xfs_btree_cur		*cur;
+	struct xfs_buf			*agf;
+	int				error;
+
+	error = xfs_alloc_read_agf(dis->sc->mp, dis->sc->tp, agno, 0, &agf);
+	if (error)
+		return error;
+
+	cur = xfs_rmapbt_init_cursor(dis->sc->mp, dis->sc->tp, agf, agno);
+	if (!cur) {
+		error = -ENOMEM;
+		goto out_agf;
+	}
+
+	error = xfs_rmap_query_all(cur, xrep_dinode_walk_rmap, dis);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+
+	xfs_btree_del_cursor(cur, error);
+out_agf:
+	xfs_trans_brelse(dis->sc->tp, agf);
+	return error;
+}
+
+/* Count extents and blocks for a given inode from all rmap data. */
+STATIC int
+xrep_dinode_count_rmaps(
+	struct xrep_dinode_stats	*dis)
+{
+	xfs_agnumber_t			agno;
+	int				error;
+
+	if (!xfs_sb_version_hasrmapbt(&dis->sc->mp->m_sb) ||
+	    xfs_sb_version_hasrealtime(&dis->sc->mp->m_sb))
+		return -EOPNOTSUPP;
+
+	/* XXX: find rt blocks too */
+	if (dis->rt_extents != 0) {
+		ASSERT(0);
+		return -EOPNOTSUPP;
+	}
+
+	for (agno = 0; agno < dis->sc->mp->m_sb.sb_agcount; agno++) {
+		error = xrep_dinode_count_ag_rmaps(dis, agno);
+		if (error)
+			return error;
+	}
+
+	/* Can't have extents on both the rt and the data device. */
+	if (dis->data_extents && dis->rt_extents)
+		return -EFSCORRUPTED;
+
+	return 0;
+}
+
+/* Return true if this extents-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_extents_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmbt_irec	new;
+	struct xfs_bmbt_rec	*dp;
+	bool			isrt;
+	int			i;
+	int			nex;
+	int			fork_size;
+
+	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	fork_size = nex * sizeof(struct xfs_bmbt_rec);
+	if (fork_size < 0 || fork_size > dfork_size)
+		return true;
+	if (whichfork == XFS_ATTR_FORK && nex > ((uint16_t)-1U))
+		return true;
+	dp = XFS_DFORK_PTR(dip, whichfork);
+
+	isrt = dip->di_flags & cpu_to_be16(XFS_DIFLAG_REALTIME);
+	for (i = 0; i < nex; i++, dp++) {
+		xfs_failaddr_t	fa;
+
+		xfs_bmbt_disk_get_all(dp, &new);
+		fa = xfs_bmap_validate_extent_raw(sc->mp, isrt, whichfork,
+				&new);
+		if (fa)
+			return true;
+	}
+
+	return false;
+}
+
+/* Return true if this btree-format ifork looks like garbage. */
+STATIC bool
+xrep_dinode_bad_btree_fork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	int			dfork_size,
+	int			whichfork)
+{
+	struct xfs_bmdr_block	*dfp;
+	int			nrecs;
+	int			level;
+
+	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
+			dfork_size / sizeof(struct xfs_bmbt_irec))
+		return true;
+
+	dfp = XFS_DFORK_PTR(dip, whichfork);
+	nrecs = be16_to_cpu(dfp->bb_numrecs);
+	level = be16_to_cpu(dfp->bb_level);
+
+	if (nrecs == 0 || XFS_BMDR_SPACE_CALC(nrecs) > dfork_size)
+		return true;
+	if (level == 0 || level > XFS_BTREE_MAXLEVELS)
+		return true;
+	return false;
+}
+
+/*
+ * Check the data fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_dfork(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	uint16_t		mode)
+{
+	uint64_t		size;
+	unsigned int		fmt;
+	int			dfork_size;
+
+	fmt = XFS_DFORK_FORMAT(dip, XFS_DATA_FORK);
+	size = be64_to_cpu(dip->di_size);
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		if (fmt != XFS_DINODE_FMT_DEV)
+			return true;
+		break;
+	case S_IFREG:
+		if (fmt == XFS_DINODE_FMT_LOCAL)
+			return true;
+		/* fall through */
+	case S_IFLNK:
+	case S_IFDIR:
+		switch (fmt) {
+		case XFS_DINODE_FMT_LOCAL:
+		case XFS_DINODE_FMT_EXTENTS:
+		case XFS_DINODE_FMT_BTREE:
+			break;
+		default:
+			return true;
+		}
+		break;
+	default:
+		return true;
+	}
+	dfork_size = XFS_DFORK_SIZE(dip, sc->mp, XFS_DATA_FORK);
+	switch (fmt) {
+	case XFS_DINODE_FMT_DEV:
+		break;
+	case XFS_DINODE_FMT_LOCAL:
+		if (size > dfork_size)
+			return true;
+		break;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, dfork_size,
+				XFS_DATA_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the data fork to something sane. */
+STATIC void
+xrep_dinode_zap_dfork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	uint16_t			mode,
+	struct xrep_dinode_stats	*dis)
+{
+	/* Special files always get reset to DEV */
+	switch (mode & S_IFMT) {
+	case S_IFIFO:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFSOCK:
+		dip->di_format = XFS_DINODE_FMT_DEV;
+		dip->di_size = 0;
+		return;
+	}
+
+	/*
+	 * If we have data extents, reset to an empty map and hope the user
+	 * will run the bmapbtd checker next.
+	 */
+	if (dis->data_extents || dis->rt_extents || S_ISREG(mode)) {
+		dip->di_format = XFS_DINODE_FMT_EXTENTS;
+		dip->di_nextents = 0;
+		return;
+	}
+
+	/* Otherwise, reset the local format to the minimum. */
+	switch (mode & S_IFMT) {
+	case S_IFLNK:
+		xrep_dinode_zap_symlink(dip);
+		break;
+	case S_IFDIR:
+		xrep_dinode_zap_dir(sc->mp, dip);
+		break;
+	}
+}
+
+/*
+ * Check the attr fork for things that will fail the ifork verifiers or the
+ * ifork formatters.
+ */
+STATIC bool
+xrep_dinode_check_afork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip)
+{
+	struct xfs_attr_shortform	*sfp;
+	int				size;
+
+	if (XFS_DFORK_BOFF(dip) == 0)
+		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
+		       dip->di_anextents != 0;
+
+	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
+	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
+	case XFS_DINODE_FMT_LOCAL:
+		sfp = XFS_DFORK_PTR(dip, XFS_ATTR_FORK);
+		return xfs_attr_shortform_verify_struct(sfp, size) != NULL;
+	case XFS_DINODE_FMT_EXTENTS:
+		if (xrep_dinode_bad_extents_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (xrep_dinode_bad_btree_fork(sc, dip, size, XFS_ATTR_FORK))
+			return true;
+		break;
+	default:
+		return true;
+	}
+
+	return false;
+}
+
+/* Reset the attr fork to something sane. */
+STATIC void
+xrep_dinode_zap_afork(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	struct xrep_dinode_stats	*dis)
+{
+	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
+	dip->di_anextents = 0;
+	/*
+	 * We leave a nonzero forkoff so that the bmap scrub will look for
+	 * attr rmaps.
+	 */
+	dip->di_forkoff = dis->attr_extents ? 1 : 0;
+}
+
+/*
+ * Zap the data/attr forks if we spot anything that isn't going to pass the
+ * ifork verifiers or the ifork formatters, because we need to get the inode
+ * into good enough shape that the higher level repair functions can run.
+ */
+STATIC void
+xrep_dinode_zap_forks(
+	struct xfs_scrub		*sc,
+	struct xfs_dinode		*dip,
+	struct xrep_dinode_stats	*dis)
+{
+	uint16_t			mode;
+	bool				zap_datafork = false;
+	bool				zap_attrfork = false;
+
+	mode = be16_to_cpu(dip->di_mode);
+
+	/* Inode counters don't make sense? */
+	if (be32_to_cpu(dip->di_nextents) > be64_to_cpu(dip->di_nblocks))
+		zap_datafork = true;
+	if (be16_to_cpu(dip->di_anextents) > be64_to_cpu(dip->di_nblocks))
+		zap_attrfork = true;
+	if (be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
+			be64_to_cpu(dip->di_nblocks))
+		zap_datafork = zap_attrfork = true;
+
+	if (!zap_datafork)
+		zap_datafork = xrep_dinode_check_dfork(sc, dip, mode);
+	if (!zap_attrfork)
+		zap_attrfork = xrep_dinode_check_afork(sc, dip);
+
+	/* Zap whatever's bad. */
+	if (zap_attrfork)
+		xrep_dinode_zap_afork(sc, dip, dis);
+	if (zap_datafork)
+		xrep_dinode_zap_dfork(sc, dip, mode, dis);
+	dip->di_nblocks = 0;
+	if (!zap_attrfork)
+		be64_add_cpu(&dip->di_nblocks, dis->attr_blocks);
+	if (!zap_datafork) {
+		be64_add_cpu(&dip->di_nblocks, dis->data_blocks);
+		be64_add_cpu(&dip->di_nblocks, dis->rt_blocks);
+	}
+}
+
 /* Inode didn't pass verifiers, so fix the raw buffer and retry iget. */
 STATIC int
 xrep_dinode_core(
 	struct xfs_scrub	*sc)
 {
+	struct xrep_dinode_stats	dis = { .sc = sc };
 	struct xfs_imap		imap;
 	struct xfs_buf		*bp;
 	struct xfs_dinode	*dip;
@@ -300,6 +691,11 @@ xrep_dinode_core(
 	bool			inuse;
 	int			error;
 
+	/* Figure out what this inode had mapped in both forks. */
+	error = xrep_dinode_count_rmaps(&dis);
+	if (error)
+		return error;
+
 	/* Map & read inode. */
 	ino = sc->sm->sm_ino;
 	error = xfs_imap(sc->mp, sc->tp, ino, &imap, XFS_IGET_UNTRUSTED);
@@ -326,9 +722,10 @@ xrep_dinode_core(
 	dip = xfs_buf_offset(bp, imap.im_boffset);
 	xrep_dinode_header(sc, dip);
 	xrep_dinode_mode(dip);
-	xrep_dinode_flags(sc, dip);
+	xrep_dinode_flags(sc, dip, dis.rt_extents > 0);
 	xrep_dinode_size(sc->mp, dip);
 	xrep_dinode_extsize_hints(sc, dip);
+	xrep_dinode_zap_forks(sc, dip, &dis);
 
 	/* Write out the inode... */
 	xfs_dinode_calc_crc(sc->mp, dip);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 10/14] xfs: repair inode block maps
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (8 preceding siblings ...)
  2018-07-30  5:48 ` [PATCH 09/14] xfs: zap broken inode forks Darrick J. Wong
@ 2018-07-30  5:49 ` Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 11/14] xfs: repair damaged symlinks Darrick J. Wong
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Use the reverse-mapping btree information to rebuild an inode fork.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/bmap.c        |   22 ++
 fs/xfs/scrub/bmap_repair.c |  513 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    4 
 fs/xfs/scrub/scrub.c       |    4 
 fs/xfs/scrub/trace.h       |    2 
 fs/xfs/xfs_trans.c         |   54 +++++
 fs/xfs/xfs_trans.h         |    2 
 8 files changed, 599 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/scrub/bmap_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e01b5003d543..7f5467bb18b9 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -166,6 +166,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
 				   alloc_repair.o \
 				   bitmap.o \
+				   bmap_repair.o \
 				   ialloc_repair.o \
 				   inode_repair.o \
 				   refcount_repair.o \
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index e1d11f3223e3..6659f41e7b4c 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -37,6 +37,7 @@ xchk_setup_inode_bmap(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
 {
+	bool			is_repair = false;
 	int			error;
 
 	error = xchk_get_inode(sc, ip);
@@ -46,6 +47,10 @@ xchk_setup_inode_bmap(
 	sc->ilock_flags = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
 	xfs_ilock(sc->ip, sc->ilock_flags);
 
+#ifdef CONFIG_XFS_REPAIR
+	is_repair = (sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR);
+#endif
+
 	/*
 	 * We don't want any ephemeral data fork updates sitting around
 	 * while we inspect block mappings, so wait for directio to finish
@@ -53,10 +58,27 @@ xchk_setup_inode_bmap(
 	 */
 	if (S_ISREG(VFS_I(sc->ip)->i_mode) &&
 	    sc->sm->sm_type == XFS_SCRUB_TYPE_BMBTD) {
+		/* Break all our leases, we're going to mess with things. */
+		if (is_repair) {
+			error = xfs_break_layouts(VFS_I(sc->ip),
+					&sc->ilock_flags, BREAK_UNMAP);
+			if (error)
+				goto out;
+		}
+
 		inode_dio_wait(VFS_I(sc->ip));
 		error = filemap_write_and_wait(VFS_I(sc->ip)->i_mapping);
 		if (error)
 			goto out;
+
+		/* Drop the page cache if we're repairing block mappings. */
+		if (is_repair) {
+			error = invalidate_inode_pages2(
+					VFS_I(sc->ip)->i_mapping);
+			if (error)
+				goto out;
+		}
+
 	}
 
 	/* Got the inode, lock it and we're ready to go. */
diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
new file mode 100644
index 000000000000..00907b97512e
--- /dev/null
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -0,0 +1,513 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_rtalloc.h"
+#include "xfs_bmap.h"
+#include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount.h"
+#include "xfs_quota.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/btree.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+#include "scrub/bitmap.h"
+
+/*
+ * Inode fork block mapping (BMBT) repair.
+ *
+ * Basically, we gather all the rmap records for the inode and fork we're
+ * fixing, reset the incore fork, then re-add all the records.
+ */
+
+struct xrep_bmap_extent {
+	struct list_head	list;
+	struct xfs_rmap_irec	rmap;
+	xfs_agnumber_t		agno;
+};
+
+struct xrep_bmap {
+	/* List of new bmap records. */
+	struct list_head	*extlist;
+
+	/* Old bmbt blocks */
+	struct xfs_bitmap	*btlist;
+
+	struct xfs_scrub	*sc;
+
+	/* Inode we're fixing. */
+	xfs_ino_t		ino;
+
+	/* How many blocks did we find in the other fork? */
+	xfs_rfsblock_t		otherfork_blocks;
+
+	/* How many bmbt blocks did we find for this fork? */
+	xfs_rfsblock_t		bmbt_blocks;
+
+	/* How many extents did we find for this fork? */
+	xfs_extnum_t		extents;
+
+	/* Which fork are we fixing? */
+	int			whichfork;
+};
+
+/* Record extents that belong to this inode's fork. */
+STATIC int
+xrep_bmap_walk_rmap(
+	struct xfs_btree_cur	*cur,
+	struct xfs_rmap_irec	*rec,
+	void			*priv)
+{
+	struct xrep_bmap	*rb = priv;
+	struct xrep_bmap_extent	*rbe;
+	struct xfs_mount	*mp = cur->bc_mp;
+	xfs_fsblock_t		fsbno;
+	int			error = 0;
+
+	if (xchk_should_terminate(rb->sc, &error))
+		return error;
+
+	/* Skip extents which are not owned by this inode and fork. */
+	if (rec->rm_owner != rb->ino) {
+		return 0;
+	} else if (rb->whichfork == XFS_DATA_FORK &&
+		 (rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	} else if (rb->whichfork == XFS_ATTR_FORK &&
+		 !(rec->rm_flags & XFS_RMAP_ATTR_FORK)) {
+		rb->otherfork_blocks += rec->rm_blockcount;
+		return 0;
+	}
+
+	/* Delete the old bmbt blocks later. */
+	if (rec->rm_flags & XFS_RMAP_BMBT_BLOCK) {
+		fsbno = XFS_AGB_TO_FSB(mp, cur->bc_private.a.agno,
+				rec->rm_startblock);
+		rb->bmbt_blocks += rec->rm_blockcount;
+		return xfs_bitmap_set(rb->btlist, fsbno, rec->rm_blockcount);
+	}
+
+	/* Remember this rmap. */
+	rb->extents++;
+	trace_xrep_bmap_walk_rmap(mp, cur->bc_private.a.agno,
+			rec->rm_startblock, rec->rm_blockcount, rec->rm_owner,
+			rec->rm_offset, rec->rm_flags);
+
+	rbe = kmem_alloc(sizeof(struct xrep_bmap_extent), KM_MAYFAIL);
+	if (!rbe)
+		return -ENOMEM;
+
+	INIT_LIST_HEAD(&rbe->list);
+	rbe->rmap = *rec;
+	rbe->agno = cur->bc_private.a.agno;
+	list_add_tail(&rbe->list, rb->extlist);
+
+	return 0;
+}
+
+/* Compare two bmap extents. */
+static int
+xrep_bmap_extent_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xrep_bmap_extent	*ap;
+	struct xrep_bmap_extent	*bp;
+
+	ap = container_of(a, struct xrep_bmap_extent, list);
+	bp = container_of(b, struct xrep_bmap_extent, list);
+
+	if (ap->rmap.rm_offset > bp->rmap.rm_offset)
+		return 1;
+	else if (ap->rmap.rm_offset < bp->rmap.rm_offset)
+		return -1;
+	return 0;
+}
+
+/* Scan one AG for reverse mappings that we can turn into extent maps. */
+STATIC int
+xrep_bmap_scan_ag(
+	struct xrep_bmap	*rb,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_scrub	*sc = rb->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_buf		*agf_bp = NULL;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_alloc_read_agf(mp, sc->tp, agno, 0, &agf_bp);
+	if (error)
+		return error;
+	if (!agf_bp)
+		return -ENOMEM;
+	cur = xfs_rmapbt_init_cursor(mp, sc->tp, agf_bp, agno);
+	error = xfs_rmap_query_all(cur, xrep_bmap_walk_rmap, rb);
+	if (error == XFS_BTREE_QUERY_RANGE_ABORT)
+		error = 0;
+	xfs_btree_del_cursor(cur, error);
+	xfs_trans_brelse(sc->tp, agf_bp);
+	return error;
+}
+
+/* Insert bmap records into an inode fork, given an rmap. */
+STATIC int
+xrep_bmap_insert_rec(
+	struct xfs_scrub	*sc,
+	struct xrep_bmap_extent	*rbe,
+	int			baseflags)
+{
+	struct xfs_bmbt_irec	bmap;
+	xfs_extlen_t		extlen;
+	int			flags;
+	int			error = 0;
+
+	/* Form the "new" mapping... */
+	bmap.br_startblock = XFS_AGB_TO_FSB(sc->mp, rbe->agno,
+			rbe->rmap.rm_startblock);
+	bmap.br_startoff = rbe->rmap.rm_offset;
+
+	flags = 0;
+	if (rbe->rmap.rm_flags & XFS_RMAP_UNWRITTEN)
+		flags = XFS_BMAPI_PREALLOC;
+	while (rbe->rmap.rm_blockcount > 0) {
+		extlen = min_t(xfs_extlen_t, rbe->rmap.rm_blockcount,
+				MAXEXTLEN);
+		bmap.br_blockcount = extlen;
+
+		/* Re-add the extent to the fork. */
+		error = xfs_bmapi_remap(sc->tp, sc->ip, bmap.br_startoff,
+				extlen, bmap.br_startblock, baseflags | flags);
+		if (error)
+			goto out;
+
+		bmap.br_startblock += extlen;
+		bmap.br_startoff += extlen;
+		rbe->rmap.rm_blockcount -= extlen;
+		error = xfs_defer_ijoin(sc->tp->t_dfops, sc->ip);
+		if (error)
+			goto out;
+		error = xfs_defer_finish(&sc->tp);
+		if (error)
+			goto out;
+		/* Make sure we roll the transaction. */
+		error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+		if (error)
+			goto out;
+	}
+
+out:
+	return error;
+}
+
+/* Check for garbage inputs. */
+STATIC int
+xrep_bmap_check_inputs(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	ASSERT(whichfork == XFS_DATA_FORK || whichfork == XFS_ATTR_FORK);
+
+	/* Don't know how to repair the other fork formats. */
+	if (XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_EXTENTS &&
+	    XFS_IFORK_FORMAT(sc->ip, whichfork) != XFS_DINODE_FMT_BTREE)
+		return -EOPNOTSUPP;
+
+	/*
+	 * If there's no attr fork area in the inode, there's no attr fork to
+	 * rebuild.
+	 */
+	if (whichfork == XFS_ATTR_FORK) {
+		if (!XFS_IFORK_Q(sc->ip))
+			return -ENOENT;
+		return 0;
+	}
+
+	/* Only files, symlinks, and directories get to have data forks. */
+	switch (VFS_I(sc->ip)->i_mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFDIR:
+	case S_IFLNK:
+		/* ok */
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	/* If we somehow have delalloc extents, forget it. */
+	if (sc->ip->i_delayed_blks)
+		return -EBUSY;
+
+	/* Don't know how to rebuild realtime data forks. */
+	if (XFS_IS_REALTIME_INODE(sc->ip))
+		return -EOPNOTSUPP;
+
+	return 0;
+}
+
+/*
+ * Collect block mappings for this fork of this inode and decide if we have
+ * enough space to rebuild.  Caller is responsible for cleaning up the list if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_bmap_find_mappings(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct list_head	*mapping_records,
+	struct xfs_bitmap	*old_bmbt_blocks,
+	xfs_rfsblock_t		*old_bmbt_block_count,
+	xfs_rfsblock_t		*otherfork_blocks)
+{
+	struct xrep_bmap	rb;
+	xfs_agnumber_t		agno;
+	unsigned int		resblks;
+	int			error;
+
+	memset(&rb, 0, sizeof(rb));
+	rb.extlist = mapping_records;
+	rb.btlist = old_bmbt_blocks;
+	rb.ino = sc->ip->i_ino;
+	rb.whichfork = whichfork;
+	rb.sc = sc;
+
+	/* Iterate the rmaps for extents. */
+	for (agno = 0; agno < sc->mp->m_sb.sb_agcount; agno++) {
+		error = xrep_bmap_scan_ag(&rb, agno);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Guess how many blocks we're going to need to rebuild an entire bmap
+	 * from the number of extents we found, and pump up our transaction to
+	 * have sufficient block reservation.
+	 */
+	resblks = xfs_bmbt_calc_size(sc->mp, rb.extents);
+	error = xfs_trans_reserve_more(sc->tp, resblks, 0);
+	if (error)
+		return error;
+
+	*otherfork_blocks = rb.otherfork_blocks;
+	*old_bmbt_block_count = rb.bmbt_blocks;
+	return 0;
+}
+
+/* Update the inode counters. */
+STATIC int
+xrep_bmap_reset_counters(
+	struct xfs_scrub	*sc,
+	xfs_rfsblock_t		old_bmbt_block_count,
+	xfs_rfsblock_t		otherfork_blocks,
+	int			*log_flags)
+{
+	int			error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/*
+	 * We're going to use the bmap routines to reconstruct a fork from rmap
+	 * records.  Those functions increment di_nblocks for us, so we need to
+	 * subtract out all the data and bmbt blocks from the fork we're about
+	 * to rebuild.  otherfork_blocks reflects all the data and bmbt blocks
+	 * for the other fork, so this assignment effectively performs the
+	 * subtraction for us.
+	 */
+	sc->ip->i_d.di_nblocks = otherfork_blocks;
+	*log_flags |= XFS_ILOG_CORE;
+
+	if (!old_bmbt_block_count)
+		return 0;
+
+	/* Release quota counts for the old bmbt blocks. */
+	error = xrep_ino_dqattach(sc);
+	if (error)
+		return error;
+	xfs_trans_mod_dquot_byino(sc->tp, sc->ip, XFS_TRANS_DQ_BCOUNT,
+			-(int64_t)old_bmbt_block_count);
+	return 0;
+}
+
+/* Initialize a new fork and implant it in the inode. */
+STATIC void
+xrep_bmap_reset_fork(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	bool			has_mappings,
+	int			*log_flags)
+{
+	/* Set us back to extents format with zero records. */
+	XFS_IFORK_FMT_SET(sc->ip, whichfork, XFS_DINODE_FMT_EXTENTS);
+	XFS_IFORK_NEXT_SET(sc->ip, whichfork, 0);
+
+	/* Reinitialize the in-core fork. */
+	if (XFS_IFORK_PTR(sc->ip, whichfork) != NULL)
+		xfs_idestroy_fork(sc->ip, whichfork);
+	if (whichfork == XFS_DATA_FORK) {
+		memset(&sc->ip->i_df, 0, sizeof(struct xfs_ifork));
+		sc->ip->i_df.if_flags |= XFS_IFEXTENTS;
+	} else if (whichfork == XFS_ATTR_FORK) {
+		if (has_mappings) {
+			sc->ip->i_afp = NULL;
+		} else {
+			sc->ip->i_afp = kmem_zone_zalloc(xfs_ifork_zone,
+					KM_SLEEP);
+			sc->ip->i_afp->if_flags |= XFS_IFEXTENTS;
+		}
+	}
+
+	/*
+	 * Now that we've reinitialized the in-memory fork and set the inode
+	 * back to extents format with zero extents, any extents that we
+	 * subsequently map into the file will reinitialize the on-disk fork
+	 * area for us.  All we have to do is log the inode core to preserve
+	 * the format and extent count fields.
+	 */
+	*log_flags |= XFS_ILOG_CORE;
+}
+
+/* Make our changes permanent so that we can start rebuilding the fork. */
+STATIC int
+xrep_bmap_commit_new(
+	struct xfs_scrub	*sc,
+	int			log_flags)
+{
+	xfs_trans_log_inode(sc->tp, sc->ip, log_flags);
+	return xfs_trans_roll_inode(&sc->tp, sc->ip);
+}
+
+/* Build new fork mappings and dispose of the old bmbt blocks. */
+STATIC int
+xrep_bmap_rebuild_tree(
+	struct xfs_scrub	*sc,
+	int			whichfork,
+	struct list_head	*mapping_records,
+	struct xfs_bitmap	*old_bmbt_blocks)
+{
+	struct xfs_owner_info	oinfo;
+	struct xrep_bmap_extent	*rbe;
+	struct xrep_bmap_extent	*n;
+	int			baseflags;
+	int			error;
+
+	baseflags = XFS_BMAPI_NORMAP;
+	if (whichfork == XFS_ATTR_FORK)
+		baseflags |= XFS_BMAPI_ATTRFORK;
+
+	/* "Remap" the extents into the fork. */
+	list_sort(NULL, mapping_records, xrep_bmap_extent_cmp);
+	list_for_each_entry_safe(rbe, n, mapping_records, list) {
+		error = xrep_bmap_insert_rec(sc, rbe, baseflags);
+		if (error)
+			return error;
+		list_del(&rbe->list);
+		kmem_free(rbe);
+	}
+
+	/* Dispose of all the old bmbt blocks. */
+	xfs_rmap_ino_bmbt_owner(&oinfo, sc->ip->i_ino, whichfork);
+	return xrep_reap_extents(sc, old_bmbt_blocks, &oinfo,
+			XFS_AG_RESV_NONE);
+}
+
+/* Free every record in the mapping list. */
+STATIC void
+xrep_bmap_cancel_recs(
+	struct list_head	*recs)
+{
+	struct xrep_bmap_extent	*rbe;
+	struct xrep_bmap_extent	*n;
+
+	list_for_each_entry_safe(rbe, n, recs, list) {
+		list_del(&rbe->list);
+		kmem_free(rbe);
+	}
+}
+
+/* Repair an inode fork. */
+STATIC int
+xrep_bmap(
+	struct xfs_scrub	*sc,
+	int			whichfork)
+{
+	struct list_head	mapping_records;
+	struct xfs_bitmap	old_bmbt_blocks;
+	xfs_rfsblock_t		old_bmbt_block_count;
+	xfs_rfsblock_t		otherfork_blocks;
+	int			log_flags = 0;
+	int			error = 0;
+
+	error = xrep_bmap_check_inputs(sc, whichfork);
+	if (error)
+		return error;
+
+	/* Collect all reverse mappings for this fork's extents. */
+	INIT_LIST_HEAD(&mapping_records);
+	xfs_bitmap_init(&old_bmbt_blocks);
+	error = xrep_bmap_find_mappings(sc, whichfork, &mapping_records,
+			&old_bmbt_blocks, &old_bmbt_block_count,
+			&otherfork_blocks);
+	if (error)
+		goto out;
+
+	/*
+	 * Blow out the in-core fork and zero the on-disk fork.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 */
+	error = xrep_bmap_reset_counters(sc, old_bmbt_block_count,
+			otherfork_blocks, &log_flags);
+	if (error)
+		goto out;
+	xrep_bmap_reset_fork(sc, whichfork, list_empty(&mapping_records),
+			&log_flags);
+	error = xrep_bmap_commit_new(sc, log_flags);
+	if (error)
+		goto out;
+
+	/* Now rebuild the fork extent map information. */
+	error = xrep_bmap_rebuild_tree(sc, whichfork, &mapping_records,
+			&old_bmbt_blocks);
+out:
+	xfs_bitmap_destroy(&old_bmbt_blocks);
+	xrep_bmap_cancel_recs(&mapping_records);
+	return error;
+}
+
+/* Repair an inode's data fork. */
+int
+xrep_bmap_data(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_DATA_FORK);
+}
+
+/* Repair an inode's attr fork. */
+int
+xrep_bmap_attr(
+	struct xfs_scrub	*sc)
+{
+	return xrep_bmap(sc, XFS_ATTR_FORK);
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 20e449c7a0df..38444fec70db 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -66,6 +66,8 @@ int xrep_allocbt(struct xfs_scrub *sc);
 int xrep_iallocbt(struct xfs_scrub *sc);
 int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
+int xrep_bmap_data(struct xfs_scrub *sc);
+int xrep_bmap_attr(struct xfs_scrub *sc);
 
 #else
 
@@ -104,6 +106,8 @@ xrep_reset_perag_resv(
 #define xrep_iallocbt			xrep_notsupported
 #define xrep_refcountbt			xrep_notsupported
 #define xrep_inode			xrep_notsupported
+#define xrep_bmap_data			xrep_notsupported
+#define xrep_bmap_attr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index ae922801808d..45af20a3ab50 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -277,13 +277,13 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_data,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_data,
 	},
 	[XFS_SCRUB_TYPE_BMBTA] = {	/* inode attr fork */
 		.type	= ST_INODE,
 		.setup	= xchk_setup_inode_bmap,
 		.scrub	= xchk_bmap_attr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_bmap_attr,
 	},
 	[XFS_SCRUB_TYPE_BMBTC] = {	/* inode CoW fork */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 9126dc66f726..3383b14fd0c0 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -554,7 +554,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
 DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
 DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
-DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
+DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_walk_rmap);
 
 TRACE_EVENT(xrep_refcount_extent_fn,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 7bf5c1202719..226a30465def 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -133,6 +133,60 @@ xfs_trans_dup(
 	return ntp;
 }
 
+/*
+ * Try to reserve more blocks for a transaction.  The single use case we
+ * support is for online repair -- use a transaction to gather data without
+ * fear of btree cycle deadlocks; calculate how many blocks we really need
+ * from that data; and only then start modifying data.  This can fail due to
+ * ENOSPC, so we have to be able to cancel the transaction.
+ */
+int
+xfs_trans_reserve_more(
+	struct xfs_trans	*tp,
+	uint			blocks,
+	uint			rtextents)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
+	int			error = 0;
+
+	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
+
+	/*
+	 * Attempt to reserve the needed disk blocks by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (blocks > 0) {
+		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
+		if (error)
+			return -ENOSPC;
+		tp->t_blk_res += blocks;
+	}
+
+	/*
+	 * Attempt to reserve the needed realtime extents by decrementing
+	 * the number needed from the number available.  This will
+	 * fail if the count would go below zero.
+	 */
+	if (rtextents > 0) {
+		error = xfs_mod_frextents(mp, -((int64_t)rtextents));
+		if (error) {
+			error = -ENOSPC;
+			goto out_blocks;
+		}
+		tp->t_rtx_res += rtextents;
+	}
+
+	return 0;
+out_blocks:
+	if (blocks > 0) {
+		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
+		tp->t_blk_res -= blocks;
+	}
+	return error;
+}
+
 /*
  * This is called to reserve free disk blocks and log space for the
  * given transaction.  This must be done before allocating any resources
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 5170e89bec02..8708bd2bbdba 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -169,6 +169,8 @@ typedef struct xfs_trans {
 int		xfs_trans_alloc(struct xfs_mount *mp, struct xfs_trans_res *resp,
 			uint blocks, uint rtextents, uint flags,
 			struct xfs_trans **tpp);
+int		xfs_trans_reserve_more(struct xfs_trans *tp, uint blocks,
+			uint rtextents);
 int		xfs_trans_alloc_empty(struct xfs_mount *mp,
 			struct xfs_trans **tpp);
 void		xfs_trans_mod_sb(xfs_trans_t *, uint, int64_t);


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 11/14] xfs: repair damaged symlinks
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (9 preceding siblings ...)
  2018-07-30  5:49 ` [PATCH 10/14] xfs: repair inode block maps Darrick J. Wong
@ 2018-07-30  5:49 ` Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 12/14] xfs: repair extended attributes Darrick J. Wong
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Repair inconsistent symbolic link data.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile               |    1 
 fs/xfs/scrub/repair.h         |    2 
 fs/xfs/scrub/scrub.c          |    2 
 fs/xfs/scrub/symlink.c        |    5 +
 fs/xfs/scrub/symlink_repair.c |  244 +++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_symlink.c          |  150 ++++++++++++++-----------
 fs/xfs/xfs_symlink.h          |    3 +
 7 files changed, 339 insertions(+), 68 deletions(-)
 create mode 100644 fs/xfs/scrub/symlink_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 7f5467bb18b9..e25cde969d99 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -171,6 +171,7 @@ xfs-y				+= $(addprefix scrub/, \
 				   inode_repair.o \
 				   refcount_repair.o \
 				   repair.o \
+				   symlink_repair.o \
 				   )
 endif
 endif
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 38444fec70db..17769efb20d9 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -68,6 +68,7 @@ int xrep_refcountbt(struct xfs_scrub *sc);
 int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
+int xrep_symlink(struct xfs_scrub *sc);
 
 #else
 
@@ -108,6 +109,7 @@ xrep_reset_perag_resv(
 #define xrep_inode			xrep_notsupported
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
+#define xrep_symlink			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 45af20a3ab50..0a8eea77e58f 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -307,7 +307,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_symlink,
 		.scrub	= xchk_symlink,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_symlink,
 	},
 	[XFS_SCRUB_TYPE_PARENT] = {	/* parent pointers */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c
index f7ebaa946999..ee968c62d0f2 100644
--- a/fs/xfs/scrub/symlink.c
+++ b/fs/xfs/scrub/symlink.c
@@ -29,12 +29,15 @@ xchk_setup_symlink(
 	struct xfs_scrub	*sc,
 	struct xfs_inode	*ip)
 {
+	uint			resblks;
+
 	/* Allocate the buffer without the inode lock held. */
 	sc->buf = kmem_zalloc_large(XFS_SYMLINK_MAXLEN + 1, KM_SLEEP);
 	if (!sc->buf)
 		return -ENOMEM;
 
-	return xchk_setup_inode_contents(sc, ip, 0);
+	resblks = xfs_symlink_blocks(sc->mp, XFS_SYMLINK_MAXLEN);
+	return xchk_setup_inode_contents(sc, ip, resblks);
 }
 
 /* Symbolic links. */
diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c
new file mode 100644
index 000000000000..6888094cf941
--- /dev/null
+++ b/fs/xfs/scrub/symlink_repair.c
@@ -0,0 +1,244 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_symlink.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Symbolic Link Repair
+ * ====================
+ *
+ * There's not much we can do to repair symbolic links -- we truncate them to
+ * the first NULL byte and reinitialize the target.  Zero-length symlinks are
+ * turned into links to the current dir.
+ */
+
+/* Try to salvage the pathname from rmt blocks. */
+STATIC int
+xrep_symlink_salvage_remote(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_inode	*ip = sc->ip;
+	struct xfs_buf		*bp;
+	char			*target_buf = sc->buf;
+	xfs_failaddr_t		fa;
+	xfs_filblks_t		fsblocks;
+	xfs_daddr_t		d;
+	loff_t			len;
+	loff_t			offset;
+	unsigned int		byte_cnt;
+	bool			magic_ok;
+	bool			hdr_ok;
+	int			n;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			error;
+
+	/* We'll only read until the buffer is full. */
+	len = max_t(loff_t, ip->i_d.di_size, XFS_SYMLINK_MAXLEN);
+	fsblocks = xfs_symlink_blocks(sc->mp, len);
+	error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		struct xfs_dsymlink_hdr	*dsl;
+
+		d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock);
+
+		/* Read the rmt block.  We'll run the verifiers manually. */
+		error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp,
+				d, XFS_FSB_TO_BB(sc->mp, mval[n].br_blockcount),
+				0, &bp, NULL);
+		if (error)
+			return error;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		/* How many bytes do we expect to get out of this buffer? */
+		byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount);
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt);
+		byte_cnt = min_t(unsigned int, byte_cnt, len);
+
+		/*
+		 * See if the verifiers accept this block.  We're willing to
+		 * salvage if the if the offset/byte/ino are ok and either the
+		 * verifier passed or the magic is ok.  Anything else and we
+		 * stop dead in our tracks.
+		 */
+		fa = bp->b_ops->verify_struct(bp);
+		dsl = bp->b_addr;
+		magic_ok = dsl->sl_magic == cpu_to_be32(XFS_SYMLINK_MAGIC);
+		hdr_ok = xfs_symlink_hdr_ok(ip->i_ino, offset, byte_cnt, bp);
+		if (!hdr_ok || (fa != NULL && !magic_ok))
+			break;
+
+		memcpy(target_buf + offset, dsl + 1, byte_cnt);
+
+		len -= byte_cnt;
+		offset += byte_cnt;
+	}
+
+	/* Ensure we have a zero at the end, and /some/ contents. */
+	if (offset == 0)
+		sprintf(target_buf, ".");
+	else
+		target_buf[offset] = 0;
+	return 0;
+}
+
+/*
+ * Try to salvage an inline symlink's contents.  Empty symlinks become a link
+ * to the current directory.
+ */
+STATIC void
+xrep_symlink_salvage_inline(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_inode	*ip = sc->ip;
+	struct xfs_ifork	*ifp;
+
+	ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	if (ifp->if_u1.if_data)
+		strncpy(sc->buf, ifp->if_u1.if_data, XFS_IFORK_DSIZE(ip));
+	if (strlen(sc->buf) == 0)
+		sprintf(sc->buf, ".");
+}
+
+/* Reset an inline symlink to its fresh configuration. */
+STATIC void
+xrep_symlink_truncate_inline(
+	struct xfs_inode	*ip)
+{
+	xfs_idestroy_fork(ip, XFS_DATA_FORK);
+	ip->i_d.di_format = XFS_DINODE_FMT_EXTENTS;
+	ip->i_d.di_nextents = 0;
+	memset(&ip->i_df, 0, sizeof(struct xfs_ifork));
+	ip->i_df.if_flags |= XFS_IFEXTENTS;
+}
+
+/*
+ * Salvage an inline symlink's contents and reset data fork.
+ * Returns with the inode joined to the transaction.
+ */
+STATIC int
+xrep_symlink_inline(
+	struct xfs_scrub	*sc)
+{
+	/* Salvage whatever link target information we can find. */
+	xrep_symlink_salvage_inline(sc);
+
+	/* Truncate the symlink. */
+	xrep_symlink_truncate_inline(sc->ip);
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+	return 0;
+}
+
+/*
+ * Salvage an inline symlink's contents and reset data fork.
+ * Returns with the inode joined to the transaction.
+ */
+STATIC int
+xrep_symlink_remote(
+	struct xfs_scrub	*sc)
+{
+	int			error;
+
+	/* Salvage whatever link target information we can find. */
+	error = xrep_symlink_salvage_remote(sc);
+	if (error)
+		return error;
+
+	/* Truncate the symlink. */
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+	return xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK, 0);
+}
+
+/*
+ * Reinitialize a link target.  Caller must ensure the inode is joined to
+ * the transaction.
+ */
+STATIC int
+xrep_symlink_reinitialize(
+	struct xfs_scrub	*sc)
+{
+	xfs_fsblock_t		fs_blocks;
+	unsigned int		target_len;
+	uint			resblks;
+	int			error;
+
+	/* How many blocks do we need? */
+	target_len = strlen(sc->buf);
+	ASSERT(target_len != 0);
+	if (target_len == 0 || target_len > XFS_SYMLINK_MAXLEN)
+		return -EFSCORRUPTED;
+
+	/* Set up to reinitialize the target. */
+	fs_blocks = xfs_symlink_blocks(sc->mp, target_len);
+	resblks = XFS_SYMLINK_SPACE_RES(sc->mp, target_len, fs_blocks);
+	error = xfs_trans_reserve_quota_nblks(sc->tp, sc->ip, resblks, 0,
+			XFS_QMOPT_RES_REGBLKS);
+
+	/* Try to write the new target back out. */
+	xfs_defer_ijoin(sc->tp->t_dfops, sc->ip);
+	error = xfs_symlink_write_target(sc->tp, sc->ip, sc->buf, target_len,
+			fs_blocks, resblks);
+	if (error)
+		return error;
+
+	/* Finish up any block mapping activities. */
+	return xfs_defer_finish(&sc->tp);
+}
+
+/* Repair a symbolic link. */
+int
+xrep_symlink(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_ifork	*ifp;
+	int			error;
+
+	error = xfs_qm_dqattach_locked(sc->ip, false);
+	if (error)
+		return error;
+
+	/* Salvage whatever we can of the target. */
+	*((char *)sc->buf) = 0;
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	if (ifp->if_flags & XFS_IFINLINE)
+		error = xrep_symlink_inline(sc);
+	else
+		error = xrep_symlink_remote(sc);
+	if (error)
+		return error;
+
+	/* Now reset the target. */
+	return xrep_symlink_reinitialize(sc);
+}
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 2bfe7fbbedb2..1097d9062ef3 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -150,6 +150,86 @@ xfs_readlink(
 	return error;
 }
 
+/* Write the symlink target into the inode. */
+int
+xfs_symlink_write_target(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	const char		*target_path,
+	int			pathlen,
+	xfs_fsblock_t		fs_blocks,
+	uint			resblks)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	const char		*cur_chunk;
+	struct xfs_buf		*bp;
+	xfs_daddr_t		d;
+	int			byte_cnt;
+	int			nmaps;
+	int			offset;
+	int			n;
+	int			error;
+
+	/*
+	 * If the symlink will fit into the inode, write it inline.
+	 */
+	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
+		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
+
+		ip->i_d.di_size = pathlen;
+		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+
+		return 0;
+	}
+
+	/* Write target to remote blocks. */
+	nmaps = XFS_SYMLINK_MAPS;
+	error = xfs_bmapi_write(tp, ip, 0, fs_blocks, XFS_BMAPI_METADATA,
+			resblks, mval, &nmaps);
+	if (error)
+		return error;
+
+	if (resblks)
+		resblks -= fs_blocks;
+	ip->i_d.di_size = pathlen;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+
+	cur_chunk = target_path;
+	offset = 0;
+	for (n = 0; n < nmaps; n++) {
+		char	*buf;
+
+		d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
+		byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
+		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
+				BTOBB(byte_cnt), 0);
+		if (!bp)
+			return -ENOMEM;
+		bp->b_ops = &xfs_symlink_buf_ops;
+
+		byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
+		byte_cnt = min(byte_cnt, pathlen);
+
+		buf = bp->b_addr;
+		buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
+					   byte_cnt, bp);
+
+		memcpy(buf, cur_chunk, byte_cnt);
+
+		cur_chunk += byte_cnt;
+		pathlen -= byte_cnt;
+		offset += byte_cnt;
+
+		xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
+		xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
+						(char *)bp->b_addr);
+	}
+	ASSERT(pathlen == 0);
+	return 0;
+}
+
 int
 xfs_symlink(
 	struct xfs_inode	*dp,
@@ -164,15 +244,7 @@ xfs_symlink(
 	int			error = 0;
 	int			pathlen;
 	bool                    unlock_dp_on_error = false;
-	xfs_fileoff_t		first_fsb;
 	xfs_filblks_t		fs_blocks;
-	int			nmaps;
-	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
-	xfs_daddr_t		d;
-	const char		*cur_chunk;
-	int			byte_cnt;
-	int			n;
-	xfs_buf_t		*bp;
 	prid_t			prid;
 	struct xfs_dquot	*udqp = NULL;
 	struct xfs_dquot	*gdqp = NULL;
@@ -265,65 +337,11 @@ xfs_symlink(
 
 	if (resblks)
 		resblks -= XFS_IALLOC_SPACE_RES(mp);
-	/*
-	 * If the symlink will fit into the inode, write it inline.
-	 */
-	if (pathlen <= XFS_IFORK_DSIZE(ip)) {
-		xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
-
-		ip->i_d.di_size = pathlen;
-		ip->i_d.di_format = XFS_DINODE_FMT_LOCAL;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
-	} else {
-		int	offset;
-
-		first_fsb = 0;
-		nmaps = XFS_SYMLINK_MAPS;
-
-		error = xfs_bmapi_write(tp, ip, first_fsb, fs_blocks,
-				  XFS_BMAPI_METADATA, resblks, mval, &nmaps);
-		if (error)
-			goto out_trans_cancel;
-
-		if (resblks)
-			resblks -= fs_blocks;
-		ip->i_d.di_size = pathlen;
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-
-		cur_chunk = target_path;
-		offset = 0;
-		for (n = 0; n < nmaps; n++) {
-			char	*buf;
-
-			d = XFS_FSB_TO_DADDR(mp, mval[n].br_startblock);
-			byte_cnt = XFS_FSB_TO_B(mp, mval[n].br_blockcount);
-			bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					       BTOBB(byte_cnt), 0);
-			if (!bp) {
-				error = -ENOMEM;
-				goto out_trans_cancel;
-			}
-			bp->b_ops = &xfs_symlink_buf_ops;
-
-			byte_cnt = XFS_SYMLINK_BUF_SPACE(mp, byte_cnt);
-			byte_cnt = min(byte_cnt, pathlen);
-
-			buf = bp->b_addr;
-			buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset,
-						   byte_cnt, bp);
-
-			memcpy(buf, cur_chunk, byte_cnt);
 
-			cur_chunk += byte_cnt;
-			pathlen -= byte_cnt;
-			offset += byte_cnt;
-
-			xfs_trans_buf_set_type(tp, bp, XFS_BLFT_SYMLINK_BUF);
-			xfs_trans_log_buf(tp, bp, 0, (buf + byte_cnt - 1) -
-							(char *)bp->b_addr);
-		}
-		ASSERT(pathlen == 0);
-	}
+	error = xfs_symlink_write_target(tp, ip, target_path, pathlen,
+			fs_blocks, resblks);
+	if (error)
+		goto out_trans_cancel;
 
 	/*
 	 * Create the directory entry for the symlink.
diff --git a/fs/xfs/xfs_symlink.h b/fs/xfs/xfs_symlink.h
index 9743d8c9394b..d7252f9cab41 100644
--- a/fs/xfs/xfs_symlink.h
+++ b/fs/xfs/xfs_symlink.h
@@ -12,5 +12,8 @@ int xfs_symlink(struct xfs_inode *dp, struct xfs_name *link_name,
 int xfs_readlink_bmap_ilocked(struct xfs_inode *ip, char *link);
 int xfs_readlink(struct xfs_inode *ip, char *link);
 int xfs_inactive_symlink(struct xfs_inode *ip);
+int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
+		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
+		uint resblks);
 
 #endif /* __XFS_SYMLINK_H */


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 12/14] xfs: repair extended attributes
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (10 preceding siblings ...)
  2018-07-30  5:49 ` [PATCH 11/14] xfs: repair damaged symlinks Darrick J. Wong
@ 2018-07-30  5:49 ` Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 13/14] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 14/14] xfs: repair quotas Darrick J. Wong
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, zap the attr tree, and re-add the
values.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile            |    1 
 fs/xfs/scrub/attr.c        |    2 
 fs/xfs/scrub/attr_repair.c |  611 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.h      |    2 
 fs/xfs/scrub/scrub.c       |    2 
 fs/xfs/scrub/scrub.h       |    3 
 6 files changed, 619 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/scrub/attr_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index e25cde969d99..c3963c88f952 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -164,6 +164,7 @@ xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota.o
 ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y)
 xfs-y				+= $(addprefix scrub/, \
 				   agheader_repair.o \
+				   attr_repair.o \
 				   alloc_repair.o \
 				   bitmap.o \
 				   bmap_repair.o \
diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index 81d5e90547a1..e20074c241b5 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -125,7 +125,7 @@ xchk_xattr_listent(
  * Within a char, the lowest bit of the char represents the byte with
  * the smallest address
  */
-STATIC bool
+bool
 xchk_xattr_set_map(
 	struct xfs_scrub	*sc,
 	unsigned long		*map,
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
new file mode 100644
index 000000000000..5bacfb88f25e
--- /dev/null
+++ b/fs/xfs/scrub/attr_repair.c
@@ -0,0 +1,611 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_dir2.h"
+#include "xfs_attr.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr_sf.h"
+#include "xfs_attr_remote.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Extended Attribute Repair
+ * =========================
+ *
+ * We repair extended attributes by reading the attribute fork blocks looking
+ * for keys and values, then truncate the entire attr fork and reinsert all
+ * the attributes.  Unfortunately, there's no secondary copy of most extended
+ * attribute data, which means that if we blow up midway through there's
+ * little we can do.
+ */
+
+struct xrep_xattr_key {
+	struct list_head	list;
+	unsigned char		*value;
+	int			valuelen;
+	int			flags;
+	int			namelen;
+	unsigned char		name[0];
+};
+
+#define XREP_XATTR_KEY_LEN(namelen) \
+	(sizeof(struct xrep_xattr_key) + (namelen) + 1)
+
+struct xrep_xattr {
+	struct list_head	*attrlist;
+	struct xfs_scrub	*sc;
+};
+
+/*
+ * Iterate each block in an attr fork extent.  The m_attr_geo fsbcount is
+ * always 1 for now, but code defensively in case this ever changes.
+ */
+#define for_each_xfs_attr_block(mp, irec, dabno) \
+	for ((dabno) = roundup((xfs_dablk_t)(irec)->br_startoff, \
+			(mp)->m_attr_geo->fsbcount); \
+	     (dabno) < (irec)->br_startoff + (irec)->br_blockcount; \
+	     (dabno) += (mp)->m_attr_geo->fsbcount)
+
+/*
+ * Decide if we want to salvage this attribute.  We don't bother with
+ * incomplete or oversized keys or values.
+ */
+STATIC int
+xrep_xattr_want_salvage(
+	int			flags,
+	int			namelen,
+	int			valuelen)
+{
+	if (flags & XFS_ATTR_INCOMPLETE)
+		return false;
+	if (namelen > XATTR_NAME_MAX || namelen <= 0)
+		return false;
+	if (valuelen > XATTR_SIZE_MAX || valuelen < 0)
+		return false;
+	return true;
+}
+
+/* Allocate an in-core record to hold xattrs while we rebuild the xattr data. */
+STATIC struct xrep_xattr_key *
+xrep_xattr_salvage_key(
+	int			flags,
+	unsigned char		*name,
+	int			namelen,
+	int			valuelen)
+{
+	struct xrep_xattr_key	*key;
+
+	/* Store attr key. */
+	key = kmem_alloc(XREP_XATTR_KEY_LEN(namelen), KM_MAYFAIL);
+	if (!key)
+		return NULL;
+	INIT_LIST_HEAD(&key->list);
+	key->valuelen = valuelen;
+	key->flags = flags & (ATTR_ROOT | ATTR_SECURE);
+	key->namelen = namelen;
+	key->name[namelen] = 0;
+	memcpy(key->name, name, namelen);
+	key->value = NULL;
+	if (valuelen) {
+		key->value = kmem_alloc_large(valuelen, KM_MAYFAIL);
+		if (!key->value) {
+			kmem_free(key);
+			return NULL;
+		}
+	}
+	return key;
+}
+
+/*
+ * Record a shortform extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_sf_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_sf_entry	*sfe)
+{
+	unsigned char			*value = &sfe->nameval[sfe->namelen];
+	struct xrep_xattr_key		*key;
+
+	if (!xrep_xattr_want_salvage(sfe->flags, sfe->namelen, sfe->valuelen))
+		return 0;
+	key = xrep_xattr_salvage_key(sfe->flags, sfe->nameval, sfe->namelen,
+			sfe->valuelen);
+	if (!key)
+		return -ENOMEM;
+	if (sfe->valuelen)
+		memcpy(key->value, value, sfe->valuelen);
+	list_add_tail(&key->list, rx->attrlist);
+	return 0;
+}
+
+/*
+ * Record a local format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_local_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_local	*lentry)
+{
+	struct xrep_xattr_key		*key;
+	unsigned long			*usedmap = rx->sc->buf;
+	unsigned int			valuelen;
+	unsigned int			namesize;
+
+	/*
+	 * Decode the leaf local entry format.  If something seems wrong, we
+	 * junk the attribute.
+	 */
+	valuelen = be16_to_cpu(lentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_local(lentry->namelen, valuelen);
+	if ((char *)lentry + namesize > buf_end)
+		return 0;
+	if (!xrep_xattr_want_salvage(ent->flags, lentry->namelen, valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, usedmap, nameidx, namesize))
+		return 0;
+
+	/* Try to save this attribute. */
+	key = xrep_xattr_salvage_key(ent->flags, lentry->nameval,
+			lentry->namelen, valuelen);
+	if (!key)
+		return -ENOMEM;
+	if (valuelen)
+		memcpy(key->value, &lentry->nameval[lentry->namelen], valuelen);
+	list_add_tail(&key->list, rx->attrlist);
+	return 0;
+}
+
+/*
+ * Record a remote format extended attribute key & value for later reinsertion
+ * into the inode.
+ */
+STATIC int
+xrep_xattr_salvage_remote_attr(
+	struct xrep_xattr		*rx,
+	struct xfs_attr_leaf_entry	*ent,
+	unsigned int			nameidx,
+	const char			*buf_end,
+	struct xfs_attr_leaf_name_remote *rentry,
+	unsigned int			ent_idx,
+	struct xfs_buf			*leaf_bp)
+{
+	struct xfs_da_args		args = {
+		.trans		= rx->sc->tp,
+		.dp		= rx->sc->ip,
+		.index		= ent_idx,
+		.geo		= rx->sc->mp->m_attr_geo,
+	};
+	struct xrep_xattr_key		*key;
+	unsigned long			*usedmap = rx->sc->buf;
+	unsigned int			valuelen;
+	unsigned int			namesize;
+	int				error;
+
+	/*
+	 * Decode the leaf remote entry format.  If something seems wrong, we
+	 * junk the attribute.  Note that we should never find a zero-length
+	 * remote attribute value.
+	 */
+	valuelen = be32_to_cpu(rentry->valuelen);
+	namesize = xfs_attr_leaf_entsize_remote(rentry->namelen);
+	if ((char *)rentry + namesize > buf_end)
+		return 0;
+	if (valuelen == 0 ||
+	    !xrep_xattr_want_salvage(ent->flags, rentry->namelen, valuelen))
+		return 0;
+	if (!xchk_xattr_set_map(rx->sc, usedmap, nameidx, namesize))
+		return 0;
+
+	/* Try to save this attribute. */
+	key = xrep_xattr_salvage_key(ent->flags, rentry->name, rentry->namelen,
+			valuelen);
+	if (!key)
+		return -ENOMEM;
+
+	/* Look up the remote value and stash it for reconstruction. */
+	args.valuelen = valuelen;
+	args.namelen = rentry->namelen;
+	args.name = key->name;
+	args.value = key->value;
+	error = xfs_attr3_leaf_getvalue(leaf_bp, &args);
+	if (error || args.rmtblkno == 0)
+		goto err_free;
+
+	error = xfs_attr_rmtval_get(&args);
+	if (error == 0) {
+		/* Got the value, add the attr and get out. */
+		list_add_tail(&key->list, rx->attrlist);
+		return 0;
+	}
+
+err_free:
+	/* remote value was garbage, junk it */
+	if (error == -EFSBADCRC || error == -EFSCORRUPTED)
+		error = 0;
+	kmem_free(key->value);
+	kmem_free(key);
+	return error;
+}
+
+/* Extract every xattr key that we can from this attr fork block. */
+STATIC int
+xrep_xattr_recover_leaf(
+	struct xrep_xattr		*rx,
+	struct xfs_buf			*bp)
+{
+	struct xfs_attr3_icleaf_hdr	leafhdr;
+	struct xfs_scrub		*sc = rx->sc;
+	struct xfs_mount		*mp = sc->mp;
+	struct xfs_attr_leafblock	*leaf;
+	unsigned long			*usedmap = sc->buf;
+	struct xfs_attr_leaf_name_local	*lentry;
+	struct xfs_attr_leaf_name_remote *rentry;
+	struct xfs_attr_leaf_entry	*ent;
+	struct xfs_attr_leaf_entry	*entries;
+	char				*buf_end;
+	size_t				off;
+	unsigned int			nameidx;
+	unsigned int			hdrsize;
+	int				i;
+	int				error = 0;
+
+	bitmap_zero(usedmap, mp->m_attr_geo->blksize);
+
+	/* Check the leaf header */
+	leaf = bp->b_addr;
+	xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf);
+	hdrsize = xfs_attr3_leaf_hdr_size(leaf);
+	xchk_xattr_set_map(sc, usedmap, 0, hdrsize);
+	entries = xfs_attr3_leaf_entryp(leaf);
+
+	buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize;
+	for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) {
+		/* Skip key if it conflicts with something else? */
+		off = (char *)ent - (char *)leaf;
+		if (!xchk_xattr_set_map(sc, usedmap, off,
+				sizeof(xfs_attr_leaf_entry_t)))
+			continue;
+
+		/* Check the name information. */
+		nameidx = be16_to_cpu(ent->nameidx);
+		if (nameidx < leafhdr.firstused ||
+		    nameidx >= mp->m_attr_geo->blksize)
+			continue;
+
+		if (ent->flags & XFS_ATTR_LOCAL) {
+			lentry = xfs_attr3_leaf_name_local(leaf, i);
+			error = xrep_xattr_salvage_local_attr(rx, ent, nameidx,
+					buf_end, lentry);
+		} else {
+			rentry = xfs_attr3_leaf_name_remote(leaf, i);
+			error = xrep_xattr_salvage_remote_attr(rx, ent, nameidx,
+					buf_end, rentry, i, bp);
+		}
+		if (error)
+			break;
+	}
+
+	return error;
+}
+
+/* Try to recover shortform attrs. */
+STATIC int
+xrep_xattr_recover_sf(
+	struct xrep_xattr		*rx)
+{
+	struct xfs_attr_shortform	*sf;
+	struct xfs_attr_sf_entry	*sfe;
+	struct xfs_attr_sf_entry	*next;
+	struct xfs_ifork		*ifp;
+	unsigned char			*end;
+	int				i;
+	int				error;
+
+	ifp = XFS_IFORK_PTR(rx->sc->ip, XFS_ATTR_FORK);
+	sf = (struct xfs_attr_shortform *)rx->sc->ip->i_afp->if_u1.if_data;
+	end = (unsigned char *)ifp->if_u1.if_data + ifp->if_bytes;
+
+	for (i = 0, sfe = &sf->list[0]; i < sf->hdr.count; i++) {
+		next = XFS_ATTR_SF_NEXTENTRY(sfe);
+		if ((unsigned char *)next > end)
+			break;
+
+		/* Ok, let's save this key/value. */
+		error = xrep_xattr_salvage_sf_attr(rx, sfe);
+		if (error)
+			return error;
+
+		sfe = next;
+	}
+
+	return 0;
+}
+
+/* Extract as many attribute keys and values as we can. */
+STATIC int
+xrep_xattr_recover(
+	struct xrep_xattr	*rx)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_scrub	*sc = rx->sc;
+	struct xfs_ifork	*ifp;
+	struct xfs_da_blkinfo	*info;
+	struct xfs_buf		*bp;
+	xfs_dablk_t		dabno;
+	int			error = 0;
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		return xrep_xattr_recover_sf(rx);
+
+	/* Iterate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		for_each_xfs_attr_block(sc->mp, &got, dabno) {
+			/*
+			 * Try to read buffer.  We invalidate them in the next
+			 * step so we don't bother to set a buffer type or
+			 * ops.
+			 */
+			error = xfs_da_read_buf(sc->tp, sc->ip, dabno, -1, &bp,
+					XFS_ATTR_FORK, NULL);
+			if (error || !bp)
+				continue;
+
+			/* Screen out non-leaves & other garbage. */
+			info = bp->b_addr;
+			if (info->magic != cpu_to_be16(XFS_ATTR3_LEAF_MAGIC) ||
+			    xfs_attr3_leaf_buf_ops.verify_struct(bp) != NULL)
+				continue;
+
+			error = xrep_xattr_recover_leaf(rx, bp);
+			if (error)
+				return error;
+		}
+	}
+
+	return error;
+}
+
+/* Free all the attribute fork blocks and delete the fork. */
+STATIC int
+xrep_xattr_reset_btree(
+	struct xfs_scrub	*sc)
+{
+	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got;
+	struct xfs_ifork	*ifp;
+	struct xfs_buf		*bp;
+	xfs_fileoff_t		lblk;
+	int			error;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	if (sc->ip->i_d.di_aformat == XFS_DINODE_FMT_LOCAL)
+		goto out_fork_remove;
+
+	/* Invalidate each attr block in the attr fork. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	for_each_xfs_iext(ifp, &icur, &got) {
+		for_each_xfs_attr_block(sc->mp, &got, lblk) {
+			error = xfs_da_get_buf(sc->tp, sc->ip, lblk, -1, &bp,
+					XFS_ATTR_FORK);
+			if (error || !bp)
+				continue;
+			xfs_trans_binval(sc->tp, bp);
+			error = xfs_trans_roll_inode(&sc->tp, sc->ip);
+			if (error)
+				return error;
+		}
+	}
+
+	/* Now free all the blocks. */
+	error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_ATTR_FORK, 0);
+	if (error)
+		return error;
+
+out_fork_remove:
+	/* Reset the attribute fork - this also destroys the in-core fork */
+	xfs_attr_fork_remove(sc->ip, sc->tp);
+	return 0;
+}
+
+/*
+ * Compare two xattr keys.  ATTR_SECURE keys come before ATTR_ROOT and
+ * ATTR_ROOT keys come before user attrs.  Otherwise sort in hash order.
+ */
+static int
+xrep_xattr_key_cmp(
+	void			*priv,
+	struct list_head	*a,
+	struct list_head	*b)
+{
+	struct xrep_xattr_key	*ap;
+	struct xrep_xattr_key	*bp;
+	uint			ahash;
+	uint			bhash;
+
+	ap = container_of(a, struct xrep_xattr_key, list);
+	bp = container_of(b, struct xrep_xattr_key, list);
+
+	if (ap->flags > bp->flags)
+		return 1;
+	else if (ap->flags < bp->flags)
+		return -1;
+
+	ahash = xfs_da_hashname(ap->name, ap->namelen);
+	bhash = xfs_da_hashname(bp->name, bp->namelen);
+	if (ahash > bhash)
+		return 1;
+	else if (ahash < bhash)
+		return -1;
+	return 0;
+}
+
+/*
+ * Find all the extended attributes for this inode by scraping them out of the
+ * attribute key blocks by hand.  The caller must clean up the lists if
+ * anything goes wrong.
+ */
+STATIC int
+xrep_xattr_find_attributes(
+	struct xfs_scrub	*sc,
+	struct list_head	*attrlist)
+{
+	struct xrep_xattr	rx;
+	struct xfs_ifork	*ifp;
+	int			error;
+
+	error = xrep_ino_dqattach(sc);
+	if (error)
+		return error;
+
+	/* Extent map should be loaded. */
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_ATTR_FORK);
+	if (XFS_IFORK_FORMAT(sc->ip, XFS_ATTR_FORK) != XFS_DINODE_FMT_LOCAL &&
+	    !(ifp->if_flags & XFS_IFEXTENTS)) {
+		error = xfs_iread_extents(sc->tp, sc->ip, XFS_ATTR_FORK);
+		if (error)
+			return error;
+	}
+
+	rx.attrlist = attrlist;
+	rx.sc = sc;
+
+	/* Read every attr key and value and record them in memory. */
+	return xrep_xattr_recover(&rx);
+}
+
+/* Free all the attributes. */
+STATIC void
+xrep_xattr_cancel_attrs(
+	struct list_head	*attrlist)
+{
+	struct xrep_xattr_key	*key;
+	struct xrep_xattr_key	*n;
+
+	list_for_each_entry_safe(key, n, attrlist, list) {
+		list_del(&key->list);
+		kmem_free(key->value);
+		kmem_free(key);
+	}
+}
+
+/*
+ * Insert all the attributes that we collected.
+ *
+ * Commit the repair transaction and drop the ilock because the attribute
+ * setting code needs to be able to allocate special transactions and take the
+ * ilock on its own.  Some day we'll have deferred attribute setting, at which
+ * point we'll be able to use that to replace the attributes atomically and
+ * safely.
+ */
+STATIC int
+xrep_xattr_rebuild_tree(
+	struct xfs_scrub	*sc,
+	struct list_head	*attrlist)
+{
+	struct xrep_xattr_key	*key;
+	struct xrep_xattr_key	*n;
+	int			error;
+
+	error = xfs_trans_commit(sc->tp);
+	sc->tp = NULL;
+	if (error)
+		return error;
+
+	xfs_iunlock(sc->ip, XFS_ILOCK_EXCL);
+	sc->ilock_flags &= ~XFS_ILOCK_EXCL;
+
+	/* Re-add every attr to the file. */
+	list_sort(NULL, attrlist, xrep_xattr_key_cmp);
+	list_for_each_entry_safe(key, n, attrlist, list) {
+		error = xfs_attr_set(sc->ip, key->name, key->value,
+				key->valuelen, key->flags);
+		if (error)
+			return error;
+
+		/*
+		 * If the attr value is larger than a single page, free the
+		 * key now so that we aren't hogging memory while doing a lot
+		 * of metadata updates.  Otherwise, we want to spend as little
+		 * time reconstructing the attrs as we possibly can.
+		 */
+		if (key->valuelen <= PAGE_SIZE)
+			continue;
+		list_del(&key->list);
+		kmem_free(key->value);
+		kmem_free(key);
+	}
+
+	xrep_xattr_cancel_attrs(attrlist);
+	return 0;
+}
+
+/*
+ * Repair the extended attribute metadata.
+ *
+ * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer.
+ * The buffer cache in XFS can't handle aliased multiblock buffers, so this
+ * might misbehave if the attr fork is crosslinked with other filesystem
+ * metadata.
+ */
+int
+xrep_xattr(
+	struct xfs_scrub	*sc)
+{
+	struct list_head	attrlist;
+	int			error;
+
+	if (!xfs_inode_hasattr(sc->ip))
+		return -ENOENT;
+
+	/* Collect extended attributes by parsing raw blocks. */
+	INIT_LIST_HEAD(&attrlist);
+	error = xrep_xattr_find_attributes(sc, &attrlist);
+	if (error)
+		goto out;
+
+	/*
+	 * Invalidate and truncate all attribute fork extents.  This is the
+	 * point at which we are no longer able to bail out gracefully.
+	 * We commit the transaction here because xfs_attr_set allocates its
+	 * own transactions.
+	 */
+	error = xrep_xattr_reset_btree(sc);
+	if (error)
+		goto out;
+
+	/* Now rebuild the attribute information. */
+	error = xrep_xattr_rebuild_tree(sc, &attrlist);
+out:
+	xrep_xattr_cancel_attrs(&attrlist);
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 17769efb20d9..b630084d0f39 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -69,6 +69,7 @@ int xrep_inode(struct xfs_scrub *sc);
 int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_symlink(struct xfs_scrub *sc);
+int xrep_xattr(struct xfs_scrub *sc);
 
 #else
 
@@ -110,6 +111,7 @@ xrep_reset_perag_resv(
 #define xrep_bmap_data			xrep_notsupported
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_symlink			xrep_notsupported
+#define xrep_xattr			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 0a8eea77e58f..537636d789fb 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -301,7 +301,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_INODE,
 		.setup	= xchk_setup_xattr,
 		.scrub	= xchk_xattr,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_xattr,
 	},
 	[XFS_SCRUB_TYPE_SYMLINK] = {	/* symbolic link */
 		.type	= ST_INODE,
diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
index 762db46fd696..d7ad8fad9318 100644
--- a/fs/xfs/scrub/scrub.h
+++ b/fs/xfs/scrub/scrub.h
@@ -139,4 +139,7 @@ void xchk_xref_is_used_rt_space(struct xfs_scrub *sc, xfs_rtblock_t rtbno,
 # define xchk_xref_is_used_rt_space(sc, rtbno, len) do { } while (0)
 #endif
 
+bool xchk_xattr_set_map(struct xfs_scrub *sc, unsigned long *map,
+		unsigned int start, unsigned int len);
+
 #endif	/* __XFS_SCRUB_SCRUB_H__ */


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 13/14] xfs: scrub should set preen if attr leaf has holes
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (11 preceding siblings ...)
  2018-07-30  5:49 ` [PATCH 12/14] xfs: repair extended attributes Darrick J. Wong
@ 2018-07-30  5:49 ` Darrick J. Wong
  2018-07-30  5:49 ` [PATCH 14/14] xfs: repair quotas Darrick J. Wong
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson, Dave Chinner

From: Darrick J. Wong <darrick.wong@oracle.com>

If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/attr.c    |    2 ++
 fs/xfs/scrub/dabtree.c |   15 +++++++++++++++
 fs/xfs/scrub/dabtree.h |    1 +
 fs/xfs/scrub/trace.h   |    1 +
 4 files changed, 19 insertions(+)


diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c
index e20074c241b5..0956d4588dc5 100644
--- a/fs/xfs/scrub/attr.c
+++ b/fs/xfs/scrub/attr.c
@@ -293,6 +293,8 @@ xchk_xattr_block(
 		xchk_da_set_corrupt(ds, level);
 	if (!xchk_xattr_set_map(ds->sc, usedmap, 0, hdrsize))
 		xchk_da_set_corrupt(ds, level);
+	if (leafhdr.holes)
+		xchk_da_set_preen(ds, level);
 
 	if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
 		goto out;
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index f1260b4bfdee..e2ecf9c77010 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -85,6 +85,21 @@ xchk_da_set_corrupt(
 			__return_address);
 }
 
+/* Flag a da btree node in need of optimization. */
+void
+xchk_da_set_preen(
+	struct xchk_da_btree	*ds,
+	int			level)
+{
+	struct xfs_scrub	*sc = ds->sc;
+
+	sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN;
+	trace_xchk_fblock_preen(sc, ds->dargs.whichfork,
+			xfs_dir2_da_to_db(ds->dargs.geo,
+				ds->state->path.blk[level].blkno),
+			__return_address);
+}
+
 /* Find an entry at a certain level in a da btree. */
 STATIC void *
 xchk_da_btree_entry(
diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h
index cb3f0003245b..b367bf87a183 100644
--- a/fs/xfs/scrub/dabtree.h
+++ b/fs/xfs/scrub/dabtree.h
@@ -36,6 +36,7 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error);
 
 /* Check for da btree corruption. */
 void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level);
+void xchk_da_set_preen(struct xchk_da_btree *ds, int level);
 
 int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp);
 int xchk_da_btree(struct xfs_scrub *sc, int whichfork,
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 3383b14fd0c0..d7133d1d23d6 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -230,6 +230,7 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \
 
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error);
 DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning);
+DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_preen);
 
 TRACE_EVENT(xchk_incomplete,
 	TP_PROTO(struct xfs_scrub *sc, void *ret_ip),


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH 14/14] xfs: repair quotas
  2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
                   ` (12 preceding siblings ...)
  2018-07-30  5:49 ` [PATCH 13/14] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
@ 2018-07-30  5:49 ` Darrick J. Wong
  13 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30  5:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, bfoster, david, allison.henderson

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix anything that causes the quota verifiers to fail.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/scrub/attr_repair.c  |    2 
 fs/xfs/scrub/common.h       |    9 +
 fs/xfs/scrub/quota.c        |    2 
 fs/xfs/scrub/quota_repair.c |  363 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/scrub/repair.c       |   58 +++++++
 fs/xfs/scrub/repair.h       |    8 +
 fs/xfs/scrub/scrub.c        |   11 +
 8 files changed, 446 insertions(+), 8 deletions(-)
 create mode 100644 fs/xfs/scrub/quota_repair.c


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index c3963c88f952..ed1fc827ed15 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -174,5 +174,6 @@ xfs-y				+= $(addprefix scrub/, \
 				   repair.o \
 				   symlink_repair.o \
 				   )
+xfs-$(CONFIG_XFS_QUOTA)		+= scrub/quota_repair.o
 endif
 endif
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
index 5bacfb88f25e..e01ca4350857 100644
--- a/fs/xfs/scrub/attr_repair.c
+++ b/fs/xfs/scrub/attr_repair.c
@@ -395,7 +395,7 @@ xrep_xattr_recover(
 }
 
 /* Free all the attribute fork blocks and delete the fork. */
-STATIC int
+int
 xrep_xattr_reset_btree(
 	struct xfs_scrub	*sc)
 {
diff --git a/fs/xfs/scrub/common.h b/fs/xfs/scrub/common.h
index 2d4324d12f9a..aab82f7f9a67 100644
--- a/fs/xfs/scrub/common.h
+++ b/fs/xfs/scrub/common.h
@@ -138,4 +138,13 @@ static inline bool xchk_skip_xref(struct xfs_scrub_metadata *sm)
 int xchk_metadata_inode_forks(struct xfs_scrub *sc);
 int xchk_ilock_inverted(struct xfs_inode *ip, uint lock_mode);
 
+/* Do we need to invoke the repair tool? */
+static inline bool xfs_scrub_needs_repair(struct xfs_scrub_metadata *sm)
+{
+	return sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+			       XFS_SCRUB_OFLAG_XCORRUPT |
+			       XFS_SCRUB_OFLAG_PREEN);
+}
+uint xchk_quota_to_dqtype(struct xfs_scrub *sc);
+
 #endif	/* __XFS_SCRUB_COMMON_H__ */
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 782d582d3edd..0e5578ab088e 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -29,7 +29,7 @@
 #include "scrub/trace.h"
 
 /* Convert a scrub type code to a DQ flag, or return 0 if error. */
-static inline uint
+uint
 xchk_quota_to_dqtype(
 	struct xfs_scrub	*sc)
 {
diff --git a/fs/xfs/scrub/quota_repair.c b/fs/xfs/scrub/quota_repair.c
new file mode 100644
index 000000000000..36635f7ca217
--- /dev/null
+++ b/fs/xfs/scrub/quota_repair.c
@@ -0,0 +1,363 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2018 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_btree.h"
+#include "xfs_bit.h"
+#include "xfs_log_format.h"
+#include "xfs_trans.h"
+#include "xfs_sb.h"
+#include "xfs_inode.h"
+#include "xfs_inode_fork.h"
+#include "xfs_alloc.h"
+#include "xfs_bmap.h"
+#include "xfs_quota.h"
+#include "xfs_qm.h"
+#include "xfs_dquot.h"
+#include "xfs_dquot_item.h"
+#include "scrub/xfs_scrub.h"
+#include "scrub/scrub.h"
+#include "scrub/common.h"
+#include "scrub/trace.h"
+#include "scrub/repair.h"
+
+/*
+ * Quota Repair
+ * ============
+ *
+ * Quota repairs are fairly simplistic; we fix everything that the dquot
+ * verifiers complain about, cap any counters or limits that make no sense,
+ * and schedule a quotacheck if we had to fix anything.  We also repair any
+ * data fork extent records that don't apply to metadata files.
+ */
+
+struct xrep_quota_info {
+	struct xfs_scrub	*sc;
+	bool			need_quotacheck;
+};
+
+/* Scrub the fields in an individual quota item. */
+STATIC int
+xrep_quota_item(
+	struct xfs_dquot	*dq,
+	uint			dqtype,
+	void			*priv)
+{
+	struct xrep_quota_info	*rqi = priv;
+	struct xfs_scrub	*sc = rqi->sc;
+	struct xfs_mount	*mp = sc->mp;
+	struct xfs_disk_dquot	*d = &dq->q_core;
+	unsigned long long	bsoft;
+	unsigned long long	isoft;
+	unsigned long long	rsoft;
+	unsigned long long	bhard;
+	unsigned long long	ihard;
+	unsigned long long	rhard;
+	unsigned long long	bcount;
+	unsigned long long	icount;
+	unsigned long long	rcount;
+	xfs_ino_t		fs_icount;
+	bool			dirty = false;
+	int			error;
+
+	/* Did we get the dquot type we wanted? */
+	if (dqtype != (d->d_flags & XFS_DQ_ALLTYPES)) {
+		d->d_flags = dqtype;
+		dirty = true;
+	}
+
+	if (d->d_pad0 || d->d_pad) {
+		d->d_pad0 = 0;
+		d->d_pad = 0;
+		dirty = true;
+	}
+
+	/* Check the limits. */
+	bhard = be64_to_cpu(d->d_blk_hardlimit);
+	ihard = be64_to_cpu(d->d_ino_hardlimit);
+	rhard = be64_to_cpu(d->d_rtb_hardlimit);
+
+	bsoft = be64_to_cpu(d->d_blk_softlimit);
+	isoft = be64_to_cpu(d->d_ino_softlimit);
+	rsoft = be64_to_cpu(d->d_rtb_softlimit);
+
+	if (bsoft > bhard) {
+		d->d_blk_softlimit = d->d_blk_hardlimit;
+		dirty = true;
+	}
+
+	if (isoft > ihard) {
+		d->d_ino_softlimit = d->d_ino_hardlimit;
+		dirty = true;
+	}
+
+	if (rsoft > rhard) {
+		d->d_rtb_softlimit = d->d_rtb_hardlimit;
+		dirty = true;
+	}
+
+	/* Check the resource counts. */
+	bcount = be64_to_cpu(d->d_bcount);
+	icount = be64_to_cpu(d->d_icount);
+	rcount = be64_to_cpu(d->d_rtbcount);
+	fs_icount = percpu_counter_sum(&mp->m_icount);
+
+	/*
+	 * Check that usage doesn't exceed physical limits.  However, on
+	 * a reflink filesystem we're allowed to exceed physical space
+	 * if there are no quota limits.  We don't know what the real number
+	 * is, but we can make quotacheck find out for us.
+	 */
+	if (!xfs_sb_version_hasreflink(&mp->m_sb) &&
+	    mp->m_sb.sb_dblocks < bcount) {
+		dq->q_res_bcount -= be64_to_cpu(dq->q_core.d_bcount);
+		dq->q_res_bcount += mp->m_sb.sb_dblocks;
+		d->d_bcount = cpu_to_be64(mp->m_sb.sb_dblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (icount > fs_icount) {
+		dq->q_res_icount -= be64_to_cpu(dq->q_core.d_icount);
+		dq->q_res_icount += fs_icount;
+		d->d_icount = cpu_to_be64(fs_icount);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+	if (rcount > mp->m_sb.sb_rblocks) {
+		dq->q_res_rtbcount -= be64_to_cpu(dq->q_core.d_rtbcount);
+		dq->q_res_rtbcount += mp->m_sb.sb_rblocks;
+		d->d_rtbcount = cpu_to_be64(mp->m_sb.sb_rblocks);
+		rqi->need_quotacheck = true;
+		dirty = true;
+	}
+
+	if (!dirty)
+		return 0;
+
+	dq->dq_flags |= XFS_DQ_DIRTY;
+	xfs_trans_dqjoin(sc->tp, dq);
+	xfs_trans_log_dquot(sc->tp, dq);
+	error = xfs_trans_roll(&sc->tp);
+	xfs_dqlock(dq);
+	return error;
+}
+
+/* Fix a quota timer so that we can pass the verifier. */
+STATIC void
+xrep_quota_fix_timer(
+	__be64			softlimit,
+	__be64			countnow,
+	__be32			*timer,
+	time_t			timelimit)
+{
+	uint64_t		soft = be64_to_cpu(softlimit);
+	uint64_t		count = be64_to_cpu(countnow);
+
+	if (soft && count > soft && *timer == 0)
+		*timer = cpu_to_be32(get_seconds() + timelimit);
+}
+
+/* Fix anything the verifiers complain about. */
+STATIC int
+xrep_quota_block(
+	struct xfs_scrub	*sc,
+	struct xfs_buf		*bp,
+	uint			dqtype,
+	xfs_dqid_t		id)
+{
+	struct xfs_dqblk	*d = (struct xfs_dqblk *)bp->b_addr;
+	struct xfs_disk_dquot	*ddq;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	enum xfs_blft		buftype = 0;
+	int			i;
+
+	bp->b_ops = &xfs_dquot_buf_ops;
+	for (i = 0; i < qi->qi_dqperchunk; i++) {
+		ddq = &d[i].dd_diskdq;
+
+		ddq->d_magic = cpu_to_be16(XFS_DQUOT_MAGIC);
+		ddq->d_version = XFS_DQUOT_VERSION;
+		ddq->d_flags = dqtype;
+		ddq->d_id = cpu_to_be32(id + i);
+
+		xrep_quota_fix_timer(ddq->d_blk_softlimit,
+				ddq->d_bcount, &ddq->d_btimer,
+				qi->qi_btimelimit);
+		xrep_quota_fix_timer(ddq->d_ino_softlimit,
+				ddq->d_icount, &ddq->d_itimer,
+				qi->qi_itimelimit);
+		xrep_quota_fix_timer(ddq->d_rtb_softlimit,
+				ddq->d_rtbcount, &ddq->d_rtbtimer,
+				qi->qi_rtbtimelimit);
+
+		/* We only support v5 filesystems so always set these. */
+		uuid_copy(&d->dd_uuid, &sc->mp->m_sb.sb_meta_uuid);
+		xfs_update_cksum((char *)d, sizeof(struct xfs_dqblk),
+				 XFS_DQUOT_CRC_OFF);
+		d->dd_lsn = 0;
+	}
+	switch (dqtype) {
+	case XFS_DQ_USER:
+		buftype = XFS_BLFT_UDQUOT_BUF;
+		break;
+	case XFS_DQ_GROUP:
+		buftype = XFS_BLFT_GDQUOT_BUF;
+		break;
+	case XFS_DQ_PROJ:
+		buftype = XFS_BLFT_PDQUOT_BUF;
+		break;
+	}
+	xfs_trans_buf_set_type(sc->tp, bp, buftype);
+	xfs_trans_log_buf(sc->tp, bp, 0, BBTOB(bp->b_length) - 1);
+	return xfs_trans_roll(&sc->tp);
+}
+
+/* Repair quota's data fork. */
+STATIC int
+xrep_quota_data_fork(
+	struct xfs_scrub	*sc,
+	uint			dqtype)
+{
+	struct xfs_bmbt_irec	irec = { 0 };
+	struct xfs_iext_cursor	icur;
+	struct xfs_quotainfo	*qi = sc->mp->m_quotainfo;
+	struct xfs_ifork	*ifp;
+	struct xfs_buf		*bp;
+	struct xfs_dqblk	*d;
+	xfs_dqid_t		id;
+	xfs_fileoff_t		max_dqid_off;
+	xfs_fileoff_t		off;
+	xfs_fsblock_t		fsbno;
+	bool			truncate = false;
+	int			error = 0;
+
+	error = xrep_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+
+	/* Check for data fork problems that apply only to quota files. */
+	max_dqid_off = ((xfs_dqid_t)-1) / qi->qi_dqperchunk;
+	ifp = XFS_IFORK_PTR(sc->ip, XFS_DATA_FORK);
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		if (isnullstartblock(irec.br_startblock)) {
+			error = -EFSCORRUPTED;
+			goto out;
+		}
+
+		if (irec.br_startoff > max_dqid_off ||
+		    irec.br_startoff + irec.br_blockcount - 1 > max_dqid_off) {
+			truncate = true;
+			break;
+		}
+	}
+	if (truncate) {
+		error = xfs_itruncate_extents(&sc->tp, sc->ip, XFS_DATA_FORK,
+				max_dqid_off * sc->mp->m_sb.sb_blocksize);
+		if (error)
+			goto out;
+	}
+
+	/* Now go fix anything that fails the verifiers. */
+	for_each_xfs_iext(ifp, &icur, &irec) {
+		for (fsbno = irec.br_startblock, off = irec.br_startoff;
+		     fsbno < irec.br_startblock + irec.br_blockcount;
+		     fsbno += XFS_DQUOT_CLUSTER_SIZE_FSB,
+				off += XFS_DQUOT_CLUSTER_SIZE_FSB) {
+			id = off * qi->qi_dqperchunk;
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, &xfs_dquot_buf_ops);
+			if (error == 0) {
+				d = (struct xfs_dqblk *)bp->b_addr;
+				if (id == be32_to_cpu(d->dd_diskdq.d_id)) {
+					xfs_trans_brelse(sc->tp, bp);
+					continue;
+				}
+				error = -EFSCORRUPTED;
+				xfs_trans_brelse(sc->tp, bp);
+			}
+			if (error != -EFSBADCRC && error != -EFSCORRUPTED)
+				goto out;
+
+			/* Failed verifier, try again. */
+			error = xfs_trans_read_buf(sc->mp, sc->tp,
+					sc->mp->m_ddev_targp,
+					XFS_FSB_TO_DADDR(sc->mp, fsbno),
+					qi->qi_dqchunklen,
+					0, &bp, NULL);
+			if (error)
+				goto out;
+
+			/*
+			 * Fix the quota block, which will roll our transaction
+			 * and release bp.
+			 */
+			error = xrep_quota_block(sc, bp, dqtype, id);
+			if (error)
+				goto out;
+		}
+	}
+
+out:
+	return error;
+}
+
+/*
+ * Go fix anything in the quota items that we could have been mad about.  Now
+ * that we've checked the quota inode data fork we have to drop ILOCK_EXCL to
+ * use the regular dquot functions.
+ */
+STATIC int
+xrep_quota_problems(
+	struct xfs_scrub	*sc,
+	uint			dqtype)
+{
+	struct xrep_quota_info	rqi;
+	int			error;
+
+	rqi.sc = sc;
+	rqi.need_quotacheck = false;
+	error = xfs_qm_dqiterate(sc->mp, dqtype, xrep_quota_item, &rqi);
+	if (error)
+		return error;
+
+	/* Make a quotacheck happen. */
+	if (rqi.need_quotacheck)
+		xrep_force_quotacheck(sc, dqtype);
+	return 0;
+}
+
+/* Repair all of a quota type's items. */
+int
+xrep_quota(
+	struct xfs_scrub	*sc)
+{
+	uint			dqtype;
+	int			error;
+
+	dqtype = xchk_quota_to_dqtype(sc);
+
+	/* Fix problematic data fork mappings. */
+	error = xrep_quota_data_fork(sc, dqtype);
+	if (error)
+		goto out;
+
+	/* Unlock quota inode; we play only with dquots from now on. */
+	xfs_iunlock(sc->ip, sc->ilock_flags);
+	sc->ilock_flags = 0;
+
+	/* Fix anything the dquot verifiers complain about. */
+	error = xrep_quota_problems(sc, dqtype);
+out:
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index a44deb6f06ab..27cc50178d86 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -29,6 +29,8 @@
 #include "xfs_ag_resv.h"
 #include "xfs_trans_space.h"
 #include "xfs_quota.h"
+#include "xfs_attr.h"
+#include "xfs_reflink.h"
 #include "scrub/xfs_scrub.h"
 #include "scrub/scrub.h"
 #include "scrub/common.h"
@@ -900,3 +902,59 @@ xrep_reset_perag_resv(
 out:
 	return error;
 }
+
+/*
+ * Repair the attr/data forks of a metadata inode.  The metadata inode must be
+ * pointed to by sc->ip and the ILOCK must be held.
+ */
+int
+xrep_metadata_inode_forks(
+	struct xfs_scrub	*sc)
+{
+	__u32			smtype;
+	__u32			smflags;
+	int			error;
+
+	smtype = sc->sm->sm_type;
+	smflags = sc->sm->sm_flags;
+
+	/* Let's see if the forks need repair. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xchk_metadata_inode_forks(sc);
+	if (error || !xfs_scrub_needs_repair(sc->sm))
+		goto out;
+
+	xfs_trans_ijoin(sc->tp, sc->ip, 0);
+
+	/* Clear the reflink flag & attr forks that we shouldn't have. */
+	if (xfs_is_reflink_inode(sc->ip)) {
+		error = xfs_reflink_clear_inode_flag(sc->ip, &sc->tp);
+		if (error)
+			goto out;
+	}
+
+	if (xfs_inode_hasattr(sc->ip)) {
+		error = xrep_xattr_reset_btree(sc);
+		if (error)
+			goto out;
+	}
+
+	/* Repair the data fork. */
+	sc->sm->sm_type = XFS_SCRUB_TYPE_BMBTD;
+	error = xrep_bmap_data(sc);
+	sc->sm->sm_type = smtype;
+	if (error)
+		goto out;
+
+	/* Bail out if we still need repairs. */
+	sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT;
+	error = xchk_metadata_inode_forks(sc);
+	if (error)
+		goto out;
+	if (xfs_scrub_needs_repair(sc->sm))
+		error = -EFSCORRUPTED;
+out:
+	sc->sm->sm_type = smtype;
+	sc->sm->sm_flags = smflags;
+	return error;
+}
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index b630084d0f39..aa032a7b99d0 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -54,6 +54,8 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
 void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
 int xrep_ino_dqattach(struct xfs_scrub *sc);
 int xrep_reset_perag_resv(struct xfs_scrub *sc);
+int xrep_xattr_reset_btree(struct xfs_scrub *sc);
+int xrep_metadata_inode_forks(struct xfs_scrub *sc);
 
 /* Metadata repairers */
 
@@ -70,6 +72,11 @@ int xrep_bmap_data(struct xfs_scrub *sc);
 int xrep_bmap_attr(struct xfs_scrub *sc);
 int xrep_symlink(struct xfs_scrub *sc);
 int xrep_xattr(struct xfs_scrub *sc);
+#ifdef CONFIG_XFS_QUOTA
+int xrep_quota(struct xfs_scrub *sc);
+#else
+# define xrep_quota			xrep_notsupported
+#endif /* CONFIG_XFS_QUOTA */
 
 #else
 
@@ -112,6 +119,7 @@ xrep_reset_perag_resv(
 #define xrep_bmap_attr			xrep_notsupported
 #define xrep_symlink			xrep_notsupported
 #define xrep_xattr			xrep_notsupported
+#define xrep_quota			xrep_notsupported
 
 #endif /* CONFIG_XFS_ONLINE_REPAIR */
 
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 537636d789fb..a9f969214e69 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -333,19 +333,19 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_GQUOTA] = {	/* group quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 	[XFS_SCRUB_TYPE_PQUOTA] = {	/* project quota */
 		.type	= ST_FS,
 		.setup	= xchk_setup_quota,
 		.scrub	= xchk_quota,
-		.repair	= xrep_notsupported,
+		.repair	= xrep_quota,
 	},
 };
 
@@ -539,9 +539,8 @@ xfs_scrub_metadata(
 		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
 			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 
-		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-						XFS_SCRUB_OFLAG_XCORRUPT |
-						XFS_SCRUB_OFLAG_PREEN));
+		needs_fix = xfs_scrub_needs_repair(sc.sm);
+
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap
  2018-07-30  5:47 ` [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap Darrick J. Wong
@ 2018-07-30 16:21   ` Brian Foster
  0 siblings, 0 replies; 53+ messages in thread
From: Brian Foster @ 2018-07-30 16:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:47:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> As mentioned previously, the xrep_extent_list basically implements a
> bitmap with two functions: set and disjoint union.  Rename all these
> functions to xfs_bitmap to shorten the name and make it more obvious
> what we're doing.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/scrub/bitmap.c |  183 +++++++++++++++++++++++++------------------------
>  fs/xfs/scrub/bitmap.h |   35 ++++-----
>  fs/xfs/scrub/repair.c |   85 ++++++++++-------------
>  fs/xfs/scrub/repair.h |    8 +-
>  fs/xfs/scrub/trace.h  |    1 
>  5 files changed, 149 insertions(+), 163 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> index a7c2f4773f98..c770e2d0b6aa 100644
> --- a/fs/xfs/scrub/bitmap.c
> +++ b/fs/xfs/scrub/bitmap.c
> @@ -16,183 +16,186 @@
>  #include "scrub/repair.h"
>  #include "scrub/bitmap.h"
>  
> -/* Collect a dead btree extent for later disposal. */
> +/*
> + * Set a range of this bitmap.  Caller must ensure the range is not set.
> + *
> + * This is the logical equivalent of bitmap |= mask(start, len).
> + */
>  int
> -xrep_collect_btree_extent(
> -	struct xfs_scrub	*sc,
> -	struct xrep_extent_list	*exlist,
> -	xfs_fsblock_t		fsbno,
> -	xfs_extlen_t		len)
> +xfs_bitmap_set(
> +	struct xfs_bitmap	*bitmap,
> +	uint64_t		start,
> +	uint64_t		len)
>  {
> -	struct xrep_extent	*rex;
> +	struct xfs_bitmap_range	*bmr;
>  
> -	trace_xrep_collect_btree_extent(sc->mp,
> -			XFS_FSB_TO_AGNO(sc->mp, fsbno),
> -			XFS_FSB_TO_AGBNO(sc->mp, fsbno), len);
> -
> -	rex = kmem_alloc(sizeof(struct xrep_extent), KM_MAYFAIL);
> -	if (!rex)
> +	bmr = kmem_alloc(sizeof(struct xfs_bitmap_range), KM_MAYFAIL);
> +	if (!bmr)
>  		return -ENOMEM;
>  
> -	INIT_LIST_HEAD(&rex->list);
> -	rex->fsbno = fsbno;
> -	rex->len = len;
> -	list_add_tail(&rex->list, &exlist->list);
> +	INIT_LIST_HEAD(&bmr->list);
> +	bmr->start = start;
> +	bmr->len = len;
> +	list_add_tail(&bmr->list, &bitmap->list);
>  
>  	return 0;
>  }
>  
> -/*
> - * An error happened during the rebuild so the transaction will be cancelled.
> - * The fs will shut down, and the administrator has to unmount and run repair.
> - * Therefore, free all the memory associated with the list so we can die.
> - */
> +/* Free everything related to this bitmap. */
>  void
> -xrep_cancel_btree_extents(
> -	struct xfs_scrub	*sc,
> -	struct xrep_extent_list	*exlist)
> +xfs_bitmap_destroy(
> +	struct xfs_bitmap	*bitmap)
>  {
> -	struct xrep_extent	*rex;
> -	struct xrep_extent	*n;
> +	struct xfs_bitmap_range	*bmr;
> +	struct xfs_bitmap_range	*n;
>  
> -	for_each_xrep_extent_safe(rex, n, exlist) {
> -		list_del(&rex->list);
> -		kmem_free(rex);
> +	for_each_xfs_bitmap_extent(bmr, n, bitmap) {
> +		list_del(&bmr->list);
> +		kmem_free(bmr);
>  	}
>  }
>  
> +/* Set up a per-AG block bitmap. */
> +void
> +xfs_bitmap_init(
> +	struct xfs_bitmap	*bitmap)
> +{
> +	INIT_LIST_HEAD(&bitmap->list);
> +}
> +
>  /* Compare two btree extents. */
>  static int
> -xrep_btree_extent_cmp(
> +xfs_bitmap_range_cmp(
>  	void			*priv,
>  	struct list_head	*a,
>  	struct list_head	*b)
>  {
> -	struct xrep_extent	*ap;
> -	struct xrep_extent	*bp;
> +	struct xfs_bitmap_range	*ap;
> +	struct xfs_bitmap_range	*bp;
>  
> -	ap = container_of(a, struct xrep_extent, list);
> -	bp = container_of(b, struct xrep_extent, list);
> +	ap = container_of(a, struct xfs_bitmap_range, list);
> +	bp = container_of(b, struct xfs_bitmap_range, list);
>  
> -	if (ap->fsbno > bp->fsbno)
> +	if (ap->start > bp->start)
>  		return 1;
> -	if (ap->fsbno < bp->fsbno)
> +	if (ap->start < bp->start)
>  		return -1;
>  	return 0;
>  }
>  
>  /*
> - * Remove all the blocks mentioned in @sublist from the extents in @exlist.
> + * Remove all the blocks mentioned in @sub from the extents in @bitmap.
>   *
>   * The intent is that callers will iterate the rmapbt for all of its records
> - * for a given owner to generate @exlist; and iterate all the blocks of the
> + * for a given owner to generate @bitmap; and iterate all the blocks of the
>   * metadata structures that are not being rebuilt and have the same rmapbt
> - * owner to generate @sublist.  This routine subtracts all the extents
> - * mentioned in sublist from all the extents linked in @exlist, which leaves
> - * @exlist as the list of blocks that are not accounted for, which we assume
> + * owner to generate @sub.  This routine subtracts all the extents
> + * mentioned in sub from all the extents linked in @bitmap, which leaves
> + * @bitmap as the list of blocks that are not accounted for, which we assume
>   * are the dead blocks of the old metadata structure.  The blocks mentioned in
> - * @exlist can be reaped.
> + * @bitmap can be reaped.
> + *
> + * This is the logical equivalent of bitmap &= ~sub.
>   */
>  #define LEFT_ALIGNED	(1 << 0)
>  #define RIGHT_ALIGNED	(1 << 1)
>  int
> -xrep_subtract_extents(
> -	struct xfs_scrub	*sc,
> -	struct xrep_extent_list	*exlist,
> -	struct xrep_extent_list	*sublist)
> +xfs_bitmap_disunion(
> +	struct xfs_bitmap	*bitmap,
> +	struct xfs_bitmap	*sub)
>  {
>  	struct list_head	*lp;
> -	struct xrep_extent	*ex;
> -	struct xrep_extent	*newex;
> -	struct xrep_extent	*subex;
> -	xfs_fsblock_t		sub_fsb;
> -	xfs_extlen_t		sub_len;
> +	struct xfs_bitmap_range	*br;
> +	struct xfs_bitmap_range	*new_br;
> +	struct xfs_bitmap_range	*sub_br;
> +	uint64_t		sub_start;
> +	uint64_t		sub_len;
>  	int			state;
>  	int			error = 0;
>  
> -	if (list_empty(&exlist->list) || list_empty(&sublist->list))
> +	if (list_empty(&bitmap->list) || list_empty(&sub->list))
>  		return 0;
> -	ASSERT(!list_empty(&sublist->list));
> +	ASSERT(!list_empty(&sub->list));
>  
> -	list_sort(NULL, &exlist->list, xrep_btree_extent_cmp);
> -	list_sort(NULL, &sublist->list, xrep_btree_extent_cmp);
> +	list_sort(NULL, &bitmap->list, xfs_bitmap_range_cmp);
> +	list_sort(NULL, &sub->list, xfs_bitmap_range_cmp);
>  
>  	/*
> -	 * Now that we've sorted both lists, we iterate exlist once, rolling
> -	 * forward through sublist and/or exlist as necessary until we find an
> +	 * Now that we've sorted both lists, we iterate bitmap once, rolling
> +	 * forward through sub and/or bitmap as necessary until we find an
>  	 * overlap or reach the end of either list.  We do not reset lp to the
> -	 * head of exlist nor do we reset subex to the head of sublist.  The
> +	 * head of bitmap nor do we reset sub_br to the head of sub.  The
>  	 * list traversal is similar to merge sort, but we're deleting
>  	 * instead.  In this manner we avoid O(n^2) operations.
>  	 */
> -	subex = list_first_entry(&sublist->list, struct xrep_extent,
> +	sub_br = list_first_entry(&sub->list, struct xfs_bitmap_range,
>  			list);
> -	lp = exlist->list.next;
> -	while (lp != &exlist->list) {
> -		ex = list_entry(lp, struct xrep_extent, list);
> +	lp = bitmap->list.next;
> +	while (lp != &bitmap->list) {
> +		br = list_entry(lp, struct xfs_bitmap_range, list);
>  
>  		/*
> -		 * Advance subex and/or ex until we find a pair that
> +		 * Advance sub_br and/or br until we find a pair that
>  		 * intersect or we run out of extents.
>  		 */
> -		while (subex->fsbno + subex->len <= ex->fsbno) {
> -			if (list_is_last(&subex->list, &sublist->list))
> +		while (sub_br->start + sub_br->len <= br->start) {
> +			if (list_is_last(&sub_br->list, &sub->list))
>  				goto out;
> -			subex = list_next_entry(subex, list);
> +			sub_br = list_next_entry(sub_br, list);
>  		}
> -		if (subex->fsbno >= ex->fsbno + ex->len) {
> +		if (sub_br->start >= br->start + br->len) {
>  			lp = lp->next;
>  			continue;
>  		}
>  
> -		/* trim subex to fit the extent we have */
> -		sub_fsb = subex->fsbno;
> -		sub_len = subex->len;
> -		if (subex->fsbno < ex->fsbno) {
> -			sub_len -= ex->fsbno - subex->fsbno;
> -			sub_fsb = ex->fsbno;
> +		/* trim sub_br to fit the extent we have */
> +		sub_start = sub_br->start;
> +		sub_len = sub_br->len;
> +		if (sub_br->start < br->start) {
> +			sub_len -= br->start - sub_br->start;
> +			sub_start = br->start;
>  		}
> -		if (sub_len > ex->len)
> -			sub_len = ex->len;
> +		if (sub_len > br->len)
> +			sub_len = br->len;
>  
>  		state = 0;
> -		if (sub_fsb == ex->fsbno)
> +		if (sub_start == br->start)
>  			state |= LEFT_ALIGNED;
> -		if (sub_fsb + sub_len == ex->fsbno + ex->len)
> +		if (sub_start + sub_len == br->start + br->len)
>  			state |= RIGHT_ALIGNED;
>  		switch (state) {
>  		case LEFT_ALIGNED:
>  			/* Coincides with only the left. */
> -			ex->fsbno += sub_len;
> -			ex->len -= sub_len;
> +			br->start += sub_len;
> +			br->len -= sub_len;
>  			break;
>  		case RIGHT_ALIGNED:
>  			/* Coincides with only the right. */
> -			ex->len -= sub_len;
> +			br->len -= sub_len;
>  			lp = lp->next;
>  			break;
>  		case LEFT_ALIGNED | RIGHT_ALIGNED:
>  			/* Total overlap, just delete ex. */
>  			lp = lp->next;
> -			list_del(&ex->list);
> -			kmem_free(ex);
> +			list_del(&br->list);
> +			kmem_free(br);
>  			break;
>  		case 0:
>  			/*
>  			 * Deleting from the middle: add the new right extent
>  			 * and then shrink the left extent.
>  			 */
> -			newex = kmem_alloc(sizeof(struct xrep_extent),
> +			new_br = kmem_alloc(sizeof(struct xfs_bitmap_range),
>  					KM_MAYFAIL);
> -			if (!newex) {
> +			if (!new_br) {
>  				error = -ENOMEM;
>  				goto out;
>  			}
> -			INIT_LIST_HEAD(&newex->list);
> -			newex->fsbno = sub_fsb + sub_len;
> -			newex->len = ex->fsbno + ex->len - newex->fsbno;
> -			list_add(&newex->list, &ex->list);
> -			ex->len = sub_fsb - ex->fsbno;
> +			INIT_LIST_HEAD(&new_br->list);
> +			new_br->start = sub_start + sub_len;
> +			new_br->len = br->start + br->len - new_br->start;
> +			list_add(&new_br->list, &br->list);
> +			br->len = sub_start - br->start;
>  			lp = lp->next;
>  			break;
>  		default:
> diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> index 1038157695a8..dad652ee9177 100644
> --- a/fs/xfs/scrub/bitmap.h
> +++ b/fs/xfs/scrub/bitmap.h
> @@ -6,32 +6,27 @@
>  #ifndef __XFS_SCRUB_BITMAP_H__
>  #define __XFS_SCRUB_BITMAP_H__
>  
> -struct xrep_extent {
> +struct xfs_bitmap_range {
>  	struct list_head	list;
> -	xfs_fsblock_t		fsbno;
> -	xfs_extlen_t		len;
> +	uint64_t		start;
> +	uint64_t		len;
>  };
>  
> -struct xrep_extent_list {
> +struct xfs_bitmap {
>  	struct list_head	list;
>  };
>  
> -static inline void
> -xrep_init_extent_list(
> -	struct xrep_extent_list		*exlist)
> -{
> -	INIT_LIST_HEAD(&exlist->list);
> -}
> +void xfs_bitmap_init(struct xfs_bitmap *bitmap);
> +void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
>  
> -#define for_each_xrep_extent_safe(rbe, n, exlist) \
> -	list_for_each_entry_safe((rbe), (n), &(exlist)->list, list)
> -int xrep_collect_btree_extent(struct xfs_scrub *sc,
> -		struct xrep_extent_list *btlist, xfs_fsblock_t fsbno,
> -		xfs_extlen_t len);
> -void xrep_cancel_btree_extents(struct xfs_scrub *sc,
> -		struct xrep_extent_list *btlist);
> -int xrep_subtract_extents(struct xfs_scrub *sc,
> -		struct xrep_extent_list *exlist,
> -		struct xrep_extent_list *sublist);
> +#define for_each_xfs_bitmap_extent(bex, n, bitmap) \
> +	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list)
> +
> +#define for_each_xfs_bitmap_block(b, bex, n, bitmap) \
> +	list_for_each_entry_safe((bex), (n), &(bitmap)->list, list) \
> +		for ((b) = bex->start; (b) < bex->start + bex->len; (b)++)
> +
> +int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> +int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
>  
>  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 27a904ef6189..85b048b341a0 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -368,17 +368,17 @@ xrep_init_btblock(
>   *
>   * However, that leaves the matter of removing all the metadata describing the
>   * old broken structure.  For primary metadata we use the rmap data to collect
> - * every extent with a matching rmap owner (exlist); we then iterate all other
> + * every extent with a matching rmap owner (bitmap); we then iterate all other
>   * metadata structures with the same rmap owner to collect the extents that
> - * cannot be removed (sublist).  We then subtract sublist from exlist to
> + * cannot be removed (sublist).  We then subtract sublist from bitmap to
>   * derive the blocks that were used by the old btree.  These blocks can be
>   * reaped.
>   *
>   * For rmapbt reconstructions we must use different tactics for extent
>   * collection.  First we iterate all primary metadata (this excludes the old
>   * rmapbt, obviously) to generate new rmap records.  The gaps in the rmap
> - * records are collected as exlist.  The bnobt records are collected as
> - * sublist.  As with the other btrees we subtract sublist from exlist, and the
> + * records are collected as bitmap.  The bnobt records are collected as
> + * sublist.  As with the other btrees we subtract sublist from bitmap, and the
>   * result (since the rmapbt lives in the free space) are the blocks from the
>   * old rmapbt.
>   *
> @@ -386,11 +386,11 @@ xrep_init_btblock(
>   *
>   * Now that we've constructed a new btree to replace the damaged one, we want
>   * to dispose of the blocks that (we think) the old btree was using.
> - * Previously, we used the rmapbt to collect the extents (exlist) with the
> + * Previously, we used the rmapbt to collect the extents (bitmap) with the
>   * rmap owner corresponding to the tree we rebuilt, collected extents for any
>   * blocks with the same rmap owner that are owned by another data structure
> - * (sublist), and subtracted sublist from exlist.  In theory the extents
> - * remaining in exlist are the old btree's blocks.
> + * (sublist), and subtracted sublist from bitmap.  In theory the extents
> + * remaining in bitmap are the old btree's blocks.
>   *
>   * Unfortunately, it's possible that the btree was crosslinked with other
>   * blocks on disk.  The rmap data can tell us if there are multiple owners, so
> @@ -406,7 +406,7 @@ xrep_init_btblock(
>   * If there are no rmap records at all, we also free the block.  If the btree
>   * being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then there isn't
>   * supposed to be a rmap record and everything is ok.  For other btrees there
> - * had to have been an rmap entry for the block to have ended up on @exlist,
> + * had to have been an rmap entry for the block to have ended up on @bitmap,
>   * so if it's gone now there's something wrong and the fs will shut down.
>   *
>   * Note: If there are multiple rmap records with only the same rmap owner as
> @@ -419,7 +419,7 @@ xrep_init_btblock(
>   * The caller is responsible for locking the AG headers for the entire rebuild
>   * operation so that nothing else can sneak in and change the AG state while
>   * we're not looking.  We also assume that the caller already invalidated any
> - * buffers associated with @exlist.
> + * buffers associated with @bitmap.
>   */
>  
>  /*
> @@ -429,13 +429,12 @@ xrep_init_btblock(
>  int
>  xrep_invalidate_blocks(
>  	struct xfs_scrub	*sc,
> -	struct xrep_extent_list	*exlist)
> +	struct xfs_bitmap	*bitmap)
>  {
> -	struct xrep_extent	*rex;
> -	struct xrep_extent	*n;
> +	struct xfs_bitmap_range	*bmr;
> +	struct xfs_bitmap_range	*n;
>  	struct xfs_buf		*bp;
>  	xfs_fsblock_t		fsbno;
> -	xfs_agblock_t		i;
>  
>  	/*
>  	 * For each block in each extent, see if there's an incore buffer for
> @@ -445,18 +444,16 @@ xrep_invalidate_blocks(
>  	 * because we never own those; and if we can't TRYLOCK the buffer we
>  	 * assume it's owned by someone else.
>  	 */
> -	for_each_xrep_extent_safe(rex, n, exlist) {
> -		for (fsbno = rex->fsbno, i = rex->len; i > 0; fsbno++, i--) {
> -			/* Skip AG headers and post-EOFS blocks */
> -			if (!xfs_verify_fsbno(sc->mp, fsbno))
> -				continue;
> -			bp = xfs_buf_incore(sc->mp->m_ddev_targp,
> -					XFS_FSB_TO_DADDR(sc->mp, fsbno),
> -					XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK);
> -			if (bp) {
> -				xfs_trans_bjoin(sc->tp, bp);
> -				xfs_trans_binval(sc->tp, bp);
> -			}
> +	for_each_xfs_bitmap_block(fsbno, bmr, n, bitmap) {
> +		/* Skip AG headers and post-EOFS blocks */
> +		if (!xfs_verify_fsbno(sc->mp, fsbno))
> +			continue;
> +		bp = xfs_buf_incore(sc->mp->m_ddev_targp,
> +				XFS_FSB_TO_DADDR(sc->mp, fsbno),
> +				XFS_FSB_TO_BB(sc->mp, 1), XBF_TRYLOCK);
> +		if (bp) {
> +			xfs_trans_bjoin(sc->tp, bp);
> +			xfs_trans_binval(sc->tp, bp);
>  		}
>  	}
>  
> @@ -519,9 +516,9 @@ xrep_put_freelist(
>  	return 0;
>  }
>  
> -/* Dispose of a single metadata block. */
> +/* Dispose of a single block. */
>  STATIC int
> -xrep_dispose_btree_block(
> +xrep_reap_block(
>  	struct xfs_scrub	*sc,
>  	xfs_fsblock_t		fsbno,
>  	struct xfs_owner_info	*oinfo,
> @@ -593,41 +590,35 @@ xrep_dispose_btree_block(
>  	return error;
>  }
>  
> -/* Dispose of btree blocks from an old per-AG btree. */
> +/* Dispose of every block of every extent in the bitmap. */
>  int
> -xrep_reap_btree_extents(
> +xrep_reap_extents(
>  	struct xfs_scrub	*sc,
> -	struct xrep_extent_list	*exlist,
> +	struct xfs_bitmap	*bitmap,
>  	struct xfs_owner_info	*oinfo,
>  	enum xfs_ag_resv_type	type)
>  {
> -	struct xrep_extent	*rex;
> -	struct xrep_extent	*n;
> +	struct xfs_bitmap_range	*bmr;
> +	struct xfs_bitmap_range	*n;
> +	xfs_fsblock_t		fsbno;
>  	int			error = 0;
>  
>  	ASSERT(xfs_sb_version_hasrmapbt(&sc->mp->m_sb));
>  
> -	/* Dispose of every block from the old btree. */
> -	for_each_xrep_extent_safe(rex, n, exlist) {
> +	for_each_xfs_bitmap_block(fsbno, bmr, n, bitmap) {
>  		ASSERT(sc->ip != NULL ||
> -		       XFS_FSB_TO_AGNO(sc->mp, rex->fsbno) == sc->sa.agno);
> -
> +		       XFS_FSB_TO_AGNO(sc->mp, fsbno) == sc->sa.agno);
>  		trace_xrep_dispose_btree_extent(sc->mp,
> -				XFS_FSB_TO_AGNO(sc->mp, rex->fsbno),
> -				XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno), rex->len);
> +				XFS_FSB_TO_AGNO(sc->mp, fsbno),
> +				XFS_FSB_TO_AGBNO(sc->mp, fsbno), 1);
>  
> -		for (; rex->len > 0; rex->len--, rex->fsbno++) {
> -			error = xrep_dispose_btree_block(sc, rex->fsbno,
> -					oinfo, type);
> -			if (error)
> -				goto out;
> -		}
> -		list_del(&rex->list);
> -		kmem_free(rex);
> +		error = xrep_reap_block(sc, fsbno, oinfo, type);
> +		if (error)
> +			goto out;
>  	}
>  
>  out:
> -	xrep_cancel_btree_extents(sc, exlist);
> +	xfs_bitmap_destroy(bitmap);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index a3d491a438f4..5a4e92221916 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -27,13 +27,11 @@ int xrep_init_btblock(struct xfs_scrub *sc, xfs_fsblock_t fsb,
>  		struct xfs_buf **bpp, xfs_btnum_t btnum,
>  		const struct xfs_buf_ops *ops);
>  
> -struct xrep_extent_list;
> +struct xfs_bitmap;
>  
>  int xrep_fix_freelist(struct xfs_scrub *sc, bool can_shrink);
> -int xrep_invalidate_blocks(struct xfs_scrub *sc,
> -		struct xrep_extent_list *btlist);
> -int xrep_reap_btree_extents(struct xfs_scrub *sc,
> -		struct xrep_extent_list *exlist,
> +int xrep_invalidate_blocks(struct xfs_scrub *sc, struct xfs_bitmap *btlist);
> +int xrep_reap_extents(struct xfs_scrub *sc, struct xfs_bitmap *exlist,
>  		struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type);
>  
>  struct xrep_find_ag_btree {
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index 93db22c39b51..4e20f0e48232 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -511,7 +511,6 @@ DEFINE_EVENT(xrep_extent_class, name, \
>  		 xfs_agblock_t agbno, xfs_extlen_t len), \
>  	TP_ARGS(mp, agno, agbno, len))
>  DEFINE_REPAIR_EXTENT_EVENT(xrep_dispose_btree_extent);
> -DEFINE_REPAIR_EXTENT_EVENT(xrep_collect_btree_extent);
>  DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert);
>  
>  DECLARE_EVENT_CLASS(xrep_rmap_class,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/14] xfs: repair the AGF
  2018-07-30  5:48 ` [PATCH 02/14] xfs: repair the AGF Darrick J. Wong
@ 2018-07-30 16:22   ` Brian Foster
  2018-07-30 17:31     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-30 16:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:48:02PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Regenerate the AGF from the rmap data.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/repair.c          |   27 ++-
>  fs/xfs/scrub/repair.h          |    4 
>  fs/xfs/scrub/scrub.c           |    2 
>  4 files changed, 393 insertions(+), 10 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 1e96621ece3a..4842fc598c9b 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
...
> @@ -54,3 +61,366 @@ xrep_superblock(
...
> +/* Repair the AGF. v5 filesystems only. */
> +int
> +xrep_agf(
> +	struct xfs_scrub		*sc)
> +{
...
> +	/* Start rewriting the header and implant the btrees we found. */
> +	xrep_agf_init_header(sc, agf_bp, &old_agf);
> +	xrep_agf_set_roots(sc, agf, fab);
> +	error = xrep_agf_calc_from_btrees(sc, agf_bp);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Commit the changes and reinitialize incore state. */
> +	return xrep_agf_commit_new(sc, agf_bp);
> +
> +out_revert:
> +	/* Mark the incore AGF state stale and revert the AGF. */
> +	sc->sa.pag->pagf_init = 0;

Hmm, looking at this again I'm not sure it's safe to reset ->pagf_init
like this. The contexts where we hold agf might be Ok because I think
that might prevent some other thread from actually coming in and
resetting it, but look at xfs_alloc_read_agf() does in this case if the
agf becomes available with !pagf_init. Specifically, are we at risk of
corrupting a populated ->pagb_tree or causing other problems by
reinitializing the spinlock? Perhaps we need another patch to separate
out some of those fields that should only ever be initialized once.

With something like that, it might subsequently make sense to factor the
reinit from disk bits into a helper to be shared between
xrep_agf_commit_new() and xfs_allo_read_agf(). I also wonder if it's
sufficient to just update the agf on disk and leave pagf_init == 0.

Otherwise the rest of this patch seems Ok to me.

Brian

> +	memcpy(agf, &old_agf, sizeof(old_agf));
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 85b048b341a0..17cf48564390 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -128,9 +128,12 @@ xrep_roll_ag_trans(
>  	int			error;
>  
>  	/* Keep the AG header buffers locked so we can keep going. */
> -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
>  
>  	/* Roll the transaction. */
>  	error = xfs_trans_roll(&sc->tp);
> @@ -138,9 +141,12 @@ xrep_roll_ag_trans(
>  		goto out_release;
>  
>  	/* Join AG headers to the new transaction. */
> -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
>  
>  	return 0;
>  
> @@ -150,9 +156,12 @@ xrep_roll_ag_trans(
>  	 * buffers will be released during teardown on our way out
>  	 * of the kernel.
>  	 */
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> +	if (sc->sa.agi_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> +	if (sc->sa.agf_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> +	if (sc->sa.agfl_bp)
> +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index 5a4e92221916..1d283360b5ab 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
>  
>  int xrep_probe(struct xfs_scrub *sc);
>  int xrep_superblock(struct xfs_scrub *sc);
> +int xrep_agf(struct xfs_scrub *sc);
> +int xrep_agfl(struct xfs_scrub *sc);
>  
>  #else
>  
> @@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
>  
>  #define xrep_probe			xrep_notsupported
>  #define xrep_superblock			xrep_notsupported
> +#define xrep_agf			xrep_notsupported
> +#define xrep_agfl			xrep_notsupported
>  
>  #endif /* CONFIG_XFS_ONLINE_REPAIR */
>  
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 6efb926f3cf8..1e8a17c8e2b9 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
>  		.type	= ST_PERAG,
>  		.setup	= xchk_setup_fs,
>  		.scrub	= xchk_agf,
> -		.repair	= xrep_notsupported,
> +		.repair	= xrep_agf,
>  	},
>  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
>  		.type	= ST_PERAG,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-07-30  5:48 ` [PATCH 03/14] xfs: repair the AGFL Darrick J. Wong
@ 2018-07-30 16:25   ` Brian Foster
  2018-07-30 17:22     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-30 16:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Repair the AGFL from the rmap data.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
seems to always dump a cross-referencing failed error and not want to
deal with it. Expected? Is there a good way to unit test some of this
stuff with simple/localized corruptions?

Otherwise this looks sane, a couple comments..

>  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
>  fs/xfs/scrub/bitmap.h          |    4 +
>  fs/xfs/scrub/scrub.c           |    2 
>  4 files changed, 373 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 4842fc598c9b..bfef066c87c3 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -424,3 +424,279 @@ xrep_agf(
>  	memcpy(agf, &old_agf, sizeof(old_agf));
>  	return error;
>  }
> +
...
> +/* Write out a totally new AGFL. */
> +STATIC void
> +xrep_agfl_init_header(
> +	struct xfs_scrub	*sc,
> +	struct xfs_buf		*agfl_bp,
> +	struct xfs_bitmap	*agfl_extents,
> +	xfs_agblock_t		flcount)
> +{
> +	struct xfs_mount	*mp = sc->mp;
> +	__be32			*agfl_bno;
> +	struct xfs_bitmap_range	*br;
> +	struct xfs_bitmap_range	*n;
> +	struct xfs_agfl		*agfl;
> +	xfs_agblock_t		agbno;
> +	unsigned int		fl_off;
> +
> +	ASSERT(flcount <= xfs_agfl_size(mp));
> +
> +	/* Start rewriting the header. */
> +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));

What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?

> +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> +
> +	/*
> +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> +	 * step.
> +	 */
> +	fl_off = 0;
> +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> +
> +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> +
> +		while (br->len > 0 && fl_off < flcount) {
> +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> +			fl_off++;
> +			agbno++;

			/* bump br so we don't reap blocks we've used */

(i.e., took me a sec to realize why we bother with ->start)

> +			br->start++;
> +			br->len--;
> +		}
> +
> +		if (br->len)
> +			break;
> +		list_del(&br->list);
> +		kmem_free(br);
> +	}
> +
> +	/* Write new AGFL to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> +}
> +
...
> diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> index c770e2d0b6aa..fdadc9e1dc49 100644
> --- a/fs/xfs/scrub/bitmap.c
> +++ b/fs/xfs/scrub/bitmap.c
> @@ -9,6 +9,7 @@
>  #include "xfs_format.h"
>  #include "xfs_trans_resv.h"
>  #include "xfs_mount.h"
> +#include "xfs_btree.h"
>  #include "scrub/xfs_scrub.h"
>  #include "scrub/scrub.h"
>  #include "scrub/common.h"
> @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
>  }
>  #undef LEFT_ALIGNED
>  #undef RIGHT_ALIGNED
> +
> +/*
> + * Record all btree blocks seen while iterating all records of a btree.
> + *
> + * We know that the btree query_all function starts at the left edge and walks
> + * towards the right edge of the tree.  Therefore, we know that we can walk up
> + * the btree cursor towards the root; if the pointer for a given level points
> + * to the first record/key in that block, we haven't seen this block before;
> + * and therefore we need to remember that we saw this block in the btree.
> + *
> + * So if our btree is:
> + *
> + *    4
> + *  / | \
> + * 1  2  3
> + *
> + * Pretend for this example that each leaf block has 100 btree records.  For
> + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> + * block 4.  The list is [1, 4].
> + *
> + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> + * loop.  The list remains [1, 4].
> + *
> + * For the 101st btree record, we've moved onto leaf block 2.  Now
> + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> + *
> + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> + *
> + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> + *
> + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> + */
> +
> +/*
> + * Record all the buffers pointed to by the btree cursor.  Callers already
> + * engaged in a btree walk should call this function to capture the list of
> + * blocks going from the leaf towards the root.
> + */
> +int
> +xfs_bitmap_set_btcur_path(
> +	struct xfs_bitmap	*bitmap,
> +	struct xfs_btree_cur	*cur)
> +{
> +	struct xfs_buf		*bp;
> +	xfs_fsblock_t		fsb;
> +	int			i;
> +	int			error;
> +
> +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> +		xfs_btree_get_block(cur, i, &bp);
> +		if (!bp)
> +			continue;
> +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> +		error = xfs_bitmap_set(bitmap, fsb, 1);

Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
but also highlights that xfs_bitmap_set() essentially allocates entries
for duplicate values if they exist. Is this handled by the broader
mechanism, for example, if the rmapbt was corrupted to have multiple
entries for a particular unused OWN_AG block? Or could we end up leaking
that corruption over to the agfl?

I also wonder a bit about memory consumption on filesystems with large
metadata footprints. We essentially have to allocate one of these for
every allocation btree block before we can do the disunion and locate
the agfl-appropriate blocks. If we had a more lookup friendly structure,
perhaps this could be optimized by filtering out bnobt/cntbt blocks
during the associated btree walks..?

Have you thought about reusing something like the new in-core extent
tree mechanism as a pure in-memory extent store? It's certainly not
worth reworking something like that right now, but I wonder if we could
save memory via the denser format (and perhaps benefit from code
flexibility, reuse, etc.).

Brian

> +		if (error)
> +			return error;
> +	}
> +
> +	return 0;
> +}
> +
> +/* Collect a btree's block in the bitmap. */
> +STATIC int
> +xfs_bitmap_collect_btblock(
> +	struct xfs_btree_cur	*cur,
> +	int			level,
> +	void			*priv)
> +{
> +	struct xfs_bitmap	*bitmap = priv;
> +	struct xfs_buf		*bp;
> +	xfs_fsblock_t		fsbno;
> +
> +	xfs_btree_get_block(cur, level, &bp);
> +	if (!bp)
> +		return 0;
> +
> +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> +	return xfs_bitmap_set(bitmap, fsbno, 1);
> +}
> +
> +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> +int
> +xfs_bitmap_set_btblocks(
> +	struct xfs_bitmap	*bitmap,
> +	struct xfs_btree_cur	*cur)
> +{
> +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> +}
> diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> index dad652ee9177..ae8ecbce6fa6 100644
> --- a/fs/xfs/scrub/bitmap.h
> +++ b/fs/xfs/scrub/bitmap.h
> @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
>  
>  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
>  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> +		struct xfs_btree_cur *cur);
> +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> +		struct xfs_btree_cur *cur);
>  
>  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 1e8a17c8e2b9..2670f4cf62f4 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
>  		.type	= ST_PERAG,
>  		.setup	= xchk_setup_fs,
>  		.scrub	= xchk_agfl,
> -		.repair	= xrep_notsupported,
> +		.repair	= xrep_agfl,
>  	},
>  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
>  		.type	= ST_PERAG,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-07-30 16:25   ` Brian Foster
@ 2018-07-30 17:22     ` Darrick J. Wong
  2018-07-31 15:10       ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30 17:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Repair the AGFL from the rmap data.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> seems to always dump a cross-referencing failed error and not want to
> deal with it. Expected? Is there a good way to unit test some of this
> stuff with simple/localized corruptions?

I usually pick one of the corruptions from xfs/355...

$ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
./check xfs/355

> Otherwise this looks sane, a couple comments..
> 
> >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> >  fs/xfs/scrub/bitmap.h          |    4 +
> >  fs/xfs/scrub/scrub.c           |    2 
> >  4 files changed, 373 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > index 4842fc598c9b..bfef066c87c3 100644
> > --- a/fs/xfs/scrub/agheader_repair.c
> > +++ b/fs/xfs/scrub/agheader_repair.c
> > @@ -424,3 +424,279 @@ xrep_agf(
> >  	memcpy(agf, &old_agf, sizeof(old_agf));
> >  	return error;
> >  }
> > +
> ...
> > +/* Write out a totally new AGFL. */
> > +STATIC void
> > +xrep_agfl_init_header(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_buf		*agfl_bp,
> > +	struct xfs_bitmap	*agfl_extents,
> > +	xfs_agblock_t		flcount)
> > +{
> > +	struct xfs_mount	*mp = sc->mp;
> > +	__be32			*agfl_bno;
> > +	struct xfs_bitmap_range	*br;
> > +	struct xfs_bitmap_range	*n;
> > +	struct xfs_agfl		*agfl;
> > +	xfs_agblock_t		agbno;
> > +	unsigned int		fl_off;
> > +
> > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > +
> > +	/* Start rewriting the header. */
> > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> 
> What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?

Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
in the header fields.

> > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > +
> > +	/*
> > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > +	 * step.
> > +	 */
> > +	fl_off = 0;
> > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > +
> > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > +
> > +		while (br->len > 0 && fl_off < flcount) {
> > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > +			fl_off++;
> > +			agbno++;
> 
> 			/* bump br so we don't reap blocks we've used */
> 
> (i.e., took me a sec to realize why we bother with ->start)
> 
> > +			br->start++;
> > +			br->len--;
> > +		}
> > +
> > +		if (br->len)
> > +			break;
> > +		list_del(&br->list);
> > +		kmem_free(br);
> > +	}
> > +
> > +	/* Write new AGFL to disk. */
> > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > +}
> > +
> ...
> > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > index c770e2d0b6aa..fdadc9e1dc49 100644
> > --- a/fs/xfs/scrub/bitmap.c
> > +++ b/fs/xfs/scrub/bitmap.c
> > @@ -9,6 +9,7 @@
> >  #include "xfs_format.h"
> >  #include "xfs_trans_resv.h"
> >  #include "xfs_mount.h"
> > +#include "xfs_btree.h"
> >  #include "scrub/xfs_scrub.h"
> >  #include "scrub/scrub.h"
> >  #include "scrub/common.h"
> > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> >  }
> >  #undef LEFT_ALIGNED
> >  #undef RIGHT_ALIGNED
> > +
> > +/*
> > + * Record all btree blocks seen while iterating all records of a btree.
> > + *
> > + * We know that the btree query_all function starts at the left edge and walks
> > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > + * the btree cursor towards the root; if the pointer for a given level points
> > + * to the first record/key in that block, we haven't seen this block before;
> > + * and therefore we need to remember that we saw this block in the btree.
> > + *
> > + * So if our btree is:
> > + *
> > + *    4
> > + *  / | \
> > + * 1  2  3
> > + *
> > + * Pretend for this example that each leaf block has 100 btree records.  For
> > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > + * block 4.  The list is [1, 4].
> > + *
> > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > + * loop.  The list remains [1, 4].
> > + *
> > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > + *
> > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > + *
> > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > + *
> > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > + */
> > +
> > +/*
> > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > + * engaged in a btree walk should call this function to capture the list of
> > + * blocks going from the leaf towards the root.
> > + */
> > +int
> > +xfs_bitmap_set_btcur_path(
> > +	struct xfs_bitmap	*bitmap,
> > +	struct xfs_btree_cur	*cur)
> > +{
> > +	struct xfs_buf		*bp;
> > +	xfs_fsblock_t		fsb;
> > +	int			i;
> > +	int			error;
> > +
> > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > +		xfs_btree_get_block(cur, i, &bp);
> > +		if (!bp)
> > +			continue;
> > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> 
> Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> but also highlights that xfs_bitmap_set() essentially allocates entries
> for duplicate values if they exist. Is this handled by the broader
> mechanism, for example, if the rmapbt was corrupted to have multiple
> entries for a particular unused OWN_AG block? Or could we end up leaking
> that corruption over to the agfl?

Right now we're totally dependent on the rmapbt being sane to rebuild
the space metadata.

> I also wonder a bit about memory consumption on filesystems with large
> metadata footprints. We essentially have to allocate one of these for
> every allocation btree block before we can do the disunion and locate
> the agfl-appropriate blocks. If we had a more lookup friendly structure,
> perhaps this could be optimized by filtering out bnobt/cntbt blocks
> during the associated btree walks..?
> 
> Have you thought about reusing something like the new in-core extent
> tree mechanism as a pure in-memory extent store? It's certainly not
> worth reworking something like that right now, but I wonder if we could
> save memory via the denser format (and perhaps benefit from code
> flexibility, reuse, etc.).

Yes, I was thinking about refactoring the iext btree into a more generic
in-core index with 64-bit key so that I could adapt xfs_bitmap to use
it.  In the longer term I would /also/ like to use xfs_bitmap to detect
xfs_buf cache aliasing when multi-block buffers are in use, but that's
further off. :)

As for the memory-intensive record lists in all the btree rebuilders, I
have some ideas around that too -- either find a way to build an
alternate btree and switch the roots over, or (once we gain the ability
to mark an AG unavailable for new allocations) allocate an unlinked
inode, store the records in the page cache pages for the file, and
release it when we're done.

But, that can wait until I've gotten more of this merged, or get bored.
:)

--D

> Brian
> 
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +/* Collect a btree's block in the bitmap. */
> > +STATIC int
> > +xfs_bitmap_collect_btblock(
> > +	struct xfs_btree_cur	*cur,
> > +	int			level,
> > +	void			*priv)
> > +{
> > +	struct xfs_bitmap	*bitmap = priv;
> > +	struct xfs_buf		*bp;
> > +	xfs_fsblock_t		fsbno;
> > +
> > +	xfs_btree_get_block(cur, level, &bp);
> > +	if (!bp)
> > +		return 0;
> > +
> > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > +}
> > +
> > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > +int
> > +xfs_bitmap_set_btblocks(
> > +	struct xfs_bitmap	*bitmap,
> > +	struct xfs_btree_cur	*cur)
> > +{
> > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > +}
> > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > index dad652ee9177..ae8ecbce6fa6 100644
> > --- a/fs/xfs/scrub/bitmap.h
> > +++ b/fs/xfs/scrub/bitmap.h
> > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> >  
> >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > +		struct xfs_btree_cur *cur);
> > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > +		struct xfs_btree_cur *cur);
> >  
> >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> >  		.type	= ST_PERAG,
> >  		.setup	= xchk_setup_fs,
> >  		.scrub	= xchk_agfl,
> > -		.repair	= xrep_notsupported,
> > +		.repair	= xrep_agfl,
> >  	},
> >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> >  		.type	= ST_PERAG,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/14] xfs: repair the AGF
  2018-07-30 16:22   ` Brian Foster
@ 2018-07-30 17:31     ` Darrick J. Wong
  2018-07-30 18:19       ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30 17:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 12:22:24PM -0400, Brian Foster wrote:
> On Sun, Jul 29, 2018 at 10:48:02PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Regenerate the AGF from the rmap data.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/repair.c          |   27 ++-
> >  fs/xfs/scrub/repair.h          |    4 
> >  fs/xfs/scrub/scrub.c           |    2 
> >  4 files changed, 393 insertions(+), 10 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > index 1e96621ece3a..4842fc598c9b 100644
> > --- a/fs/xfs/scrub/agheader_repair.c
> > +++ b/fs/xfs/scrub/agheader_repair.c
> ...
> > @@ -54,3 +61,366 @@ xrep_superblock(
> ...
> > +/* Repair the AGF. v5 filesystems only. */
> > +int
> > +xrep_agf(
> > +	struct xfs_scrub		*sc)
> > +{
> ...
> > +	/* Start rewriting the header and implant the btrees we found. */
> > +	xrep_agf_init_header(sc, agf_bp, &old_agf);
> > +	xrep_agf_set_roots(sc, agf, fab);
> > +	error = xrep_agf_calc_from_btrees(sc, agf_bp);
> > +	if (error)
> > +		goto out_revert;
> > +
> > +	/* Commit the changes and reinitialize incore state. */
> > +	return xrep_agf_commit_new(sc, agf_bp);
> > +
> > +out_revert:
> > +	/* Mark the incore AGF state stale and revert the AGF. */
> > +	sc->sa.pag->pagf_init = 0;
> 
> Hmm, looking at this again I'm not sure it's safe to reset ->pagf_init
> like this. The contexts where we hold agf might be Ok because I think
> that might prevent some other thread from actually coming in and
> resetting it, but look at xfs_alloc_read_agf() does in this case if the
> agf becomes available with !pagf_init. Specifically, are we at risk of
> corrupting a populated ->pagb_tree or causing other problems by
> reinitializing the spinlock? Perhaps we need another patch to separate
> out some of those fields that should only ever be initialized once.

Yikes, the pagb_tree & spinlock should not get reinitialized.  I don't
see where we ever tear them down except for unmount, so I *think* we can
move it to xfs_initialize_perag.  It's a little mystifying why we don't
initialze those things there like we do for the incore inode radix tree.

Also it would finally fix the discrepancy with xfsprogs libxfs where
they comment out the RB_ROOT initialization.

> With something like that, it might subsequently make sense to factor the
> reinit from disk bits into a helper to be shared between
> xrep_agf_commit_new() and xfs_allo_read_agf(). I also wonder if it's
> sufficient to just update the agf on disk and leave pagf_init == 0.

Hmm, wasn't there some verifier that used pag*_init (can't remember
which one) to decide if we were in log recovery?

--D

> Otherwise the rest of this patch seems Ok to me.
> 
> Brian
> 
> > +	memcpy(agf, &old_agf, sizeof(old_agf));
> > +	return error;
> > +}
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 85b048b341a0..17cf48564390 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> > @@ -128,9 +128,12 @@ xrep_roll_ag_trans(
> >  	int			error;
> >  
> >  	/* Keep the AG header buffers locked so we can keep going. */
> > -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > +	if (sc->sa.agi_bp)
> > +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > +	if (sc->sa.agf_bp)
> > +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > +	if (sc->sa.agfl_bp)
> > +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> >  
> >  	/* Roll the transaction. */
> >  	error = xfs_trans_roll(&sc->tp);
> > @@ -138,9 +141,12 @@ xrep_roll_ag_trans(
> >  		goto out_release;
> >  
> >  	/* Join AG headers to the new transaction. */
> > -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > +	if (sc->sa.agi_bp)
> > +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > +	if (sc->sa.agf_bp)
> > +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > +	if (sc->sa.agfl_bp)
> > +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> >  
> >  	return 0;
> >  
> > @@ -150,9 +156,12 @@ xrep_roll_ag_trans(
> >  	 * buffers will be released during teardown on our way out
> >  	 * of the kernel.
> >  	 */
> > -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > +	if (sc->sa.agi_bp)
> > +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > +	if (sc->sa.agf_bp)
> > +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > +	if (sc->sa.agfl_bp)
> > +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> >  
> >  	return error;
> >  }
> > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > index 5a4e92221916..1d283360b5ab 100644
> > --- a/fs/xfs/scrub/repair.h
> > +++ b/fs/xfs/scrub/repair.h
> > @@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
> >  
> >  int xrep_probe(struct xfs_scrub *sc);
> >  int xrep_superblock(struct xfs_scrub *sc);
> > +int xrep_agf(struct xfs_scrub *sc);
> > +int xrep_agfl(struct xfs_scrub *sc);
> >  
> >  #else
> >  
> > @@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
> >  
> >  #define xrep_probe			xrep_notsupported
> >  #define xrep_superblock			xrep_notsupported
> > +#define xrep_agf			xrep_notsupported
> > +#define xrep_agfl			xrep_notsupported
> >  
> >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> >  
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 6efb926f3cf8..1e8a17c8e2b9 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> >  		.type	= ST_PERAG,
> >  		.setup	= xchk_setup_fs,
> >  		.scrub	= xchk_agf,
> > -		.repair	= xrep_notsupported,
> > +		.repair	= xrep_agf,
> >  	},
> >  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
> >  		.type	= ST_PERAG,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/14] xfs: repair the AGF
  2018-07-30 17:31     ` Darrick J. Wong
@ 2018-07-30 18:19       ` Brian Foster
  2018-07-30 18:22         ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-30 18:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 10:31:11AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 30, 2018 at 12:22:24PM -0400, Brian Foster wrote:
> > On Sun, Jul 29, 2018 at 10:48:02PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Regenerate the AGF from the rmap data.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/scrub/repair.c          |   27 ++-
> > >  fs/xfs/scrub/repair.h          |    4 
> > >  fs/xfs/scrub/scrub.c           |    2 
> > >  4 files changed, 393 insertions(+), 10 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > index 1e96621ece3a..4842fc598c9b 100644
> > > --- a/fs/xfs/scrub/agheader_repair.c
> > > +++ b/fs/xfs/scrub/agheader_repair.c
> > ...
> > > @@ -54,3 +61,366 @@ xrep_superblock(
> > ...
> > > +/* Repair the AGF. v5 filesystems only. */
> > > +int
> > > +xrep_agf(
> > > +	struct xfs_scrub		*sc)
> > > +{
> > ...
> > > +	/* Start rewriting the header and implant the btrees we found. */
> > > +	xrep_agf_init_header(sc, agf_bp, &old_agf);
> > > +	xrep_agf_set_roots(sc, agf, fab);
> > > +	error = xrep_agf_calc_from_btrees(sc, agf_bp);
> > > +	if (error)
> > > +		goto out_revert;
> > > +
> > > +	/* Commit the changes and reinitialize incore state. */
> > > +	return xrep_agf_commit_new(sc, agf_bp);
> > > +
> > > +out_revert:
> > > +	/* Mark the incore AGF state stale and revert the AGF. */
> > > +	sc->sa.pag->pagf_init = 0;
> > 
> > Hmm, looking at this again I'm not sure it's safe to reset ->pagf_init
> > like this. The contexts where we hold agf might be Ok because I think
> > that might prevent some other thread from actually coming in and
> > resetting it, but look at xfs_alloc_read_agf() does in this case if the
> > agf becomes available with !pagf_init. Specifically, are we at risk of
> > corrupting a populated ->pagb_tree or causing other problems by
> > reinitializing the spinlock? Perhaps we need another patch to separate
> > out some of those fields that should only ever be initialized once.
> 
> Yikes, the pagb_tree & spinlock should not get reinitialized.  I don't
> see where we ever tear them down except for unmount, so I *think* we can
> move it to xfs_initialize_perag.  It's a little mystifying why we don't
> initialze those things there like we do for the incore inode radix tree.
> 
> Also it would finally fix the discrepancy with xfsprogs libxfs where
> they comment out the RB_ROOT initialization.
> 
> > With something like that, it might subsequently make sense to factor the
> > reinit from disk bits into a helper to be shared between
> > xrep_agf_commit_new() and xfs_allo_read_agf(). I also wonder if it's
> > sufficient to just update the agf on disk and leave pagf_init == 0.
> 
> Hmm, wasn't there some verifier that used pag*_init (can't remember
> which one) to decide if we were in log recovery?
> 

It looks like xfs_attr3_leaf_verify() might do something like that. But
don't we have to handle that either way if the error path leaves
pagf_init == 0 on return? Actually we might have to address it
regardless if we want to use pagf_init if that path isn't holding the
agf.

Brian

> --D
> 
> > Otherwise the rest of this patch seems Ok to me.
> > 
> > Brian
> > 
> > > +	memcpy(agf, &old_agf, sizeof(old_agf));
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > > index 85b048b341a0..17cf48564390 100644
> > > --- a/fs/xfs/scrub/repair.c
> > > +++ b/fs/xfs/scrub/repair.c
> > > @@ -128,9 +128,12 @@ xrep_roll_ag_trans(
> > >  	int			error;
> > >  
> > >  	/* Keep the AG header buffers locked so we can keep going. */
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > >  
> > >  	/* Roll the transaction. */
> > >  	error = xfs_trans_roll(&sc->tp);
> > > @@ -138,9 +141,12 @@ xrep_roll_ag_trans(
> > >  		goto out_release;
> > >  
> > >  	/* Join AG headers to the new transaction. */
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > >  
> > >  	return 0;
> > >  
> > > @@ -150,9 +156,12 @@ xrep_roll_ag_trans(
> > >  	 * buffers will be released during teardown on our way out
> > >  	 * of the kernel.
> > >  	 */
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > +	if (sc->sa.agi_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > +	if (sc->sa.agf_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > +	if (sc->sa.agfl_bp)
> > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > >  
> > >  	return error;
> > >  }
> > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > > index 5a4e92221916..1d283360b5ab 100644
> > > --- a/fs/xfs/scrub/repair.h
> > > +++ b/fs/xfs/scrub/repair.h
> > > @@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
> > >  
> > >  int xrep_probe(struct xfs_scrub *sc);
> > >  int xrep_superblock(struct xfs_scrub *sc);
> > > +int xrep_agf(struct xfs_scrub *sc);
> > > +int xrep_agfl(struct xfs_scrub *sc);
> > >  
> > >  #else
> > >  
> > > @@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
> > >  
> > >  #define xrep_probe			xrep_notsupported
> > >  #define xrep_superblock			xrep_notsupported
> > > +#define xrep_agf			xrep_notsupported
> > > +#define xrep_agfl			xrep_notsupported
> > >  
> > >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> > >  
> > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > index 6efb926f3cf8..1e8a17c8e2b9 100644
> > > --- a/fs/xfs/scrub/scrub.c
> > > +++ b/fs/xfs/scrub/scrub.c
> > > @@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > >  		.type	= ST_PERAG,
> > >  		.setup	= xchk_setup_fs,
> > >  		.scrub	= xchk_agf,
> > > -		.repair	= xrep_notsupported,
> > > +		.repair	= xrep_agf,
> > >  	},
> > >  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
> > >  		.type	= ST_PERAG,
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/14] xfs: repair the AGI
  2018-07-30  5:48 ` [PATCH 04/14] xfs: repair the AGI Darrick J. Wong
@ 2018-07-30 18:20   ` Brian Foster
  2018-07-30 18:44     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-30 18:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:48:15PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the AGI header items with some help from the rmapbt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

A couple nits and future thoughts..

>  fs/xfs/scrub/agheader_repair.c |  220 ++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/repair.h          |    2 
>  fs/xfs/scrub/scrub.c           |    2 
>  3 files changed, 223 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index bfef066c87c3..921e7d42a2ef 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -700,3 +700,223 @@ xrep_agfl(
>  	xfs_bitmap_destroy(&agfl_extents);
>  	return error;
>  }
...
> +STATIC int
> +xrep_agi_find_btrees(
> +	struct xfs_scrub		*sc,
> +	struct xrep_find_ag_btree	*fab)
> +{
> +	struct xfs_buf			*agf_bp;
> +	struct xfs_mount		*mp = sc->mp;
> +	int				error;
> +
> +	/* Read the AGF. */
> +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> +	if (error)
> +		return error;
> +	if (!agf_bp)
> +		return -ENOMEM;
> +
> +	/* Find the btree roots. */
> +	error = xrep_find_ag_btree_roots(sc, agf_bp, fab, NULL);
> +	if (error)
> +		return error;
> +
> +	/* We must find the inobt root. */
> +	if (fab[XREP_AGI_INOBT].root == NULLAGBLOCK ||
> +	    fab[XREP_AGI_INOBT].height > XFS_BTREE_MAXLEVELS)
> +		return -EFSCORRUPTED;
> +
> +	/* We must find the finobt root if that feature is enabled. */
> +	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
> +	    (fab[XREP_AGI_FINOBT].root == NULLAGBLOCK ||
> +	     fab[XREP_AGI_FINOBT].height > XFS_BTREE_MAXLEVELS))
> +		return -EFSCORRUPTED;

Skimming around some of the existing code to find the btree roots, I
notice that the .root field is going to be != NULLAGBLOCK so long as we
find at least one appropriately typed block in the rmapbt. I know you
mentioned that we depend on a correct rmapbt atm, but I'm wondering if
there's room for slightly more robust error checks to at least prevent
us from doing something damaging. For example, perhaps we could unset
.root when we've found a second block at the same level as the current
"root," and/or check some of the generic characteristics of a root btree
block (no left/right siblings) once we're done..?

> +
> +	return 0;
> +}
> +
...
> +/* Trigger reinitialization of the in-core data. */
> +STATIC int
> +xrep_agi_commit_new(
> +	struct xfs_scrub	*sc,
> +	struct xfs_buf		*agi_bp,
> +	const struct xfs_agi	*old_agi)

old_agi is unused here.

> +{
> +	struct xfs_perag	*pag;
> +	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agi_bp);
> +
> +	/* Trigger inode count recalculation */
> +	xfs_force_summary_recalc(sc->mp);
> +
> +	/* Write this to disk. */
> +	xfs_trans_buf_set_type(sc->tp, agi_bp, XFS_BLFT_AGI_BUF);
> +	xfs_trans_log_buf(sc->tp, agi_bp, 0, BBTOB(agi_bp->b_length) - 1);
> +
> +	/* Now reinitialize the in-core counters if necessary. */
> +	pag = sc->sa.pag;
> +	sc->sa.pag->pagi_init = 1;

Same s/sc->sa.pag/pag/ nit here as before.

> +	pag->pagi_count = be32_to_cpu(agi->agi_count);
> +	pag->pagi_freecount = be32_to_cpu(agi->agi_freecount);
> +
> +	return 0;
> +}
> +
> +/* Repair the AGI. */
> +int
> +xrep_agi(
> +	struct xfs_scrub		*sc)
> +{
> +	struct xrep_find_ag_btree	fab[XREP_AGI_MAX] = {
> +		[XREP_AGI_INOBT] = {
> +			.rmap_owner = XFS_RMAP_OWN_INOBT,
> +			.buf_ops = &xfs_inobt_buf_ops,
> +			.magic = XFS_IBT_CRC_MAGIC,
> +		},
> +		[XREP_AGI_FINOBT] = {
> +			.rmap_owner = XFS_RMAP_OWN_INOBT,
> +			.buf_ops = &xfs_inobt_buf_ops,
> +			.magic = XFS_FIBT_CRC_MAGIC,
> +		},
> +		[XREP_AGI_END] = {
> +			.buf_ops = NULL
> +		},
> +	};
> +	struct xfs_agi			old_agi;

It's not immediately clear to me how much of a danger this is here, if
at all, but FWIW xfs_agi is one of our larger structures at 336 bytes
(mostly due to agi_unlinked). I'm not terribly concerned if this isn't
currently exploding, but it might be worth thinking about another
technique to preserve original behavior without the stack usage. Perhaps
we could use an uncached buffer to preserve the original data and
implement an xfs_buf_copy() helper to facilitate, for example.

Brian

> +	struct xfs_mount		*mp = sc->mp;
> +	struct xfs_buf			*agi_bp;
> +	struct xfs_agi			*agi;
> +	int				error;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xchk_perag_get(sc->mp, &sc->sa);
> +	/*
> +	 * Make sure we have the AGI buffer, as scrub might have decided it
> +	 * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED.
> +	 */
> +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGI_DADDR(mp)),
> +			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
> +	if (error)
> +		return error;
> +	agi_bp->b_ops = &xfs_agi_buf_ops;
> +	agi = XFS_BUF_TO_AGI(agi_bp);
> +
> +	/* Find the AGI btree roots. */
> +	error = xrep_agi_find_btrees(sc, fab);
> +	if (error)
> +		return error;
> +
> +	/* Start rewriting the header and implant the btrees we found. */
> +	xrep_agi_init_header(sc, agi_bp, &old_agi);
> +	xrep_agi_set_roots(sc, agi, fab);
> +	error = xrep_agi_calc_from_btrees(sc, agi_bp);
> +	if (error)
> +		goto out_revert;
> +
> +	/* Reinitialize in-core state. */
> +	return xrep_agi_commit_new(sc, agi_bp, &old_agi);
> +
> +out_revert:
> +	/* Mark the incore AGI state stale and revert the AGI. */
> +	sc->sa.pag->pagi_init = 0;
> +	memcpy(agi, &old_agi, sizeof(old_agi));
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index 1d283360b5ab..9de321eee4ab 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -60,6 +60,7 @@ int xrep_probe(struct xfs_scrub *sc);
>  int xrep_superblock(struct xfs_scrub *sc);
>  int xrep_agf(struct xfs_scrub *sc);
>  int xrep_agfl(struct xfs_scrub *sc);
> +int xrep_agi(struct xfs_scrub *sc);
>  
>  #else
>  
> @@ -85,6 +86,7 @@ xrep_calc_ag_resblks(
>  #define xrep_superblock			xrep_notsupported
>  #define xrep_agf			xrep_notsupported
>  #define xrep_agfl			xrep_notsupported
> +#define xrep_agi			xrep_notsupported
>  
>  #endif /* CONFIG_XFS_ONLINE_REPAIR */
>  
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 2670f4cf62f4..4bfae1e61d30 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -226,7 +226,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
>  		.type	= ST_PERAG,
>  		.setup	= xchk_setup_fs,
>  		.scrub	= xchk_agi,
> -		.repair	= xrep_notsupported,
> +		.repair	= xrep_agi,
>  	},
>  	[XFS_SCRUB_TYPE_BNOBT] = {	/* bnobt */
>  		.type	= ST_PERAG,
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/14] xfs: repair the AGF
  2018-07-30 18:19       ` Brian Foster
@ 2018-07-30 18:22         ` Darrick J. Wong
  2018-07-30 18:33           ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30 18:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 02:19:15PM -0400, Brian Foster wrote:
> On Mon, Jul 30, 2018 at 10:31:11AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 30, 2018 at 12:22:24PM -0400, Brian Foster wrote:
> > > On Sun, Jul 29, 2018 at 10:48:02PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Regenerate the AGF from the rmap data.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/repair.c          |   27 ++-
> > > >  fs/xfs/scrub/repair.h          |    4 
> > > >  fs/xfs/scrub/scrub.c           |    2 
> > > >  4 files changed, 393 insertions(+), 10 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > index 1e96621ece3a..4842fc598c9b 100644
> > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > ...
> > > > @@ -54,3 +61,366 @@ xrep_superblock(
> > > ...
> > > > +/* Repair the AGF. v5 filesystems only. */
> > > > +int
> > > > +xrep_agf(
> > > > +	struct xfs_scrub		*sc)
> > > > +{
> > > ...
> > > > +	/* Start rewriting the header and implant the btrees we found. */
> > > > +	xrep_agf_init_header(sc, agf_bp, &old_agf);
> > > > +	xrep_agf_set_roots(sc, agf, fab);
> > > > +	error = xrep_agf_calc_from_btrees(sc, agf_bp);
> > > > +	if (error)
> > > > +		goto out_revert;
> > > > +
> > > > +	/* Commit the changes and reinitialize incore state. */
> > > > +	return xrep_agf_commit_new(sc, agf_bp);
> > > > +
> > > > +out_revert:
> > > > +	/* Mark the incore AGF state stale and revert the AGF. */
> > > > +	sc->sa.pag->pagf_init = 0;
> > > 
> > > Hmm, looking at this again I'm not sure it's safe to reset ->pagf_init
> > > like this. The contexts where we hold agf might be Ok because I think
> > > that might prevent some other thread from actually coming in and
> > > resetting it, but look at xfs_alloc_read_agf() does in this case if the
> > > agf becomes available with !pagf_init. Specifically, are we at risk of
> > > corrupting a populated ->pagb_tree or causing other problems by
> > > reinitializing the spinlock? Perhaps we need another patch to separate
> > > out some of those fields that should only ever be initialized once.
> > 
> > Yikes, the pagb_tree & spinlock should not get reinitialized.  I don't
> > see where we ever tear them down except for unmount, so I *think* we can
> > move it to xfs_initialize_perag.  It's a little mystifying why we don't
> > initialze those things there like we do for the incore inode radix tree.
> > 
> > Also it would finally fix the discrepancy with xfsprogs libxfs where
> > they comment out the RB_ROOT initialization.
> > 
> > > With something like that, it might subsequently make sense to factor the
> > > reinit from disk bits into a helper to be shared between
> > > xrep_agf_commit_new() and xfs_allo_read_agf(). I also wonder if it's
> > > sufficient to just update the agf on disk and leave pagf_init == 0.
> > 
> > Hmm, wasn't there some verifier that used pag*_init (can't remember
> > which one) to decide if we were in log recovery?
> > 
> 
> It looks like xfs_attr3_leaf_verify() might do something like that. But
> don't we have to handle that either way if the error path leaves
> pagf_init == 0 on return? Actually we might have to address it
> regardless if we want to use pagf_init if that path isn't holding the
> agf.

<nod> I think we should simply add a xlog helper that decides if the log
is in recovery and call it from xfs_attr3_leaf_verify rather than having
an open-coded check on some other data structure.

--D

> Brian
> 
> > --D
> > 
> > > Otherwise the rest of this patch seems Ok to me.
> > > 
> > > Brian
> > > 
> > > > +	memcpy(agf, &old_agf, sizeof(old_agf));
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > > > index 85b048b341a0..17cf48564390 100644
> > > > --- a/fs/xfs/scrub/repair.c
> > > > +++ b/fs/xfs/scrub/repair.c
> > > > @@ -128,9 +128,12 @@ xrep_roll_ag_trans(
> > > >  	int			error;
> > > >  
> > > >  	/* Keep the AG header buffers locked so we can keep going. */
> > > > -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > > -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > > -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > > +	if (sc->sa.agi_bp)
> > > > +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > > +	if (sc->sa.agf_bp)
> > > > +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > > +	if (sc->sa.agfl_bp)
> > > > +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > >  
> > > >  	/* Roll the transaction. */
> > > >  	error = xfs_trans_roll(&sc->tp);
> > > > @@ -138,9 +141,12 @@ xrep_roll_ag_trans(
> > > >  		goto out_release;
> > > >  
> > > >  	/* Join AG headers to the new transaction. */
> > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > > +	if (sc->sa.agi_bp)
> > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > > +	if (sc->sa.agf_bp)
> > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > > +	if (sc->sa.agfl_bp)
> > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > >  
> > > >  	return 0;
> > > >  
> > > > @@ -150,9 +156,12 @@ xrep_roll_ag_trans(
> > > >  	 * buffers will be released during teardown on our way out
> > > >  	 * of the kernel.
> > > >  	 */
> > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > > +	if (sc->sa.agi_bp)
> > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > > +	if (sc->sa.agf_bp)
> > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > > +	if (sc->sa.agfl_bp)
> > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > >  
> > > >  	return error;
> > > >  }
> > > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > > > index 5a4e92221916..1d283360b5ab 100644
> > > > --- a/fs/xfs/scrub/repair.h
> > > > +++ b/fs/xfs/scrub/repair.h
> > > > @@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
> > > >  
> > > >  int xrep_probe(struct xfs_scrub *sc);
> > > >  int xrep_superblock(struct xfs_scrub *sc);
> > > > +int xrep_agf(struct xfs_scrub *sc);
> > > > +int xrep_agfl(struct xfs_scrub *sc);
> > > >  
> > > >  #else
> > > >  
> > > > @@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
> > > >  
> > > >  #define xrep_probe			xrep_notsupported
> > > >  #define xrep_superblock			xrep_notsupported
> > > > +#define xrep_agf			xrep_notsupported
> > > > +#define xrep_agfl			xrep_notsupported
> > > >  
> > > >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> > > >  
> > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > index 6efb926f3cf8..1e8a17c8e2b9 100644
> > > > --- a/fs/xfs/scrub/scrub.c
> > > > +++ b/fs/xfs/scrub/scrub.c
> > > > @@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > >  		.type	= ST_PERAG,
> > > >  		.setup	= xchk_setup_fs,
> > > >  		.scrub	= xchk_agf,
> > > > -		.repair	= xrep_notsupported,
> > > > +		.repair	= xrep_agf,
> > > >  	},
> > > >  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
> > > >  		.type	= ST_PERAG,
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 02/14] xfs: repair the AGF
  2018-07-30 18:22         ` Darrick J. Wong
@ 2018-07-30 18:33           ` Brian Foster
  0 siblings, 0 replies; 53+ messages in thread
From: Brian Foster @ 2018-07-30 18:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 11:22:31AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 30, 2018 at 02:19:15PM -0400, Brian Foster wrote:
> > On Mon, Jul 30, 2018 at 10:31:11AM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 30, 2018 at 12:22:24PM -0400, Brian Foster wrote:
> > > > On Sun, Jul 29, 2018 at 10:48:02PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Regenerate the AGF from the rmap data.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/scrub/agheader_repair.c |  370 ++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/scrub/repair.c          |   27 ++-
> > > > >  fs/xfs/scrub/repair.h          |    4 
> > > > >  fs/xfs/scrub/scrub.c           |    2 
> > > > >  4 files changed, 393 insertions(+), 10 deletions(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > > index 1e96621ece3a..4842fc598c9b 100644
> > > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > > ...
> > > > > @@ -54,3 +61,366 @@ xrep_superblock(
> > > > ...
> > > > > +/* Repair the AGF. v5 filesystems only. */
> > > > > +int
> > > > > +xrep_agf(
> > > > > +	struct xfs_scrub		*sc)
> > > > > +{
> > > > ...
> > > > > +	/* Start rewriting the header and implant the btrees we found. */
> > > > > +	xrep_agf_init_header(sc, agf_bp, &old_agf);
> > > > > +	xrep_agf_set_roots(sc, agf, fab);
> > > > > +	error = xrep_agf_calc_from_btrees(sc, agf_bp);
> > > > > +	if (error)
> > > > > +		goto out_revert;
> > > > > +
> > > > > +	/* Commit the changes and reinitialize incore state. */
> > > > > +	return xrep_agf_commit_new(sc, agf_bp);
> > > > > +
> > > > > +out_revert:
> > > > > +	/* Mark the incore AGF state stale and revert the AGF. */
> > > > > +	sc->sa.pag->pagf_init = 0;
> > > > 
> > > > Hmm, looking at this again I'm not sure it's safe to reset ->pagf_init
> > > > like this. The contexts where we hold agf might be Ok because I think
> > > > that might prevent some other thread from actually coming in and
> > > > resetting it, but look at xfs_alloc_read_agf() does in this case if the
> > > > agf becomes available with !pagf_init. Specifically, are we at risk of
> > > > corrupting a populated ->pagb_tree or causing other problems by
> > > > reinitializing the spinlock? Perhaps we need another patch to separate
> > > > out some of those fields that should only ever be initialized once.
> > > 
> > > Yikes, the pagb_tree & spinlock should not get reinitialized.  I don't
> > > see where we ever tear them down except for unmount, so I *think* we can
> > > move it to xfs_initialize_perag.  It's a little mystifying why we don't
> > > initialze those things there like we do for the incore inode radix tree.
> > > 
> > > Also it would finally fix the discrepancy with xfsprogs libxfs where
> > > they comment out the RB_ROOT initialization.
> > > 
> > > > With something like that, it might subsequently make sense to factor the
> > > > reinit from disk bits into a helper to be shared between
> > > > xrep_agf_commit_new() and xfs_allo_read_agf(). I also wonder if it's
> > > > sufficient to just update the agf on disk and leave pagf_init == 0.
> > > 
> > > Hmm, wasn't there some verifier that used pag*_init (can't remember
> > > which one) to decide if we were in log recovery?
> > > 
> > 
> > It looks like xfs_attr3_leaf_verify() might do something like that. But
> > don't we have to handle that either way if the error path leaves
> > pagf_init == 0 on return? Actually we might have to address it
> > regardless if we want to use pagf_init if that path isn't holding the
> > agf.
> 
> <nod> I think we should simply add a xlog helper that decides if the log
> is in recovery and call it from xfs_attr3_leaf_verify rather than having
> an open-coded check on some other data structure.
> 

Agreed.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Otherwise the rest of this patch seems Ok to me.
> > > > 
> > > > Brian
> > > > 
> > > > > +	memcpy(agf, &old_agf, sizeof(old_agf));
> > > > > +	return error;
> > > > > +}
> > > > > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > > > > index 85b048b341a0..17cf48564390 100644
> > > > > --- a/fs/xfs/scrub/repair.c
> > > > > +++ b/fs/xfs/scrub/repair.c
> > > > > @@ -128,9 +128,12 @@ xrep_roll_ag_trans(
> > > > >  	int			error;
> > > > >  
> > > > >  	/* Keep the AG header buffers locked so we can keep going. */
> > > > > -	xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > > > -	xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > > > -	xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > > > +	if (sc->sa.agi_bp)
> > > > > +		xfs_trans_bhold(sc->tp, sc->sa.agi_bp);
> > > > > +	if (sc->sa.agf_bp)
> > > > > +		xfs_trans_bhold(sc->tp, sc->sa.agf_bp);
> > > > > +	if (sc->sa.agfl_bp)
> > > > > +		xfs_trans_bhold(sc->tp, sc->sa.agfl_bp);
> > > > >  
> > > > >  	/* Roll the transaction. */
> > > > >  	error = xfs_trans_roll(&sc->tp);
> > > > > @@ -138,9 +141,12 @@ xrep_roll_ag_trans(
> > > > >  		goto out_release;
> > > > >  
> > > > >  	/* Join AG headers to the new transaction. */
> > > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > > > -	xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > > > +	if (sc->sa.agi_bp)
> > > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agi_bp);
> > > > > +	if (sc->sa.agf_bp)
> > > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agf_bp);
> > > > > +	if (sc->sa.agfl_bp)
> > > > > +		xfs_trans_bjoin(sc->tp, sc->sa.agfl_bp);
> > > > >  
> > > > >  	return 0;
> > > > >  
> > > > > @@ -150,9 +156,12 @@ xrep_roll_ag_trans(
> > > > >  	 * buffers will be released during teardown on our way out
> > > > >  	 * of the kernel.
> > > > >  	 */
> > > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > > > -	xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > > > +	if (sc->sa.agi_bp)
> > > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agi_bp);
> > > > > +	if (sc->sa.agf_bp)
> > > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agf_bp);
> > > > > +	if (sc->sa.agfl_bp)
> > > > > +		xfs_trans_bhold_release(sc->tp, sc->sa.agfl_bp);
> > > > >  
> > > > >  	return error;
> > > > >  }
> > > > > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > > > > index 5a4e92221916..1d283360b5ab 100644
> > > > > --- a/fs/xfs/scrub/repair.h
> > > > > +++ b/fs/xfs/scrub/repair.h
> > > > > @@ -58,6 +58,8 @@ int xrep_ino_dqattach(struct xfs_scrub *sc);
> > > > >  
> > > > >  int xrep_probe(struct xfs_scrub *sc);
> > > > >  int xrep_superblock(struct xfs_scrub *sc);
> > > > > +int xrep_agf(struct xfs_scrub *sc);
> > > > > +int xrep_agfl(struct xfs_scrub *sc);
> > > > >  
> > > > >  #else
> > > > >  
> > > > > @@ -81,6 +83,8 @@ xrep_calc_ag_resblks(
> > > > >  
> > > > >  #define xrep_probe			xrep_notsupported
> > > > >  #define xrep_superblock			xrep_notsupported
> > > > > +#define xrep_agf			xrep_notsupported
> > > > > +#define xrep_agfl			xrep_notsupported
> > > > >  
> > > > >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> > > > >  
> > > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > > index 6efb926f3cf8..1e8a17c8e2b9 100644
> > > > > --- a/fs/xfs/scrub/scrub.c
> > > > > +++ b/fs/xfs/scrub/scrub.c
> > > > > @@ -214,7 +214,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > > >  		.type	= ST_PERAG,
> > > > >  		.setup	= xchk_setup_fs,
> > > > >  		.scrub	= xchk_agf,
> > > > > -		.repair	= xrep_notsupported,
> > > > > +		.repair	= xrep_agf,
> > > > >  	},
> > > > >  	[XFS_SCRUB_TYPE_AGFL]= {	/* agfl */
> > > > >  		.type	= ST_PERAG,
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 04/14] xfs: repair the AGI
  2018-07-30 18:20   ` Brian Foster
@ 2018-07-30 18:44     ` Darrick J. Wong
  0 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-30 18:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 02:20:51PM -0400, Brian Foster wrote:
> On Sun, Jul 29, 2018 at 10:48:15PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Rebuild the AGI header items with some help from the rmapbt.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> A couple nits and future thoughts..
> 
> >  fs/xfs/scrub/agheader_repair.c |  220 ++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/repair.h          |    2 
> >  fs/xfs/scrub/scrub.c           |    2 
> >  3 files changed, 223 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > index bfef066c87c3..921e7d42a2ef 100644
> > --- a/fs/xfs/scrub/agheader_repair.c
> > +++ b/fs/xfs/scrub/agheader_repair.c
> > @@ -700,3 +700,223 @@ xrep_agfl(
> >  	xfs_bitmap_destroy(&agfl_extents);
> >  	return error;
> >  }
> ...
> > +STATIC int
> > +xrep_agi_find_btrees(
> > +	struct xfs_scrub		*sc,
> > +	struct xrep_find_ag_btree	*fab)
> > +{
> > +	struct xfs_buf			*agf_bp;
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error;
> > +
> > +	/* Read the AGF. */
> > +	error = xfs_alloc_read_agf(mp, sc->tp, sc->sa.agno, 0, &agf_bp);
> > +	if (error)
> > +		return error;
> > +	if (!agf_bp)
> > +		return -ENOMEM;
> > +
> > +	/* Find the btree roots. */
> > +	error = xrep_find_ag_btree_roots(sc, agf_bp, fab, NULL);
> > +	if (error)
> > +		return error;
> > +
> > +	/* We must find the inobt root. */
> > +	if (fab[XREP_AGI_INOBT].root == NULLAGBLOCK ||
> > +	    fab[XREP_AGI_INOBT].height > XFS_BTREE_MAXLEVELS)
> > +		return -EFSCORRUPTED;
> > +
> > +	/* We must find the finobt root if that feature is enabled. */
> > +	if (xfs_sb_version_hasfinobt(&mp->m_sb) &&
> > +	    (fab[XREP_AGI_FINOBT].root == NULLAGBLOCK ||
> > +	     fab[XREP_AGI_FINOBT].height > XFS_BTREE_MAXLEVELS))
> > +		return -EFSCORRUPTED;
> 
> Skimming around some of the existing code to find the btree roots, I
> notice that the .root field is going to be != NULLAGBLOCK so long as we
> find at least one appropriately typed block in the rmapbt. I know you
> mentioned that we depend on a correct rmapbt atm, but I'm wondering if
> there's room for slightly more robust error checks to at least prevent
> us from doing something damaging. For example, perhaps we could unset
> .root when we've found a second block at the same level as the current
> "root," and/or check some of the generic characteristics of a root btree
> block (no left/right siblings) once we're done..?

Good suggestion, I'll add it to xrep_findroot_block.

> > +
> > +	return 0;
> > +}
> > +
> ...
> > +/* Trigger reinitialization of the in-core data. */
> > +STATIC int
> > +xrep_agi_commit_new(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_buf		*agi_bp,
> > +	const struct xfs_agi	*old_agi)
> 
> old_agi is unused here.
> 
> > +{
> > +	struct xfs_perag	*pag;
> > +	struct xfs_agi		*agi = XFS_BUF_TO_AGI(agi_bp);
> > +
> > +	/* Trigger inode count recalculation */
> > +	xfs_force_summary_recalc(sc->mp);
> > +
> > +	/* Write this to disk. */
> > +	xfs_trans_buf_set_type(sc->tp, agi_bp, XFS_BLFT_AGI_BUF);
> > +	xfs_trans_log_buf(sc->tp, agi_bp, 0, BBTOB(agi_bp->b_length) - 1);
> > +
> > +	/* Now reinitialize the in-core counters if necessary. */
> > +	pag = sc->sa.pag;
> > +	sc->sa.pag->pagi_init = 1;
> 
> Same s/sc->sa.pag/pag/ nit here as before.

Both fixed.

> > +	pag->pagi_count = be32_to_cpu(agi->agi_count);
> > +	pag->pagi_freecount = be32_to_cpu(agi->agi_freecount);
> > +
> > +	return 0;
> > +}
> > +
> > +/* Repair the AGI. */
> > +int
> > +xrep_agi(
> > +	struct xfs_scrub		*sc)
> > +{
> > +	struct xrep_find_ag_btree	fab[XREP_AGI_MAX] = {
> > +		[XREP_AGI_INOBT] = {
> > +			.rmap_owner = XFS_RMAP_OWN_INOBT,
> > +			.buf_ops = &xfs_inobt_buf_ops,
> > +			.magic = XFS_IBT_CRC_MAGIC,
> > +		},
> > +		[XREP_AGI_FINOBT] = {
> > +			.rmap_owner = XFS_RMAP_OWN_INOBT,
> > +			.buf_ops = &xfs_inobt_buf_ops,
> > +			.magic = XFS_FIBT_CRC_MAGIC,
> > +		},
> > +		[XREP_AGI_END] = {
> > +			.buf_ops = NULL
> > +		},
> > +	};
> > +	struct xfs_agi			old_agi;
> 
> It's not immediately clear to me how much of a danger this is here, if
> at all, but FWIW xfs_agi is one of our larger structures at 336 bytes
> (mostly due to agi_unlinked). I'm not terribly concerned if this isn't
> currently exploding, but it might be worth thinking about another
> technique to preserve original behavior without the stack usage. Perhaps
> we could use an uncached buffer to preserve the original data and
> implement an xfs_buf_copy() helper to facilitate, for example.

It's not a huge deal since (vmap) kernel stacks are 16K(!) these days on
64-bit machines, but if it ever becomes a problem we can simply allocate
some memory in sc->buf in the setup routine.

--D

> Brian
> 
> > +	struct xfs_mount		*mp = sc->mp;
> > +	struct xfs_buf			*agi_bp;
> > +	struct xfs_agi			*agi;
> > +	int				error;
> > +
> > +	/* We require the rmapbt to rebuild anything. */
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	xchk_perag_get(sc->mp, &sc->sa);
> > +	/*
> > +	 * Make sure we have the AGI buffer, as scrub might have decided it
> > +	 * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED.
> > +	 */
> > +	error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp,
> > +			XFS_AG_DADDR(mp, sc->sa.agno, XFS_AGI_DADDR(mp)),
> > +			XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL);
> > +	if (error)
> > +		return error;
> > +	agi_bp->b_ops = &xfs_agi_buf_ops;
> > +	agi = XFS_BUF_TO_AGI(agi_bp);
> > +
> > +	/* Find the AGI btree roots. */
> > +	error = xrep_agi_find_btrees(sc, fab);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Start rewriting the header and implant the btrees we found. */
> > +	xrep_agi_init_header(sc, agi_bp, &old_agi);
> > +	xrep_agi_set_roots(sc, agi, fab);
> > +	error = xrep_agi_calc_from_btrees(sc, agi_bp);
> > +	if (error)
> > +		goto out_revert;
> > +
> > +	/* Reinitialize in-core state. */
> > +	return xrep_agi_commit_new(sc, agi_bp, &old_agi);
> > +
> > +out_revert:
> > +	/* Mark the incore AGI state stale and revert the AGI. */
> > +	sc->sa.pag->pagi_init = 0;
> > +	memcpy(agi, &old_agi, sizeof(old_agi));
> > +	return error;
> > +}
> > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > index 1d283360b5ab..9de321eee4ab 100644
> > --- a/fs/xfs/scrub/repair.h
> > +++ b/fs/xfs/scrub/repair.h
> > @@ -60,6 +60,7 @@ int xrep_probe(struct xfs_scrub *sc);
> >  int xrep_superblock(struct xfs_scrub *sc);
> >  int xrep_agf(struct xfs_scrub *sc);
> >  int xrep_agfl(struct xfs_scrub *sc);
> > +int xrep_agi(struct xfs_scrub *sc);
> >  
> >  #else
> >  
> > @@ -85,6 +86,7 @@ xrep_calc_ag_resblks(
> >  #define xrep_superblock			xrep_notsupported
> >  #define xrep_agf			xrep_notsupported
> >  #define xrep_agfl			xrep_notsupported
> > +#define xrep_agi			xrep_notsupported
> >  
> >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> >  
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 2670f4cf62f4..4bfae1e61d30 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -226,7 +226,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> >  		.type	= ST_PERAG,
> >  		.setup	= xchk_setup_fs,
> >  		.scrub	= xchk_agi,
> > -		.repair	= xrep_notsupported,
> > +		.repair	= xrep_agi,
> >  	},
> >  	[XFS_SCRUB_TYPE_BNOBT] = {	/* bnobt */
> >  		.type	= ST_PERAG,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-07-30 17:22     ` Darrick J. Wong
@ 2018-07-31 15:10       ` Brian Foster
  2018-08-07 22:02         ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-31 15:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Mon, Jul 30, 2018 at 10:22:16AM -0700, Darrick J. Wong wrote:
> On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> > On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Repair the AGFL from the rmap data.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> > seems to always dump a cross-referencing failed error and not want to
> > deal with it. Expected? Is there a good way to unit test some of this
> > stuff with simple/localized corruptions?
> 
> I usually pick one of the corruptions from xfs/355...
> 
> $ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
> SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
> ./check xfs/355
> 

It looks like similar behavior if I do that, but tbh I'm not sure if I'm
using this correctly. E.g., if I do:

# SCRATCH_XFS_LIST_FUZZ_VERBS=random SCRATCH_XFS_LIST_METADATA_FIELDS=bno[0] ./check xfs/355
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 localhost 4.18.0-rc4+
MKFS_OPTIONS  -- -f -mrmapbt=1,reflink=1 /dev/mapper/test-scratch
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/test-scratch /mnt/scratch

xfs/355 - output mismatch (see /root/xfstests-dev/results//xfs/355.out.bad)
    ...
    (Run 'diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad'  to see the entire diff)
Ran: xfs/355
Failures: xfs/355
Failed 1 of 1 tests
# diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad
--- tests/xfs/355.out   2018-07-25 07:47:23.739575416 -0400
+++ /root/xfstests-dev/results//xfs/355.out.bad 2018-07-31
10:55:18.466178944 -0400
@@ -1,6 +1,10 @@
 QA output created by 355
 Format and populate
 Fuzz AGFL
+online re-scrub (1) with bno[0] = random.
 Done fuzzing AGFL
 Fuzz AGFL flfirst
+offline re-scrub (1) with bno[14] = random.
+online re-scrub (1) with bno[14] = random.
+re-repair failed (1) with bno[14] = random.
 Done fuzzing AGFL flfirst

If I run xfs_scrub directly on the scratch mount after the test I get a
stream of inode cross-referencing errors and it doesn't seem to fix
anything up.

Brian

> > Otherwise this looks sane, a couple comments..
> > 
> > >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> > >  fs/xfs/scrub/bitmap.h          |    4 +
> > >  fs/xfs/scrub/scrub.c           |    2 
> > >  4 files changed, 373 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > index 4842fc598c9b..bfef066c87c3 100644
> > > --- a/fs/xfs/scrub/agheader_repair.c
> > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > @@ -424,3 +424,279 @@ xrep_agf(
> > >  	memcpy(agf, &old_agf, sizeof(old_agf));
> > >  	return error;
> > >  }
> > > +
> > ...
> > > +/* Write out a totally new AGFL. */
> > > +STATIC void
> > > +xrep_agfl_init_header(
> > > +	struct xfs_scrub	*sc,
> > > +	struct xfs_buf		*agfl_bp,
> > > +	struct xfs_bitmap	*agfl_extents,
> > > +	xfs_agblock_t		flcount)
> > > +{
> > > +	struct xfs_mount	*mp = sc->mp;
> > > +	__be32			*agfl_bno;
> > > +	struct xfs_bitmap_range	*br;
> > > +	struct xfs_bitmap_range	*n;
> > > +	struct xfs_agfl		*agfl;
> > > +	xfs_agblock_t		agbno;
> > > +	unsigned int		fl_off;
> > > +
> > > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > > +
> > > +	/* Start rewriting the header. */
> > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > 
> > What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?
> 
> Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
> in the header fields.
> 
> > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > +
> > > +	/*
> > > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > > +	 * step.
> > > +	 */
> > > +	fl_off = 0;
> > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > > +
> > > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > > +
> > > +		while (br->len > 0 && fl_off < flcount) {
> > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > +			fl_off++;
> > > +			agbno++;
> > 
> > 			/* bump br so we don't reap blocks we've used */
> > 
> > (i.e., took me a sec to realize why we bother with ->start)
> > 
> > > +			br->start++;
> > > +			br->len--;
> > > +		}
> > > +
> > > +		if (br->len)
> > > +			break;
> > > +		list_del(&br->list);
> > > +		kmem_free(br);
> > > +	}
> > > +
> > > +	/* Write new AGFL to disk. */
> > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > +}
> > > +
> > ...
> > > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > > index c770e2d0b6aa..fdadc9e1dc49 100644
> > > --- a/fs/xfs/scrub/bitmap.c
> > > +++ b/fs/xfs/scrub/bitmap.c
> > > @@ -9,6 +9,7 @@
> > >  #include "xfs_format.h"
> > >  #include "xfs_trans_resv.h"
> > >  #include "xfs_mount.h"
> > > +#include "xfs_btree.h"
> > >  #include "scrub/xfs_scrub.h"
> > >  #include "scrub/scrub.h"
> > >  #include "scrub/common.h"
> > > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> > >  }
> > >  #undef LEFT_ALIGNED
> > >  #undef RIGHT_ALIGNED
> > > +
> > > +/*
> > > + * Record all btree blocks seen while iterating all records of a btree.
> > > + *
> > > + * We know that the btree query_all function starts at the left edge and walks
> > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > + * the btree cursor towards the root; if the pointer for a given level points
> > > + * to the first record/key in that block, we haven't seen this block before;
> > > + * and therefore we need to remember that we saw this block in the btree.
> > > + *
> > > + * So if our btree is:
> > > + *
> > > + *    4
> > > + *  / | \
> > > + * 1  2  3
> > > + *
> > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > + * block 4.  The list is [1, 4].
> > > + *
> > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > + * loop.  The list remains [1, 4].
> > > + *
> > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > + *
> > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > + *
> > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > + *
> > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > + */
> > > +
> > > +/*
> > > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > > + * engaged in a btree walk should call this function to capture the list of
> > > + * blocks going from the leaf towards the root.
> > > + */
> > > +int
> > > +xfs_bitmap_set_btcur_path(
> > > +	struct xfs_bitmap	*bitmap,
> > > +	struct xfs_btree_cur	*cur)
> > > +{
> > > +	struct xfs_buf		*bp;
> > > +	xfs_fsblock_t		fsb;
> > > +	int			i;
> > > +	int			error;
> > > +
> > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > +		xfs_btree_get_block(cur, i, &bp);
> > > +		if (!bp)
> > > +			continue;
> > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> > 
> > Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> > but also highlights that xfs_bitmap_set() essentially allocates entries
> > for duplicate values if they exist. Is this handled by the broader
> > mechanism, for example, if the rmapbt was corrupted to have multiple
> > entries for a particular unused OWN_AG block? Or could we end up leaking
> > that corruption over to the agfl?
> 
> Right now we're totally dependent on the rmapbt being sane to rebuild
> the space metadata.
> 
> > I also wonder a bit about memory consumption on filesystems with large
> > metadata footprints. We essentially have to allocate one of these for
> > every allocation btree block before we can do the disunion and locate
> > the agfl-appropriate blocks. If we had a more lookup friendly structure,
> > perhaps this could be optimized by filtering out bnobt/cntbt blocks
> > during the associated btree walks..?
> > 
> > Have you thought about reusing something like the new in-core extent
> > tree mechanism as a pure in-memory extent store? It's certainly not
> > worth reworking something like that right now, but I wonder if we could
> > save memory via the denser format (and perhaps benefit from code
> > flexibility, reuse, etc.).
> 
> Yes, I was thinking about refactoring the iext btree into a more generic
> in-core index with 64-bit key so that I could adapt xfs_bitmap to use
> it.  In the longer term I would /also/ like to use xfs_bitmap to detect
> xfs_buf cache aliasing when multi-block buffers are in use, but that's
> further off. :)
> 
> As for the memory-intensive record lists in all the btree rebuilders, I
> have some ideas around that too -- either find a way to build an
> alternate btree and switch the roots over, or (once we gain the ability
> to mark an AG unavailable for new allocations) allocate an unlinked
> inode, store the records in the page cache pages for the file, and
> release it when we're done.
> 
> But, that can wait until I've gotten more of this merged, or get bored.
> :)
> 
> --D
> 
> > Brian
> > 
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +/* Collect a btree's block in the bitmap. */
> > > +STATIC int
> > > +xfs_bitmap_collect_btblock(
> > > +	struct xfs_btree_cur	*cur,
> > > +	int			level,
> > > +	void			*priv)
> > > +{
> > > +	struct xfs_bitmap	*bitmap = priv;
> > > +	struct xfs_buf		*bp;
> > > +	xfs_fsblock_t		fsbno;
> > > +
> > > +	xfs_btree_get_block(cur, level, &bp);
> > > +	if (!bp)
> > > +		return 0;
> > > +
> > > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > > +}
> > > +
> > > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > > +int
> > > +xfs_bitmap_set_btblocks(
> > > +	struct xfs_bitmap	*bitmap,
> > > +	struct xfs_btree_cur	*cur)
> > > +{
> > > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > > +}
> > > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > > index dad652ee9177..ae8ecbce6fa6 100644
> > > --- a/fs/xfs/scrub/bitmap.h
> > > +++ b/fs/xfs/scrub/bitmap.h
> > > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> > >  
> > >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> > >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > > +		struct xfs_btree_cur *cur);
> > > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > > +		struct xfs_btree_cur *cur);
> > >  
> > >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > > --- a/fs/xfs/scrub/scrub.c
> > > +++ b/fs/xfs/scrub/scrub.c
> > > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > >  		.type	= ST_PERAG,
> > >  		.setup	= xchk_setup_fs,
> > >  		.scrub	= xchk_agfl,
> > > -		.repair	= xrep_notsupported,
> > > +		.repair	= xrep_agfl,
> > >  	},
> > >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > >  		.type	= ST_PERAG,
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-07-30  5:48 ` [PATCH 05/14] xfs: repair free space btrees Darrick J. Wong
@ 2018-07-31 17:47   ` Brian Foster
  2018-07-31 22:01     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-07-31 17:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Rebuild the free space btrees from the gaps in the rmap btree.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/Makefile             |    1 
>  fs/xfs/scrub/alloc.c        |    1 
>  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/common.c       |    8 +
>  fs/xfs/scrub/repair.h       |    2 
>  fs/xfs/scrub/scrub.c        |    4 
>  fs/xfs/scrub/trace.h        |    2 
>  fs/xfs/xfs_extent_busy.c    |   14 +
>  fs/xfs/xfs_extent_busy.h    |    2 
>  9 files changed, 610 insertions(+), 5 deletions(-)
>  create mode 100644 fs/xfs/scrub/alloc_repair.c
> 
> 
...
> diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> new file mode 100644
> index 000000000000..b228c2906de2
> --- /dev/null
> +++ b/fs/xfs/scrub/alloc_repair.c
> @@ -0,0 +1,581 @@
...
> +/* Record extents that aren't in use from gaps in the rmap records. */
> +STATIC int
> +xrep_abt_walk_rmap(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_rmap_irec	*rec,
> +	void			*priv)
> +{
> +	struct xrep_abt		*ra = priv;
> +	struct xrep_abt_extent	*rae;
> +	xfs_fsblock_t		fsb;
> +	int			error;
> +
> +	/* Record all the OWN_AG blocks... */
> +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> +				rec->rm_startblock);
> +		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
> +		if (error)
> +			return error;
> +	}
> +
> +	/* ...and all the rmapbt blocks... */
> +	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
> +	if (error)
> +		return error;
> +
> +	/* ...and all the free space. */
> +	if (rec->rm_startblock > ra->next_bno) {
> +		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
> +				ra->next_bno, rec->rm_startblock - ra->next_bno,
> +				XFS_RMAP_OWN_NULL, 0, 0);
> +
> +		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
> +		if (!rae)
> +			return -ENOMEM;
> +		INIT_LIST_HEAD(&rae->list);
> +		rae->bno = ra->next_bno;
> +		rae->len = rec->rm_startblock - ra->next_bno;
> +		list_add_tail(&rae->list, ra->extlist);

Any reason we don't use a bitmap for this one?

> +		ra->nr_records++;
> +		ra->nr_blocks += rae->len;
> +	}
> +	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
> +			rec->rm_startblock + rec->rm_blockcount);

The max_t() is to cover the record overlap case, right? If so, another
one liner comment would be good.

> +	return 0;
> +}
> +
...
> +/* Free an extent, which creates a record in the bnobt/cntbt. */
> +STATIC int
> +xrep_abt_free_extent(
> +	struct xfs_scrub	*sc,
> +	xfs_fsblock_t		fsbno,
> +	xfs_extlen_t		len,
> +	struct xfs_owner_info	*oinfo)
> +{
> +	int			error;
> +
> +	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
> +	if (error)
> +		return error;
> +	error = xrep_roll_ag_trans(sc);
> +	if (error)
> +		return error;
> +	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);

What's this call for? Is it because the blocks we're freeing were
already free? (Similar question on the other xfs_mod_fdblocks() call
further down).

BTW, what prevents some other task from coming along and screwing with
this? For example, could a large falloc or buffered write come in and
allocate these global blocks before we take them away here (causing the
whole sequence to fail)?

> +}
> +
...
> +/*
> + * Allocate a block from the (cached) first extent in the AG.  In theory
> + * this should never fail, since we already checked that there was enough
> + * space to handle the new btrees.
> + */
> +STATIC xfs_fsblock_t
> +xrep_abt_alloc_block(
> +	struct xfs_scrub	*sc,
> +	struct list_head	*free_extents)
> +{
> +	struct xrep_abt_extent	*ext;
> +
> +	/* Pull the first free space extent off the list, and... */
> +	ext = list_first_entry(free_extents, struct xrep_abt_extent, list);
> +
> +	/* ...take its first block. */
> +	ext->bno++;
> +	ext->len--;
> +	if (ext->len == 0) {
> +		list_del(&ext->list);
> +		kmem_free(ext);
> +	}
> +
> +	return XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, ext->bno - 1);

Looks like a potential use after free of ext.

> +}
> +
...
> +/*
> + * Reset the global free block counter and the per-AG counters to make it look
> + * like this AG has no free space.
> + */
> +STATIC int
> +xrep_abt_reset_counters(
> +	struct xfs_scrub	*sc,
> +	int			*log_flags)
> +{
> +	struct xfs_perag	*pag = sc->sa.pag;
> +	struct xfs_agf		*agf;
> +	xfs_agblock_t		new_btblks;
> +	xfs_agblock_t		to_free;
> +	int			error;
> +
> +	/*
> +	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
> +	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
> +	 * non-root blocks of the free space and rmap btrees.  Do this before
> +	 * resetting the AGF counters.
> +	 */

Hmm, I'm not quite following the comment wrt to the xfs_mod_fdblocks()
below. to_free looks like it's the count of all current btree blocks
minus rmap blocks (i.e., old bno/cnt btree blocks). Are we "allocating"
those blocks here because we're going to free them later?

> +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +
> +	/* rmap_blocks accounts root block, btreeblks doesn't */
> +	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> +
> +	/* btreeblks doesn't account bno/cnt root blocks */
> +	to_free = pag->pagf_btreeblks + 2;
> +
> +	/* and don't account for the blocks we aren't freeing */
> +	to_free -= new_btblks;
> +
> +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)to_free, false);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
> +	 * state stale in case we fail out of here.
> +	 */
> +	ASSERT(pag->pagf_init);
> +	pag->pagf_init = 0;
> +	pag->pagf_btreeblks = new_btblks;
> +	pag->pagf_freeblks = 0;
> +	pag->pagf_longest = 0;
> +
> +	agf->agf_btreeblks = cpu_to_be32(new_btblks);
> +	agf->agf_freeblks = 0;
> +	agf->agf_longest = 0;
> +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> +
> +	return 0;
> +}
> +
> +/* Initialize a new free space btree root and implant into AGF. */
> +STATIC int
> +xrep_abt_reset_btree(
> +	struct xfs_scrub	*sc,
> +	xfs_btnum_t		btnum,
> +	struct list_head	*free_extents)
> +{
> +	struct xfs_owner_info	oinfo;
> +	struct xfs_buf		*bp;
> +	struct xfs_perag	*pag = sc->sa.pag;
> +	struct xfs_mount	*mp = sc->mp;
> +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> +	xfs_fsblock_t		fsbno;
> +	int			error;
> +
> +	/* Allocate new root block. */
> +	fsbno = xrep_abt_alloc_block(sc, free_extents);

xrep_abt_alloc_block() converts an agbno to return an fsb. This function
passes the fsb to the init call just below and then converts it back to
an agbno in two places. It seems like there might be less conversions to
follow if the above just returned an agbno and we converted it to an fsb
once for xrep_init_btblock().

> +	if (fsbno == NULLFSBLOCK)
> +		return -ENOSPC;
> +
> +	/* Initialize new tree root. */
> +	error = xrep_init_btblock(sc, fsbno, &bp, btnum, &xfs_allocbt_buf_ops);
> +	if (error)
> +		return error;
> +
> +	/* Implant into AGF. */
> +	agf->agf_roots[btnum] = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
> +	agf->agf_levels[btnum] = cpu_to_be32(1);
> +
> +	/* Add rmap records for the btree roots */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> +			XFS_FSB_TO_AGBNO(mp, fsbno), 1, &oinfo);
> +	if (error)
> +		return error;
> +
> +	/* Reset the incore state. */
> +	pag->pagf_levels[btnum] = 1;
> +
> +	return 0;
> +}
> +
...
> +
> +/*
> + * Make our new freespace btree roots permanent so that we can start freeing
> + * unused space back into the AG.
> + */
> +STATIC int
> +xrep_abt_commit_new(
> +	struct xfs_scrub	*sc,
> +	struct xfs_bitmap	*old_allocbt_blocks,
> +	int			log_flags)
> +{
> +	int			error;
> +
> +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> +
> +	/* Invalidate the old freespace btree blocks and commit. */
> +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> +	if (error)
> +		return error;

It looks like the above invalidation all happens in the same
transaction. Those aren't logging buffer data or anything, but any idea
how many log formats we can get away with in this single transaction?

> +	error = xrep_roll_ag_trans(sc);
> +	if (error)
> +		return error;
> +
> +	/* Now that we've succeeded, mark the incore state valid again. */
> +	sc->sa.pag->pagf_init = 1;
> +	return 0;
> +}
> +
> +/* Build new free space btrees and dispose of the old one. */
> +STATIC int
> +xrep_abt_rebuild_trees(
> +	struct xfs_scrub	*sc,
> +	struct list_head	*free_extents,
> +	struct xfs_bitmap	*old_allocbt_blocks)
> +{
> +	struct xfs_owner_info	oinfo;
> +	struct xrep_abt_extent	*rae;
> +	struct xrep_abt_extent	*n;
> +	struct xrep_abt_extent	*longest;
> +	int			error;
> +
> +	xfs_rmap_skip_owner_update(&oinfo);
> +
> +	/*
> +	 * Insert the longest free extent in case it's necessary to
> +	 * refresh the AGFL with multiple blocks.  If there is no longest
> +	 * extent, we had exactly the free space we needed; we're done.
> +	 */

I'm confused by the last sentence. longest should only be NULL if the
free space list is empty and haven't we already bailed out with -ENOSPC
if that's the case?

> +	longest = xrep_abt_get_longest(free_extents);
> +	if (!longest)
> +		goto done;
> +	error = xrep_abt_free_extent(sc,
> +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> +			longest->len, &oinfo);
> +	list_del(&longest->list);
> +	kmem_free(longest);
> +	if (error)
> +		return error;
> +
> +	/* Insert records into the new btrees. */
> +	list_for_each_entry_safe(rae, n, free_extents, list) {
> +		error = xrep_abt_free_extent(sc,
> +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> +				rae->len, &oinfo);
> +		if (error)
> +			return error;
> +		list_del(&rae->list);
> +		kmem_free(rae);
> +	}

Ok, at this point we've reset the btree roots and we start freeing the
free ranges that were discovered via the rmapbt analysis. AFAICT, if we
fail or crash at this point, we leave the allocbts in a partially
constructed state. I take it that is Ok with respect to the broader
repair algorithm because we'd essentially start over by inspecting the
rmapbt again on a retry.

The blocks allocated for the btrees that we've begun to construct here
end up mapped in the rmapbt as we go, right? IIUC, that means we don't
necessarily have infinite retries to make sure this completes. IOW,
suppose that a first repair attempt finds just enough free space to
construct new trees, gets far enough along to consume most of that free
space and then crashes. Is it possible that a subsequent repair attempt
includes the btree blocks allocated during the previous failed repair
attempt in the sum of "old btree blocks" and determines we don't have
enough free space to repair?

> +
> +done:
> +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> +			XFS_AG_RESV_NONE);
> +}
> +
...
> diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> index 0ed68379e551..82f99633a597 100644
> --- a/fs/xfs/xfs_extent_busy.c
> +++ b/fs/xfs/xfs_extent_busy.c
> @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
>  		diff = b1->bno - b2->bno;
>  	return diff;
>  }
> +
> +/* Are there any busy extents in this AG? */
> +bool
> +xfs_extent_busy_list_empty(
> +	struct xfs_perag	*pag)
> +{
> +	spin_lock(&pag->pagb_lock);
> +	if (pag->pagb_tree.rb_node) {

RB_EMPTY_ROOT()?

Brian

> +		spin_unlock(&pag->pagb_lock);
> +		return false;
> +	}
> +	spin_unlock(&pag->pagb_lock);
> +	return true;
> +}
> diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> index 990ab3891971..2f8c73c712c6 100644
> --- a/fs/xfs/xfs_extent_busy.h
> +++ b/fs/xfs/xfs_extent_busy.h
> @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
>  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
>  }
>  
> +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> +
>  #endif /* __XFS_EXTENT_BUSY_H__ */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-07-31 17:47   ` Brian Foster
@ 2018-07-31 22:01     ` Darrick J. Wong
  2018-08-01 11:54       ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-07-31 22:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Rebuild the free space btrees from the gaps in the rmap btree.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/Makefile             |    1 
> >  fs/xfs/scrub/alloc.c        |    1 
> >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/common.c       |    8 +
> >  fs/xfs/scrub/repair.h       |    2 
> >  fs/xfs/scrub/scrub.c        |    4 
> >  fs/xfs/scrub/trace.h        |    2 
> >  fs/xfs/xfs_extent_busy.c    |   14 +
> >  fs/xfs/xfs_extent_busy.h    |    2 
> >  9 files changed, 610 insertions(+), 5 deletions(-)
> >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > 
> > 
> ...
> > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > new file mode 100644
> > index 000000000000..b228c2906de2
> > --- /dev/null
> > +++ b/fs/xfs/scrub/alloc_repair.c
> > @@ -0,0 +1,581 @@
> ...
> > +/* Record extents that aren't in use from gaps in the rmap records. */
> > +STATIC int
> > +xrep_abt_walk_rmap(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_rmap_irec	*rec,
> > +	void			*priv)
> > +{
> > +	struct xrep_abt		*ra = priv;
> > +	struct xrep_abt_extent	*rae;
> > +	xfs_fsblock_t		fsb;
> > +	int			error;
> > +
> > +	/* Record all the OWN_AG blocks... */
> > +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> > +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> > +				rec->rm_startblock);
> > +		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	/* ...and all the rmapbt blocks... */
> > +	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
> > +	if (error)
> > +		return error;
> > +
> > +	/* ...and all the free space. */
> > +	if (rec->rm_startblock > ra->next_bno) {
> > +		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
> > +				ra->next_bno, rec->rm_startblock - ra->next_bno,
> > +				XFS_RMAP_OWN_NULL, 0, 0);
> > +
> > +		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
> > +		if (!rae)
> > +			return -ENOMEM;
> > +		INIT_LIST_HEAD(&rae->list);
> > +		rae->bno = ra->next_bno;
> > +		rae->len = rec->rm_startblock - ra->next_bno;
> > +		list_add_tail(&rae->list, ra->extlist);
> 
> Any reason we don't use a bitmap for this one?
> 
> > +		ra->nr_records++;
> > +		ra->nr_blocks += rae->len;
> > +	}
> > +	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
> > +			rec->rm_startblock + rec->rm_blockcount);
> 
> The max_t() is to cover the record overlap case, right? If so, another
> one liner comment would be good.

Right.  Will add a comment.

> > +	return 0;
> > +}
> > +
> ...
> > +/* Free an extent, which creates a record in the bnobt/cntbt. */
> > +STATIC int
> > +xrep_abt_free_extent(
> > +	struct xfs_scrub	*sc,
> > +	xfs_fsblock_t		fsbno,
> > +	xfs_extlen_t		len,
> > +	struct xfs_owner_info	*oinfo)
> > +{
> > +	int			error;
> > +
> > +	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
> > +	if (error)
> > +		return error;
> > +	error = xrep_roll_ag_trans(sc);
> > +	if (error)
> > +		return error;
> > +	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
> 
> What's this call for? Is it because the blocks we're freeing were
> already free? (Similar question on the other xfs_mod_fdblocks() call
> further down).

Yes.  The goal here is to free the (already free) extent with no net
change in fdblocks...

> BTW, what prevents some other task from coming along and screwing with
> this? For example, could a large falloc or buffered write come in and
> allocate these global blocks before we take them away here (causing the
> whole sequence to fail)?

...but you're right that here is a window of opportunity for someone to
swoop in and reserve the blocks while we still have the AGF locked,
which means that we'll fail here even though that other process will
never get the space.

Thinking about this a bit more, what we really want to do is to skip the
xfs_trans_mod_sb(len) that happens after xfs_free_ag_extent inserts the
record into the bno/cntbt.  Hm.  If a record insertion requires an
expansion of the bnobt/cntbt, we'll pull blocks from the AGFL, but we
separately force those to be accounted to XFS_AG_RESV_AGFL.  Therefore,
we could make a "fake" per-AG reservation type that would skip the
fdblocks update.  That avoids the problem where we commit the free space
record but someone else reserves all the free space and then we blow out
with ENOSPC and a half-rebuilt bnobt.

For the second case (which I assume is xrep_abt_reset_counters?) I'll
respond below.

> > +}
> > +
> ...
> > +/*
> > + * Allocate a block from the (cached) first extent in the AG.  In theory
> > + * this should never fail, since we already checked that there was enough
> > + * space to handle the new btrees.
> > + */
> > +STATIC xfs_fsblock_t
> > +xrep_abt_alloc_block(
> > +	struct xfs_scrub	*sc,
> > +	struct list_head	*free_extents)
> > +{
> > +	struct xrep_abt_extent	*ext;
> > +
> > +	/* Pull the first free space extent off the list, and... */
> > +	ext = list_first_entry(free_extents, struct xrep_abt_extent, list);

Missing a if (!ext) return NULLFSBLOCK; here for some reason...

> > +	/* ...take its first block. */
> > +	ext->bno++;
> > +	ext->len--;
> > +	if (ext->len == 0) {
> > +		list_del(&ext->list);
> > +		kmem_free(ext);
> > +	}
> > +
> > +	return XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, ext->bno - 1);
> 
> Looks like a potential use after free of ext.

Oops, good catch!  I'll add a temporary variable to hold the value for
the return.

> > +}
> > +
> ...
> > +/*
> > + * Reset the global free block counter and the per-AG counters to make it look
> > + * like this AG has no free space.
> > + */
> > +STATIC int
> > +xrep_abt_reset_counters(
> > +	struct xfs_scrub	*sc,
> > +	int			*log_flags)
> > +{
> > +	struct xfs_perag	*pag = sc->sa.pag;
> > +	struct xfs_agf		*agf;
> > +	xfs_agblock_t		new_btblks;
> > +	xfs_agblock_t		to_free;
> > +	int			error;
> > +
> > +	/*
> > +	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
> > +	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
> > +	 * non-root blocks of the free space and rmap btrees.  Do this before
> > +	 * resetting the AGF counters.
> > +	 */
> 
> Hmm, I'm not quite following the comment wrt to the xfs_mod_fdblocks()
> below. to_free looks like it's the count of all current btree blocks
> minus rmap blocks (i.e., old bno/cnt btree blocks). Are we "allocating"
> those blocks here because we're going to free them later?

Yes.  Though now that I have a XFS_AG_RESV_IGNORE, maybe I should just
pass that to xrep_reap_extents in xrep_abt_rebuild_trees and then I can
skip the racy mod_fdblocks thing here too.

> > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +
> > +	/* rmap_blocks accounts root block, btreeblks doesn't */
> > +	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> > +
> > +	/* btreeblks doesn't account bno/cnt root blocks */
> > +	to_free = pag->pagf_btreeblks + 2;
> > +
> > +	/* and don't account for the blocks we aren't freeing */
> > +	to_free -= new_btblks;
> > +
> > +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)to_free, false);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
> > +	 * state stale in case we fail out of here.
> > +	 */
> > +	ASSERT(pag->pagf_init);
> > +	pag->pagf_init = 0;
> > +	pag->pagf_btreeblks = new_btblks;
> > +	pag->pagf_freeblks = 0;
> > +	pag->pagf_longest = 0;
> > +
> > +	agf->agf_btreeblks = cpu_to_be32(new_btblks);
> > +	agf->agf_freeblks = 0;
> > +	agf->agf_longest = 0;
> > +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Initialize a new free space btree root and implant into AGF. */
> > +STATIC int
> > +xrep_abt_reset_btree(
> > +	struct xfs_scrub	*sc,
> > +	xfs_btnum_t		btnum,
> > +	struct list_head	*free_extents)
> > +{
> > +	struct xfs_owner_info	oinfo;
> > +	struct xfs_buf		*bp;
> > +	struct xfs_perag	*pag = sc->sa.pag;
> > +	struct xfs_mount	*mp = sc->mp;
> > +	struct xfs_agf		*agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > +	xfs_fsblock_t		fsbno;
> > +	int			error;
> > +
> > +	/* Allocate new root block. */
> > +	fsbno = xrep_abt_alloc_block(sc, free_extents);
> 
> xrep_abt_alloc_block() converts an agbno to return an fsb. This function
> passes the fsb to the init call just below and then converts it back to
> an agbno in two places. It seems like there might be less conversions to
> follow if the above just returned an agbno and we converted it to an fsb
> once for xrep_init_btblock().

Yep, will fix.

> > +	if (fsbno == NULLFSBLOCK)
> > +		return -ENOSPC;
> > +
> > +	/* Initialize new tree root. */
> > +	error = xrep_init_btblock(sc, fsbno, &bp, btnum, &xfs_allocbt_buf_ops);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Implant into AGF. */
> > +	agf->agf_roots[btnum] = cpu_to_be32(XFS_FSB_TO_AGBNO(mp, fsbno));
> > +	agf->agf_levels[btnum] = cpu_to_be32(1);
> > +
> > +	/* Add rmap records for the btree roots */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno,
> > +			XFS_FSB_TO_AGBNO(mp, fsbno), 1, &oinfo);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Reset the incore state. */
> > +	pag->pagf_levels[btnum] = 1;
> > +
> > +	return 0;
> > +}
> > +
> ...
> > +
> > +/*
> > + * Make our new freespace btree roots permanent so that we can start freeing
> > + * unused space back into the AG.
> > + */
> > +STATIC int
> > +xrep_abt_commit_new(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_bitmap	*old_allocbt_blocks,
> > +	int			log_flags)
> > +{
> > +	int			error;
> > +
> > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > +
> > +	/* Invalidate the old freespace btree blocks and commit. */
> > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > +	if (error)
> > +		return error;
> 
> It looks like the above invalidation all happens in the same
> transaction. Those aren't logging buffer data or anything, but any idea
> how many log formats we can get away with in this single transaction?

Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
blocks, the max AG size of 1TB, maximum free space fragmentation, and
two btrees, the tree could be up to ~270 million records.  Assuming ~505
records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
for both btrees.  If we invalidate both, that's ~46M of RAM?

> > +	error = xrep_roll_ag_trans(sc);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Now that we've succeeded, mark the incore state valid again. */
> > +	sc->sa.pag->pagf_init = 1;
> > +	return 0;
> > +}
> > +
> > +/* Build new free space btrees and dispose of the old one. */
> > +STATIC int
> > +xrep_abt_rebuild_trees(
> > +	struct xfs_scrub	*sc,
> > +	struct list_head	*free_extents,
> > +	struct xfs_bitmap	*old_allocbt_blocks)
> > +{
> > +	struct xfs_owner_info	oinfo;
> > +	struct xrep_abt_extent	*rae;
> > +	struct xrep_abt_extent	*n;
> > +	struct xrep_abt_extent	*longest;
> > +	int			error;
> > +
> > +	xfs_rmap_skip_owner_update(&oinfo);
> > +
> > +	/*
> > +	 * Insert the longest free extent in case it's necessary to
> > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > +	 * extent, we had exactly the free space we needed; we're done.
> > +	 */
> 
> I'm confused by the last sentence. longest should only be NULL if the
> free space list is empty and haven't we already bailed out with -ENOSPC
> if that's the case?
> 
> > +	longest = xrep_abt_get_longest(free_extents);

xrep_abt_rebuild_trees is called after we allocate and initialize two
new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
list here, then we found exactly two blocks worth of free space and used
them to set up new btree roots.

> > +	if (!longest)
> > +		goto done;
> > +	error = xrep_abt_free_extent(sc,
> > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > +			longest->len, &oinfo);
> > +	list_del(&longest->list);
> > +	kmem_free(longest);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Insert records into the new btrees. */
> > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > +		error = xrep_abt_free_extent(sc,
> > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > +				rae->len, &oinfo);
> > +		if (error)
> > +			return error;
> > +		list_del(&rae->list);
> > +		kmem_free(rae);
> > +	}
> 
> Ok, at this point we've reset the btree roots and we start freeing the
> free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> fail or crash at this point, we leave the allocbts in a partially
> constructed state. I take it that is Ok with respect to the broader
> repair algorithm because we'd essentially start over by inspecting the
> rmapbt again on a retry.

Right.  Though in the crash/shutdown case, you'll end up with the
filesystem in an offline state at some point before you can retry the
scrub, it's probably faster to run xfs_repair to fix the damage.

> The blocks allocated for the btrees that we've begun to construct here
> end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> necessarily have infinite retries to make sure this completes. IOW,
> suppose that a first repair attempt finds just enough free space to
> construct new trees, gets far enough along to consume most of that free
> space and then crashes. Is it possible that a subsequent repair attempt
> includes the btree blocks allocated during the previous failed repair
> attempt in the sum of "old btree blocks" and determines we don't have
> enough free space to repair?

Yes, that's a risk of running the free space repair.

> > +
> > +done:
> > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > +			XFS_AG_RESV_NONE);
> > +}
> > +
> ...
> > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > index 0ed68379e551..82f99633a597 100644
> > --- a/fs/xfs/xfs_extent_busy.c
> > +++ b/fs/xfs/xfs_extent_busy.c
> > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> >  		diff = b1->bno - b2->bno;
> >  	return diff;
> >  }
> > +
> > +/* Are there any busy extents in this AG? */
> > +bool
> > +xfs_extent_busy_list_empty(
> > +	struct xfs_perag	*pag)
> > +{
> > +	spin_lock(&pag->pagb_lock);
> > +	if (pag->pagb_tree.rb_node) {
> 
> RB_EMPTY_ROOT()?

Good suggestion, thank you!

--D

> Brian
> 
> > +		spin_unlock(&pag->pagb_lock);
> > +		return false;
> > +	}
> > +	spin_unlock(&pag->pagb_lock);
> > +	return true;
> > +}
> > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > index 990ab3891971..2f8c73c712c6 100644
> > --- a/fs/xfs/xfs_extent_busy.h
> > +++ b/fs/xfs/xfs_extent_busy.h
> > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> >  }
> >  
> > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > +
> >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-07-31 22:01     ` Darrick J. Wong
@ 2018-08-01 11:54       ` Brian Foster
  2018-08-01 16:23         ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-01 11:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/Makefile             |    1 
> > >  fs/xfs/scrub/alloc.c        |    1 
> > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/scrub/common.c       |    8 +
> > >  fs/xfs/scrub/repair.h       |    2 
> > >  fs/xfs/scrub/scrub.c        |    4 
> > >  fs/xfs/scrub/trace.h        |    2 
> > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > >  fs/xfs/xfs_extent_busy.h    |    2 
> > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > new file mode 100644
> > > index 000000000000..b228c2906de2
> > > --- /dev/null
> > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > @@ -0,0 +1,581 @@
> > ...
> > > +/* Record extents that aren't in use from gaps in the rmap records. */
> > > +STATIC int
> > > +xrep_abt_walk_rmap(
> > > +	struct xfs_btree_cur	*cur,
> > > +	struct xfs_rmap_irec	*rec,
> > > +	void			*priv)
> > > +{
> > > +	struct xrep_abt		*ra = priv;
> > > +	struct xrep_abt_extent	*rae;
> > > +	xfs_fsblock_t		fsb;
> > > +	int			error;
> > > +
> > > +	/* Record all the OWN_AG blocks... */
> > > +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> > > +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> > > +				rec->rm_startblock);
> > > +		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
> > > +		if (error)
> > > +			return error;
> > > +	}
> > > +
> > > +	/* ...and all the rmapbt blocks... */
> > > +	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/* ...and all the free space. */
> > > +	if (rec->rm_startblock > ra->next_bno) {
> > > +		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
> > > +				ra->next_bno, rec->rm_startblock - ra->next_bno,
> > > +				XFS_RMAP_OWN_NULL, 0, 0);
> > > +
> > > +		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
> > > +		if (!rae)
> > > +			return -ENOMEM;
> > > +		INIT_LIST_HEAD(&rae->list);
> > > +		rae->bno = ra->next_bno;
> > > +		rae->len = rec->rm_startblock - ra->next_bno;
> > > +		list_add_tail(&rae->list, ra->extlist);
> > 
> > Any reason we don't use a bitmap for this one?
> > 

??

> > > +		ra->nr_records++;
> > > +		ra->nr_blocks += rae->len;
> > > +	}
> > > +	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
> > > +			rec->rm_startblock + rec->rm_blockcount);
> > 
> > The max_t() is to cover the record overlap case, right? If so, another
> > one liner comment would be good.
> 
> Right.  Will add a comment.
> 
> > > +	return 0;
> > > +}
> > > +
> > ...
> > > +/* Free an extent, which creates a record in the bnobt/cntbt. */
> > > +STATIC int
> > > +xrep_abt_free_extent(
> > > +	struct xfs_scrub	*sc,
> > > +	xfs_fsblock_t		fsbno,
> > > +	xfs_extlen_t		len,
> > > +	struct xfs_owner_info	*oinfo)
> > > +{
> > > +	int			error;
> > > +
> > > +	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
> > > +	if (error)
> > > +		return error;
> > > +	error = xrep_roll_ag_trans(sc);
> > > +	if (error)
> > > +		return error;
> > > +	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
> > 
> > What's this call for? Is it because the blocks we're freeing were
> > already free? (Similar question on the other xfs_mod_fdblocks() call
> > further down).
> 
> Yes.  The goal here is to free the (already free) extent with no net
> change in fdblocks...
> 
> > BTW, what prevents some other task from coming along and screwing with
> > this? For example, could a large falloc or buffered write come in and
> > allocate these global blocks before we take them away here (causing the
> > whole sequence to fail)?
> 
> ...but you're right that here is a window of opportunity for someone to
> swoop in and reserve the blocks while we still have the AGF locked,
> which means that we'll fail here even though that other process will
> never get the space.
> 
> Thinking about this a bit more, what we really want to do is to skip the
> xfs_trans_mod_sb(len) that happens after xfs_free_ag_extent inserts the
> record into the bno/cntbt.  Hm.  If a record insertion requires an
> expansion of the bnobt/cntbt, we'll pull blocks from the AGFL, but we
> separately force those to be accounted to XFS_AG_RESV_AGFL.  Therefore,
> we could make a "fake" per-AG reservation type that would skip the
> fdblocks update.  That avoids the problem where we commit the free space
> record but someone else reserves all the free space and then we blow out
> with ENOSPC and a half-rebuilt bnobt.
> 

Ok, that sounds a bit more straightforward to me.

> For the second case (which I assume is xrep_abt_reset_counters?) I'll
> respond below.
> 
> > > +}
> > > +
...
> > > +/*
> > > + * Reset the global free block counter and the per-AG counters to make it look
> > > + * like this AG has no free space.
> > > + */
> > > +STATIC int
> > > +xrep_abt_reset_counters(
> > > +	struct xfs_scrub	*sc,
> > > +	int			*log_flags)
> > > +{
> > > +	struct xfs_perag	*pag = sc->sa.pag;
> > > +	struct xfs_agf		*agf;
> > > +	xfs_agblock_t		new_btblks;
> > > +	xfs_agblock_t		to_free;
> > > +	int			error;
> > > +
> > > +	/*
> > > +	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
> > > +	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
> > > +	 * non-root blocks of the free space and rmap btrees.  Do this before
> > > +	 * resetting the AGF counters.
> > > +	 */
> > 
> > Hmm, I'm not quite following the comment wrt to the xfs_mod_fdblocks()
> > below. to_free looks like it's the count of all current btree blocks
> > minus rmap blocks (i.e., old bno/cnt btree blocks). Are we "allocating"
> > those blocks here because we're going to free them later?
> 
> Yes.  Though now that I have a XFS_AG_RESV_IGNORE, maybe I should just
> pass that to xrep_reap_extents in xrep_abt_rebuild_trees and then I can
> skip the racy mod_fdblocks thing here too.
> 

I think I'll ultimately need to see the code to make sure I follow the
ignore thing correctly, but that overall sounds better to me. If we do
retain these kind of calls to undo/work-around underlying
infrastructure, I think we need a bit more specific comments that
describe precisely what behavior the call is offsetting.

> > > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > > +
> > > +	/* rmap_blocks accounts root block, btreeblks doesn't */
> > > +	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> > > +
> > > +	/* btreeblks doesn't account bno/cnt root blocks */
> > > +	to_free = pag->pagf_btreeblks + 2;
> > > +
> > > +	/* and don't account for the blocks we aren't freeing */
> > > +	to_free -= new_btblks;
> > > +
> > > +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)to_free, false);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/*
> > > +	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
> > > +	 * state stale in case we fail out of here.
> > > +	 */
> > > +	ASSERT(pag->pagf_init);
> > > +	pag->pagf_init = 0;
> > > +	pag->pagf_btreeblks = new_btblks;
> > > +	pag->pagf_freeblks = 0;
> > > +	pag->pagf_longest = 0;
> > > +
> > > +	agf->agf_btreeblks = cpu_to_be32(new_btblks);
> > > +	agf->agf_freeblks = 0;
> > > +	agf->agf_longest = 0;
> > > +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> > > +
> > > +	return 0;
> > > +}
> > > +
...
> > > +
> > > +/*
> > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > + * unused space back into the AG.
> > > + */
> > > +STATIC int
> > > +xrep_abt_commit_new(
> > > +	struct xfs_scrub	*sc,
> > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > +	int			log_flags)
> > > +{
> > > +	int			error;
> > > +
> > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > +
> > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > +	if (error)
> > > +		return error;
> > 
> > It looks like the above invalidation all happens in the same
> > transaction. Those aren't logging buffer data or anything, but any idea
> > how many log formats we can get away with in this single transaction?
> 
> Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> blocks, the max AG size of 1TB, maximum free space fragmentation, and
> two btrees, the tree could be up to ~270 million records.  Assuming ~505
> records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> for both btrees.  If we invalidate both, that's ~46M of RAM?
> 

I was thinking more about transaction reservation than RAM. It may not
currently be an issue, but it might be worth putting something down in a
comment to note that this is a single transaction and we expect to not
have to invalidate more than N (ballpark) blocks in a single go,
whatever that value happens to be.

> > > +	error = xrep_roll_ag_trans(sc);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > +	sc->sa.pag->pagf_init = 1;
> > > +	return 0;
> > > +}
> > > +
> > > +/* Build new free space btrees and dispose of the old one. */
> > > +STATIC int
> > > +xrep_abt_rebuild_trees(
> > > +	struct xfs_scrub	*sc,
> > > +	struct list_head	*free_extents,
> > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > +{
> > > +	struct xfs_owner_info	oinfo;
> > > +	struct xrep_abt_extent	*rae;
> > > +	struct xrep_abt_extent	*n;
> > > +	struct xrep_abt_extent	*longest;
> > > +	int			error;
> > > +
> > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > +
> > > +	/*
> > > +	 * Insert the longest free extent in case it's necessary to
> > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > +	 * extent, we had exactly the free space we needed; we're done.
> > > +	 */
> > 
> > I'm confused by the last sentence. longest should only be NULL if the
> > free space list is empty and haven't we already bailed out with -ENOSPC
> > if that's the case?
> > 
> > > +	longest = xrep_abt_get_longest(free_extents);
> 
> xrep_abt_rebuild_trees is called after we allocate and initialize two
> new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> list here, then we found exactly two blocks worth of free space and used
> them to set up new btree roots.
> 

Got it, thanks.

> > > +	if (!longest)
> > > +		goto done;
> > > +	error = xrep_abt_free_extent(sc,
> > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > +			longest->len, &oinfo);
> > > +	list_del(&longest->list);
> > > +	kmem_free(longest);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	/* Insert records into the new btrees. */
> > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > +		error = xrep_abt_free_extent(sc,
> > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > +				rae->len, &oinfo);
> > > +		if (error)
> > > +			return error;
> > > +		list_del(&rae->list);
> > > +		kmem_free(rae);
> > > +	}
> > 
> > Ok, at this point we've reset the btree roots and we start freeing the
> > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > fail or crash at this point, we leave the allocbts in a partially
> > constructed state. I take it that is Ok with respect to the broader
> > repair algorithm because we'd essentially start over by inspecting the
> > rmapbt again on a retry.
> 
> Right.  Though in the crash/shutdown case, you'll end up with the
> filesystem in an offline state at some point before you can retry the
> scrub, it's probably faster to run xfs_repair to fix the damage.
> 

Can we really assume that if we're already up and running an online
repair? The filesystem has to be mountable in that case in the first
place. If we've already reset and started reconstructing the allocation
btrees then I'd think those transactions would recover just fine on a
power loss or something (perhaps not in the event of some other
corruption related shutdown).

> > The blocks allocated for the btrees that we've begun to construct here
> > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > necessarily have infinite retries to make sure this completes. IOW,
> > suppose that a first repair attempt finds just enough free space to
> > construct new trees, gets far enough along to consume most of that free
> > space and then crashes. Is it possible that a subsequent repair attempt
> > includes the btree blocks allocated during the previous failed repair
> > attempt in the sum of "old btree blocks" and determines we don't have
> > enough free space to repair?
> 
> Yes, that's a risk of running the free space repair.
> 

Can we improve on that? For example, are the rmapbt entries for the old
allocation btree blocks necessary once we commit the btree resets? If
not, could we remove those entries before we start tree reconstruction?

Alternatively, could we incorporate use of the old btree blocks? As it
is, we discover those blocks simply so we can free them at the end.
Perhaps we could free them sooner or find a more clever means to
reallocate directly from that in-core list? I guess we have to consider
whether they were really valid/sane btree blocks, but either way ISTM
that the old blocks list is essentially invalidated once we reset the
btrees.

Brian

> > > +
> > > +done:
> > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > +			XFS_AG_RESV_NONE);
> > > +}
> > > +
> > ...
> > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > index 0ed68379e551..82f99633a597 100644
> > > --- a/fs/xfs/xfs_extent_busy.c
> > > +++ b/fs/xfs/xfs_extent_busy.c
> > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > >  		diff = b1->bno - b2->bno;
> > >  	return diff;
> > >  }
> > > +
> > > +/* Are there any busy extents in this AG? */
> > > +bool
> > > +xfs_extent_busy_list_empty(
> > > +	struct xfs_perag	*pag)
> > > +{
> > > +	spin_lock(&pag->pagb_lock);
> > > +	if (pag->pagb_tree.rb_node) {
> > 
> > RB_EMPTY_ROOT()?
> 
> Good suggestion, thank you!
> 
> --D
> 
> > Brian
> > 
> > > +		spin_unlock(&pag->pagb_lock);
> > > +		return false;
> > > +	}
> > > +	spin_unlock(&pag->pagb_lock);
> > > +	return true;
> > > +}
> > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > index 990ab3891971..2f8c73c712c6 100644
> > > --- a/fs/xfs/xfs_extent_busy.h
> > > +++ b/fs/xfs/xfs_extent_busy.h
> > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > >  }
> > >  
> > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > +
> > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-01 11:54       ` Brian Foster
@ 2018-08-01 16:23         ` Darrick J. Wong
  2018-08-01 18:39           ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-01 16:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/Makefile             |    1 
> > > >  fs/xfs/scrub/alloc.c        |    1 
> > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/common.c       |    8 +
> > > >  fs/xfs/scrub/repair.h       |    2 
> > > >  fs/xfs/scrub/scrub.c        |    4 
> > > >  fs/xfs/scrub/trace.h        |    2 
> > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > 
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > new file mode 100644
> > > > index 000000000000..b228c2906de2
> > > > --- /dev/null
> > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > @@ -0,0 +1,581 @@
> > > ...
> > > > +/* Record extents that aren't in use from gaps in the rmap records. */
> > > > +STATIC int
> > > > +xrep_abt_walk_rmap(
> > > > +	struct xfs_btree_cur	*cur,
> > > > +	struct xfs_rmap_irec	*rec,
> > > > +	void			*priv)
> > > > +{
> > > > +	struct xrep_abt		*ra = priv;
> > > > +	struct xrep_abt_extent	*rae;
> > > > +	xfs_fsblock_t		fsb;
> > > > +	int			error;
> > > > +
> > > > +	/* Record all the OWN_AG blocks... */
> > > > +	if (rec->rm_owner == XFS_RMAP_OWN_AG) {
> > > > +		fsb = XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_private.a.agno,
> > > > +				rec->rm_startblock);
> > > > +		error = xfs_bitmap_set(ra->btlist, fsb, rec->rm_blockcount);
> > > > +		if (error)
> > > > +			return error;
> > > > +	}
> > > > +
> > > > +	/* ...and all the rmapbt blocks... */
> > > > +	error = xfs_bitmap_set_btcur_path(&ra->nobtlist, cur);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	/* ...and all the free space. */
> > > > +	if (rec->rm_startblock > ra->next_bno) {
> > > > +		trace_xrep_abt_walk_rmap(cur->bc_mp, cur->bc_private.a.agno,
> > > > +				ra->next_bno, rec->rm_startblock - ra->next_bno,
> > > > +				XFS_RMAP_OWN_NULL, 0, 0);
> > > > +
> > > > +		rae = kmem_alloc(sizeof(struct xrep_abt_extent), KM_MAYFAIL);
> > > > +		if (!rae)
> > > > +			return -ENOMEM;
> > > > +		INIT_LIST_HEAD(&rae->list);
> > > > +		rae->bno = ra->next_bno;
> > > > +		rae->len = rec->rm_startblock - ra->next_bno;
> > > > +		list_add_tail(&rae->list, ra->extlist);
> > > 
> > > Any reason we don't use a bitmap for this one?
> > > 
> 
> ??

Yes, I could probably do that, let's see if it works...

> > > > +		ra->nr_records++;
> > > > +		ra->nr_blocks += rae->len;
> > > > +	}
> > > > +	ra->next_bno = max_t(xfs_agblock_t, ra->next_bno,
> > > > +			rec->rm_startblock + rec->rm_blockcount);
> > > 
> > > The max_t() is to cover the record overlap case, right? If so, another
> > > one liner comment would be good.
> > 
> > Right.  Will add a comment.
> > 
> > > > +	return 0;
> > > > +}
> > > > +
> > > ...
> > > > +/* Free an extent, which creates a record in the bnobt/cntbt. */
> > > > +STATIC int
> > > > +xrep_abt_free_extent(
> > > > +	struct xfs_scrub	*sc,
> > > > +	xfs_fsblock_t		fsbno,
> > > > +	xfs_extlen_t		len,
> > > > +	struct xfs_owner_info	*oinfo)
> > > > +{
> > > > +	int			error;
> > > > +
> > > > +	error = xfs_free_extent(sc->tp, fsbno, len, oinfo, 0);
> > > > +	if (error)
> > > > +		return error;
> > > > +	error = xrep_roll_ag_trans(sc);
> > > > +	if (error)
> > > > +		return error;
> > > > +	return xfs_mod_fdblocks(sc->mp, -(int64_t)len, false);
> > > 
> > > What's this call for? Is it because the blocks we're freeing were
> > > already free? (Similar question on the other xfs_mod_fdblocks() call
> > > further down).
> > 
> > Yes.  The goal here is to free the (already free) extent with no net
> > change in fdblocks...
> > 
> > > BTW, what prevents some other task from coming along and screwing with
> > > this? For example, could a large falloc or buffered write come in and
> > > allocate these global blocks before we take them away here (causing the
> > > whole sequence to fail)?
> > 
> > ...but you're right that here is a window of opportunity for someone to
> > swoop in and reserve the blocks while we still have the AGF locked,
> > which means that we'll fail here even though that other process will
> > never get the space.
> > 
> > Thinking about this a bit more, what we really want to do is to skip the
> > xfs_trans_mod_sb(len) that happens after xfs_free_ag_extent inserts the
> > record into the bno/cntbt.  Hm.  If a record insertion requires an
> > expansion of the bnobt/cntbt, we'll pull blocks from the AGFL, but we
> > separately force those to be accounted to XFS_AG_RESV_AGFL.  Therefore,
> > we could make a "fake" per-AG reservation type that would skip the
> > fdblocks update.  That avoids the problem where we commit the free space
> > record but someone else reserves all the free space and then we blow out
> > with ENOSPC and a half-rebuilt bnobt.
> > 
> 
> Ok, that sounds a bit more straightforward to me.
> 
> > For the second case (which I assume is xrep_abt_reset_counters?) I'll
> > respond below.
> > 
> > > > +}
> > > > +
> ...
> > > > +/*
> > > > + * Reset the global free block counter and the per-AG counters to make it look
> > > > + * like this AG has no free space.
> > > > + */
> > > > +STATIC int
> > > > +xrep_abt_reset_counters(
> > > > +	struct xfs_scrub	*sc,
> > > > +	int			*log_flags)
> > > > +{
> > > > +	struct xfs_perag	*pag = sc->sa.pag;
> > > > +	struct xfs_agf		*agf;
> > > > +	xfs_agblock_t		new_btblks;
> > > > +	xfs_agblock_t		to_free;
> > > > +	int			error;
> > > > +
> > > > +	/*
> > > > +	 * Since we're abandoning the old bnobt/cntbt, we have to decrease
> > > > +	 * fdblocks by the # of blocks in those trees.  btreeblks counts the
> > > > +	 * non-root blocks of the free space and rmap btrees.  Do this before
> > > > +	 * resetting the AGF counters.
> > > > +	 */
> > > 
> > > Hmm, I'm not quite following the comment wrt to the xfs_mod_fdblocks()
> > > below. to_free looks like it's the count of all current btree blocks
> > > minus rmap blocks (i.e., old bno/cnt btree blocks). Are we "allocating"
> > > those blocks here because we're going to free them later?
> > 
> > Yes.  Though now that I have a XFS_AG_RESV_IGNORE, maybe I should just
> > pass that to xrep_reap_extents in xrep_abt_rebuild_trees and then I can
> > skip the racy mod_fdblocks thing here too.
> > 
> 
> I think I'll ultimately need to see the code to make sure I follow the
> ignore thing correctly, but that overall sounds better to me. If we do
> retain these kind of calls to undo/work-around underlying
> infrastructure, I think we need a bit more specific comments that
> describe precisely what behavior the call is offsetting.

I'll push out a new revision after I finish rebasing everything atop
your latest dfops refactoring series.

> > > > +	agf = XFS_BUF_TO_AGF(sc->sa.agf_bp);
> > > > +
> > > > +	/* rmap_blocks accounts root block, btreeblks doesn't */
> > > > +	new_btblks = be32_to_cpu(agf->agf_rmap_blocks) - 1;
> > > > +
> > > > +	/* btreeblks doesn't account bno/cnt root blocks */
> > > > +	to_free = pag->pagf_btreeblks + 2;
> > > > +
> > > > +	/* and don't account for the blocks we aren't freeing */
> > > > +	to_free -= new_btblks;
> > > > +
> > > > +	error = xfs_mod_fdblocks(sc->mp, -(int64_t)to_free, false);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	/*
> > > > +	 * Reset the per-AG info, both incore and ondisk.  Mark the incore
> > > > +	 * state stale in case we fail out of here.
> > > > +	 */
> > > > +	ASSERT(pag->pagf_init);
> > > > +	pag->pagf_init = 0;
> > > > +	pag->pagf_btreeblks = new_btblks;
> > > > +	pag->pagf_freeblks = 0;
> > > > +	pag->pagf_longest = 0;
> > > > +
> > > > +	agf->agf_btreeblks = cpu_to_be32(new_btblks);
> > > > +	agf->agf_freeblks = 0;
> > > > +	agf->agf_longest = 0;
> > > > +	*log_flags |= XFS_AGF_BTREEBLKS | XFS_AGF_LONGEST | XFS_AGF_FREEBLKS;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> ...
> > > > +
> > > > +/*
> > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > + * unused space back into the AG.
> > > > + */
> > > > +STATIC int
> > > > +xrep_abt_commit_new(
> > > > +	struct xfs_scrub	*sc,
> > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > +	int			log_flags)
> > > > +{
> > > > +	int			error;
> > > > +
> > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > +
> > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > +	if (error)
> > > > +		return error;
> > > 
> > > It looks like the above invalidation all happens in the same
> > > transaction. Those aren't logging buffer data or anything, but any idea
> > > how many log formats we can get away with in this single transaction?
> > 
> > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > 
> 
> I was thinking more about transaction reservation than RAM. It may not

Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
about ... ~7300 log format items?  Not a lot, maybe it should roll the
transaction every 1000 invalidations or so...

> currently be an issue, but it might be worth putting something down in a
> comment to note that this is a single transaction and we expect to not
> have to invalidate more than N (ballpark) blocks in a single go,
> whatever that value happens to be.
> 
> > > > +	error = xrep_roll_ag_trans(sc);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > +	sc->sa.pag->pagf_init = 1;
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/* Build new free space btrees and dispose of the old one. */
> > > > +STATIC int
> > > > +xrep_abt_rebuild_trees(
> > > > +	struct xfs_scrub	*sc,
> > > > +	struct list_head	*free_extents,
> > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > +{
> > > > +	struct xfs_owner_info	oinfo;
> > > > +	struct xrep_abt_extent	*rae;
> > > > +	struct xrep_abt_extent	*n;
> > > > +	struct xrep_abt_extent	*longest;
> > > > +	int			error;
> > > > +
> > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > +
> > > > +	/*
> > > > +	 * Insert the longest free extent in case it's necessary to
> > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > +	 */
> > > 
> > > I'm confused by the last sentence. longest should only be NULL if the
> > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > if that's the case?
> > > 
> > > > +	longest = xrep_abt_get_longest(free_extents);
> > 
> > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > list here, then we found exactly two blocks worth of free space and used
> > them to set up new btree roots.
> > 
> 
> Got it, thanks.
> 
> > > > +	if (!longest)
> > > > +		goto done;
> > > > +	error = xrep_abt_free_extent(sc,
> > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > +			longest->len, &oinfo);
> > > > +	list_del(&longest->list);
> > > > +	kmem_free(longest);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	/* Insert records into the new btrees. */
> > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > +		error = xrep_abt_free_extent(sc,
> > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > +				rae->len, &oinfo);
> > > > +		if (error)
> > > > +			return error;
> > > > +		list_del(&rae->list);
> > > > +		kmem_free(rae);
> > > > +	}
> > > 
> > > Ok, at this point we've reset the btree roots and we start freeing the
> > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > fail or crash at this point, we leave the allocbts in a partially
> > > constructed state. I take it that is Ok with respect to the broader
> > > repair algorithm because we'd essentially start over by inspecting the
> > > rmapbt again on a retry.
> > 
> > Right.  Though in the crash/shutdown case, you'll end up with the
> > filesystem in an offline state at some point before you can retry the
> > scrub, it's probably faster to run xfs_repair to fix the damage.
> > 
> 
> Can we really assume that if we're already up and running an online
> repair? The filesystem has to be mountable in that case in the first
> place. If we've already reset and started reconstructing the allocation
> btrees then I'd think those transactions would recover just fine on a
> power loss or something (perhaps not in the event of some other
> corruption related shutdown).

Right, for the system crash case, whatever transactions committed should
replay just fine, and you can even start up the online repair again, and
if the AG isn't particularly close to ENOSPC then (barring rmap
corruption) it should work just fine.

If the fs went down because either (a) repair hit other corruption or
(b) some other thread hit an error in some other part of the filesystem,
then it's not so clear -- in (b) you could probably try again, but for
(a) you'll definitely have to unmount and run xfs_repair.

Perhaps the guideline here is that if the fs goes down more than once
during online repair then unmount it and run xfs_repair.

> > > The blocks allocated for the btrees that we've begun to construct here
> > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > necessarily have infinite retries to make sure this completes. IOW,
> > > suppose that a first repair attempt finds just enough free space to
> > > construct new trees, gets far enough along to consume most of that free
> > > space and then crashes. Is it possible that a subsequent repair attempt
> > > includes the btree blocks allocated during the previous failed repair
> > > attempt in the sum of "old btree blocks" and determines we don't have
> > > enough free space to repair?
> > 
> > Yes, that's a risk of running the free space repair.
> > 
> 
> Can we improve on that? For example, are the rmapbt entries for the old
> allocation btree blocks necessary once we commit the btree resets? If
> not, could we remove those entries before we start tree reconstruction?
> 
> Alternatively, could we incorporate use of the old btree blocks? As it
> is, we discover those blocks simply so we can free them at the end.
> Perhaps we could free them sooner or find a more clever means to
> reallocate directly from that in-core list? I guess we have to consider
> whether they were really valid/sane btree blocks, but either way ISTM
> that the old blocks list is essentially invalidated once we reset the
> btrees.

Hmm, it's a little tricky to do that -- we could reap the old bnobt and
cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
record causes a btree split we'll pull blocks from the AGFL, and if
there aren't enough blocks in the bnobt to fill the AGFL back up then
fix_freelist won't succeed.  That complication is why it finds the
longest extent in the unclaimed list and pushes that in first, then
works on the rest of the extents.

I suppose one could try to avoid ENOSPC by pushing that longest extent
in first (since we know that won't trigger a split), then reap the old
alloc btree blocks, and then add everything else back in...

--D

> Brian
> 
> > > > +
> > > > +done:
> > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > +			XFS_AG_RESV_NONE);
> > > > +}
> > > > +
> > > ...
> > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > index 0ed68379e551..82f99633a597 100644
> > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > >  		diff = b1->bno - b2->bno;
> > > >  	return diff;
> > > >  }
> > > > +
> > > > +/* Are there any busy extents in this AG? */
> > > > +bool
> > > > +xfs_extent_busy_list_empty(
> > > > +	struct xfs_perag	*pag)
> > > > +{
> > > > +	spin_lock(&pag->pagb_lock);
> > > > +	if (pag->pagb_tree.rb_node) {
> > > 
> > > RB_EMPTY_ROOT()?
> > 
> > Good suggestion, thank you!
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +		spin_unlock(&pag->pagb_lock);
> > > > +		return false;
> > > > +	}
> > > > +	spin_unlock(&pag->pagb_lock);
> > > > +	return true;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > index 990ab3891971..2f8c73c712c6 100644
> > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > >  }
> > > >  
> > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > +
> > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-01 16:23         ` Darrick J. Wong
@ 2018-08-01 18:39           ` Brian Foster
  2018-08-02  6:28             ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-01 18:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/xfs/Makefile             |    1 
> > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > 
> > > > > 
> > > > ...
> > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > new file mode 100644
> > > > > index 000000000000..b228c2906de2
> > > > > --- /dev/null
> > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > @@ -0,0 +1,581 @@
...
> > > > > +
> > > > > +/*
> > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > + * unused space back into the AG.
> > > > > + */
> > > > > +STATIC int
> > > > > +xrep_abt_commit_new(
> > > > > +	struct xfs_scrub	*sc,
> > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > +	int			log_flags)
> > > > > +{
> > > > > +	int			error;
> > > > > +
> > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > +
> > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > +	if (error)
> > > > > +		return error;
> > > > 
> > > > It looks like the above invalidation all happens in the same
> > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > how many log formats we can get away with in this single transaction?
> > > 
> > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > 
> > 
> > I was thinking more about transaction reservation than RAM. It may not
> 
> Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> about ... ~7300 log format items?  Not a lot, maybe it should roll the
> transaction every 1000 invalidations or so...
> 

I'm not really sure what categorizes as a lot here given that the blocks
would need to be in-core, but rolling on some fixed/safe interval sounds
reasonable to me.

> > currently be an issue, but it might be worth putting something down in a
> > comment to note that this is a single transaction and we expect to not
> > have to invalidate more than N (ballpark) blocks in a single go,
> > whatever that value happens to be.
> > 
> > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > +STATIC int
> > > > > +xrep_abt_rebuild_trees(
> > > > > +	struct xfs_scrub	*sc,
> > > > > +	struct list_head	*free_extents,
> > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > +{
> > > > > +	struct xfs_owner_info	oinfo;
> > > > > +	struct xrep_abt_extent	*rae;
> > > > > +	struct xrep_abt_extent	*n;
> > > > > +	struct xrep_abt_extent	*longest;
> > > > > +	int			error;
> > > > > +
> > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > +
> > > > > +	/*
> > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > +	 */
> > > > 
> > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > if that's the case?
> > > > 
> > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > 
> > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > list here, then we found exactly two blocks worth of free space and used
> > > them to set up new btree roots.
> > > 
> > 
> > Got it, thanks.
> > 
> > > > > +	if (!longest)
> > > > > +		goto done;
> > > > > +	error = xrep_abt_free_extent(sc,
> > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > +			longest->len, &oinfo);
> > > > > +	list_del(&longest->list);
> > > > > +	kmem_free(longest);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	/* Insert records into the new btrees. */
> > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > +		error = xrep_abt_free_extent(sc,
> > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > +				rae->len, &oinfo);
> > > > > +		if (error)
> > > > > +			return error;
> > > > > +		list_del(&rae->list);
> > > > > +		kmem_free(rae);
> > > > > +	}
> > > > 
> > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > fail or crash at this point, we leave the allocbts in a partially
> > > > constructed state. I take it that is Ok with respect to the broader
> > > > repair algorithm because we'd essentially start over by inspecting the
> > > > rmapbt again on a retry.
> > > 
> > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > filesystem in an offline state at some point before you can retry the
> > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > 
> > 
> > Can we really assume that if we're already up and running an online
> > repair? The filesystem has to be mountable in that case in the first
> > place. If we've already reset and started reconstructing the allocation
> > btrees then I'd think those transactions would recover just fine on a
> > power loss or something (perhaps not in the event of some other
> > corruption related shutdown).
> 
> Right, for the system crash case, whatever transactions committed should
> replay just fine, and you can even start up the online repair again, and
> if the AG isn't particularly close to ENOSPC then (barring rmap
> corruption) it should work just fine.
> 
> If the fs went down because either (a) repair hit other corruption or
> (b) some other thread hit an error in some other part of the filesystem,
> then it's not so clear -- in (b) you could probably try again, but for
> (a) you'll definitely have to unmount and run xfs_repair.
> 

Indeed, there are certainly cases where we simply won't be able to do an
online repair. I'm trying to think about scenarios where we should be
able to do an online repair, but we lose power or hit some kind of
transient error like a memory allocation failure before it completes. It
would be nice if the online repair itself didn't contribute (within
reason) to the inability to simply try again just because the fs was
close to -ENOSPC.

For one, I think it's potentially confusing behavior. Second, it might
be concerning to regular users who perceive it as an online repair
leaving the fs in a worse off state. Us fs devs know that may not really
be the case, but I think we're better for addressing it if we can
reasonably do so.

> Perhaps the guideline here is that if the fs goes down more than once
> during online repair then unmount it and run xfs_repair.
> 

Yep, I think that makes sense if the filesystem or repair itself is
tripping over other corruptions that fail to keep it active for the
duration of the repair.

> > > > The blocks allocated for the btrees that we've begun to construct here
> > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > suppose that a first repair attempt finds just enough free space to
> > > > construct new trees, gets far enough along to consume most of that free
> > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > includes the btree blocks allocated during the previous failed repair
> > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > enough free space to repair?
> > > 
> > > Yes, that's a risk of running the free space repair.
> > > 
> > 
> > Can we improve on that? For example, are the rmapbt entries for the old
> > allocation btree blocks necessary once we commit the btree resets? If
> > not, could we remove those entries before we start tree reconstruction?
> > 
> > Alternatively, could we incorporate use of the old btree blocks? As it
> > is, we discover those blocks simply so we can free them at the end.
> > Perhaps we could free them sooner or find a more clever means to
> > reallocate directly from that in-core list? I guess we have to consider
> > whether they were really valid/sane btree blocks, but either way ISTM
> > that the old blocks list is essentially invalidated once we reset the
> > btrees.
> 
> Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> record causes a btree split we'll pull blocks from the AGFL, and if
> there aren't enough blocks in the bnobt to fill the AGFL back up then
> fix_freelist won't succeed.  That complication is why it finds the
> longest extent in the unclaimed list and pushes that in first, then
> works on the rest of the extents.
> 

Hmm, but doesn't a btree split require at least one full space btree
block per-level? In conjunction, the agfl minimum size requirement grows
with the height of the tree, which implies available free space..? I
could be missing something, perhaps we have to account for the rmapbt in
that case as well? Regardless...

> I suppose one could try to avoid ENOSPC by pushing that longest extent
> in first (since we know that won't trigger a split), then reap the old
> alloc btree blocks, and then add everything else back in...
> 

I think it would be reasonable to seed the btree with the longest record
or some fixed number of longest records (~1/2 a root block, for example)
before making actual use of the btrees to reap the old blocks. I think
then you'd only have a very short window of a single block leak on a
poorly timed power loss and repair retry sequence before you start
actually freeing originally used space (which in practice, I think
solves the problem).

Given that we're starting from empty, I wonder if another option may be
to over fill the agfl with old btree blocks or something. The first real
free should shift enough blocks back into the btrees to ensure the agfl
can be managed from that point forward, right? That may be more work
than it's worth though and/or a job for another patch. (FWIW, we also
have that NOSHRINK agfl fixup flag for userspace repair.)

Brian

> --D
> 
> > Brian
> > 
> > > > > +
> > > > > +done:
> > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > +			XFS_AG_RESV_NONE);
> > > > > +}
> > > > > +
> > > > ...
> > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > index 0ed68379e551..82f99633a597 100644
> > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > >  		diff = b1->bno - b2->bno;
> > > > >  	return diff;
> > > > >  }
> > > > > +
> > > > > +/* Are there any busy extents in this AG? */
> > > > > +bool
> > > > > +xfs_extent_busy_list_empty(
> > > > > +	struct xfs_perag	*pag)
> > > > > +{
> > > > > +	spin_lock(&pag->pagb_lock);
> > > > > +	if (pag->pagb_tree.rb_node) {
> > > > 
> > > > RB_EMPTY_ROOT()?
> > > 
> > > Good suggestion, thank you!
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > +		return false;
> > > > > +	}
> > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > +	return true;
> > > > > +}
> > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > >  }
> > > > >  
> > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > +
> > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-01 18:39           ` Brian Foster
@ 2018-08-02  6:28             ` Darrick J. Wong
  2018-08-02 13:48               ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-02  6:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/xfs/Makefile             |    1 
> > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > 
> > > > > > 
> > > > > ...
> > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > new file mode 100644
> > > > > > index 000000000000..b228c2906de2
> > > > > > --- /dev/null
> > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > @@ -0,0 +1,581 @@
> ...
> > > > > > +
> > > > > > +/*
> > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > + * unused space back into the AG.
> > > > > > + */
> > > > > > +STATIC int
> > > > > > +xrep_abt_commit_new(
> > > > > > +	struct xfs_scrub	*sc,
> > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > +	int			log_flags)
> > > > > > +{
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > +
> > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > 
> > > > > It looks like the above invalidation all happens in the same
> > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > how many log formats we can get away with in this single transaction?
> > > > 
> > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > 
> > > 
> > > I was thinking more about transaction reservation than RAM. It may not
> > 
> > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > transaction every 1000 invalidations or so...
> > 
> 
> I'm not really sure what categorizes as a lot here given that the blocks
> would need to be in-core, but rolling on some fixed/safe interval sounds
> reasonable to me.
> 
> > > currently be an issue, but it might be worth putting something down in a
> > > comment to note that this is a single transaction and we expect to not
> > > have to invalidate more than N (ballpark) blocks in a single go,
> > > whatever that value happens to be.
> > > 
> > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > +STATIC int
> > > > > > +xrep_abt_rebuild_trees(
> > > > > > +	struct xfs_scrub	*sc,
> > > > > > +	struct list_head	*free_extents,
> > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > +{
> > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > +	struct xrep_abt_extent	*n;
> > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > +	 */
> > > > > 
> > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > if that's the case?
> > > > > 
> > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > 
> > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > list here, then we found exactly two blocks worth of free space and used
> > > > them to set up new btree roots.
> > > > 
> > > 
> > > Got it, thanks.
> > > 
> > > > > > +	if (!longest)
> > > > > > +		goto done;
> > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > +			longest->len, &oinfo);
> > > > > > +	list_del(&longest->list);
> > > > > > +	kmem_free(longest);
> > > > > > +	if (error)
> > > > > > +		return error;
> > > > > > +
> > > > > > +	/* Insert records into the new btrees. */
> > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > +				rae->len, &oinfo);
> > > > > > +		if (error)
> > > > > > +			return error;
> > > > > > +		list_del(&rae->list);
> > > > > > +		kmem_free(rae);
> > > > > > +	}
> > > > > 
> > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > rmapbt again on a retry.
> > > > 
> > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > filesystem in an offline state at some point before you can retry the
> > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > 
> > > 
> > > Can we really assume that if we're already up and running an online
> > > repair? The filesystem has to be mountable in that case in the first
> > > place. If we've already reset and started reconstructing the allocation
> > > btrees then I'd think those transactions would recover just fine on a
> > > power loss or something (perhaps not in the event of some other
> > > corruption related shutdown).
> > 
> > Right, for the system crash case, whatever transactions committed should
> > replay just fine, and you can even start up the online repair again, and
> > if the AG isn't particularly close to ENOSPC then (barring rmap
> > corruption) it should work just fine.
> > 
> > If the fs went down because either (a) repair hit other corruption or
> > (b) some other thread hit an error in some other part of the filesystem,
> > then it's not so clear -- in (b) you could probably try again, but for
> > (a) you'll definitely have to unmount and run xfs_repair.
> > 
> 
> Indeed, there are certainly cases where we simply won't be able to do an
> online repair. I'm trying to think about scenarios where we should be
> able to do an online repair, but we lose power or hit some kind of
> transient error like a memory allocation failure before it completes. It
> would be nice if the online repair itself didn't contribute (within
> reason) to the inability to simply try again just because the fs was
> close to -ENOSPC.

Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
start modifying on-disk metadata.  If that happens, we just free all the
memory and bail out having done nothing.

> For one, I think it's potentially confusing behavior. Second, it might
> be concerning to regular users who perceive it as an online repair
> leaving the fs in a worse off state. Us fs devs know that may not really
> be the case, but I think we're better for addressing it if we can
> reasonably do so.

<nod> Further in the future I want to add the ability to offline an AG,
so the worst that happens is that scrub turns the AG off, repair doesn't
fix it, and the AG simply stays offline.  That might give us the
ability to survive cancelling the repair transaction, since if the AG's
offline already anyway we could just throw away the dirty buffers and
resurrect the AG later.  I don't know, that's purely speculative.

> > Perhaps the guideline here is that if the fs goes down more than once
> > during online repair then unmount it and run xfs_repair.
> > 
> 
> Yep, I think that makes sense if the filesystem or repair itself is
> tripping over other corruptions that fail to keep it active for the
> duration of the repair.

<nod>

> > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > suppose that a first repair attempt finds just enough free space to
> > > > > construct new trees, gets far enough along to consume most of that free
> > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > includes the btree blocks allocated during the previous failed repair
> > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > enough free space to repair?
> > > > 
> > > > Yes, that's a risk of running the free space repair.
> > > > 
> > > 
> > > Can we improve on that? For example, are the rmapbt entries for the old
> > > allocation btree blocks necessary once we commit the btree resets? If
> > > not, could we remove those entries before we start tree reconstruction?
> > > 
> > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > is, we discover those blocks simply so we can free them at the end.
> > > Perhaps we could free them sooner or find a more clever means to
> > > reallocate directly from that in-core list? I guess we have to consider
> > > whether they were really valid/sane btree blocks, but either way ISTM
> > > that the old blocks list is essentially invalidated once we reset the
> > > btrees.
> > 
> > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > record causes a btree split we'll pull blocks from the AGFL, and if
> > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > fix_freelist won't succeed.  That complication is why it finds the
> > longest extent in the unclaimed list and pushes that in first, then
> > works on the rest of the extents.
> > 
> 
> Hmm, but doesn't a btree split require at least one full space btree
> block per-level? In conjunction, the agfl minimum size requirement grows
> with the height of the tree, which implies available free space..? I
> could be missing something, perhaps we have to account for the rmapbt in
> that case as well? Regardless...
> 
> > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > in first (since we know that won't trigger a split), then reap the old
> > alloc btree blocks, and then add everything else back in...
> > 
> 
> I think it would be reasonable to seed the btree with the longest record
> or some fixed number of longest records (~1/2 a root block, for example)
> before making actual use of the btrees to reap the old blocks. I think
> then you'd only have a very short window of a single block leak on a
> poorly timed power loss and repair retry sequence before you start
> actually freeing originally used space (which in practice, I think
> solves the problem).
> 
> Given that we're starting from empty, I wonder if another option may be
> to over fill the agfl with old btree blocks or something. The first real
> free should shift enough blocks back into the btrees to ensure the agfl
> can be managed from that point forward, right? That may be more work
> than it's worth though and/or a job for another patch. (FWIW, we also
> have that NOSHRINK agfl fixup flag for userspace repair.)

Yes, I'll give that a try tomorrow, now that I've finished porting all
the 4.19 stuff to xfsprogs. :)

Looping back to something we discussed earlier in this thread, I'd
prefer to hold off on converting the list of already-freed extents to
xfs_bitmap because the same problem exists in all the repair functions
of having to store a large number of records for the rebuilt btree, and
maybe there's some way to <cough> use pageable memory for that, since
the access patterns for that are append, sort, and iterate; for those
three uses we don't necessarily require all the records to be in memory
all the time.  For the allocbt repair I expect the free space records to
be far more numerous than the list of old bnobt/cntbt blocks.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > > > +
> > > > > > +done:
> > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > +			XFS_AG_RESV_NONE);
> > > > > > +}
> > > > > > +
> > > > > ...
> > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > >  		diff = b1->bno - b2->bno;
> > > > > >  	return diff;
> > > > > >  }
> > > > > > +
> > > > > > +/* Are there any busy extents in this AG? */
> > > > > > +bool
> > > > > > +xfs_extent_busy_list_empty(
> > > > > > +	struct xfs_perag	*pag)
> > > > > > +{
> > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > 
> > > > > RB_EMPTY_ROOT()?
> > > > 
> > > > Good suggestion, thank you!
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > +		return false;
> > > > > > +	}
> > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > +	return true;
> > > > > > +}
> > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > >  }
> > > > > >  
> > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > +
> > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-02  6:28             ` Darrick J. Wong
@ 2018-08-02 13:48               ` Brian Foster
  2018-08-02 19:22                 ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-02 13:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > 
> > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > 
> > > > > > > 
> > > > > > ...
> > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..b228c2906de2
> > > > > > > --- /dev/null
> > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > @@ -0,0 +1,581 @@
> > ...
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > + * unused space back into the AG.
> > > > > > > + */
> > > > > > > +STATIC int
> > > > > > > +xrep_abt_commit_new(
> > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > +	int			log_flags)
> > > > > > > +{
> > > > > > > +	int			error;
> > > > > > > +
> > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > +
> > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > +	if (error)
> > > > > > > +		return error;
> > > > > > 
> > > > > > It looks like the above invalidation all happens in the same
> > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > how many log formats we can get away with in this single transaction?
> > > > > 
> > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > 
> > > > 
> > > > I was thinking more about transaction reservation than RAM. It may not
> > > 
> > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > transaction every 1000 invalidations or so...
> > > 
> > 
> > I'm not really sure what categorizes as a lot here given that the blocks
> > would need to be in-core, but rolling on some fixed/safe interval sounds
> > reasonable to me.
> > 
> > > > currently be an issue, but it might be worth putting something down in a
> > > > comment to note that this is a single transaction and we expect to not
> > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > whatever that value happens to be.
> > > > 
> > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > +	if (error)
> > > > > > > +		return error;
> > > > > > > +
> > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > +STATIC int
> > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > +	struct list_head	*free_extents,
> > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > +{
> > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > +	int			error;
> > > > > > > +
> > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > +	 */
> > > > > > 
> > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > if that's the case?
> > > > > > 
> > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > 
> > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > them to set up new btree roots.
> > > > > 
> > > > 
> > > > Got it, thanks.
> > > > 
> > > > > > > +	if (!longest)
> > > > > > > +		goto done;
> > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > +			longest->len, &oinfo);
> > > > > > > +	list_del(&longest->list);
> > > > > > > +	kmem_free(longest);
> > > > > > > +	if (error)
> > > > > > > +		return error;
> > > > > > > +
> > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > +				rae->len, &oinfo);
> > > > > > > +		if (error)
> > > > > > > +			return error;
> > > > > > > +		list_del(&rae->list);
> > > > > > > +		kmem_free(rae);
> > > > > > > +	}
> > > > > > 
> > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > rmapbt again on a retry.
> > > > > 
> > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > filesystem in an offline state at some point before you can retry the
> > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > 
> > > > 
> > > > Can we really assume that if we're already up and running an online
> > > > repair? The filesystem has to be mountable in that case in the first
> > > > place. If we've already reset and started reconstructing the allocation
> > > > btrees then I'd think those transactions would recover just fine on a
> > > > power loss or something (perhaps not in the event of some other
> > > > corruption related shutdown).
> > > 
> > > Right, for the system crash case, whatever transactions committed should
> > > replay just fine, and you can even start up the online repair again, and
> > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > corruption) it should work just fine.
> > > 
> > > If the fs went down because either (a) repair hit other corruption or
> > > (b) some other thread hit an error in some other part of the filesystem,
> > > then it's not so clear -- in (b) you could probably try again, but for
> > > (a) you'll definitely have to unmount and run xfs_repair.
> > > 
> > 
> > Indeed, there are certainly cases where we simply won't be able to do an
> > online repair. I'm trying to think about scenarios where we should be
> > able to do an online repair, but we lose power or hit some kind of
> > transient error like a memory allocation failure before it completes. It
> > would be nice if the online repair itself didn't contribute (within
> > reason) to the inability to simply try again just because the fs was
> > close to -ENOSPC.
> 
> Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> start modifying on-disk metadata.  If that happens, we just free all the
> memory and bail out having done nothing.
> 
> > For one, I think it's potentially confusing behavior. Second, it might
> > be concerning to regular users who perceive it as an online repair
> > leaving the fs in a worse off state. Us fs devs know that may not really
> > be the case, but I think we're better for addressing it if we can
> > reasonably do so.
> 
> <nod> Further in the future I want to add the ability to offline an AG,
> so the worst that happens is that scrub turns the AG off, repair doesn't
> fix it, and the AG simply stays offline.  That might give us the
> ability to survive cancelling the repair transaction, since if the AG's
> offline already anyway we could just throw away the dirty buffers and
> resurrect the AG later.  I don't know, that's purely speculative.
> 
> > > Perhaps the guideline here is that if the fs goes down more than once
> > > during online repair then unmount it and run xfs_repair.
> > > 
> > 
> > Yep, I think that makes sense if the filesystem or repair itself is
> > tripping over other corruptions that fail to keep it active for the
> > duration of the repair.
> 
> <nod>
> 
> > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > enough free space to repair?
> > > > > 
> > > > > Yes, that's a risk of running the free space repair.
> > > > > 
> > > > 
> > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > not, could we remove those entries before we start tree reconstruction?
> > > > 
> > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > is, we discover those blocks simply so we can free them at the end.
> > > > Perhaps we could free them sooner or find a more clever means to
> > > > reallocate directly from that in-core list? I guess we have to consider
> > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > that the old blocks list is essentially invalidated once we reset the
> > > > btrees.
> > > 
> > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > fix_freelist won't succeed.  That complication is why it finds the
> > > longest extent in the unclaimed list and pushes that in first, then
> > > works on the rest of the extents.
> > > 
> > 
> > Hmm, but doesn't a btree split require at least one full space btree
> > block per-level? In conjunction, the agfl minimum size requirement grows
> > with the height of the tree, which implies available free space..? I
> > could be missing something, perhaps we have to account for the rmapbt in
> > that case as well? Regardless...
> > 
> > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > in first (since we know that won't trigger a split), then reap the old
> > > alloc btree blocks, and then add everything else back in...
> > > 
> > 
> > I think it would be reasonable to seed the btree with the longest record
> > or some fixed number of longest records (~1/2 a root block, for example)
> > before making actual use of the btrees to reap the old blocks. I think
> > then you'd only have a very short window of a single block leak on a
> > poorly timed power loss and repair retry sequence before you start
> > actually freeing originally used space (which in practice, I think
> > solves the problem).
> > 
> > Given that we're starting from empty, I wonder if another option may be
> > to over fill the agfl with old btree blocks or something. The first real
> > free should shift enough blocks back into the btrees to ensure the agfl
> > can be managed from that point forward, right? That may be more work
> > than it's worth though and/or a job for another patch. (FWIW, we also
> > have that NOSHRINK agfl fixup flag for userspace repair.)
> 
> Yes, I'll give that a try tomorrow, now that I've finished porting all
> the 4.19 stuff to xfsprogs. :)
> 
> Looping back to something we discussed earlier in this thread, I'd
> prefer to hold off on converting the list of already-freed extents to
> xfs_bitmap because the same problem exists in all the repair functions
> of having to store a large number of records for the rebuilt btree, and
> maybe there's some way to <cough> use pageable memory for that, since
> the access patterns for that are append, sort, and iterate; for those
> three uses we don't necessarily require all the records to be in memory
> all the time.  For the allocbt repair I expect the free space records to
> be far more numerous than the list of old bnobt/cntbt blocks.
> 

Ok, it's fair enough that we'll probably want to find some kind of
generic, more efficient technique for handling this across the various
applicable repair algorithms.

One other high level thing that crossed my mind with regard to the
general btree reconstruction algorithms is whether we need to build up
this kind of central record list at all. For example, rather than slurp
up the entire list of btree records in-core, sort it and dump it back
out, could we take advantage of the fact that our existing on-disk
structure insertion mechanisms already handle out of order records
(simply stated, an extent free knows how to insert the associated record
at the right place in the space btrees)? For example, suppose we reset
the existing btrees first, then scanned the rmapbt and repopulated the
new btrees as records are discovered..?

The obvious problem is that we still have some checks that allow the
whole repair operation to bail out before we determine whether we can
start to rebuild the on-disk btrees. These are things like making sure
we can actually read the associated rmapbt blocks (i.e., no read errors
or verifier failures), basic record sanity checks, etc. But ISTM that
isn't anything we couldn't get around with a multi-pass implementation.
Secondary issues might be things like no longer being able to easily
insert the longest free extent range(s) first (meaning we'd have to
stuff the agfl with old btree blocks or figure out some other approach).

BTW, isn't the typical scrub sequence already multi-pass by virtue of
the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
callout could not only detect corruption, but validate whether repair
(if requested) is possible based on the kind of checks that are
currently in the repair side rmapbt walkers. Thoughts? Are there future
changes that are better supported by an in-core tracking structure in
general (assuming we'll eventually replace the linked lists with
something more efficient) as opposed to attempting to optimize out the
need for that tracking at all?

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > > > +
> > > > > > > +done:
> > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > +}
> > > > > > > +
> > > > > > ...
> > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > >  	return diff;
> > > > > > >  }
> > > > > > > +
> > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > +bool
> > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > +	struct xfs_perag	*pag)
> > > > > > > +{
> > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > 
> > > > > > RB_EMPTY_ROOT()?
> > > > > 
> > > > > Good suggestion, thank you!
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > +		return false;
> > > > > > > +	}
> > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > +	return true;
> > > > > > > +}
> > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > >  }
> > > > > > >  
> > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > +
> > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/14] xfs: repair inode btrees
  2018-07-30  5:48 ` [PATCH 06/14] xfs: repair inode btrees Darrick J. Wong
@ 2018-08-02 14:54   ` Brian Foster
  2018-11-06  2:16     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-02 14:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Sun, Jul 29, 2018 at 10:48:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Use the rmapbt to find inode chunks, query the chunks to compute
> hole and free masks, and with that information rebuild the inobt
> and finobt.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/Makefile              |    1 
>  fs/xfs/scrub/common.c        |    2 
>  fs/xfs/scrub/ialloc_repair.c |  673 ++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/scrub/repair.c        |   20 +
>  fs/xfs/scrub/repair.h        |   11 +
>  fs/xfs/scrub/scrub.c         |    4 
>  fs/xfs/scrub/scrub.h         |    1 
>  fs/xfs/scrub/trace.h         |    4 
>  8 files changed, 712 insertions(+), 4 deletions(-)
>  create mode 100644 fs/xfs/scrub/ialloc_repair.c
> 
> 
...
> diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
> new file mode 100644
> index 000000000000..126135c1a147
> --- /dev/null
> +++ b/fs/xfs/scrub/ialloc_repair.c
> @@ -0,0 +1,673 @@
...
> +
> +/*
> + * Inode Btree Repair
> + * ==================
> + *
> + * A quick refresher of inode btrees on a v5 filesystem:
> + *
> + * - Each inode btree record can describe a single 'inode chunk'.  The chunk
> + *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
> + *   inobt record must be aligned to the chunk size.  A chunk can be smaller
> + *   than a fs block.  One must be careful with 64k-block filesystems whose
> + *   inodes are smaller than 1k.
> + *
> + * - Inode buffers are read into memory in units of 'inode clusters'.  However
> + *   many inodes fit in a cluster buffer is the smallest number of inodes that
> + *   can be allocated or freed.  Clusters are never larger than a chunk and
> + *   never smaller than a fs block.  If sparse inodes are not enabled, then
> + *   records can be aligned to a cluster.
> + *

I find the wording around alignment in the above two sections a little
confusing. We distinguish between sparse=0/1 on some points but not
others, like the cluster buffer being the smallest possible allocation
unit of inodes, but IIUC that is only the case with sparse=1.

My general understanding is that inode records should always be aligned
to sb_inoalignmt, regardless of sparse inodes. For non-sparse
filesystems, this value can be smaller than the chunk size. For sparse
filesystems, it must match the chunk size and sb_spino_align defines the
sparse chunk allocation alignment, which must match the cluster size.

> + * - If sparse inodes are enabled, the holemask field will be active.  Each
> + *   bit of the holemask represents 4 potential inodes; if set, the
> + *   corresponding space does *not* contain inodes and must be left alone.
> + *
> + * So what's the rebuild algorithm?
> + *
> + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
> + * records.  The OWN_INOBT records are the old inode btree blocks and will be
> + * cleared out after we've rebuilt the tree.  Each possible inode chunk within
> + * an OWN_INODES record will be read in and the freemask calculated from the
> + * i_mode data in the inode chunk.  For sparse inodes the holemask will be
> + * calculated by creating the properly aligned inobt record and punching out
> + * any chunk that's missing.  Inode allocations and frees grab the AGI first,
> + * so repair protects itself from concurrent access by locking the AGI.
> + *
> + * Once we've reconstructed all the inode records, we can create new inode
> + * btree roots and reload the btrees.  We rebuild both inode trees at the same
> + * time because they have the same rmap owner and it would be more complex to
> + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
> + * blocks it owns.  We have all the data we need to build both, so dump
> + * everything and start over.
> + *
> + * We use the prefix 'xrep_ibt' because we rebuild both inode btrees.
> + */
> +
> +struct xrep_ibt_extent {
> +	struct list_head	list;
> +	xfs_inofree_t		freemask;
> +	xfs_agino_t		startino;
> +	unsigned int		count;
> +	unsigned int		usedcount;
> +	uint16_t		holemask;

I'm curious why we wouldn't just reuse xfs_inobt_rec_incore here.

> +};
> +
...
> +
> +/*
> + * For each inode cluster covering the physical extent recorded by the rmapbt,
> + * we must calculate the properly aligned startino of that cluster, then
> + * iterate each cluster to fill in used and filled masks appropriately.  We
> + * then use the (startino, used, filled) information to construct the
> + * appropriate inode records.
> + */
> +STATIC int
> +xrep_ibt_process_cluster(
> +	struct xrep_ibt		*ri,
> +	xfs_agblock_t		agbno,
> +	int			blks_per_cluster,
> +	xfs_agino_t		rec_agino)
> +{
> +	struct xfs_imap		imap;
> +	struct xrep_ibt_extent	*rie;
> +	struct xfs_dinode	*dip;
> +	struct xfs_buf		*bp;
> +	struct xfs_scrub	*sc = ri->sc;
> +	struct xfs_mount	*mp = sc->mp;
> +	xfs_ino_t		fsino;
> +	xfs_inofree_t		usedmask;
> +	xfs_agino_t		nr_inodes;
> +	xfs_agino_t		startino;
> +	xfs_agino_t		clusterino;
> +	xfs_agino_t		clusteroff;
> +	xfs_agino_t		agino;
> +	uint16_t		fillmask;
> +	bool			inuse;
> +	int			usedcount;
> +	int			error;
> +
> +	/* The per-AG inum of this inode cluster. */
> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> +
> +	/* The per-AG inum of the inobt record. */
> +	startino = rec_agino + rounddown(agino - rec_agino,
> +			XFS_INODES_PER_CHUNK);

Hmm, I'm not following what this does. When does startino != rec_agino
here? Is this related to the multi-chunk-per-block case on large block
sizes, since I'm not quite following how we handle that case either...?
Don't we need to factor in inodes_per_block somewhere in here to cover
the multi-chunk case?

BTW, that second line could use another indent or two to clarify it's
part of the rounddown() call.

> +
> +	/* The per-AG inum of the cluster within the inobt record. */
> +	clusteroff = agino - startino;
> +
> +	/* Every inode in this holemask slot is filled. */
> +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> +
> +	/*
> +	 * Grab the inode cluster buffer.  This is safe to do with a broken
> +	 * inobt because imap_to_bp directly maps the buffer without touching
> +	 * either inode btree.
> +	 */
> +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> +	imap.im_boffset = 0;
> +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> +			XFS_IGET_UNTRUSTED);
> +	if (error)
> +		return error;
> +
> +	usedmask = 0;
> +	usedcount = 0;
> +	/* Which inodes within this cluster are free? */
> +	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
> +		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
> +		error = xrep_ibt_check_free(sc, bp, fsino,
> +				clusterino, &inuse);
> +		if (error) {
> +			xfs_trans_brelse(sc->tp, bp);
> +			return error;
> +		}
> +		if (inuse) {
> +			usedcount++;
> +			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
> +		}
> +	}
> +	xfs_trans_brelse(sc->tp, bp);
> +
> +	/*
> +	 * If the last item in the list is our chunk record,
> +	 * update that.
> +	 */
> +	if (!list_empty(ri->extlist)) {
> +		rie = list_last_entry(ri->extlist, struct xrep_ibt_extent,
> +				list);
> +		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
> +			rie->freemask &= ~usedmask;
> +			rie->holemask &= ~fillmask;
> +			rie->count += nr_inodes;
> +			rie->usedcount += usedcount;
> +			return 0;
> +		}

And I think if we used the existing in-core record data structure we
could also reuse existing helpers like __xfs_inobt_rec_merge().

Alternatively, could we allocate/lookup the xrep_ibt_extent earlier and
update the associated fields directly rather than via the indirection of
the various local vars?

BTW, I initially thought this was a sparse inode thing but I see a bit
further down that we process a cluster at a time regardless. That seems
Ok, but I do wonder if some of this list hackery and whatnot could be
simplified by walking the clusters here. I guess we'd still need to
account for separate rmapbt records for sparse chunks, however.

> +	}
> +
> +	/* New inode chunk; add to the list. */
> +	rie = kmem_alloc(sizeof(struct xrep_ibt_extent), KM_MAYFAIL);
> +	if (!rie)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&rie->list);
> +	rie->startino = startino;
> +	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
> +	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;

I'm not sure we need the ALL_FREE thing here..? We don't use it in the
update case above. (Though it would make sense if we allocated this
structure earlier and initialized it.)

> +	rie->count = nr_inodes;
> +	rie->usedcount = usedcount;
> +	list_add_tail(&rie->list, ri->extlist);
> +	ri->nr_records++;
> +
> +	return 0;
> +}
> +
> +/* Record extents that belong to inode btrees. */
> +STATIC int
> +xrep_ibt_walk_rmap(
> +	struct xfs_btree_cur	*cur,
> +	struct xfs_rmap_irec	*rec,
> +	void			*priv)
> +{
> +	struct xrep_ibt		*ri = priv;
> +	struct xfs_mount	*mp = cur->bc_mp;
> +	xfs_fsblock_t		fsbno;
> +	xfs_agblock_t		agbno = rec->rm_startblock;
> +	xfs_agino_t		inoalign;
> +	xfs_agino_t		agino;
> +	xfs_agino_t		rec_agino;
> +	int			blks_per_cluster;
> +	int			error = 0;
> +
> +	if (xchk_should_terminate(ri->sc, &error))
> +		return error;
> +
> +	/* Fragment of the old btrees; dispose of them later. */
> +	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
> +		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
> +		return xfs_bitmap_set(ri->btlist, fsbno, rec->rm_blockcount);
> +	}
> +
> +	/* Skip extents which are not owned by this inode and fork. */
> +	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
> +		return 0;
> +
> +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> +
> +	if (agbno % blks_per_cluster != 0)
> +		return -EFSCORRUPTED;
> +

Ok, so we check that agbno is at least cluster aligned...

Shouldn't we verify that blockcount is sane as well?

> +	trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.agno, rec->rm_startblock,
> +			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
> +			rec->rm_flags);
> +
> +	/*
> +	 * Determine the inode block alignment, and where the block
> +	 * ought to start if it's aligned properly.  On a sparse inode
> +	 * system the rmap doesn't have to start on an alignment boundary,
> +	 * but the record does.  On pre-sparse filesystems, we /must/
> +	 * start both rmap and inobt on an alignment boundary.
> +	 */
> +	inoalign = xfs_ialloc_cluster_alignment(mp);
> +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> +	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
> +	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
> +		return -EFSCORRUPTED;
> +

... then if I follow correctly, verify the block is aligned
appropriately on !sparse. Firstly, isn't the above logically equivalent
to the following? E.g., I'm not sure why we need agino here.

	if (!sparse && (agbno % inoalign != 0))
		return -EFSCORRUPTED;

I take it that since we're walking the rmap, agbno could refer to a
sparse cluster. Perhaps we should also check against sb_spino_align in
the sparse case. FWIW, I think the comment above could be more clear as
well:

/*
 * On a sparse inode fs, agbno could refer to a partial chunk. This
 * should be aligned to the sparse chunk alignment. On a non-sparse fs,
 * agbno must always refer to the first block of an inode chunk and so
 * should be chunk aligned.
 */

> +	/*
> +	 * Set up the free/hole masks for each inode cluster that could be
> +	 * mapped by this rmap record.
> +	 */
> +	for (;
> +	     agbno < rec->rm_startblock + rec->rm_blockcount;
> +	     agbno += blks_per_cluster) {
> +		error = xrep_ibt_process_cluster(ri, agbno, blks_per_cluster,
> +				rec_agino);
> +		if (error)
> +			return error;
> +	}

Hmm, Ok. We're processing inodes a cluster size at a time regardless of
the extent length. That makes sense since we presumably need to read and
process the inode cluster itself.

> +
> +	return 0;
> +}
> +
...
> +/* Build new inode btrees and dispose of the old one. */
> +STATIC int
> +xrep_ibt_rebuild_trees(
> +	struct xfs_scrub	*sc,
> +	struct list_head	*inode_records,
> +	struct xfs_owner_info	*oinfo,
> +	struct xfs_bitmap	*old_iallocbt_blocks)
> +{
> +	struct xrep_ibt_extent	*rie;
> +	struct xrep_ibt_extent	*n;
> +	int			error;
> +
> +	/* Add all records. */
> +	list_sort(NULL, inode_records, xrep_ibt_extent_cmp);
> +	list_for_each_entry_safe(rie, n, inode_records, list) {
> +		error = xrep_ibt_insert_rec(sc, rie);
> +		if (error)
> +			return error;
> +
> +		list_del(&rie->list);
> +		kmem_free(rie);
> +	}

Same general thoughts here around freeing old blocks and whatnot as for
the allocbt repairs. Though I assume if we end up tweaking that behavior
we'll do so across the board.

Brian

> +
> +	/* Free the old inode btree blocks if they're not in use. */
> +	return xrep_reap_extents(sc, old_iallocbt_blocks, oinfo,
> +			XFS_AG_RESV_NONE);
> +}
> +
> +/*
> + * Make our new inode btree roots permanent so that we can start re-adding
> + * inode records back into the AG.
> + */
> +STATIC int
> +xrep_ibt_commit_new(
> +	struct xfs_scrub	*sc,
> +	struct xfs_bitmap	*old_iallocbt_blocks,
> +	int			log_flags)
> +{
> +	int			error;
> +
> +	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
> +
> +	/* Invalidate all the inobt/finobt blocks in btlist. */
> +	error = xrep_invalidate_blocks(sc, old_iallocbt_blocks);
> +	if (error)
> +		return error;
> +	error = xrep_roll_ag_trans(sc);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Now that we've succeeded, mark the incore state valid again.  If the
> +	 * finobt is enabled, make sure we reinitialize the per-AG reservations
> +	 * when we're done.
> +	 */
> +	sc->sa.pag->pagi_init = 1;
> +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb))
> +		sc->reset_perag_resv = true;
> +	return 0;
> +}
> +
> +/* Repair both inode btrees. */
> +int
> +xrep_iallocbt(
> +	struct xfs_scrub	*sc)
> +{
> +	struct xfs_owner_info	oinfo;
> +	struct list_head	inode_records;
> +	struct xfs_bitmap	old_iallocbt_blocks;
> +	struct xfs_mount	*mp = sc->mp;
> +	int			log_flags = 0;
> +	int			error = 0;
> +
> +	/* We require the rmapbt to rebuild anything. */
> +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		return -EOPNOTSUPP;
> +
> +	xchk_perag_get(sc->mp, &sc->sa);
> +
> +	/* Collect the free space data and find the old btree blocks. */
> +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> +	INIT_LIST_HEAD(&inode_records);
> +	xfs_bitmap_init(&old_iallocbt_blocks);
> +	error = xrep_ibt_find_inodes(sc, &inode_records, &old_iallocbt_blocks);
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Blow out the old inode btrees.  This is the point at which
> +	 * we are no longer able to bail out gracefully.
> +	 */
> +	error = xrep_ibt_reset_counters(sc, &inode_records, &log_flags);
> +	if (error)
> +		goto out;
> +	error = xrep_ibt_reset_btrees(sc, &oinfo, &log_flags);
> +	if (error)
> +		goto out;
> +	error = xrep_ibt_commit_new(sc, &old_iallocbt_blocks, log_flags);
> +	if (error)
> +		goto out;
> +
> +	/* Now rebuild the inode information. */
> +	error = xrep_ibt_rebuild_trees(sc, &inode_records, &oinfo,
> +			&old_iallocbt_blocks);
> +	if (error)
> +		goto out;
> +out:
> +	xrep_ibt_cancel_inorecs(&inode_records);
> +	xfs_bitmap_destroy(&old_iallocbt_blocks);
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> index 17cf48564390..a44deb6f06ab 100644
> --- a/fs/xfs/scrub/repair.c
> +++ b/fs/xfs/scrub/repair.c
> @@ -880,3 +880,23 @@ xrep_ino_dqattach(
>  
>  	return error;
>  }
> +
> +/*
> + * Reinitialize the per-AG block reservation for the AG we just fixed.
> + */
> +int
> +xrep_reset_perag_resv(
> +	struct xfs_scrub	*sc)
> +{
> +	int			error;
> +
> +	ASSERT(sc->ops->type == ST_PERAG);
> +	ASSERT(sc->tp);
> +
> +	error = xfs_ag_resv_free(sc->sa.pag);
> +	if (error)
> +		goto out;
> +	error = xfs_ag_resv_init(sc->sa.pag, sc->tp);
> +out:
> +	return error;
> +}
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index bc1a5f1cbcdc..0cc53dee3228 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -53,6 +53,7 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
>  		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
>  void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
>  int xrep_ino_dqattach(struct xfs_scrub *sc);
> +int xrep_reset_perag_resv(struct xfs_scrub *sc);
>  
>  /* Metadata repairers */
>  
> @@ -62,6 +63,7 @@ int xrep_agf(struct xfs_scrub *sc);
>  int xrep_agfl(struct xfs_scrub *sc);
>  int xrep_agi(struct xfs_scrub *sc);
>  int xrep_allocbt(struct xfs_scrub *sc);
> +int xrep_iallocbt(struct xfs_scrub *sc);
>  
>  #else
>  
> @@ -83,12 +85,21 @@ xrep_calc_ag_resblks(
>  	return 0;
>  }
>  
> +static inline int
> +xrep_reset_perag_resv(
> +	struct xfs_scrub	*sc)
> +{
> +	ASSERT(0);
> +	return -EOPNOTSUPP;
> +}
> +
>  #define xrep_probe			xrep_notsupported
>  #define xrep_superblock			xrep_notsupported
>  #define xrep_agf			xrep_notsupported
>  #define xrep_agfl			xrep_notsupported
>  #define xrep_agi			xrep_notsupported
>  #define xrep_allocbt			xrep_notsupported
> +#define xrep_iallocbt			xrep_notsupported
>  
>  #endif /* CONFIG_XFS_ONLINE_REPAIR */
>  
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 2133a3199372..631b0b06db99 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -244,14 +244,14 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
>  		.type	= ST_PERAG,
>  		.setup	= xchk_setup_ag_iallocbt,
>  		.scrub	= xchk_inobt,
> -		.repair	= xrep_notsupported,
> +		.repair	= xrep_iallocbt,
>  	},
>  	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
>  		.type	= ST_PERAG,
>  		.setup	= xchk_setup_ag_iallocbt,
>  		.scrub	= xchk_finobt,
>  		.has	= xfs_sb_version_hasfinobt,
> -		.repair	= xrep_notsupported,
> +		.repair	= xrep_iallocbt,
>  	},
>  	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
>  		.type	= ST_PERAG,
> diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
> index af323b229c4b..762db46fd696 100644
> --- a/fs/xfs/scrub/scrub.h
> +++ b/fs/xfs/scrub/scrub.h
> @@ -64,6 +64,7 @@ struct xfs_scrub {
>  	uint				ilock_flags;
>  	bool				try_harder;
>  	bool				has_quotaofflock;
> +	bool				reset_perag_resv;
>  
>  	/* State tracking for single-AG operations. */
>  	struct xchk_ag			sa;
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index 26bd5dc68efe..9126dc66f726 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -552,7 +552,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
>  		 uint64_t owner, uint64_t offset, unsigned int flags), \
>  	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
>  DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
> -DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
> +DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
>  DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
>  DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
>  
> @@ -700,7 +700,7 @@ TRACE_EVENT(xrep_reset_counters,
>  		  MAJOR(__entry->dev), MINOR(__entry->dev))
>  )
>  
> -TRACE_EVENT(xrep_ialloc_insert,
> +TRACE_EVENT(xrep_ibt_insert,
>  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
>  		 xfs_agino_t startino, uint16_t holemask, uint8_t count,
>  		 uint8_t freecount, uint64_t freemask),
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-02 13:48               ` Brian Foster
@ 2018-08-02 19:22                 ` Darrick J. Wong
  2018-08-03 10:49                   ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-02 19:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > 
> > > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > > 
> > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > ---
> > > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > > 
> > > > > > > > 
> > > > > > > ...
> > > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > new file mode 100644
> > > > > > > > index 000000000000..b228c2906de2
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > @@ -0,0 +1,581 @@
> > > ...
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > > + * unused space back into the AG.
> > > > > > > > + */
> > > > > > > > +STATIC int
> > > > > > > > +xrep_abt_commit_new(
> > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > > +	int			log_flags)
> > > > > > > > +{
> > > > > > > > +	int			error;
> > > > > > > > +
> > > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > > +
> > > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > > +	if (error)
> > > > > > > > +		return error;
> > > > > > > 
> > > > > > > It looks like the above invalidation all happens in the same
> > > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > > how many log formats we can get away with in this single transaction?
> > > > > > 
> > > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > > 
> > > > > 
> > > > > I was thinking more about transaction reservation than RAM. It may not
> > > > 
> > > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > > transaction every 1000 invalidations or so...
> > > > 
> > > 
> > > I'm not really sure what categorizes as a lot here given that the blocks
> > > would need to be in-core, but rolling on some fixed/safe interval sounds
> > > reasonable to me.
> > > 
> > > > > currently be an issue, but it might be worth putting something down in a
> > > > > comment to note that this is a single transaction and we expect to not
> > > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > > whatever that value happens to be.
> > > > > 
> > > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > > +	if (error)
> > > > > > > > +		return error;
> > > > > > > > +
> > > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > > +	return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > > +STATIC int
> > > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > +	struct list_head	*free_extents,
> > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > > +{
> > > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > > +	int			error;
> > > > > > > > +
> > > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > > +
> > > > > > > > +	/*
> > > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > > +	 */
> > > > > > > 
> > > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > > if that's the case?
> > > > > > > 
> > > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > > 
> > > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > > them to set up new btree roots.
> > > > > > 
> > > > > 
> > > > > Got it, thanks.
> > > > > 
> > > > > > > > +	if (!longest)
> > > > > > > > +		goto done;
> > > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > > +			longest->len, &oinfo);
> > > > > > > > +	list_del(&longest->list);
> > > > > > > > +	kmem_free(longest);
> > > > > > > > +	if (error)
> > > > > > > > +		return error;
> > > > > > > > +
> > > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > > +				rae->len, &oinfo);
> > > > > > > > +		if (error)
> > > > > > > > +			return error;
> > > > > > > > +		list_del(&rae->list);
> > > > > > > > +		kmem_free(rae);
> > > > > > > > +	}
> > > > > > > 
> > > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > > rmapbt again on a retry.
> > > > > > 
> > > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > > filesystem in an offline state at some point before you can retry the
> > > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > > 
> > > > > 
> > > > > Can we really assume that if we're already up and running an online
> > > > > repair? The filesystem has to be mountable in that case in the first
> > > > > place. If we've already reset and started reconstructing the allocation
> > > > > btrees then I'd think those transactions would recover just fine on a
> > > > > power loss or something (perhaps not in the event of some other
> > > > > corruption related shutdown).
> > > > 
> > > > Right, for the system crash case, whatever transactions committed should
> > > > replay just fine, and you can even start up the online repair again, and
> > > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > > corruption) it should work just fine.
> > > > 
> > > > If the fs went down because either (a) repair hit other corruption or
> > > > (b) some other thread hit an error in some other part of the filesystem,
> > > > then it's not so clear -- in (b) you could probably try again, but for
> > > > (a) you'll definitely have to unmount and run xfs_repair.
> > > > 
> > > 
> > > Indeed, there are certainly cases where we simply won't be able to do an
> > > online repair. I'm trying to think about scenarios where we should be
> > > able to do an online repair, but we lose power or hit some kind of
> > > transient error like a memory allocation failure before it completes. It
> > > would be nice if the online repair itself didn't contribute (within
> > > reason) to the inability to simply try again just because the fs was
> > > close to -ENOSPC.
> > 
> > Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> > start modifying on-disk metadata.  If that happens, we just free all the
> > memory and bail out having done nothing.
> > 
> > > For one, I think it's potentially confusing behavior. Second, it might
> > > be concerning to regular users who perceive it as an online repair
> > > leaving the fs in a worse off state. Us fs devs know that may not really
> > > be the case, but I think we're better for addressing it if we can
> > > reasonably do so.
> > 
> > <nod> Further in the future I want to add the ability to offline an AG,
> > so the worst that happens is that scrub turns the AG off, repair doesn't
> > fix it, and the AG simply stays offline.  That might give us the
> > ability to survive cancelling the repair transaction, since if the AG's
> > offline already anyway we could just throw away the dirty buffers and
> > resurrect the AG later.  I don't know, that's purely speculative.
> > 
> > > > Perhaps the guideline here is that if the fs goes down more than once
> > > > during online repair then unmount it and run xfs_repair.
> > > > 
> > > 
> > > Yep, I think that makes sense if the filesystem or repair itself is
> > > tripping over other corruptions that fail to keep it active for the
> > > duration of the repair.
> > 
> > <nod>
> > 
> > > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > > enough free space to repair?
> > > > > > 
> > > > > > Yes, that's a risk of running the free space repair.
> > > > > > 
> > > > > 
> > > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > > not, could we remove those entries before we start tree reconstruction?
> > > > > 
> > > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > > is, we discover those blocks simply so we can free them at the end.
> > > > > Perhaps we could free them sooner or find a more clever means to
> > > > > reallocate directly from that in-core list? I guess we have to consider
> > > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > > that the old blocks list is essentially invalidated once we reset the
> > > > > btrees.
> > > > 
> > > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > > fix_freelist won't succeed.  That complication is why it finds the
> > > > longest extent in the unclaimed list and pushes that in first, then
> > > > works on the rest of the extents.
> > > > 
> > > 
> > > Hmm, but doesn't a btree split require at least one full space btree
> > > block per-level? In conjunction, the agfl minimum size requirement grows
> > > with the height of the tree, which implies available free space..? I
> > > could be missing something, perhaps we have to account for the rmapbt in
> > > that case as well? Regardless...
> > > 
> > > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > > in first (since we know that won't trigger a split), then reap the old
> > > > alloc btree blocks, and then add everything else back in...
> > > > 
> > > 
> > > I think it would be reasonable to seed the btree with the longest record
> > > or some fixed number of longest records (~1/2 a root block, for example)
> > > before making actual use of the btrees to reap the old blocks. I think
> > > then you'd only have a very short window of a single block leak on a
> > > poorly timed power loss and repair retry sequence before you start
> > > actually freeing originally used space (which in practice, I think
> > > solves the problem).
> > > 
> > > Given that we're starting from empty, I wonder if another option may be
> > > to over fill the agfl with old btree blocks or something. The first real
> > > free should shift enough blocks back into the btrees to ensure the agfl
> > > can be managed from that point forward, right? That may be more work
> > > than it's worth though and/or a job for another patch. (FWIW, we also
> > > have that NOSHRINK agfl fixup flag for userspace repair.)
> > 
> > Yes, I'll give that a try tomorrow, now that I've finished porting all
> > the 4.19 stuff to xfsprogs. :)
> > 
> > Looping back to something we discussed earlier in this thread, I'd
> > prefer to hold off on converting the list of already-freed extents to
> > xfs_bitmap because the same problem exists in all the repair functions
> > of having to store a large number of records for the rebuilt btree, and
> > maybe there's some way to <cough> use pageable memory for that, since
> > the access patterns for that are append, sort, and iterate; for those
> > three uses we don't necessarily require all the records to be in memory
> > all the time.  For the allocbt repair I expect the free space records to
> > be far more numerous than the list of old bnobt/cntbt blocks.
> > 
> 
> Ok, it's fair enough that we'll probably want to find some kind of
> generic, more efficient technique for handling this across the various
> applicable repair algorithms.
> 
> One other high level thing that crossed my mind with regard to the
> general btree reconstruction algorithms is whether we need to build up
> this kind of central record list at all. For example, rather than slurp
> up the entire list of btree records in-core, sort it and dump it back
> out, could we take advantage of the fact that our existing on-disk
> structure insertion mechanisms already handle out of order records
> (simply stated, an extent free knows how to insert the associated record
> at the right place in the space btrees)? For example, suppose we reset
> the existing btrees first, then scanned the rmapbt and repopulated the
> new btrees as records are discovered..?

I tried that in an earlier draft of the bnobt repair function.  The
biggest problem with inserting as we go is dealing with the inevitable
transaction rolls (right now we do after every record insertion to avoid
playing games with guessing how much reservation is left).  Btree
cursor state can't survive transaction rolls because the transaction
commit releases all the buffers that aren't bhold'en, and we can't bhold
that many buffers across a _defer_finish.

So, that early draft spent a lot of time tearing down and reconstructing
rmapbt cursors since the standard _btree_query_all isn't suited to that
kind of usage.  It was easily twice as slow on a RAM-backed disk just
from the rmap cursor overhead and much more complex, so I rewrote it to
be simpler.  I also have a slight preference for not touching anything
until we're absolutely sure we have all the data we need to repair the
structure.

For other repair functions (like the data/attr fork repairs) we have to
scan all the rmapbts for extents, and I'd prefer to lock those AGs only
for as long as necessary to extract the extents we want.

> The obvious problem is that we still have some checks that allow the
> whole repair operation to bail out before we determine whether we can
> start to rebuild the on-disk btrees. These are things like making sure
> we can actually read the associated rmapbt blocks (i.e., no read errors
> or verifier failures), basic record sanity checks, etc. But ISTM that
> isn't anything we couldn't get around with a multi-pass implementation.
> Secondary issues might be things like no longer being able to easily
> insert the longest free extent range(s) first (meaning we'd have to
> stuff the agfl with old btree blocks or figure out some other approach).

Well, you could scan the rmapbt twice -- once to find the longest
record, then again to do the actual insertion.

> BTW, isn't the typical scrub sequence already multi-pass by virtue of
> the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> callout could not only detect corruption, but validate whether repair
> (if requested) is possible based on the kind of checks that are
> currently in the repair side rmapbt walkers. Thoughts?r

Yes, scrub basically validates that for us now, with the notable
exception of the notorious rmapbt scrubber, which doesn't
cross-reference with inode block mappings because that would be a
locking nightmare.

> Are there future
> changes that are better supported by an in-core tracking structure in
> general (assuming we'll eventually replace the linked lists with
> something more efficient) as opposed to attempting to optimize out the
> need for that tracking at all?

Well, I was thinking that we could just allocate a memfd (or a file on
the same xfs once we have AG offlining) and store the records in there.
That saves us the list_head overhead and potentially enables access to a
lot more storage than pinning things in RAM.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > > > +
> > > > > > > > +done:
> > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > ...
> > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > >  	return diff;
> > > > > > > >  }
> > > > > > > > +
> > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > +bool
> > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > +{
> > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > 
> > > > > > > RB_EMPTY_ROOT()?
> > > > > > 
> > > > > > Good suggestion, thank you!
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > +		return false;
> > > > > > > > +	}
> > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > +	return true;
> > > > > > > > +}
> > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > >  }
> > > > > > > >  
> > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > +
> > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > 
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-02 19:22                 ` Darrick J. Wong
@ 2018-08-03 10:49                   ` Brian Foster
  2018-08-07 23:34                     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-03 10:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > 
> > > > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > ---
> > > > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > ...
> > > > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > new file mode 100644
> > > > > > > > > index 000000000000..b228c2906de2
> > > > > > > > > --- /dev/null
> > > > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > @@ -0,0 +1,581 @@
> > > > ...
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > > > + * unused space back into the AG.
> > > > > > > > > + */
> > > > > > > > > +STATIC int
> > > > > > > > > +xrep_abt_commit_new(
> > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > > > +	int			log_flags)
> > > > > > > > > +{
> > > > > > > > > +	int			error;
> > > > > > > > > +
> > > > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > > > +
> > > > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > > > +	if (error)
> > > > > > > > > +		return error;
> > > > > > > > 
> > > > > > > > It looks like the above invalidation all happens in the same
> > > > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > > > how many log formats we can get away with in this single transaction?
> > > > > > > 
> > > > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > > > 
> > > > > > 
> > > > > > I was thinking more about transaction reservation than RAM. It may not
> > > > > 
> > > > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > > > transaction every 1000 invalidations or so...
> > > > > 
> > > > 
> > > > I'm not really sure what categorizes as a lot here given that the blocks
> > > > would need to be in-core, but rolling on some fixed/safe interval sounds
> > > > reasonable to me.
> > > > 
> > > > > > currently be an issue, but it might be worth putting something down in a
> > > > > > comment to note that this is a single transaction and we expect to not
> > > > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > > > whatever that value happens to be.
> > > > > > 
> > > > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > > > +	if (error)
> > > > > > > > > +		return error;
> > > > > > > > > +
> > > > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > > > +	return 0;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > > > +STATIC int
> > > > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > +	struct list_head	*free_extents,
> > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > > > +{
> > > > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > > > +	int			error;
> > > > > > > > > +
> > > > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > > > +
> > > > > > > > > +	/*
> > > > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > > > +	 */
> > > > > > > > 
> > > > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > > > if that's the case?
> > > > > > > > 
> > > > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > > > 
> > > > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > > > them to set up new btree roots.
> > > > > > > 
> > > > > > 
> > > > > > Got it, thanks.
> > > > > > 
> > > > > > > > > +	if (!longest)
> > > > > > > > > +		goto done;
> > > > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > > > +			longest->len, &oinfo);
> > > > > > > > > +	list_del(&longest->list);
> > > > > > > > > +	kmem_free(longest);
> > > > > > > > > +	if (error)
> > > > > > > > > +		return error;
> > > > > > > > > +
> > > > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > > > +				rae->len, &oinfo);
> > > > > > > > > +		if (error)
> > > > > > > > > +			return error;
> > > > > > > > > +		list_del(&rae->list);
> > > > > > > > > +		kmem_free(rae);
> > > > > > > > > +	}
> > > > > > > > 
> > > > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > > > rmapbt again on a retry.
> > > > > > > 
> > > > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > > > filesystem in an offline state at some point before you can retry the
> > > > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > > > 
> > > > > > 
> > > > > > Can we really assume that if we're already up and running an online
> > > > > > repair? The filesystem has to be mountable in that case in the first
> > > > > > place. If we've already reset and started reconstructing the allocation
> > > > > > btrees then I'd think those transactions would recover just fine on a
> > > > > > power loss or something (perhaps not in the event of some other
> > > > > > corruption related shutdown).
> > > > > 
> > > > > Right, for the system crash case, whatever transactions committed should
> > > > > replay just fine, and you can even start up the online repair again, and
> > > > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > > > corruption) it should work just fine.
> > > > > 
> > > > > If the fs went down because either (a) repair hit other corruption or
> > > > > (b) some other thread hit an error in some other part of the filesystem,
> > > > > then it's not so clear -- in (b) you could probably try again, but for
> > > > > (a) you'll definitely have to unmount and run xfs_repair.
> > > > > 
> > > > 
> > > > Indeed, there are certainly cases where we simply won't be able to do an
> > > > online repair. I'm trying to think about scenarios where we should be
> > > > able to do an online repair, but we lose power or hit some kind of
> > > > transient error like a memory allocation failure before it completes. It
> > > > would be nice if the online repair itself didn't contribute (within
> > > > reason) to the inability to simply try again just because the fs was
> > > > close to -ENOSPC.
> > > 
> > > Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> > > start modifying on-disk metadata.  If that happens, we just free all the
> > > memory and bail out having done nothing.
> > > 
> > > > For one, I think it's potentially confusing behavior. Second, it might
> > > > be concerning to regular users who perceive it as an online repair
> > > > leaving the fs in a worse off state. Us fs devs know that may not really
> > > > be the case, but I think we're better for addressing it if we can
> > > > reasonably do so.
> > > 
> > > <nod> Further in the future I want to add the ability to offline an AG,
> > > so the worst that happens is that scrub turns the AG off, repair doesn't
> > > fix it, and the AG simply stays offline.  That might give us the
> > > ability to survive cancelling the repair transaction, since if the AG's
> > > offline already anyway we could just throw away the dirty buffers and
> > > resurrect the AG later.  I don't know, that's purely speculative.
> > > 
> > > > > Perhaps the guideline here is that if the fs goes down more than once
> > > > > during online repair then unmount it and run xfs_repair.
> > > > > 
> > > > 
> > > > Yep, I think that makes sense if the filesystem or repair itself is
> > > > tripping over other corruptions that fail to keep it active for the
> > > > duration of the repair.
> > > 
> > > <nod>
> > > 
> > > > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > > > enough free space to repair?
> > > > > > > 
> > > > > > > Yes, that's a risk of running the free space repair.
> > > > > > > 
> > > > > > 
> > > > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > > > not, could we remove those entries before we start tree reconstruction?
> > > > > > 
> > > > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > > > is, we discover those blocks simply so we can free them at the end.
> > > > > > Perhaps we could free them sooner or find a more clever means to
> > > > > > reallocate directly from that in-core list? I guess we have to consider
> > > > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > > > that the old blocks list is essentially invalidated once we reset the
> > > > > > btrees.
> > > > > 
> > > > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > > > fix_freelist won't succeed.  That complication is why it finds the
> > > > > longest extent in the unclaimed list and pushes that in first, then
> > > > > works on the rest of the extents.
> > > > > 
> > > > 
> > > > Hmm, but doesn't a btree split require at least one full space btree
> > > > block per-level? In conjunction, the agfl minimum size requirement grows
> > > > with the height of the tree, which implies available free space..? I
> > > > could be missing something, perhaps we have to account for the rmapbt in
> > > > that case as well? Regardless...
> > > > 
> > > > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > > > in first (since we know that won't trigger a split), then reap the old
> > > > > alloc btree blocks, and then add everything else back in...
> > > > > 
> > > > 
> > > > I think it would be reasonable to seed the btree with the longest record
> > > > or some fixed number of longest records (~1/2 a root block, for example)
> > > > before making actual use of the btrees to reap the old blocks. I think
> > > > then you'd only have a very short window of a single block leak on a
> > > > poorly timed power loss and repair retry sequence before you start
> > > > actually freeing originally used space (which in practice, I think
> > > > solves the problem).
> > > > 
> > > > Given that we're starting from empty, I wonder if another option may be
> > > > to over fill the agfl with old btree blocks or something. The first real
> > > > free should shift enough blocks back into the btrees to ensure the agfl
> > > > can be managed from that point forward, right? That may be more work
> > > > than it's worth though and/or a job for another patch. (FWIW, we also
> > > > have that NOSHRINK agfl fixup flag for userspace repair.)
> > > 
> > > Yes, I'll give that a try tomorrow, now that I've finished porting all
> > > the 4.19 stuff to xfsprogs. :)
> > > 
> > > Looping back to something we discussed earlier in this thread, I'd
> > > prefer to hold off on converting the list of already-freed extents to
> > > xfs_bitmap because the same problem exists in all the repair functions
> > > of having to store a large number of records for the rebuilt btree, and
> > > maybe there's some way to <cough> use pageable memory for that, since
> > > the access patterns for that are append, sort, and iterate; for those
> > > three uses we don't necessarily require all the records to be in memory
> > > all the time.  For the allocbt repair I expect the free space records to
> > > be far more numerous than the list of old bnobt/cntbt blocks.
> > > 
> > 
> > Ok, it's fair enough that we'll probably want to find some kind of
> > generic, more efficient technique for handling this across the various
> > applicable repair algorithms.
> > 
> > One other high level thing that crossed my mind with regard to the
> > general btree reconstruction algorithms is whether we need to build up
> > this kind of central record list at all. For example, rather than slurp
> > up the entire list of btree records in-core, sort it and dump it back
> > out, could we take advantage of the fact that our existing on-disk
> > structure insertion mechanisms already handle out of order records
> > (simply stated, an extent free knows how to insert the associated record
> > at the right place in the space btrees)? For example, suppose we reset
> > the existing btrees first, then scanned the rmapbt and repopulated the
> > new btrees as records are discovered..?
> 
> I tried that in an earlier draft of the bnobt repair function.  The
> biggest problem with inserting as we go is dealing with the inevitable
> transaction rolls (right now we do after every record insertion to avoid
> playing games with guessing how much reservation is left).  Btree
> cursor state can't survive transaction rolls because the transaction
> commit releases all the buffers that aren't bhold'en, and we can't bhold
> that many buffers across a _defer_finish.
> 

Ok, interesting.

Where do we need to run an xfs_defer_finish() during the reconstruction
sequence, btw? I thought that would only run on final commit as opposed
to intermediate rolls. We could just try and make the automatic buffer
relogging list a dynamic allocation if there are enough held buffers in
the transaction.

> So, that early draft spent a lot of time tearing down and reconstructing
> rmapbt cursors since the standard _btree_query_all isn't suited to that
> kind of usage.  It was easily twice as slow on a RAM-backed disk just
> from the rmap cursor overhead and much more complex, so I rewrote it to
> be simpler.  I also have a slight preference for not touching anything
> until we're absolutely sure we have all the data we need to repair the
> structure.
> 

Yes, I think that is sane in principle. I'm just a bit concerned about
how reliable that xfs_repair-like approach will be in the kernel longer
term, particularly once we start having to deal with large filesystems
and limited or contended memory, etc. We already have xfs_repair users
that need to tweak settings because there isn't enough memory available
to repair the fs. Granted that is for fs-wide repairs and the flipside
is that we know a single AG can only be up to 1TB. It's certainly
possible that putting some persistent backing behind the in-core data is
enough to resolve the problem (and the current approach is certainly
reasonable enough to me for the initial implementation).

bjoin limitations aside, I wonder if a cursor roll mechanism that held
all of the cursor buffers, rolled the transaction and then rejoined all
said buffers would help us get around that. (Not sure I follow the early
prototype behavior, but it sounds like we had to restart the rmapbt
lookup over and over...).

Another caveat with that approach may be that I think we'd need to be
sure that the reconstruction operation doesn't ever need to update the
rmapbt while we're mid walk of the latter. That may be an issue for
inode btree reconstruction, for example, since it looks like inobt block
allocation requires rmapbt updates. We'd probably need some way to share
(or invalidate) a cursor across different contexts to deal with that.

> For other repair functions (like the data/attr fork repairs) we have to
> scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> for as long as necessary to extract the extents we want.
> 
> > The obvious problem is that we still have some checks that allow the
> > whole repair operation to bail out before we determine whether we can
> > start to rebuild the on-disk btrees. These are things like making sure
> > we can actually read the associated rmapbt blocks (i.e., no read errors
> > or verifier failures), basic record sanity checks, etc. But ISTM that
> > isn't anything we couldn't get around with a multi-pass implementation.
> > Secondary issues might be things like no longer being able to easily
> > insert the longest free extent range(s) first (meaning we'd have to
> > stuff the agfl with old btree blocks or figure out some other approach).
> 
> Well, you could scan the rmapbt twice -- once to find the longest
> record, then again to do the actual insertion.
> 

Yep, that's what I meant by multi-pass.

> > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > callout could not only detect corruption, but validate whether repair
> > (if requested) is possible based on the kind of checks that are
> > currently in the repair side rmapbt walkers. Thoughts?r
> 
> Yes, scrub basically validates that for us now, with the notable
> exception of the notorious rmapbt scrubber, which doesn't
> cross-reference with inode block mappings because that would be a
> locking nightmare.
> 
> > Are there future
> > changes that are better supported by an in-core tracking structure in
> > general (assuming we'll eventually replace the linked lists with
> > something more efficient) as opposed to attempting to optimize out the
> > need for that tracking at all?
> 
> Well, I was thinking that we could just allocate a memfd (or a file on
> the same xfs once we have AG offlining) and store the records in there.
> That saves us the list_head overhead and potentially enables access to a
> lot more storage than pinning things in RAM.
> 

Would using the same fs mean we have to store the repair data in a
separate AG, or somehow locate/use free space in the target AG? I
presume either way we'd have to ensure that AG is either consistent or
locked out from outside I/O. If we have the total record count we can
preallocate the file and hope there is no such other free space
corruption or something that would allow some other task to mess with
our blocks. I'm a little skeptical overall on relying on a corrupted
filesystem to store repair data, but perhaps there are ways to mitigate
the risks.

I'm not familiar with memfd. The manpage suggests it's ram backed, is it
swappable or something? If so, that sounds a reasonable option provided
the swap space requirement can be made clear to users and the failure
characteristics aren't more severe than for userspace. An online repair
that puts the broader system at risk of OOM as opposed to predictably
failing gracefully may not be the most useful tool.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > > > +
> > > > > > > > > +done:
> > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > ...
> > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > >  	return diff;
> > > > > > > > >  }
> > > > > > > > > +
> > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > +bool
> > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > +{
> > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > 
> > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > 
> > > > > > > Good suggestion, thank you!
> > > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > +		return false;
> > > > > > > > > +	}
> > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > +	return true;
> > > > > > > > > +}
> > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > >  }
> > > > > > > > >  
> > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > +
> > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-07-31 15:10       ` Brian Foster
@ 2018-08-07 22:02         ` Darrick J. Wong
  2018-08-08 12:09           ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-07 22:02 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Tue, Jul 31, 2018 at 11:10:00AM -0400, Brian Foster wrote:
> On Mon, Jul 30, 2018 at 10:22:16AM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> > > On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Repair the AGFL from the rmap data.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> > > seems to always dump a cross-referencing failed error and not want to
> > > deal with it. Expected? Is there a good way to unit test some of this
> > > stuff with simple/localized corruptions?
> > 
> > I usually pick one of the corruptions from xfs/355...
> > 
> > $ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
> > SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
> > ./check xfs/355
> > 
> 
> It looks like similar behavior if I do that, but tbh I'm not sure if I'm
> using this correctly. E.g., if I do:

<urk> Sorry, I forgot to reply to this...

> # SCRATCH_XFS_LIST_FUZZ_VERBS=random SCRATCH_XFS_LIST_METADATA_FIELDS=bno[0] ./check xfs/355
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 localhost 4.18.0-rc4+
> MKFS_OPTIONS  -- -f -mrmapbt=1,reflink=1 /dev/mapper/test-scratch
> MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/test-scratch /mnt/scratch
> 
> xfs/355 - output mismatch (see /root/xfstests-dev/results//xfs/355.out.bad)
>     ...
>     (Run 'diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad'  to see the entire diff)
> Ran: xfs/355
> Failures: xfs/355
> Failed 1 of 1 tests
> # diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad
> --- tests/xfs/355.out   2018-07-25 07:47:23.739575416 -0400
> +++ /root/xfstests-dev/results//xfs/355.out.bad 2018-07-31
> 10:55:18.466178944 -0400
> @@ -1,6 +1,10 @@
>  QA output created by 355
>  Format and populate
>  Fuzz AGFL
> +online re-scrub (1) with bno[0] = random.
>  Done fuzzing AGFL
>  Fuzz AGFL flfirst
> +offline re-scrub (1) with bno[14] = random.
> +online re-scrub (1) with bno[14] = random.
> +re-repair failed (1) with bno[14] = random.
>  Done fuzzing AGFL flfirst
> 
> If I run xfs_scrub directly on the scratch mount after the test I get a
> stream of inode cross-referencing errors and it doesn't seem to fix
> anything up.

Hmm.  What is your xfsprogs head?  I think Eric committed the patches to
xfs_scrub to enable repairs in v4.18.0-rc1... which git says happened on
8/1.

--D

> 
> Brian
> 
> > > Otherwise this looks sane, a couple comments..
> > > 
> > > >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> > > >  fs/xfs/scrub/bitmap.h          |    4 +
> > > >  fs/xfs/scrub/scrub.c           |    2 
> > > >  4 files changed, 373 insertions(+), 1 deletion(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > index 4842fc598c9b..bfef066c87c3 100644
> > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > > @@ -424,3 +424,279 @@ xrep_agf(
> > > >  	memcpy(agf, &old_agf, sizeof(old_agf));
> > > >  	return error;
> > > >  }
> > > > +
> > > ...
> > > > +/* Write out a totally new AGFL. */
> > > > +STATIC void
> > > > +xrep_agfl_init_header(
> > > > +	struct xfs_scrub	*sc,
> > > > +	struct xfs_buf		*agfl_bp,
> > > > +	struct xfs_bitmap	*agfl_extents,
> > > > +	xfs_agblock_t		flcount)
> > > > +{
> > > > +	struct xfs_mount	*mp = sc->mp;
> > > > +	__be32			*agfl_bno;
> > > > +	struct xfs_bitmap_range	*br;
> > > > +	struct xfs_bitmap_range	*n;
> > > > +	struct xfs_agfl		*agfl;
> > > > +	xfs_agblock_t		agbno;
> > > > +	unsigned int		fl_off;
> > > > +
> > > > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > > > +
> > > > +	/* Start rewriting the header. */
> > > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > > 
> > > What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?
> > 
> > Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
> > in the header fields.
> > 
> > > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > > +
> > > > +	/*
> > > > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > > > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > > > +	 * step.
> > > > +	 */
> > > > +	fl_off = 0;
> > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > > > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > > > +
> > > > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > > > +
> > > > +		while (br->len > 0 && fl_off < flcount) {
> > > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > > +			fl_off++;
> > > > +			agbno++;
> > > 
> > > 			/* bump br so we don't reap blocks we've used */
> > > 
> > > (i.e., took me a sec to realize why we bother with ->start)
> > > 
> > > > +			br->start++;
> > > > +			br->len--;
> > > > +		}
> > > > +
> > > > +		if (br->len)
> > > > +			break;
> > > > +		list_del(&br->list);
> > > > +		kmem_free(br);
> > > > +	}
> > > > +
> > > > +	/* Write new AGFL to disk. */
> > > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > > +}
> > > > +
> > > ...
> > > > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > > > index c770e2d0b6aa..fdadc9e1dc49 100644
> > > > --- a/fs/xfs/scrub/bitmap.c
> > > > +++ b/fs/xfs/scrub/bitmap.c
> > > > @@ -9,6 +9,7 @@
> > > >  #include "xfs_format.h"
> > > >  #include "xfs_trans_resv.h"
> > > >  #include "xfs_mount.h"
> > > > +#include "xfs_btree.h"
> > > >  #include "scrub/xfs_scrub.h"
> > > >  #include "scrub/scrub.h"
> > > >  #include "scrub/common.h"
> > > > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> > > >  }
> > > >  #undef LEFT_ALIGNED
> > > >  #undef RIGHT_ALIGNED
> > > > +
> > > > +/*
> > > > + * Record all btree blocks seen while iterating all records of a btree.
> > > > + *
> > > > + * We know that the btree query_all function starts at the left edge and walks
> > > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > > + * the btree cursor towards the root; if the pointer for a given level points
> > > > + * to the first record/key in that block, we haven't seen this block before;
> > > > + * and therefore we need to remember that we saw this block in the btree.
> > > > + *
> > > > + * So if our btree is:
> > > > + *
> > > > + *    4
> > > > + *  / | \
> > > > + * 1  2  3
> > > > + *
> > > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > > + * block 4.  The list is [1, 4].
> > > > + *
> > > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > > + * loop.  The list remains [1, 4].
> > > > + *
> > > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > > + *
> > > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > > + *
> > > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > > + *
> > > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > > + */
> > > > +
> > > > +/*
> > > > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > > > + * engaged in a btree walk should call this function to capture the list of
> > > > + * blocks going from the leaf towards the root.
> > > > + */
> > > > +int
> > > > +xfs_bitmap_set_btcur_path(
> > > > +	struct xfs_bitmap	*bitmap,
> > > > +	struct xfs_btree_cur	*cur)
> > > > +{
> > > > +	struct xfs_buf		*bp;
> > > > +	xfs_fsblock_t		fsb;
> > > > +	int			i;
> > > > +	int			error;
> > > > +
> > > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > > +		xfs_btree_get_block(cur, i, &bp);
> > > > +		if (!bp)
> > > > +			continue;
> > > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> > > 
> > > Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> > > but also highlights that xfs_bitmap_set() essentially allocates entries
> > > for duplicate values if they exist. Is this handled by the broader
> > > mechanism, for example, if the rmapbt was corrupted to have multiple
> > > entries for a particular unused OWN_AG block? Or could we end up leaking
> > > that corruption over to the agfl?
> > 
> > Right now we're totally dependent on the rmapbt being sane to rebuild
> > the space metadata.
> > 
> > > I also wonder a bit about memory consumption on filesystems with large
> > > metadata footprints. We essentially have to allocate one of these for
> > > every allocation btree block before we can do the disunion and locate
> > > the agfl-appropriate blocks. If we had a more lookup friendly structure,
> > > perhaps this could be optimized by filtering out bnobt/cntbt blocks
> > > during the associated btree walks..?
> > > 
> > > Have you thought about reusing something like the new in-core extent
> > > tree mechanism as a pure in-memory extent store? It's certainly not
> > > worth reworking something like that right now, but I wonder if we could
> > > save memory via the denser format (and perhaps benefit from code
> > > flexibility, reuse, etc.).
> > 
> > Yes, I was thinking about refactoring the iext btree into a more generic
> > in-core index with 64-bit key so that I could adapt xfs_bitmap to use
> > it.  In the longer term I would /also/ like to use xfs_bitmap to detect
> > xfs_buf cache aliasing when multi-block buffers are in use, but that's
> > further off. :)
> > 
> > As for the memory-intensive record lists in all the btree rebuilders, I
> > have some ideas around that too -- either find a way to build an
> > alternate btree and switch the roots over, or (once we gain the ability
> > to mark an AG unavailable for new allocations) allocate an unlinked
> > inode, store the records in the page cache pages for the file, and
> > release it when we're done.
> > 
> > But, that can wait until I've gotten more of this merged, or get bored.
> > :)
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +		if (error)
> > > > +			return error;
> > > > +	}
> > > > +
> > > > +	return 0;
> > > > +}
> > > > +
> > > > +/* Collect a btree's block in the bitmap. */
> > > > +STATIC int
> > > > +xfs_bitmap_collect_btblock(
> > > > +	struct xfs_btree_cur	*cur,
> > > > +	int			level,
> > > > +	void			*priv)
> > > > +{
> > > > +	struct xfs_bitmap	*bitmap = priv;
> > > > +	struct xfs_buf		*bp;
> > > > +	xfs_fsblock_t		fsbno;
> > > > +
> > > > +	xfs_btree_get_block(cur, level, &bp);
> > > > +	if (!bp)
> > > > +		return 0;
> > > > +
> > > > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > > > +}
> > > > +
> > > > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > > > +int
> > > > +xfs_bitmap_set_btblocks(
> > > > +	struct xfs_bitmap	*bitmap,
> > > > +	struct xfs_btree_cur	*cur)
> > > > +{
> > > > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > > > +}
> > > > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > > > index dad652ee9177..ae8ecbce6fa6 100644
> > > > --- a/fs/xfs/scrub/bitmap.h
> > > > +++ b/fs/xfs/scrub/bitmap.h
> > > > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> > > >  
> > > >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> > > >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > > > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > > > +		struct xfs_btree_cur *cur);
> > > > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > > > +		struct xfs_btree_cur *cur);
> > > >  
> > > >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > > > --- a/fs/xfs/scrub/scrub.c
> > > > +++ b/fs/xfs/scrub/scrub.c
> > > > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > >  		.type	= ST_PERAG,
> > > >  		.setup	= xchk_setup_fs,
> > > >  		.scrub	= xchk_agfl,
> > > > -		.repair	= xrep_notsupported,
> > > > +		.repair	= xrep_agfl,
> > > >  	},
> > > >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > > >  		.type	= ST_PERAG,
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-03 10:49                   ` Brian Foster
@ 2018-08-07 23:34                     ` Darrick J. Wong
  2018-08-08 12:29                       ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-07 23:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > 
> > > > > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > ---
> > > > > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > ...
> > > > > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > new file mode 100644
> > > > > > > > > > index 000000000000..b228c2906de2
> > > > > > > > > > --- /dev/null
> > > > > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > @@ -0,0 +1,581 @@
> > > > > ...
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > > > > + * unused space back into the AG.
> > > > > > > > > > + */
> > > > > > > > > > +STATIC int
> > > > > > > > > > +xrep_abt_commit_new(
> > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > > > > +	int			log_flags)
> > > > > > > > > > +{
> > > > > > > > > > +	int			error;
> > > > > > > > > > +
> > > > > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > > > > +
> > > > > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > > > > +	if (error)
> > > > > > > > > > +		return error;
> > > > > > > > > 
> > > > > > > > > It looks like the above invalidation all happens in the same
> > > > > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > > > > how many log formats we can get away with in this single transaction?
> > > > > > > > 
> > > > > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > > > > 
> > > > > > > 
> > > > > > > I was thinking more about transaction reservation than RAM. It may not
> > > > > > 
> > > > > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > > > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > > > > transaction every 1000 invalidations or so...
> > > > > > 
> > > > > 
> > > > > I'm not really sure what categorizes as a lot here given that the blocks
> > > > > would need to be in-core, but rolling on some fixed/safe interval sounds
> > > > > reasonable to me.
> > > > > 
> > > > > > > currently be an issue, but it might be worth putting something down in a
> > > > > > > comment to note that this is a single transaction and we expect to not
> > > > > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > > > > whatever that value happens to be.
> > > > > > > 
> > > > > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > > > > +	if (error)
> > > > > > > > > > +		return error;
> > > > > > > > > > +
> > > > > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > > > > +	return 0;
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > > > > +STATIC int
> > > > > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > +	struct list_head	*free_extents,
> > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > > > > +{
> > > > > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > > > > +	int			error;
> > > > > > > > > > +
> > > > > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > > > > +
> > > > > > > > > > +	/*
> > > > > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > > > > +	 */
> > > > > > > > > 
> > > > > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > > > > if that's the case?
> > > > > > > > > 
> > > > > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > > > > 
> > > > > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > > > > them to set up new btree roots.
> > > > > > > > 
> > > > > > > 
> > > > > > > Got it, thanks.
> > > > > > > 
> > > > > > > > > > +	if (!longest)
> > > > > > > > > > +		goto done;
> > > > > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > > > > +			longest->len, &oinfo);
> > > > > > > > > > +	list_del(&longest->list);
> > > > > > > > > > +	kmem_free(longest);
> > > > > > > > > > +	if (error)
> > > > > > > > > > +		return error;
> > > > > > > > > > +
> > > > > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > > > > +				rae->len, &oinfo);
> > > > > > > > > > +		if (error)
> > > > > > > > > > +			return error;
> > > > > > > > > > +		list_del(&rae->list);
> > > > > > > > > > +		kmem_free(rae);
> > > > > > > > > > +	}
> > > > > > > > > 
> > > > > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > > > > rmapbt again on a retry.
> > > > > > > > 
> > > > > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > > > > filesystem in an offline state at some point before you can retry the
> > > > > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > > > > 
> > > > > > > 
> > > > > > > Can we really assume that if we're already up and running an online
> > > > > > > repair? The filesystem has to be mountable in that case in the first
> > > > > > > place. If we've already reset and started reconstructing the allocation
> > > > > > > btrees then I'd think those transactions would recover just fine on a
> > > > > > > power loss or something (perhaps not in the event of some other
> > > > > > > corruption related shutdown).
> > > > > > 
> > > > > > Right, for the system crash case, whatever transactions committed should
> > > > > > replay just fine, and you can even start up the online repair again, and
> > > > > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > > > > corruption) it should work just fine.
> > > > > > 
> > > > > > If the fs went down because either (a) repair hit other corruption or
> > > > > > (b) some other thread hit an error in some other part of the filesystem,
> > > > > > then it's not so clear -- in (b) you could probably try again, but for
> > > > > > (a) you'll definitely have to unmount and run xfs_repair.
> > > > > > 
> > > > > 
> > > > > Indeed, there are certainly cases where we simply won't be able to do an
> > > > > online repair. I'm trying to think about scenarios where we should be
> > > > > able to do an online repair, but we lose power or hit some kind of
> > > > > transient error like a memory allocation failure before it completes. It
> > > > > would be nice if the online repair itself didn't contribute (within
> > > > > reason) to the inability to simply try again just because the fs was
> > > > > close to -ENOSPC.
> > > > 
> > > > Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> > > > start modifying on-disk metadata.  If that happens, we just free all the
> > > > memory and bail out having done nothing.
> > > > 
> > > > > For one, I think it's potentially confusing behavior. Second, it might
> > > > > be concerning to regular users who perceive it as an online repair
> > > > > leaving the fs in a worse off state. Us fs devs know that may not really
> > > > > be the case, but I think we're better for addressing it if we can
> > > > > reasonably do so.
> > > > 
> > > > <nod> Further in the future I want to add the ability to offline an AG,
> > > > so the worst that happens is that scrub turns the AG off, repair doesn't
> > > > fix it, and the AG simply stays offline.  That might give us the
> > > > ability to survive cancelling the repair transaction, since if the AG's
> > > > offline already anyway we could just throw away the dirty buffers and
> > > > resurrect the AG later.  I don't know, that's purely speculative.
> > > > 
> > > > > > Perhaps the guideline here is that if the fs goes down more than once
> > > > > > during online repair then unmount it and run xfs_repair.
> > > > > > 
> > > > > 
> > > > > Yep, I think that makes sense if the filesystem or repair itself is
> > > > > tripping over other corruptions that fail to keep it active for the
> > > > > duration of the repair.
> > > > 
> > > > <nod>
> > > > 
> > > > > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > > > > enough free space to repair?
> > > > > > > > 
> > > > > > > > Yes, that's a risk of running the free space repair.
> > > > > > > > 
> > > > > > > 
> > > > > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > > > > not, could we remove those entries before we start tree reconstruction?
> > > > > > > 
> > > > > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > > > > is, we discover those blocks simply so we can free them at the end.
> > > > > > > Perhaps we could free them sooner or find a more clever means to
> > > > > > > reallocate directly from that in-core list? I guess we have to consider
> > > > > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > > > > that the old blocks list is essentially invalidated once we reset the
> > > > > > > btrees.
> > > > > > 
> > > > > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > > > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > > > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > > > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > > > > fix_freelist won't succeed.  That complication is why it finds the
> > > > > > longest extent in the unclaimed list and pushes that in first, then
> > > > > > works on the rest of the extents.
> > > > > > 
> > > > > 
> > > > > Hmm, but doesn't a btree split require at least one full space btree
> > > > > block per-level? In conjunction, the agfl minimum size requirement grows
> > > > > with the height of the tree, which implies available free space..? I
> > > > > could be missing something, perhaps we have to account for the rmapbt in
> > > > > that case as well? Regardless...
> > > > > 
> > > > > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > > > > in first (since we know that won't trigger a split), then reap the old
> > > > > > alloc btree blocks, and then add everything else back in...
> > > > > > 
> > > > > 
> > > > > I think it would be reasonable to seed the btree with the longest record
> > > > > or some fixed number of longest records (~1/2 a root block, for example)
> > > > > before making actual use of the btrees to reap the old blocks. I think
> > > > > then you'd only have a very short window of a single block leak on a
> > > > > poorly timed power loss and repair retry sequence before you start
> > > > > actually freeing originally used space (which in practice, I think
> > > > > solves the problem).
> > > > > 
> > > > > Given that we're starting from empty, I wonder if another option may be
> > > > > to over fill the agfl with old btree blocks or something. The first real
> > > > > free should shift enough blocks back into the btrees to ensure the agfl
> > > > > can be managed from that point forward, right? That may be more work
> > > > > than it's worth though and/or a job for another patch. (FWIW, we also
> > > > > have that NOSHRINK agfl fixup flag for userspace repair.)
> > > > 
> > > > Yes, I'll give that a try tomorrow, now that I've finished porting all
> > > > the 4.19 stuff to xfsprogs. :)
> > > > 
> > > > Looping back to something we discussed earlier in this thread, I'd
> > > > prefer to hold off on converting the list of already-freed extents to
> > > > xfs_bitmap because the same problem exists in all the repair functions
> > > > of having to store a large number of records for the rebuilt btree, and
> > > > maybe there's some way to <cough> use pageable memory for that, since
> > > > the access patterns for that are append, sort, and iterate; for those
> > > > three uses we don't necessarily require all the records to be in memory
> > > > all the time.  For the allocbt repair I expect the free space records to
> > > > be far more numerous than the list of old bnobt/cntbt blocks.
> > > > 
> > > 
> > > Ok, it's fair enough that we'll probably want to find some kind of
> > > generic, more efficient technique for handling this across the various
> > > applicable repair algorithms.
> > > 
> > > One other high level thing that crossed my mind with regard to the
> > > general btree reconstruction algorithms is whether we need to build up
> > > this kind of central record list at all. For example, rather than slurp
> > > up the entire list of btree records in-core, sort it and dump it back
> > > out, could we take advantage of the fact that our existing on-disk
> > > structure insertion mechanisms already handle out of order records
> > > (simply stated, an extent free knows how to insert the associated record
> > > at the right place in the space btrees)? For example, suppose we reset
> > > the existing btrees first, then scanned the rmapbt and repopulated the
> > > new btrees as records are discovered..?
> > 
> > I tried that in an earlier draft of the bnobt repair function.  The
> > biggest problem with inserting as we go is dealing with the inevitable
> > transaction rolls (right now we do after every record insertion to avoid
> > playing games with guessing how much reservation is left).  Btree
> > cursor state can't survive transaction rolls because the transaction
> > commit releases all the buffers that aren't bhold'en, and we can't bhold
> > that many buffers across a _defer_finish.
> > 
> 
> Ok, interesting.
> 
> Where do we need to run an xfs_defer_finish() during the reconstruction
> sequence, btw?

Not here, as I'm sure you were thinking. :)  For the AG btrees
themselves it's sufficient to roll the transaction.  I suppose we could
simply have a xfs_btree_bhold function that would bhold every buffer so
that a cursor could survive a roll.

Inode fork reconstruction is going to require _defer_finish, however.

> I thought that would only run on final commit as opposed to
> intermediate rolls.

We could let the deferred items sit around until final commit, but I
think I'd prefer to process them as soon as possible since iirc deferred
items pin the log until they're finished.  I would hope that userspace
isn't banging on the log while repair runs, but it's certainly possible.

> We could just try and make the automatic buffer relogging list a
> dynamic allocation if there are enough held buffers in the
> transaction.

Hmm.  Might be worth pursuing...

> > So, that early draft spent a lot of time tearing down and reconstructing
> > rmapbt cursors since the standard _btree_query_all isn't suited to that
> > kind of usage.  It was easily twice as slow on a RAM-backed disk just
> > from the rmap cursor overhead and much more complex, so I rewrote it to
> > be simpler.  I also have a slight preference for not touching anything
> > until we're absolutely sure we have all the data we need to repair the
> > structure.
> > 
> 
> Yes, I think that is sane in principle. I'm just a bit concerned about
> how reliable that xfs_repair-like approach will be in the kernel longer
> term, particularly once we start having to deal with large filesystems
> and limited or contended memory, etc. We already have xfs_repair users
> that need to tweak settings because there isn't enough memory available
> to repair the fs. Granted that is for fs-wide repairs and the flipside
> is that we know a single AG can only be up to 1TB. It's certainly
> possible that putting some persistent backing behind the in-core data is
> enough to resolve the problem (and the current approach is certainly
> reasonable enough to me for the initial implementation).
> 
> bjoin limitations aside, I wonder if a cursor roll mechanism that held
> all of the cursor buffers, rolled the transaction and then rejoined all
> said buffers would help us get around that. (Not sure I follow the early
> prototype behavior, but it sounds like we had to restart the rmapbt
> lookup over and over...).

Correct.

> Another caveat with that approach may be that I think we'd need to be
> sure that the reconstruction operation doesn't ever need to update the
> rmapbt while we're mid walk of the latter.

<nod> Looking even farther back in my notes, that was also an issue --
fixing the free list causes blocks to go on or off the agfl, which
causes rmapbt updates, which meant that the only way I could get
in-place updates to work was to re-lookup where we were in the btree and
also try to deal with any rmapbt entries that might have crept in as
result of the record insertion.

Getting the concurrency right for each repair function looked like a
difficult problem to solve, but amassing all the records elsewhere and
rebuilding was easy to understand.

> That may be an issue for inode btree reconstruction, for example,
> since it looks like inobt block allocation requires rmapbt updates.
> We'd probably need some way to share (or invalidate) a cursor across
> different contexts to deal with that.

I might pursue that strategy if we ever hit the point where we can't
find space to store the records (see below).  Another option could be to
divert all deferred items for an AG, build a replacement btree in new
space, then finish all the deferred items... but that's starting to get
into offlineable AGs, which is its own project that I want to tackle
later.

(Not that much later, just not this cycle.)

> > For other repair functions (like the data/attr fork repairs) we have to
> > scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> > for as long as necessary to extract the extents we want.
> > 
> > > The obvious problem is that we still have some checks that allow the
> > > whole repair operation to bail out before we determine whether we can
> > > start to rebuild the on-disk btrees. These are things like making sure
> > > we can actually read the associated rmapbt blocks (i.e., no read errors
> > > or verifier failures), basic record sanity checks, etc. But ISTM that
> > > isn't anything we couldn't get around with a multi-pass implementation.
> > > Secondary issues might be things like no longer being able to easily
> > > insert the longest free extent range(s) first (meaning we'd have to
> > > stuff the agfl with old btree blocks or figure out some other approach).
> > 
> > Well, you could scan the rmapbt twice -- once to find the longest
> > record, then again to do the actual insertion.
> > 
> 
> Yep, that's what I meant by multi-pass.
> 
> > > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > > callout could not only detect corruption, but validate whether repair
> > > (if requested) is possible based on the kind of checks that are
> > > currently in the repair side rmapbt walkers. Thoughts?r
> > 
> > Yes, scrub basically validates that for us now, with the notable
> > exception of the notorious rmapbt scrubber, which doesn't
> > cross-reference with inode block mappings because that would be a
> > locking nightmare.
> > 
> > > Are there future
> > > changes that are better supported by an in-core tracking structure in
> > > general (assuming we'll eventually replace the linked lists with
> > > something more efficient) as opposed to attempting to optimize out the
> > > need for that tracking at all?
> > 
> > Well, I was thinking that we could just allocate a memfd (or a file on
> > the same xfs once we have AG offlining) and store the records in there.
> > That saves us the list_head overhead and potentially enables access to a
> > lot more storage than pinning things in RAM.
> > 
> 
> Would using the same fs mean we have to store the repair data in a
> separate AG, or somehow locate/use free space in the target AG?

As part of building an "offline AG" feature we'd presumably have to
teach the allocators to avoid the offline AGs for allocations, which
would make it so that we could host the repair data files in the same
XFS that's being fixed.  That seems a little risky to me, but the disk
is probably larger than mem+swap.

> presume either way we'd have to ensure that AG is either consistent or
> locked out from outside I/O. If we have the total record count we can

We usually don't, but for the btrees that have their own record/blocks
counters we might be able to guess a number, fallocate it, and see if
that doesn't ENOSPC.

> preallocate the file and hope there is no such other free space
> corruption or something that would allow some other task to mess with
> our blocks. I'm a little skeptical overall on relying on a corrupted
> filesystem to store repair data, but perhaps there are ways to mitigate
> the risks.

Store it elsewhere?  /home for root repairs, /root for any other
repair... though if we're going to do that, why not just add a swap file
temporarily?

> I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> swappable or something?

It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
the swap file.

> If so, that sounds a reasonable option provided the swap space
> requirement can be made clear to users

We can document it.  I don't think it's any worse than xfs_repair being
able to use up all the memory + swap... and since we're probably only
going to be repairing one thing at a time, most likely scrub won't need
as much memory.

> and the failure characteristics aren't more severe than for userspace.
> An online repair that puts the broader system at risk of OOM as
> opposed to predictably failing gracefully may not be the most useful
> tool.

Agreed.  One huge downside of memfd seems to be the lack of a mechanism
for the vm to push back on us if we successfully write all we need to
the memfd but then other processes need some memory.  Obviously, if the
memfd write itself comes up short or fails then we dump the memfd and
error back to userspace.  We might simply have to free array memory
while we iterate the records to minimize the time spent at peak memory
usage.

--D

> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > > > +
> > > > > > > > > > +done:
> > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > +}
> > > > > > > > > > +
> > > > > > > > > ...
> > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > >  	return diff;
> > > > > > > > > >  }
> > > > > > > > > > +
> > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > +bool
> > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > +{
> > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > 
> > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > 
> > > > > > > > Good suggestion, thank you!
> > > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > +		return false;
> > > > > > > > > > +	}
> > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > +	return true;
> > > > > > > > > > +}
> > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > >  }
> > > > > > > > > >  
> > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > +
> > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-08-07 22:02         ` Darrick J. Wong
@ 2018-08-08 12:09           ` Brian Foster
  2018-08-08 21:26             ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-08 12:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Tue, Aug 07, 2018 at 03:02:24PM -0700, Darrick J. Wong wrote:
> On Tue, Jul 31, 2018 at 11:10:00AM -0400, Brian Foster wrote:
> > On Mon, Jul 30, 2018 at 10:22:16AM -0700, Darrick J. Wong wrote:
> > > On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> > > > On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Repair the AGFL from the rmap data.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> > > > seems to always dump a cross-referencing failed error and not want to
> > > > deal with it. Expected? Is there a good way to unit test some of this
> > > > stuff with simple/localized corruptions?
> > > 
> > > I usually pick one of the corruptions from xfs/355...
> > > 
> > > $ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
> > > SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
> > > ./check xfs/355
> > > 
> > 
> > It looks like similar behavior if I do that, but tbh I'm not sure if I'm
> > using this correctly. E.g., if I do:
> 
> <urk> Sorry, I forgot to reply to this...
> 
> > # SCRATCH_XFS_LIST_FUZZ_VERBS=random SCRATCH_XFS_LIST_METADATA_FIELDS=bno[0] ./check xfs/355
> > FSTYP         -- xfs (debug)
> > PLATFORM      -- Linux/x86_64 localhost 4.18.0-rc4+
> > MKFS_OPTIONS  -- -f -mrmapbt=1,reflink=1 /dev/mapper/test-scratch
> > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/test-scratch /mnt/scratch
> > 
> > xfs/355 - output mismatch (see /root/xfstests-dev/results//xfs/355.out.bad)
> >     ...
> >     (Run 'diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad'  to see the entire diff)
> > Ran: xfs/355
> > Failures: xfs/355
> > Failed 1 of 1 tests
> > # diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad
> > --- tests/xfs/355.out   2018-07-25 07:47:23.739575416 -0400
> > +++ /root/xfstests-dev/results//xfs/355.out.bad 2018-07-31
> > 10:55:18.466178944 -0400
> > @@ -1,6 +1,10 @@
> >  QA output created by 355
> >  Format and populate
> >  Fuzz AGFL
> > +online re-scrub (1) with bno[0] = random.
> >  Done fuzzing AGFL
> >  Fuzz AGFL flfirst
> > +offline re-scrub (1) with bno[14] = random.
> > +online re-scrub (1) with bno[14] = random.
> > +re-repair failed (1) with bno[14] = random.
> >  Done fuzzing AGFL flfirst
> > 
> > If I run xfs_scrub directly on the scratch mount after the test I get a
> > stream of inode cross-referencing errors and it doesn't seem to fix
> > anything up.
> 
> Hmm.  What is your xfsprogs head?  I think Eric committed the patches to
> xfs_scrub to enable repairs in v4.18.0-rc1... which git says happened on
> 8/1.
> 

I think it was just for-next. Regardless, I was really just looking for
a way to trigger a specific repair cycle and got around it once I
discovered the XFS_ERRTAG_FORCE_SCRUB_REPAIR tag. I did have to stick
the repair flag in the xfs_io scrub calls as well to trigger it that
way, IIRC.

Any thoughts on allowing that, perhaps with an extra scrub command flag
(and/or in experimental mode)?

Brian

> --D
> 
> > 
> > Brian
> > 
> > > > Otherwise this looks sane, a couple comments..
> > > > 
> > > > >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> > > > >  fs/xfs/scrub/bitmap.h          |    4 +
> > > > >  fs/xfs/scrub/scrub.c           |    2 
> > > > >  4 files changed, 373 insertions(+), 1 deletion(-)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > > index 4842fc598c9b..bfef066c87c3 100644
> > > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > > > @@ -424,3 +424,279 @@ xrep_agf(
> > > > >  	memcpy(agf, &old_agf, sizeof(old_agf));
> > > > >  	return error;
> > > > >  }
> > > > > +
> > > > ...
> > > > > +/* Write out a totally new AGFL. */
> > > > > +STATIC void
> > > > > +xrep_agfl_init_header(
> > > > > +	struct xfs_scrub	*sc,
> > > > > +	struct xfs_buf		*agfl_bp,
> > > > > +	struct xfs_bitmap	*agfl_extents,
> > > > > +	xfs_agblock_t		flcount)
> > > > > +{
> > > > > +	struct xfs_mount	*mp = sc->mp;
> > > > > +	__be32			*agfl_bno;
> > > > > +	struct xfs_bitmap_range	*br;
> > > > > +	struct xfs_bitmap_range	*n;
> > > > > +	struct xfs_agfl		*agfl;
> > > > > +	xfs_agblock_t		agbno;
> > > > > +	unsigned int		fl_off;
> > > > > +
> > > > > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > > > > +
> > > > > +	/* Start rewriting the header. */
> > > > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > > > 
> > > > What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?
> > > 
> > > Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
> > > in the header fields.
> > > 
> > > > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > > > +
> > > > > +	/*
> > > > > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > > > > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > > > > +	 * step.
> > > > > +	 */
> > > > > +	fl_off = 0;
> > > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > > > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > > > > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > > > > +
> > > > > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > > > > +
> > > > > +		while (br->len > 0 && fl_off < flcount) {
> > > > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > > > +			fl_off++;
> > > > > +			agbno++;
> > > > 
> > > > 			/* bump br so we don't reap blocks we've used */
> > > > 
> > > > (i.e., took me a sec to realize why we bother with ->start)
> > > > 
> > > > > +			br->start++;
> > > > > +			br->len--;
> > > > > +		}
> > > > > +
> > > > > +		if (br->len)
> > > > > +			break;
> > > > > +		list_del(&br->list);
> > > > > +		kmem_free(br);
> > > > > +	}
> > > > > +
> > > > > +	/* Write new AGFL to disk. */
> > > > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > > > +}
> > > > > +
> > > > ...
> > > > > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > > > > index c770e2d0b6aa..fdadc9e1dc49 100644
> > > > > --- a/fs/xfs/scrub/bitmap.c
> > > > > +++ b/fs/xfs/scrub/bitmap.c
> > > > > @@ -9,6 +9,7 @@
> > > > >  #include "xfs_format.h"
> > > > >  #include "xfs_trans_resv.h"
> > > > >  #include "xfs_mount.h"
> > > > > +#include "xfs_btree.h"
> > > > >  #include "scrub/xfs_scrub.h"
> > > > >  #include "scrub/scrub.h"
> > > > >  #include "scrub/common.h"
> > > > > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> > > > >  }
> > > > >  #undef LEFT_ALIGNED
> > > > >  #undef RIGHT_ALIGNED
> > > > > +
> > > > > +/*
> > > > > + * Record all btree blocks seen while iterating all records of a btree.
> > > > > + *
> > > > > + * We know that the btree query_all function starts at the left edge and walks
> > > > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > > > + * the btree cursor towards the root; if the pointer for a given level points
> > > > > + * to the first record/key in that block, we haven't seen this block before;
> > > > > + * and therefore we need to remember that we saw this block in the btree.
> > > > > + *
> > > > > + * So if our btree is:
> > > > > + *
> > > > > + *    4
> > > > > + *  / | \
> > > > > + * 1  2  3
> > > > > + *
> > > > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > > > + * block 4.  The list is [1, 4].
> > > > > + *
> > > > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > > > + * loop.  The list remains [1, 4].
> > > > > + *
> > > > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > > > + *
> > > > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > > > + *
> > > > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > > > + *
> > > > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > > > + */
> > > > > +
> > > > > +/*
> > > > > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > > > > + * engaged in a btree walk should call this function to capture the list of
> > > > > + * blocks going from the leaf towards the root.
> > > > > + */
> > > > > +int
> > > > > +xfs_bitmap_set_btcur_path(
> > > > > +	struct xfs_bitmap	*bitmap,
> > > > > +	struct xfs_btree_cur	*cur)
> > > > > +{
> > > > > +	struct xfs_buf		*bp;
> > > > > +	xfs_fsblock_t		fsb;
> > > > > +	int			i;
> > > > > +	int			error;
> > > > > +
> > > > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > > > +		xfs_btree_get_block(cur, i, &bp);
> > > > > +		if (!bp)
> > > > > +			continue;
> > > > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> > > > 
> > > > Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> > > > but also highlights that xfs_bitmap_set() essentially allocates entries
> > > > for duplicate values if they exist. Is this handled by the broader
> > > > mechanism, for example, if the rmapbt was corrupted to have multiple
> > > > entries for a particular unused OWN_AG block? Or could we end up leaking
> > > > that corruption over to the agfl?
> > > 
> > > Right now we're totally dependent on the rmapbt being sane to rebuild
> > > the space metadata.
> > > 
> > > > I also wonder a bit about memory consumption on filesystems with large
> > > > metadata footprints. We essentially have to allocate one of these for
> > > > every allocation btree block before we can do the disunion and locate
> > > > the agfl-appropriate blocks. If we had a more lookup friendly structure,
> > > > perhaps this could be optimized by filtering out bnobt/cntbt blocks
> > > > during the associated btree walks..?
> > > > 
> > > > Have you thought about reusing something like the new in-core extent
> > > > tree mechanism as a pure in-memory extent store? It's certainly not
> > > > worth reworking something like that right now, but I wonder if we could
> > > > save memory via the denser format (and perhaps benefit from code
> > > > flexibility, reuse, etc.).
> > > 
> > > Yes, I was thinking about refactoring the iext btree into a more generic
> > > in-core index with 64-bit key so that I could adapt xfs_bitmap to use
> > > it.  In the longer term I would /also/ like to use xfs_bitmap to detect
> > > xfs_buf cache aliasing when multi-block buffers are in use, but that's
> > > further off. :)
> > > 
> > > As for the memory-intensive record lists in all the btree rebuilders, I
> > > have some ideas around that too -- either find a way to build an
> > > alternate btree and switch the roots over, or (once we gain the ability
> > > to mark an AG unavailable for new allocations) allocate an unlinked
> > > inode, store the records in the page cache pages for the file, and
> > > release it when we're done.
> > > 
> > > But, that can wait until I've gotten more of this merged, or get bored.
> > > :)
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > +		if (error)
> > > > > +			return error;
> > > > > +	}
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > +
> > > > > +/* Collect a btree's block in the bitmap. */
> > > > > +STATIC int
> > > > > +xfs_bitmap_collect_btblock(
> > > > > +	struct xfs_btree_cur	*cur,
> > > > > +	int			level,
> > > > > +	void			*priv)
> > > > > +{
> > > > > +	struct xfs_bitmap	*bitmap = priv;
> > > > > +	struct xfs_buf		*bp;
> > > > > +	xfs_fsblock_t		fsbno;
> > > > > +
> > > > > +	xfs_btree_get_block(cur, level, &bp);
> > > > > +	if (!bp)
> > > > > +		return 0;
> > > > > +
> > > > > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > > > > +}
> > > > > +
> > > > > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > > > > +int
> > > > > +xfs_bitmap_set_btblocks(
> > > > > +	struct xfs_bitmap	*bitmap,
> > > > > +	struct xfs_btree_cur	*cur)
> > > > > +{
> > > > > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > > > > +}
> > > > > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > > > > index dad652ee9177..ae8ecbce6fa6 100644
> > > > > --- a/fs/xfs/scrub/bitmap.h
> > > > > +++ b/fs/xfs/scrub/bitmap.h
> > > > > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> > > > >  
> > > > >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> > > > >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > > > > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > > > > +		struct xfs_btree_cur *cur);
> > > > > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > > > > +		struct xfs_btree_cur *cur);
> > > > >  
> > > > >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > > > > --- a/fs/xfs/scrub/scrub.c
> > > > > +++ b/fs/xfs/scrub/scrub.c
> > > > > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > > >  		.type	= ST_PERAG,
> > > > >  		.setup	= xchk_setup_fs,
> > > > >  		.scrub	= xchk_agfl,
> > > > > -		.repair	= xrep_notsupported,
> > > > > +		.repair	= xrep_agfl,
> > > > >  	},
> > > > >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > > > >  		.type	= ST_PERAG,
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-07 23:34                     ` Darrick J. Wong
@ 2018-08-08 12:29                       ` Brian Foster
  2018-08-08 22:42                         ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-08 12:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > > 
> > > > > > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > ...
> > > > > > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > new file mode 100644
> > > > > > > > > > > index 000000000000..b228c2906de2
> > > > > > > > > > > --- /dev/null
> > > > > > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > @@ -0,0 +1,581 @@
> > > > > > ...
> > > > > > > > > > > +
> > > > > > > > > > > +/*
> > > > > > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > > > > > + * unused space back into the AG.
> > > > > > > > > > > + */
> > > > > > > > > > > +STATIC int
> > > > > > > > > > > +xrep_abt_commit_new(
> > > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > > > > > +	int			log_flags)
> > > > > > > > > > > +{
> > > > > > > > > > > +	int			error;
> > > > > > > > > > > +
> > > > > > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > > > > > +	if (error)
> > > > > > > > > > > +		return error;
> > > > > > > > > > 
> > > > > > > > > > It looks like the above invalidation all happens in the same
> > > > > > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > > > > > how many log formats we can get away with in this single transaction?
> > > > > > > > > 
> > > > > > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I was thinking more about transaction reservation than RAM. It may not
> > > > > > > 
> > > > > > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > > > > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > > > > > transaction every 1000 invalidations or so...
> > > > > > > 
> > > > > > 
> > > > > > I'm not really sure what categorizes as a lot here given that the blocks
> > > > > > would need to be in-core, but rolling on some fixed/safe interval sounds
> > > > > > reasonable to me.
> > > > > > 
> > > > > > > > currently be an issue, but it might be worth putting something down in a
> > > > > > > > comment to note that this is a single transaction and we expect to not
> > > > > > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > > > > > whatever that value happens to be.
> > > > > > > > 
> > > > > > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > > > > > +	if (error)
> > > > > > > > > > > +		return error;
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > > > > > +	return 0;
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > > > > > +STATIC int
> > > > > > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > > +	struct list_head	*free_extents,
> > > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > > > > > +{
> > > > > > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > > > > > +	int			error;
> > > > > > > > > > > +
> > > > > > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > > > > > +
> > > > > > > > > > > +	/*
> > > > > > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > > > > > +	 */
> > > > > > > > > > 
> > > > > > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > > > > > if that's the case?
> > > > > > > > > > 
> > > > > > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > > > > > 
> > > > > > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > > > > > them to set up new btree roots.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Got it, thanks.
> > > > > > > > 
> > > > > > > > > > > +	if (!longest)
> > > > > > > > > > > +		goto done;
> > > > > > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > > > > > +			longest->len, &oinfo);
> > > > > > > > > > > +	list_del(&longest->list);
> > > > > > > > > > > +	kmem_free(longest);
> > > > > > > > > > > +	if (error)
> > > > > > > > > > > +		return error;
> > > > > > > > > > > +
> > > > > > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > > > > > +				rae->len, &oinfo);
> > > > > > > > > > > +		if (error)
> > > > > > > > > > > +			return error;
> > > > > > > > > > > +		list_del(&rae->list);
> > > > > > > > > > > +		kmem_free(rae);
> > > > > > > > > > > +	}
> > > > > > > > > > 
> > > > > > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > > > > > rmapbt again on a retry.
> > > > > > > > > 
> > > > > > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > > > > > filesystem in an offline state at some point before you can retry the
> > > > > > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Can we really assume that if we're already up and running an online
> > > > > > > > repair? The filesystem has to be mountable in that case in the first
> > > > > > > > place. If we've already reset and started reconstructing the allocation
> > > > > > > > btrees then I'd think those transactions would recover just fine on a
> > > > > > > > power loss or something (perhaps not in the event of some other
> > > > > > > > corruption related shutdown).
> > > > > > > 
> > > > > > > Right, for the system crash case, whatever transactions committed should
> > > > > > > replay just fine, and you can even start up the online repair again, and
> > > > > > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > > > > > corruption) it should work just fine.
> > > > > > > 
> > > > > > > If the fs went down because either (a) repair hit other corruption or
> > > > > > > (b) some other thread hit an error in some other part of the filesystem,
> > > > > > > then it's not so clear -- in (b) you could probably try again, but for
> > > > > > > (a) you'll definitely have to unmount and run xfs_repair.
> > > > > > > 
> > > > > > 
> > > > > > Indeed, there are certainly cases where we simply won't be able to do an
> > > > > > online repair. I'm trying to think about scenarios where we should be
> > > > > > able to do an online repair, but we lose power or hit some kind of
> > > > > > transient error like a memory allocation failure before it completes. It
> > > > > > would be nice if the online repair itself didn't contribute (within
> > > > > > reason) to the inability to simply try again just because the fs was
> > > > > > close to -ENOSPC.
> > > > > 
> > > > > Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> > > > > start modifying on-disk metadata.  If that happens, we just free all the
> > > > > memory and bail out having done nothing.
> > > > > 
> > > > > > For one, I think it's potentially confusing behavior. Second, it might
> > > > > > be concerning to regular users who perceive it as an online repair
> > > > > > leaving the fs in a worse off state. Us fs devs know that may not really
> > > > > > be the case, but I think we're better for addressing it if we can
> > > > > > reasonably do so.
> > > > > 
> > > > > <nod> Further in the future I want to add the ability to offline an AG,
> > > > > so the worst that happens is that scrub turns the AG off, repair doesn't
> > > > > fix it, and the AG simply stays offline.  That might give us the
> > > > > ability to survive cancelling the repair transaction, since if the AG's
> > > > > offline already anyway we could just throw away the dirty buffers and
> > > > > resurrect the AG later.  I don't know, that's purely speculative.
> > > > > 
> > > > > > > Perhaps the guideline here is that if the fs goes down more than once
> > > > > > > during online repair then unmount it and run xfs_repair.
> > > > > > > 
> > > > > > 
> > > > > > Yep, I think that makes sense if the filesystem or repair itself is
> > > > > > tripping over other corruptions that fail to keep it active for the
> > > > > > duration of the repair.
> > > > > 
> > > > > <nod>
> > > > > 
> > > > > > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > > > > > enough free space to repair?
> > > > > > > > > 
> > > > > > > > > Yes, that's a risk of running the free space repair.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > > > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > > > > > not, could we remove those entries before we start tree reconstruction?
> > > > > > > > 
> > > > > > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > > > > > is, we discover those blocks simply so we can free them at the end.
> > > > > > > > Perhaps we could free them sooner or find a more clever means to
> > > > > > > > reallocate directly from that in-core list? I guess we have to consider
> > > > > > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > > > > > that the old blocks list is essentially invalidated once we reset the
> > > > > > > > btrees.
> > > > > > > 
> > > > > > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > > > > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > > > > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > > > > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > > > > > fix_freelist won't succeed.  That complication is why it finds the
> > > > > > > longest extent in the unclaimed list and pushes that in first, then
> > > > > > > works on the rest of the extents.
> > > > > > > 
> > > > > > 
> > > > > > Hmm, but doesn't a btree split require at least one full space btree
> > > > > > block per-level? In conjunction, the agfl minimum size requirement grows
> > > > > > with the height of the tree, which implies available free space..? I
> > > > > > could be missing something, perhaps we have to account for the rmapbt in
> > > > > > that case as well? Regardless...
> > > > > > 
> > > > > > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > > > > > in first (since we know that won't trigger a split), then reap the old
> > > > > > > alloc btree blocks, and then add everything else back in...
> > > > > > > 
> > > > > > 
> > > > > > I think it would be reasonable to seed the btree with the longest record
> > > > > > or some fixed number of longest records (~1/2 a root block, for example)
> > > > > > before making actual use of the btrees to reap the old blocks. I think
> > > > > > then you'd only have a very short window of a single block leak on a
> > > > > > poorly timed power loss and repair retry sequence before you start
> > > > > > actually freeing originally used space (which in practice, I think
> > > > > > solves the problem).
> > > > > > 
> > > > > > Given that we're starting from empty, I wonder if another option may be
> > > > > > to over fill the agfl with old btree blocks or something. The first real
> > > > > > free should shift enough blocks back into the btrees to ensure the agfl
> > > > > > can be managed from that point forward, right? That may be more work
> > > > > > than it's worth though and/or a job for another patch. (FWIW, we also
> > > > > > have that NOSHRINK agfl fixup flag for userspace repair.)
> > > > > 
> > > > > Yes, I'll give that a try tomorrow, now that I've finished porting all
> > > > > the 4.19 stuff to xfsprogs. :)
> > > > > 
> > > > > Looping back to something we discussed earlier in this thread, I'd
> > > > > prefer to hold off on converting the list of already-freed extents to
> > > > > xfs_bitmap because the same problem exists in all the repair functions
> > > > > of having to store a large number of records for the rebuilt btree, and
> > > > > maybe there's some way to <cough> use pageable memory for that, since
> > > > > the access patterns for that are append, sort, and iterate; for those
> > > > > three uses we don't necessarily require all the records to be in memory
> > > > > all the time.  For the allocbt repair I expect the free space records to
> > > > > be far more numerous than the list of old bnobt/cntbt blocks.
> > > > > 
> > > > 
> > > > Ok, it's fair enough that we'll probably want to find some kind of
> > > > generic, more efficient technique for handling this across the various
> > > > applicable repair algorithms.
> > > > 
> > > > One other high level thing that crossed my mind with regard to the
> > > > general btree reconstruction algorithms is whether we need to build up
> > > > this kind of central record list at all. For example, rather than slurp
> > > > up the entire list of btree records in-core, sort it and dump it back
> > > > out, could we take advantage of the fact that our existing on-disk
> > > > structure insertion mechanisms already handle out of order records
> > > > (simply stated, an extent free knows how to insert the associated record
> > > > at the right place in the space btrees)? For example, suppose we reset
> > > > the existing btrees first, then scanned the rmapbt and repopulated the
> > > > new btrees as records are discovered..?
> > > 
> > > I tried that in an earlier draft of the bnobt repair function.  The
> > > biggest problem with inserting as we go is dealing with the inevitable
> > > transaction rolls (right now we do after every record insertion to avoid
> > > playing games with guessing how much reservation is left).  Btree
> > > cursor state can't survive transaction rolls because the transaction
> > > commit releases all the buffers that aren't bhold'en, and we can't bhold
> > > that many buffers across a _defer_finish.
> > > 
> > 
> > Ok, interesting.
> > 
> > Where do we need to run an xfs_defer_finish() during the reconstruction
> > sequence, btw?
> 
> Not here, as I'm sure you were thinking. :)  For the AG btrees
> themselves it's sufficient to roll the transaction.  I suppose we could
> simply have a xfs_btree_bhold function that would bhold every buffer so
> that a cursor could survive a roll.
> 
> Inode fork reconstruction is going to require _defer_finish, however.
> 

Ok, just wasn't sure if I missed something in the bits I've looked
through so far..

> > I thought that would only run on final commit as opposed to
> > intermediate rolls.
> 
> We could let the deferred items sit around until final commit, but I
> think I'd prefer to process them as soon as possible since iirc deferred
> items pin the log until they're finished.  I would hope that userspace
> isn't banging on the log while repair runs, but it's certainly possible.
> 

I was just surmising in general, not necessarily suggesting we change
behavior.

> > We could just try and make the automatic buffer relogging list a
> > dynamic allocation if there are enough held buffers in the
> > transaction.
> 
> Hmm.  Might be worth pursuing...
> 
> > > So, that early draft spent a lot of time tearing down and reconstructing
> > > rmapbt cursors since the standard _btree_query_all isn't suited to that
> > > kind of usage.  It was easily twice as slow on a RAM-backed disk just
> > > from the rmap cursor overhead and much more complex, so I rewrote it to
> > > be simpler.  I also have a slight preference for not touching anything
> > > until we're absolutely sure we have all the data we need to repair the
> > > structure.
> > > 
> > 
> > Yes, I think that is sane in principle. I'm just a bit concerned about
> > how reliable that xfs_repair-like approach will be in the kernel longer
> > term, particularly once we start having to deal with large filesystems
> > and limited or contended memory, etc. We already have xfs_repair users
> > that need to tweak settings because there isn't enough memory available
> > to repair the fs. Granted that is for fs-wide repairs and the flipside
> > is that we know a single AG can only be up to 1TB. It's certainly
> > possible that putting some persistent backing behind the in-core data is
> > enough to resolve the problem (and the current approach is certainly
> > reasonable enough to me for the initial implementation).
> > 
> > bjoin limitations aside, I wonder if a cursor roll mechanism that held
> > all of the cursor buffers, rolled the transaction and then rejoined all
> > said buffers would help us get around that. (Not sure I follow the early
> > prototype behavior, but it sounds like we had to restart the rmapbt
> > lookup over and over...).
> 
> Correct.
> 
> > Another caveat with that approach may be that I think we'd need to be
> > sure that the reconstruction operation doesn't ever need to update the
> > rmapbt while we're mid walk of the latter.
> 
> <nod> Looking even farther back in my notes, that was also an issue --
> fixing the free list causes blocks to go on or off the agfl, which
> causes rmapbt updates, which meant that the only way I could get
> in-place updates to work was to re-lookup where we were in the btree and
> also try to deal with any rmapbt entries that might have crept in as
> result of the record insertion.
> 
> Getting the concurrency right for each repair function looked like a
> difficult problem to solve, but amassing all the records elsewhere and
> rebuilding was easy to understand.
> 

Yeah. This all points to this kind of strategy being too complex to be
worth the prospective benefits in the short term. Clearly we have
several, potentially tricky roadblocks to work through before this can
be made feasible. Thanks for the background, it's still useful to have
this context to compare with whatever we may have to do to support a
reclaimable memory approach.

> > That may be an issue for inode btree reconstruction, for example,
> > since it looks like inobt block allocation requires rmapbt updates.
> > We'd probably need some way to share (or invalidate) a cursor across
> > different contexts to deal with that.
> 
> I might pursue that strategy if we ever hit the point where we can't
> find space to store the records (see below).  Another option could be to
> divert all deferred items for an AG, build a replacement btree in new
> space, then finish all the deferred items... but that's starting to get
> into offlineable AGs, which is its own project that I want to tackle
> later.
> 
> (Not that much later, just not this cycle.)
> 

*nod*

> > > For other repair functions (like the data/attr fork repairs) we have to
> > > scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> > > for as long as necessary to extract the extents we want.
> > > 
> > > > The obvious problem is that we still have some checks that allow the
> > > > whole repair operation to bail out before we determine whether we can
> > > > start to rebuild the on-disk btrees. These are things like making sure
> > > > we can actually read the associated rmapbt blocks (i.e., no read errors
> > > > or verifier failures), basic record sanity checks, etc. But ISTM that
> > > > isn't anything we couldn't get around with a multi-pass implementation.
> > > > Secondary issues might be things like no longer being able to easily
> > > > insert the longest free extent range(s) first (meaning we'd have to
> > > > stuff the agfl with old btree blocks or figure out some other approach).
> > > 
> > > Well, you could scan the rmapbt twice -- once to find the longest
> > > record, then again to do the actual insertion.
> > > 
> > 
> > Yep, that's what I meant by multi-pass.
> > 
> > > > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > > > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > > > callout could not only detect corruption, but validate whether repair
> > > > (if requested) is possible based on the kind of checks that are
> > > > currently in the repair side rmapbt walkers. Thoughts?r
> > > 
> > > Yes, scrub basically validates that for us now, with the notable
> > > exception of the notorious rmapbt scrubber, which doesn't
> > > cross-reference with inode block mappings because that would be a
> > > locking nightmare.
> > > 
> > > > Are there future
> > > > changes that are better supported by an in-core tracking structure in
> > > > general (assuming we'll eventually replace the linked lists with
> > > > something more efficient) as opposed to attempting to optimize out the
> > > > need for that tracking at all?
> > > 
> > > Well, I was thinking that we could just allocate a memfd (or a file on
> > > the same xfs once we have AG offlining) and store the records in there.
> > > That saves us the list_head overhead and potentially enables access to a
> > > lot more storage than pinning things in RAM.
> > > 
> > 
> > Would using the same fs mean we have to store the repair data in a
> > separate AG, or somehow locate/use free space in the target AG?
> 
> As part of building an "offline AG" feature we'd presumably have to
> teach the allocators to avoid the offline AGs for allocations, which
> would make it so that we could host the repair data files in the same
> XFS that's being fixed.  That seems a little risky to me, but the disk
> is probably larger than mem+swap.
> 

Got it, so we'd use the remaining space in the fs outside of the target
AG. ISTM that still presumes the rest of the fs is coherent, but I
suppose the offline AG thing helps us with that. We'd just have to make
sure we've shut down all currently corrupted AGs before we start to
repair a particular corrupted one, and then hope there's still enough
free space in the fs to proceed.

That makes more sense, but I still agree that it seems risky in general.
Technical risk aside, there's also usability concerns in that the local
free space requirement is another bit of non-determinism around the
ability to online repair vs. having to punt to xfs_repair, or if the
repair consumes whatever free space remains in the fs to the detriment
of whatever workload the user presumably wanted to keep the fs online
for, etc.

> > presume either way we'd have to ensure that AG is either consistent or
> > locked out from outside I/O. If we have the total record count we can
> 
> We usually don't, but for the btrees that have their own record/blocks
> counters we might be able to guess a number, fallocate it, and see if
> that doesn't ENOSPC.
> 
> > preallocate the file and hope there is no such other free space
> > corruption or something that would allow some other task to mess with
> > our blocks. I'm a little skeptical overall on relying on a corrupted
> > filesystem to store repair data, but perhaps there are ways to mitigate
> > the risks.
> 
> Store it elsewhere?  /home for root repairs, /root for any other
> repair... though if we're going to do that, why not just add a swap file
> temporarily?
> 

Indeed. The thought crossed my mind about whether we could do something
like have an internal/isolated swap file for dedicated XFS allocations
to avoid contention with the traditional swap. Userspace could somehow
set it up or communicate to the kernel. I have no idea how realistic
that is though or if there's a better interface for that kind of thing
(i.e., file backed kmem cache?). What _seems_ beneficial about that
approach is we get (potentially external) persistent backing and memory
reclaim ability with the traditional memory allocation model.

ISTM that if we used a regular file, we'd need to deal with the
traditional file interface somehow or another (file read/pagecache
lookup -> record ??). We could repurpose some existing mechanism like
the directory code or quota inode mechanism to use xfs buffers for that
purpose, but I think that would require us to always use an internal
inode. Allowing userspace to pass an fd/file passes that consideration
on to the user, which might be more flexible. We could always warn about
additional limitations if that fd happens to be based on the target fs.

> > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > swappable or something?
> 
> It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> the swap file.
> 

Ok.

> > If so, that sounds a reasonable option provided the swap space
> > requirement can be made clear to users
> 
> We can document it.  I don't think it's any worse than xfs_repair being
> able to use up all the memory + swap... and since we're probably only
> going to be repairing one thing at a time, most likely scrub won't need
> as much memory.
> 

Right, but as noted below, my concerns with the xfs_repair comparison
are that 1.) the kernel generally has more of a limit on anonymous
memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
it's not clear how effectively running the system out of memory via the
kernel will behave from a failure perspective.

IOW, xfs_repair can run the system out of memory but for the most part
that ends up being a simple problem for the system: OOM kill the bloated
xfs_repair process. For an online repair in a similar situation, I have
no idea what's going to happen. The hope is that the online repair hits
-ENOMEM and unwinds, but ISTM we'd still be at risk of other subsystems
running into memory allocation problems, filling up swap, the OOM killer
going after unrelated processes, etc.

What if, for example, the OOM killer starts picking off processes in
service to a running online repair that immediately consumes freed up
memory until the system is borked? I don't know how likely that is or if
it really ends up much different from the analogous xfs_repair
situation. My only point right now is that failure scenario is something
we should explore for any solution we ultimately consider because it may
be an unexpected use case of the underlying mechanism. (To the
contrary, just using a cached file seems a natural fit from that
perspective.)

> > and the failure characteristics aren't more severe than for userspace.
> > An online repair that puts the broader system at risk of OOM as
> > opposed to predictably failing gracefully may not be the most useful
> > tool.
> 
> Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> for the vm to push back on us if we successfully write all we need to
> the memfd but then other processes need some memory.  Obviously, if the
> memfd write itself comes up short or fails then we dump the memfd and
> error back to userspace.  We might simply have to free array memory
> while we iterate the records to minimize the time spent at peak memory
> usage.
> 

Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
may simplify things because we could allocate it up front and know right
away whether we just don't have enough memory available to repair.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > > > +
> > > > > > > > > > > +done:
> > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > +}
> > > > > > > > > > > +
> > > > > > > > > > ...
> > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > >  	return diff;
> > > > > > > > > > >  }
> > > > > > > > > > > +
> > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > +bool
> > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > +{
> > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > 
> > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > 
> > > > > > > > > Good suggestion, thank you!
> > > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > +		return false;
> > > > > > > > > > > +	}
> > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > +	return true;
> > > > > > > > > > > +}
> > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > >  }
> > > > > > > > > > >  
> > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > +
> > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-08-08 12:09           ` Brian Foster
@ 2018-08-08 21:26             ` Darrick J. Wong
  2018-08-09 11:14               ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-08 21:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 08, 2018 at 08:09:39AM -0400, Brian Foster wrote:
> On Tue, Aug 07, 2018 at 03:02:24PM -0700, Darrick J. Wong wrote:
> > On Tue, Jul 31, 2018 at 11:10:00AM -0400, Brian Foster wrote:
> > > On Mon, Jul 30, 2018 at 10:22:16AM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> > > > > On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > 
> > > > > > Repair the AGFL from the rmap data.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > 
> > > > > FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> > > > > seems to always dump a cross-referencing failed error and not want to
> > > > > deal with it. Expected? Is there a good way to unit test some of this
> > > > > stuff with simple/localized corruptions?
> > > > 
> > > > I usually pick one of the corruptions from xfs/355...
> > > > 
> > > > $ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
> > > > SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
> > > > ./check xfs/355
> > > > 
> > > 
> > > It looks like similar behavior if I do that, but tbh I'm not sure if I'm
> > > using this correctly. E.g., if I do:
> > 
> > <urk> Sorry, I forgot to reply to this...
> > 
> > > # SCRATCH_XFS_LIST_FUZZ_VERBS=random SCRATCH_XFS_LIST_METADATA_FIELDS=bno[0] ./check xfs/355
> > > FSTYP         -- xfs (debug)
> > > PLATFORM      -- Linux/x86_64 localhost 4.18.0-rc4+
> > > MKFS_OPTIONS  -- -f -mrmapbt=1,reflink=1 /dev/mapper/test-scratch
> > > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/test-scratch /mnt/scratch
> > > 
> > > xfs/355 - output mismatch (see /root/xfstests-dev/results//xfs/355.out.bad)
> > >     ...
> > >     (Run 'diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad'  to see the entire diff)
> > > Ran: xfs/355
> > > Failures: xfs/355
> > > Failed 1 of 1 tests
> > > # diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad
> > > --- tests/xfs/355.out   2018-07-25 07:47:23.739575416 -0400
> > > +++ /root/xfstests-dev/results//xfs/355.out.bad 2018-07-31
> > > 10:55:18.466178944 -0400
> > > @@ -1,6 +1,10 @@
> > >  QA output created by 355
> > >  Format and populate
> > >  Fuzz AGFL
> > > +online re-scrub (1) with bno[0] = random.
> > >  Done fuzzing AGFL
> > >  Fuzz AGFL flfirst
> > > +offline re-scrub (1) with bno[14] = random.
> > > +online re-scrub (1) with bno[14] = random.
> > > +re-repair failed (1) with bno[14] = random.
> > >  Done fuzzing AGFL flfirst
> > > 
> > > If I run xfs_scrub directly on the scratch mount after the test I get a
> > > stream of inode cross-referencing errors and it doesn't seem to fix
> > > anything up.
> > 
> > Hmm.  What is your xfsprogs head?  I think Eric committed the patches to
> > xfs_scrub to enable repairs in v4.18.0-rc1... which git says happened on
> > 8/1.
> > 
> 
> I think it was just for-next. Regardless, I was really just looking for
> a way to trigger a specific repair cycle and got around it once I
> discovered the XFS_ERRTAG_FORCE_SCRUB_REPAIR tag. I did have to stick
> the repair flag in the xfs_io scrub calls as well to trigger it that
> way, IIRC.
>
> Any thoughts on allowing that, perhaps with an extra scrub command flag
> (and/or in experimental mode)?

I'm a little confused by what you meant by having to "stick in the
repair flag"-- did you mean XFS_SCRUB_IFLAG_REPAIR?  Repair gets its own
xfs_io command (only in -x mode) "repair"; which should be in commit
bec810e8b483 ("xfs_io: wire up repair ioctl stuff").

Or did you mean you had to stick in the errortag to force a repair?
That was added to the 'inject' command in 52818844f1 ("xfs: implement
the metadata repair ioctl flag").

Either way...

# xfs_io -x -c 'inject force_repair' -c 'repair agfl 0' /mnt

...should do the trick.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > > Otherwise this looks sane, a couple comments..
> > > > > 
> > > > > >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> > > > > >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> > > > > >  fs/xfs/scrub/bitmap.h          |    4 +
> > > > > >  fs/xfs/scrub/scrub.c           |    2 
> > > > > >  4 files changed, 373 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > > > index 4842fc598c9b..bfef066c87c3 100644
> > > > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > > > > @@ -424,3 +424,279 @@ xrep_agf(
> > > > > >  	memcpy(agf, &old_agf, sizeof(old_agf));
> > > > > >  	return error;
> > > > > >  }
> > > > > > +
> > > > > ...
> > > > > > +/* Write out a totally new AGFL. */
> > > > > > +STATIC void
> > > > > > +xrep_agfl_init_header(
> > > > > > +	struct xfs_scrub	*sc,
> > > > > > +	struct xfs_buf		*agfl_bp,
> > > > > > +	struct xfs_bitmap	*agfl_extents,
> > > > > > +	xfs_agblock_t		flcount)
> > > > > > +{
> > > > > > +	struct xfs_mount	*mp = sc->mp;
> > > > > > +	__be32			*agfl_bno;
> > > > > > +	struct xfs_bitmap_range	*br;
> > > > > > +	struct xfs_bitmap_range	*n;
> > > > > > +	struct xfs_agfl		*agfl;
> > > > > > +	xfs_agblock_t		agbno;
> > > > > > +	unsigned int		fl_off;
> > > > > > +
> > > > > > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > > > > > +
> > > > > > +	/* Start rewriting the header. */
> > > > > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > > > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > > > > 
> > > > > What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?
> > > > 
> > > > Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
> > > > in the header fields.
> > > > 
> > > > > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > > > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > > > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > > > > > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > > > > > +	 * step.
> > > > > > +	 */
> > > > > > +	fl_off = 0;
> > > > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > > > > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > > > > > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > > > > > +
> > > > > > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > > > > > +
> > > > > > +		while (br->len > 0 && fl_off < flcount) {
> > > > > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > > > > +			fl_off++;
> > > > > > +			agbno++;
> > > > > 
> > > > > 			/* bump br so we don't reap blocks we've used */
> > > > > 
> > > > > (i.e., took me a sec to realize why we bother with ->start)
> > > > > 
> > > > > > +			br->start++;
> > > > > > +			br->len--;
> > > > > > +		}
> > > > > > +
> > > > > > +		if (br->len)
> > > > > > +			break;
> > > > > > +		list_del(&br->list);
> > > > > > +		kmem_free(br);
> > > > > > +	}
> > > > > > +
> > > > > > +	/* Write new AGFL to disk. */
> > > > > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > > > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > > > > +}
> > > > > > +
> > > > > ...
> > > > > > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > > > > > index c770e2d0b6aa..fdadc9e1dc49 100644
> > > > > > --- a/fs/xfs/scrub/bitmap.c
> > > > > > +++ b/fs/xfs/scrub/bitmap.c
> > > > > > @@ -9,6 +9,7 @@
> > > > > >  #include "xfs_format.h"
> > > > > >  #include "xfs_trans_resv.h"
> > > > > >  #include "xfs_mount.h"
> > > > > > +#include "xfs_btree.h"
> > > > > >  #include "scrub/xfs_scrub.h"
> > > > > >  #include "scrub/scrub.h"
> > > > > >  #include "scrub/common.h"
> > > > > > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> > > > > >  }
> > > > > >  #undef LEFT_ALIGNED
> > > > > >  #undef RIGHT_ALIGNED
> > > > > > +
> > > > > > +/*
> > > > > > + * Record all btree blocks seen while iterating all records of a btree.
> > > > > > + *
> > > > > > + * We know that the btree query_all function starts at the left edge and walks
> > > > > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > > > > + * the btree cursor towards the root; if the pointer for a given level points
> > > > > > + * to the first record/key in that block, we haven't seen this block before;
> > > > > > + * and therefore we need to remember that we saw this block in the btree.
> > > > > > + *
> > > > > > + * So if our btree is:
> > > > > > + *
> > > > > > + *    4
> > > > > > + *  / | \
> > > > > > + * 1  2  3
> > > > > > + *
> > > > > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > > > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > > > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > > > > + * block 4.  The list is [1, 4].
> > > > > > + *
> > > > > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > > > > + * loop.  The list remains [1, 4].
> > > > > > + *
> > > > > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > > > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > > > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > > > > + *
> > > > > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > > > > + *
> > > > > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > > > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > > > > + *
> > > > > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > > > > + */
> > > > > > +
> > > > > > +/*
> > > > > > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > > > > > + * engaged in a btree walk should call this function to capture the list of
> > > > > > + * blocks going from the leaf towards the root.
> > > > > > + */
> > > > > > +int
> > > > > > +xfs_bitmap_set_btcur_path(
> > > > > > +	struct xfs_bitmap	*bitmap,
> > > > > > +	struct xfs_btree_cur	*cur)
> > > > > > +{
> > > > > > +	struct xfs_buf		*bp;
> > > > > > +	xfs_fsblock_t		fsb;
> > > > > > +	int			i;
> > > > > > +	int			error;
> > > > > > +
> > > > > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > > > > +		xfs_btree_get_block(cur, i, &bp);
> > > > > > +		if (!bp)
> > > > > > +			continue;
> > > > > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> > > > > 
> > > > > Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> > > > > but also highlights that xfs_bitmap_set() essentially allocates entries
> > > > > for duplicate values if they exist. Is this handled by the broader
> > > > > mechanism, for example, if the rmapbt was corrupted to have multiple
> > > > > entries for a particular unused OWN_AG block? Or could we end up leaking
> > > > > that corruption over to the agfl?
> > > > 
> > > > Right now we're totally dependent on the rmapbt being sane to rebuild
> > > > the space metadata.
> > > > 
> > > > > I also wonder a bit about memory consumption on filesystems with large
> > > > > metadata footprints. We essentially have to allocate one of these for
> > > > > every allocation btree block before we can do the disunion and locate
> > > > > the agfl-appropriate blocks. If we had a more lookup friendly structure,
> > > > > perhaps this could be optimized by filtering out bnobt/cntbt blocks
> > > > > during the associated btree walks..?
> > > > > 
> > > > > Have you thought about reusing something like the new in-core extent
> > > > > tree mechanism as a pure in-memory extent store? It's certainly not
> > > > > worth reworking something like that right now, but I wonder if we could
> > > > > save memory via the denser format (and perhaps benefit from code
> > > > > flexibility, reuse, etc.).
> > > > 
> > > > Yes, I was thinking about refactoring the iext btree into a more generic
> > > > in-core index with 64-bit key so that I could adapt xfs_bitmap to use
> > > > it.  In the longer term I would /also/ like to use xfs_bitmap to detect
> > > > xfs_buf cache aliasing when multi-block buffers are in use, but that's
> > > > further off. :)
> > > > 
> > > > As for the memory-intensive record lists in all the btree rebuilders, I
> > > > have some ideas around that too -- either find a way to build an
> > > > alternate btree and switch the roots over, or (once we gain the ability
> > > > to mark an AG unavailable for new allocations) allocate an unlinked
> > > > inode, store the records in the page cache pages for the file, and
> > > > release it when we're done.
> > > > 
> > > > But, that can wait until I've gotten more of this merged, or get bored.
> > > > :)
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > +		if (error)
> > > > > > +			return error;
> > > > > > +	}
> > > > > > +
> > > > > > +	return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/* Collect a btree's block in the bitmap. */
> > > > > > +STATIC int
> > > > > > +xfs_bitmap_collect_btblock(
> > > > > > +	struct xfs_btree_cur	*cur,
> > > > > > +	int			level,
> > > > > > +	void			*priv)
> > > > > > +{
> > > > > > +	struct xfs_bitmap	*bitmap = priv;
> > > > > > +	struct xfs_buf		*bp;
> > > > > > +	xfs_fsblock_t		fsbno;
> > > > > > +
> > > > > > +	xfs_btree_get_block(cur, level, &bp);
> > > > > > +	if (!bp)
> > > > > > +		return 0;
> > > > > > +
> > > > > > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > > > > > +}
> > > > > > +
> > > > > > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > > > > > +int
> > > > > > +xfs_bitmap_set_btblocks(
> > > > > > +	struct xfs_bitmap	*bitmap,
> > > > > > +	struct xfs_btree_cur	*cur)
> > > > > > +{
> > > > > > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > > > > > +}
> > > > > > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > > > > > index dad652ee9177..ae8ecbce6fa6 100644
> > > > > > --- a/fs/xfs/scrub/bitmap.h
> > > > > > +++ b/fs/xfs/scrub/bitmap.h
> > > > > > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> > > > > >  
> > > > > >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> > > > > >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > > > > > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > > > > > +		struct xfs_btree_cur *cur);
> > > > > > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > > > > > +		struct xfs_btree_cur *cur);
> > > > > >  
> > > > > >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > > > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > > > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > > > > > --- a/fs/xfs/scrub/scrub.c
> > > > > > +++ b/fs/xfs/scrub/scrub.c
> > > > > > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > > > >  		.type	= ST_PERAG,
> > > > > >  		.setup	= xchk_setup_fs,
> > > > > >  		.scrub	= xchk_agfl,
> > > > > > -		.repair	= xrep_notsupported,
> > > > > > +		.repair	= xrep_agfl,
> > > > > >  	},
> > > > > >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > > > > >  		.type	= ST_PERAG,
> > > > > > 
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-08 12:29                       ` Brian Foster
@ 2018-08-08 22:42                         ` Darrick J. Wong
  2018-08-09 12:00                           ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-08 22:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > > > 
> > > > > > > > > > > > Rebuild the free space btrees from the gaps in the rmap btree.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  fs/xfs/Makefile             |    1 
> > > > > > > > > > > >  fs/xfs/scrub/alloc.c        |    1 
> > > > > > > > > > > >  fs/xfs/scrub/alloc_repair.c |  581 +++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > > > >  fs/xfs/scrub/common.c       |    8 +
> > > > > > > > > > > >  fs/xfs/scrub/repair.h       |    2 
> > > > > > > > > > > >  fs/xfs/scrub/scrub.c        |    4 
> > > > > > > > > > > >  fs/xfs/scrub/trace.h        |    2 
> > > > > > > > > > > >  fs/xfs/xfs_extent_busy.c    |   14 +
> > > > > > > > > > > >  fs/xfs/xfs_extent_busy.h    |    2 
> > > > > > > > > > > >  9 files changed, 610 insertions(+), 5 deletions(-)
> > > > > > > > > > > >  create mode 100644 fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > ...
> > > > > > > > > > > > diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > > new file mode 100644
> > > > > > > > > > > > index 000000000000..b228c2906de2
> > > > > > > > > > > > --- /dev/null
> > > > > > > > > > > > +++ b/fs/xfs/scrub/alloc_repair.c
> > > > > > > > > > > > @@ -0,0 +1,581 @@
> > > > > > > ...
> > > > > > > > > > > > +
> > > > > > > > > > > > +/*
> > > > > > > > > > > > + * Make our new freespace btree roots permanent so that we can start freeing
> > > > > > > > > > > > + * unused space back into the AG.
> > > > > > > > > > > > + */
> > > > > > > > > > > > +STATIC int
> > > > > > > > > > > > +xrep_abt_commit_new(
> > > > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks,
> > > > > > > > > > > > +	int			log_flags)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	int			error;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	xfs_alloc_log_agf(sc->tp, sc->sa.agf_bp, log_flags);
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Invalidate the old freespace btree blocks and commit. */
> > > > > > > > > > > > +	error = xrep_invalidate_blocks(sc, old_allocbt_blocks);
> > > > > > > > > > > > +	if (error)
> > > > > > > > > > > > +		return error;
> > > > > > > > > > > 
> > > > > > > > > > > It looks like the above invalidation all happens in the same
> > > > > > > > > > > transaction. Those aren't logging buffer data or anything, but any idea
> > > > > > > > > > > how many log formats we can get away with in this single transaction?
> > > > > > > > > > 
> > > > > > > > > > Hm... well, on my computer a log format is ~88 bytes.  Assuming 4K
> > > > > > > > > > blocks, the max AG size of 1TB, maximum free space fragmentation, and
> > > > > > > > > > two btrees, the tree could be up to ~270 million records.  Assuming ~505
> > > > > > > > > > records per block, that's ... ~531,000 leaf blocks and ~1100 node blocks
> > > > > > > > > > for both btrees.  If we invalidate both, that's ~46M of RAM?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I was thinking more about transaction reservation than RAM. It may not
> > > > > > > > 
> > > > > > > > Hmm.  tr_itruncate is ~650K on my 2TB SSD, assuming 88 bytes per, that's
> > > > > > > > about ... ~7300 log format items?  Not a lot, maybe it should roll the
> > > > > > > > transaction every 1000 invalidations or so...
> > > > > > > > 
> > > > > > > 
> > > > > > > I'm not really sure what categorizes as a lot here given that the blocks
> > > > > > > would need to be in-core, but rolling on some fixed/safe interval sounds
> > > > > > > reasonable to me.
> > > > > > > 
> > > > > > > > > currently be an issue, but it might be worth putting something down in a
> > > > > > > > > comment to note that this is a single transaction and we expect to not
> > > > > > > > > have to invalidate more than N (ballpark) blocks in a single go,
> > > > > > > > > whatever that value happens to be.
> > > > > > > > > 
> > > > > > > > > > > > +	error = xrep_roll_ag_trans(sc);
> > > > > > > > > > > > +	if (error)
> > > > > > > > > > > > +		return error;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Now that we've succeeded, mark the incore state valid again. */
> > > > > > > > > > > > +	sc->sa.pag->pagf_init = 1;
> > > > > > > > > > > > +	return 0;
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > > +/* Build new free space btrees and dispose of the old one. */
> > > > > > > > > > > > +STATIC int
> > > > > > > > > > > > +xrep_abt_rebuild_trees(
> > > > > > > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > > > > > > +	struct list_head	*free_extents,
> > > > > > > > > > > > +	struct xfs_bitmap	*old_allocbt_blocks)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	struct xfs_owner_info	oinfo;
> > > > > > > > > > > > +	struct xrep_abt_extent	*rae;
> > > > > > > > > > > > +	struct xrep_abt_extent	*n;
> > > > > > > > > > > > +	struct xrep_abt_extent	*longest;
> > > > > > > > > > > > +	int			error;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	xfs_rmap_skip_owner_update(&oinfo);
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/*
> > > > > > > > > > > > +	 * Insert the longest free extent in case it's necessary to
> > > > > > > > > > > > +	 * refresh the AGFL with multiple blocks.  If there is no longest
> > > > > > > > > > > > +	 * extent, we had exactly the free space we needed; we're done.
> > > > > > > > > > > > +	 */
> > > > > > > > > > > 
> > > > > > > > > > > I'm confused by the last sentence. longest should only be NULL if the
> > > > > > > > > > > free space list is empty and haven't we already bailed out with -ENOSPC
> > > > > > > > > > > if that's the case?
> > > > > > > > > > > 
> > > > > > > > > > > > +	longest = xrep_abt_get_longest(free_extents);
> > > > > > > > > > 
> > > > > > > > > > xrep_abt_rebuild_trees is called after we allocate and initialize two
> > > > > > > > > > new btree roots in xrep_abt_reset_btrees.  If free_extents is an empty
> > > > > > > > > > list here, then we found exactly two blocks worth of free space and used
> > > > > > > > > > them to set up new btree roots.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Got it, thanks.
> > > > > > > > > 
> > > > > > > > > > > > +	if (!longest)
> > > > > > > > > > > > +		goto done;
> > > > > > > > > > > > +	error = xrep_abt_free_extent(sc,
> > > > > > > > > > > > +			XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, longest->bno),
> > > > > > > > > > > > +			longest->len, &oinfo);
> > > > > > > > > > > > +	list_del(&longest->list);
> > > > > > > > > > > > +	kmem_free(longest);
> > > > > > > > > > > > +	if (error)
> > > > > > > > > > > > +		return error;
> > > > > > > > > > > > +
> > > > > > > > > > > > +	/* Insert records into the new btrees. */
> > > > > > > > > > > > +	list_for_each_entry_safe(rae, n, free_extents, list) {
> > > > > > > > > > > > +		error = xrep_abt_free_extent(sc,
> > > > > > > > > > > > +				XFS_AGB_TO_FSB(sc->mp, sc->sa.agno, rae->bno),
> > > > > > > > > > > > +				rae->len, &oinfo);
> > > > > > > > > > > > +		if (error)
> > > > > > > > > > > > +			return error;
> > > > > > > > > > > > +		list_del(&rae->list);
> > > > > > > > > > > > +		kmem_free(rae);
> > > > > > > > > > > > +	}
> > > > > > > > > > > 
> > > > > > > > > > > Ok, at this point we've reset the btree roots and we start freeing the
> > > > > > > > > > > free ranges that were discovered via the rmapbt analysis. AFAICT, if we
> > > > > > > > > > > fail or crash at this point, we leave the allocbts in a partially
> > > > > > > > > > > constructed state. I take it that is Ok with respect to the broader
> > > > > > > > > > > repair algorithm because we'd essentially start over by inspecting the
> > > > > > > > > > > rmapbt again on a retry.
> > > > > > > > > > 
> > > > > > > > > > Right.  Though in the crash/shutdown case, you'll end up with the
> > > > > > > > > > filesystem in an offline state at some point before you can retry the
> > > > > > > > > > scrub, it's probably faster to run xfs_repair to fix the damage.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Can we really assume that if we're already up and running an online
> > > > > > > > > repair? The filesystem has to be mountable in that case in the first
> > > > > > > > > place. If we've already reset and started reconstructing the allocation
> > > > > > > > > btrees then I'd think those transactions would recover just fine on a
> > > > > > > > > power loss or something (perhaps not in the event of some other
> > > > > > > > > corruption related shutdown).
> > > > > > > > 
> > > > > > > > Right, for the system crash case, whatever transactions committed should
> > > > > > > > replay just fine, and you can even start up the online repair again, and
> > > > > > > > if the AG isn't particularly close to ENOSPC then (barring rmap
> > > > > > > > corruption) it should work just fine.
> > > > > > > > 
> > > > > > > > If the fs went down because either (a) repair hit other corruption or
> > > > > > > > (b) some other thread hit an error in some other part of the filesystem,
> > > > > > > > then it's not so clear -- in (b) you could probably try again, but for
> > > > > > > > (a) you'll definitely have to unmount and run xfs_repair.
> > > > > > > > 
> > > > > > > 
> > > > > > > Indeed, there are certainly cases where we simply won't be able to do an
> > > > > > > online repair. I'm trying to think about scenarios where we should be
> > > > > > > able to do an online repair, but we lose power or hit some kind of
> > > > > > > transient error like a memory allocation failure before it completes. It
> > > > > > > would be nice if the online repair itself didn't contribute (within
> > > > > > > reason) to the inability to simply try again just because the fs was
> > > > > > > close to -ENOSPC.
> > > > > > 
> > > > > > Agreed.  Most of the, uh, opportunities to hit ENOMEM happen before we
> > > > > > start modifying on-disk metadata.  If that happens, we just free all the
> > > > > > memory and bail out having done nothing.
> > > > > > 
> > > > > > > For one, I think it's potentially confusing behavior. Second, it might
> > > > > > > be concerning to regular users who perceive it as an online repair
> > > > > > > leaving the fs in a worse off state. Us fs devs know that may not really
> > > > > > > be the case, but I think we're better for addressing it if we can
> > > > > > > reasonably do so.
> > > > > > 
> > > > > > <nod> Further in the future I want to add the ability to offline an AG,
> > > > > > so the worst that happens is that scrub turns the AG off, repair doesn't
> > > > > > fix it, and the AG simply stays offline.  That might give us the
> > > > > > ability to survive cancelling the repair transaction, since if the AG's
> > > > > > offline already anyway we could just throw away the dirty buffers and
> > > > > > resurrect the AG later.  I don't know, that's purely speculative.
> > > > > > 
> > > > > > > > Perhaps the guideline here is that if the fs goes down more than once
> > > > > > > > during online repair then unmount it and run xfs_repair.
> > > > > > > > 
> > > > > > > 
> > > > > > > Yep, I think that makes sense if the filesystem or repair itself is
> > > > > > > tripping over other corruptions that fail to keep it active for the
> > > > > > > duration of the repair.
> > > > > > 
> > > > > > <nod>
> > > > > > 
> > > > > > > > > > > The blocks allocated for the btrees that we've begun to construct here
> > > > > > > > > > > end up mapped in the rmapbt as we go, right? IIUC, that means we don't
> > > > > > > > > > > necessarily have infinite retries to make sure this completes. IOW,
> > > > > > > > > > > suppose that a first repair attempt finds just enough free space to
> > > > > > > > > > > construct new trees, gets far enough along to consume most of that free
> > > > > > > > > > > space and then crashes. Is it possible that a subsequent repair attempt
> > > > > > > > > > > includes the btree blocks allocated during the previous failed repair
> > > > > > > > > > > attempt in the sum of "old btree blocks" and determines we don't have
> > > > > > > > > > > enough free space to repair?
> > > > > > > > > > 
> > > > > > > > > > Yes, that's a risk of running the free space repair.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Can we improve on that? For example, are the rmapbt entries for the old
> > > > > > > > > allocation btree blocks necessary once we commit the btree resets? If
> > > > > > > > > not, could we remove those entries before we start tree reconstruction?
> > > > > > > > > 
> > > > > > > > > Alternatively, could we incorporate use of the old btree blocks? As it
> > > > > > > > > is, we discover those blocks simply so we can free them at the end.
> > > > > > > > > Perhaps we could free them sooner or find a more clever means to
> > > > > > > > > reallocate directly from that in-core list? I guess we have to consider
> > > > > > > > > whether they were really valid/sane btree blocks, but either way ISTM
> > > > > > > > > that the old blocks list is essentially invalidated once we reset the
> > > > > > > > > btrees.
> > > > > > > > 
> > > > > > > > Hmm, it's a little tricky to do that -- we could reap the old bnobt and
> > > > > > > > cntbt blocks (in the old_allocbt_blocks bitmap) first, but if adding a
> > > > > > > > record causes a btree split we'll pull blocks from the AGFL, and if
> > > > > > > > there aren't enough blocks in the bnobt to fill the AGFL back up then
> > > > > > > > fix_freelist won't succeed.  That complication is why it finds the
> > > > > > > > longest extent in the unclaimed list and pushes that in first, then
> > > > > > > > works on the rest of the extents.
> > > > > > > > 
> > > > > > > 
> > > > > > > Hmm, but doesn't a btree split require at least one full space btree
> > > > > > > block per-level? In conjunction, the agfl minimum size requirement grows
> > > > > > > with the height of the tree, which implies available free space..? I
> > > > > > > could be missing something, perhaps we have to account for the rmapbt in
> > > > > > > that case as well? Regardless...
> > > > > > > 
> > > > > > > > I suppose one could try to avoid ENOSPC by pushing that longest extent
> > > > > > > > in first (since we know that won't trigger a split), then reap the old
> > > > > > > > alloc btree blocks, and then add everything else back in...
> > > > > > > > 
> > > > > > > 
> > > > > > > I think it would be reasonable to seed the btree with the longest record
> > > > > > > or some fixed number of longest records (~1/2 a root block, for example)
> > > > > > > before making actual use of the btrees to reap the old blocks. I think
> > > > > > > then you'd only have a very short window of a single block leak on a
> > > > > > > poorly timed power loss and repair retry sequence before you start
> > > > > > > actually freeing originally used space (which in practice, I think
> > > > > > > solves the problem).
> > > > > > > 
> > > > > > > Given that we're starting from empty, I wonder if another option may be
> > > > > > > to over fill the agfl with old btree blocks or something. The first real
> > > > > > > free should shift enough blocks back into the btrees to ensure the agfl
> > > > > > > can be managed from that point forward, right? That may be more work
> > > > > > > than it's worth though and/or a job for another patch. (FWIW, we also
> > > > > > > have that NOSHRINK agfl fixup flag for userspace repair.)
> > > > > > 
> > > > > > Yes, I'll give that a try tomorrow, now that I've finished porting all
> > > > > > the 4.19 stuff to xfsprogs. :)
> > > > > > 
> > > > > > Looping back to something we discussed earlier in this thread, I'd
> > > > > > prefer to hold off on converting the list of already-freed extents to
> > > > > > xfs_bitmap because the same problem exists in all the repair functions
> > > > > > of having to store a large number of records for the rebuilt btree, and
> > > > > > maybe there's some way to <cough> use pageable memory for that, since
> > > > > > the access patterns for that are append, sort, and iterate; for those
> > > > > > three uses we don't necessarily require all the records to be in memory
> > > > > > all the time.  For the allocbt repair I expect the free space records to
> > > > > > be far more numerous than the list of old bnobt/cntbt blocks.
> > > > > > 
> > > > > 
> > > > > Ok, it's fair enough that we'll probably want to find some kind of
> > > > > generic, more efficient technique for handling this across the various
> > > > > applicable repair algorithms.
> > > > > 
> > > > > One other high level thing that crossed my mind with regard to the
> > > > > general btree reconstruction algorithms is whether we need to build up
> > > > > this kind of central record list at all. For example, rather than slurp
> > > > > up the entire list of btree records in-core, sort it and dump it back
> > > > > out, could we take advantage of the fact that our existing on-disk
> > > > > structure insertion mechanisms already handle out of order records
> > > > > (simply stated, an extent free knows how to insert the associated record
> > > > > at the right place in the space btrees)? For example, suppose we reset
> > > > > the existing btrees first, then scanned the rmapbt and repopulated the
> > > > > new btrees as records are discovered..?
> > > > 
> > > > I tried that in an earlier draft of the bnobt repair function.  The
> > > > biggest problem with inserting as we go is dealing with the inevitable
> > > > transaction rolls (right now we do after every record insertion to avoid
> > > > playing games with guessing how much reservation is left).  Btree
> > > > cursor state can't survive transaction rolls because the transaction
> > > > commit releases all the buffers that aren't bhold'en, and we can't bhold
> > > > that many buffers across a _defer_finish.
> > > > 
> > > 
> > > Ok, interesting.
> > > 
> > > Where do we need to run an xfs_defer_finish() during the reconstruction
> > > sequence, btw?
> > 
> > Not here, as I'm sure you were thinking. :)  For the AG btrees
> > themselves it's sufficient to roll the transaction.  I suppose we could
> > simply have a xfs_btree_bhold function that would bhold every buffer so
> > that a cursor could survive a roll.
> > 
> > Inode fork reconstruction is going to require _defer_finish, however.
> > 
> 
> Ok, just wasn't sure if I missed something in the bits I've looked
> through so far..
> 
> > > I thought that would only run on final commit as opposed to
> > > intermediate rolls.
> > 
> > We could let the deferred items sit around until final commit, but I
> > think I'd prefer to process them as soon as possible since iirc deferred
> > items pin the log until they're finished.  I would hope that userspace
> > isn't banging on the log while repair runs, but it's certainly possible.
> > 
> 
> I was just surmising in general, not necessarily suggesting we change
> behavior.

Oh, ok.  Sorry, I misinterpreted you. :)

> > > We could just try and make the automatic buffer relogging list a
> > > dynamic allocation if there are enough held buffers in the
> > > transaction.
> > 
> > Hmm.  Might be worth pursuing...
> > 
> > > > So, that early draft spent a lot of time tearing down and reconstructing
> > > > rmapbt cursors since the standard _btree_query_all isn't suited to that
> > > > kind of usage.  It was easily twice as slow on a RAM-backed disk just
> > > > from the rmap cursor overhead and much more complex, so I rewrote it to
> > > > be simpler.  I also have a slight preference for not touching anything
> > > > until we're absolutely sure we have all the data we need to repair the
> > > > structure.
> > > > 
> > > 
> > > Yes, I think that is sane in principle. I'm just a bit concerned about
> > > how reliable that xfs_repair-like approach will be in the kernel longer
> > > term, particularly once we start having to deal with large filesystems
> > > and limited or contended memory, etc. We already have xfs_repair users
> > > that need to tweak settings because there isn't enough memory available
> > > to repair the fs. Granted that is for fs-wide repairs and the flipside
> > > is that we know a single AG can only be up to 1TB. It's certainly
> > > possible that putting some persistent backing behind the in-core data is
> > > enough to resolve the problem (and the current approach is certainly
> > > reasonable enough to me for the initial implementation).
> > > 
> > > bjoin limitations aside, I wonder if a cursor roll mechanism that held
> > > all of the cursor buffers, rolled the transaction and then rejoined all
> > > said buffers would help us get around that. (Not sure I follow the early
> > > prototype behavior, but it sounds like we had to restart the rmapbt
> > > lookup over and over...).
> > 
> > Correct.
> > 
> > > Another caveat with that approach may be that I think we'd need to be
> > > sure that the reconstruction operation doesn't ever need to update the
> > > rmapbt while we're mid walk of the latter.
> > 
> > <nod> Looking even farther back in my notes, that was also an issue --
> > fixing the free list causes blocks to go on or off the agfl, which
> > causes rmapbt updates, which meant that the only way I could get
> > in-place updates to work was to re-lookup where we were in the btree and
> > also try to deal with any rmapbt entries that might have crept in as
> > result of the record insertion.
> > 
> > Getting the concurrency right for each repair function looked like a
> > difficult problem to solve, but amassing all the records elsewhere and
> > rebuilding was easy to understand.
> > 
> 
> Yeah. This all points to this kind of strategy being too complex to be
> worth the prospective benefits in the short term. Clearly we have
> several, potentially tricky roadblocks to work through before this can
> be made feasible. Thanks for the background, it's still useful to have
> this context to compare with whatever we may have to do to support a
> reclaimable memory approach.

<nod>  Reclaimable memfd "memory" isn't too difficult, we can call
kernel_read and kernel_write, though lockdep gets pretty mad about xfs
taking sb_start_write (on the memfd filesystem) at the same time it has
sb_starT_write on the xfs (not to mention the stack usage) so I had to
throw in the extra twist of delegating the actual file io to a workqueue
item (a la xfs_btree_split).

> > > That may be an issue for inode btree reconstruction, for example,
> > > since it looks like inobt block allocation requires rmapbt updates.
> > > We'd probably need some way to share (or invalidate) a cursor across
> > > different contexts to deal with that.
> > 
> > I might pursue that strategy if we ever hit the point where we can't
> > find space to store the records (see below).  Another option could be to
> > divert all deferred items for an AG, build a replacement btree in new
> > space, then finish all the deferred items... but that's starting to get
> > into offlineable AGs, which is its own project that I want to tackle
> > later.
> > 
> > (Not that much later, just not this cycle.)
> > 
> 
> *nod*
> 
> > > > For other repair functions (like the data/attr fork repairs) we have to
> > > > scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> > > > for as long as necessary to extract the extents we want.
> > > > 
> > > > > The obvious problem is that we still have some checks that allow the
> > > > > whole repair operation to bail out before we determine whether we can
> > > > > start to rebuild the on-disk btrees. These are things like making sure
> > > > > we can actually read the associated rmapbt blocks (i.e., no read errors
> > > > > or verifier failures), basic record sanity checks, etc. But ISTM that
> > > > > isn't anything we couldn't get around with a multi-pass implementation.
> > > > > Secondary issues might be things like no longer being able to easily
> > > > > insert the longest free extent range(s) first (meaning we'd have to
> > > > > stuff the agfl with old btree blocks or figure out some other approach).
> > > > 
> > > > Well, you could scan the rmapbt twice -- once to find the longest
> > > > record, then again to do the actual insertion.
> > > > 
> > > 
> > > Yep, that's what I meant by multi-pass.
> > > 
> > > > > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > > > > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > > > > callout could not only detect corruption, but validate whether repair
> > > > > (if requested) is possible based on the kind of checks that are
> > > > > currently in the repair side rmapbt walkers. Thoughts?r
> > > > 
> > > > Yes, scrub basically validates that for us now, with the notable
> > > > exception of the notorious rmapbt scrubber, which doesn't
> > > > cross-reference with inode block mappings because that would be a
> > > > locking nightmare.
> > > > 
> > > > > Are there future
> > > > > changes that are better supported by an in-core tracking structure in
> > > > > general (assuming we'll eventually replace the linked lists with
> > > > > something more efficient) as opposed to attempting to optimize out the
> > > > > need for that tracking at all?
> > > > 
> > > > Well, I was thinking that we could just allocate a memfd (or a file on
> > > > the same xfs once we have AG offlining) and store the records in there.
> > > > That saves us the list_head overhead and potentially enables access to a
> > > > lot more storage than pinning things in RAM.
> > > > 
> > > 
> > > Would using the same fs mean we have to store the repair data in a
> > > separate AG, or somehow locate/use free space in the target AG?
> > 
> > As part of building an "offline AG" feature we'd presumably have to
> > teach the allocators to avoid the offline AGs for allocations, which
> > would make it so that we could host the repair data files in the same
> > XFS that's being fixed.  That seems a little risky to me, but the disk
> > is probably larger than mem+swap.
> > 
> 
> Got it, so we'd use the remaining space in the fs outside of the target
> AG. ISTM that still presumes the rest of the fs is coherent, but I
> suppose the offline AG thing helps us with that. We'd just have to make
> sure we've shut down all currently corrupted AGs before we start to
> repair a particular corrupted one, and then hope there's still enough
> free space in the fs to proceed.

That's a pretty big hope. :)  I think for now 

> That makes more sense, but I still agree that it seems risky in general.
> Technical risk aside, there's also usability concerns in that the local
> free space requirement is another bit of non-determinism

I don't think it's non-deterministic, it's just hard for the filesystem
to communicate to the user/admin ahead of time.  Roughly speaking, we
need to have about as much disk space for the new btree as we had
allocated for the old one.

As far as memory requirements go, in last week's revising of the patches
I compressed the in-memory record structs down about as far as possible;
with the removal of the list heads, the memory requirements drop by
30-60%.  We require the same amount of memory as would be needed to
store all of the records in the leaf nodes, and no more, and we can use
swap space to do it.

> around the ability to online repair vs. having to punt to xfs_repair,
> or if the repair consumes whatever free space remains in the fs to the
> detriment of whatever workload the user presumably wanted to keep the
> fs online for, etc.

I've occasionally thought that future xfs_scrub could ask the kernel to
estimate how much disk and memory it will need for the repair (and
whether the disk space requirement is fs-scope or AG-scope); then it
could forego a repair action and recommend xfs_repair if running the
online repair would take the system below some configurable threshold.

> > > presume either way we'd have to ensure that AG is either consistent or
> > > locked out from outside I/O. If we have the total record count we can
> > 
> > We usually don't, but for the btrees that have their own record/blocks
> > counters we might be able to guess a number, fallocate it, and see if
> > that doesn't ENOSPC.
> > 
> > > preallocate the file and hope there is no such other free space
> > > corruption or something that would allow some other task to mess with
> > > our blocks. I'm a little skeptical overall on relying on a corrupted
> > > filesystem to store repair data, but perhaps there are ways to mitigate
> > > the risks.
> > 
> > Store it elsewhere?  /home for root repairs, /root for any other
> > repair... though if we're going to do that, why not just add a swap file
> > temporarily?
> > 
> 
> Indeed. The thought crossed my mind about whether we could do something
> like have an internal/isolated swap file for dedicated XFS allocations
> to avoid contention with the traditional swap.

Heh, I think e2fsck has some feature like that where you can pass it a
swap file.  No idea how much good that does on modern systems where
there's one huge partition... :)

> Userspace could somehow set it up or communicate to the kernel. I have
> no idea how realistic that is though or if there's a better interface
> for that kind of thing (i.e., file backed kmem cache?).

I looked, and there aren't any other mechanisms for unpinnned kernel
memory allocations.

> What _seems_ beneficial about that approach is we get (potentially
> external) persistent backing and memory reclaim ability with the
> traditional memory allocation model.
>
> ISTM that if we used a regular file, we'd need to deal with the
> traditional file interface somehow or another (file read/pagecache
> lookup -> record ??).

Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
all we need is a (struct file *).

> We could repurpose some existing mechanism like the directory code or
> quota inode mechanism to use xfs buffers for that purpose, but I think
> that would require us to always use an internal inode. Allowing
> userspace to pass an fd/file passes that consideration on to the user,
> which might be more flexible. We could always warn about additional
> limitations if that fd happens to be based on the target fs.

<nod> A second advantage of the struct file/kernel_{read,write} approach
is that we if we ever decide to let userspace pass in a fd, it's trivial
to feed that struct file to the kernel io routines instead of a memfd
one.

> > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > swappable or something?
> > 
> > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > the swap file.
> > 
> 
> Ok.
> 
> > > If so, that sounds a reasonable option provided the swap space
> > > requirement can be made clear to users
> > 
> > We can document it.  I don't think it's any worse than xfs_repair being
> > able to use up all the memory + swap... and since we're probably only
> > going to be repairing one thing at a time, most likely scrub won't need
> > as much memory.
> > 
> 
> Right, but as noted below, my concerns with the xfs_repair comparison
> are that 1.) the kernel generally has more of a limit on anonymous
> memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> it's not clear how effectively running the system out of memory via the
> kernel will behave from a failure perspective.
> 
> IOW, xfs_repair can run the system out of memory but for the most part
> that ends up being a simple problem for the system: OOM kill the bloated
> xfs_repair process. For an online repair in a similar situation, I have
> no idea what's going to happen.

Back in the days of the huge linked lists the oom killer would target
other proceses because it doesn't know that the online repair thread is
sitting on a ton of pinned kernel memory...

> The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> we'd still be at risk of other subsystems running into memory
> allocation problems, filling up swap, the OOM killer going after
> unrelated processes, etc.  What if, for example, the OOM killer starts
> picking off processes in service to a running online repair that
> immediately consumes freed up memory until the system is borked?

Yeah.  One thing we /could/ do is register an oom notifier that would
urge any running repair threads to bail out if they can.  It seems to me
that the oom killer blocks on the oom_notify_list chain, so our handler
could wait until at least one thread exits before returning.

> I don't know how likely that is or if it really ends up much different
> from the analogous xfs_repair situation. My only point right now is
> that failure scenario is something we should explore for any solution
> we ultimately consider because it may be an unexpected use case of the
> underlying mechanism.

Ideally, online repair would always be the victim since we know we have
a reasonable fallback.  At least for memfd, however, I think the only
clues we have to decide the question "is this memfd getting in the way
of other threads?" is either seeing ENOMEM, short writes, or getting
kicked by an oom notification.  Maybe that'll be enough?

> (To the contrary, just using a cached file seems a natural fit from
> that perspective.)

Same here.

> > > and the failure characteristics aren't more severe than for userspace.
> > > An online repair that puts the broader system at risk of OOM as
> > > opposed to predictably failing gracefully may not be the most useful
> > > tool.
> > 
> > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > for the vm to push back on us if we successfully write all we need to
> > the memfd but then other processes need some memory.  Obviously, if the
> > memfd write itself comes up short or fails then we dump the memfd and
> > error back to userspace.  We might simply have to free array memory
> > while we iterate the records to minimize the time spent at peak memory
> > usage.
> > 
> 
> Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> may simplify things because we could allocate it up front and know right
> away whether we just don't have enough memory available to repair.

Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
the pages at once, provided we have some guesstimate beforehand of how
much space we think we'll need.

So long as my earlier statement about the memory requirements being no
more than the size of the btree leaves is actually true (I haven't
rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
that and if we don't get it, we kill the memfd and exit.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > > > +
> > > > > > > > > > > > +done:
> > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > +}
> > > > > > > > > > > > +
> > > > > > > > > > > ...
> > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > >  	return diff;
> > > > > > > > > > > >  }
> > > > > > > > > > > > +
> > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > +bool
> > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > +{
> > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > 
> > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > 
> > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > 
> > > > > > > > > > --D
> > > > > > > > > > 
> > > > > > > > > > > Brian
> > > > > > > > > > > 
> > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > +		return false;
> > > > > > > > > > > > +	}
> > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > +	return true;
> > > > > > > > > > > > +}
> > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > >  }
> > > > > > > > > > > >  
> > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > +
> > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > 
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 03/14] xfs: repair the AGFL
  2018-08-08 21:26             ` Darrick J. Wong
@ 2018-08-09 11:14               ` Brian Foster
  0 siblings, 0 replies; 53+ messages in thread
From: Brian Foster @ 2018-08-09 11:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 08, 2018 at 02:26:55PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 08, 2018 at 08:09:39AM -0400, Brian Foster wrote:
> > On Tue, Aug 07, 2018 at 03:02:24PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 31, 2018 at 11:10:00AM -0400, Brian Foster wrote:
> > > > On Mon, Jul 30, 2018 at 10:22:16AM -0700, Darrick J. Wong wrote:
> > > > > On Mon, Jul 30, 2018 at 12:25:24PM -0400, Brian Foster wrote:
> > > > > > On Sun, Jul 29, 2018 at 10:48:08PM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > 
> > > > > > > Repair the AGFL from the rmap data.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > 
> > > > > > FWIW, I tried tweaking a couple agfl values via xfs_db and xfs_scrub
> > > > > > seems to always dump a cross-referencing failed error and not want to
> > > > > > deal with it. Expected? Is there a good way to unit test some of this
> > > > > > stuff with simple/localized corruptions?
> > > > > 
> > > > > I usually pick one of the corruptions from xfs/355...
> > > > > 
> > > > > $ SCRATCH_XFS_LIST_FUZZ_VERBS=random \
> > > > > SCRATCH_XFS_LIST_METADATA_FIELDS=somefield \
> > > > > ./check xfs/355
> > > > > 
> > > > 
> > > > It looks like similar behavior if I do that, but tbh I'm not sure if I'm
> > > > using this correctly. E.g., if I do:
> > > 
> > > <urk> Sorry, I forgot to reply to this...
> > > 
> > > > # SCRATCH_XFS_LIST_FUZZ_VERBS=random SCRATCH_XFS_LIST_METADATA_FIELDS=bno[0] ./check xfs/355
> > > > FSTYP         -- xfs (debug)
> > > > PLATFORM      -- Linux/x86_64 localhost 4.18.0-rc4+
> > > > MKFS_OPTIONS  -- -f -mrmapbt=1,reflink=1 /dev/mapper/test-scratch
> > > > MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0 /dev/mapper/test-scratch /mnt/scratch
> > > > 
> > > > xfs/355 - output mismatch (see /root/xfstests-dev/results//xfs/355.out.bad)
> > > >     ...
> > > >     (Run 'diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad'  to see the entire diff)
> > > > Ran: xfs/355
> > > > Failures: xfs/355
> > > > Failed 1 of 1 tests
> > > > # diff -u tests/xfs/355.out /root/xfstests-dev/results//xfs/355.out.bad
> > > > --- tests/xfs/355.out   2018-07-25 07:47:23.739575416 -0400
> > > > +++ /root/xfstests-dev/results//xfs/355.out.bad 2018-07-31
> > > > 10:55:18.466178944 -0400
> > > > @@ -1,6 +1,10 @@
> > > >  QA output created by 355
> > > >  Format and populate
> > > >  Fuzz AGFL
> > > > +online re-scrub (1) with bno[0] = random.
> > > >  Done fuzzing AGFL
> > > >  Fuzz AGFL flfirst
> > > > +offline re-scrub (1) with bno[14] = random.
> > > > +online re-scrub (1) with bno[14] = random.
> > > > +re-repair failed (1) with bno[14] = random.
> > > >  Done fuzzing AGFL flfirst
> > > > 
> > > > If I run xfs_scrub directly on the scratch mount after the test I get a
> > > > stream of inode cross-referencing errors and it doesn't seem to fix
> > > > anything up.
> > > 
> > > Hmm.  What is your xfsprogs head?  I think Eric committed the patches to
> > > xfs_scrub to enable repairs in v4.18.0-rc1... which git says happened on
> > > 8/1.
> > > 
> > 
> > I think it was just for-next. Regardless, I was really just looking for
> > a way to trigger a specific repair cycle and got around it once I
> > discovered the XFS_ERRTAG_FORCE_SCRUB_REPAIR tag. I did have to stick
> > the repair flag in the xfs_io scrub calls as well to trigger it that
> > way, IIRC.
> >
> > Any thoughts on allowing that, perhaps with an extra scrub command flag
> > (and/or in experimental mode)?
> 
> I'm a little confused by what you meant by having to "stick in the
> repair flag"-- did you mean XFS_SCRUB_IFLAG_REPAIR?  Repair gets its own
> xfs_io command (only in -x mode) "repair"; which should be in commit
> bec810e8b483 ("xfs_io: wire up repair ioctl stuff").
> 
> Or did you mean you had to stick in the errortag to force a repair?
> That was added to the 'inject' command in 52818844f1 ("xfs: implement
> the metadata repair ioctl flag").
> 

Both..

> Either way...
> 
> # xfs_io -x -c 'inject force_repair' -c 'repair agfl 0' /mnt
> 
> ...should do the trick.
> 

I was basically doing the above with a scrub command with a hacked in
IFLAG_REPAIR flag because I just missed that there was a repair command.
This is pretty much what I was looking for, so disregard my previous
comment. Thanks!

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > > Otherwise this looks sane, a couple comments..
> > > > > > 
> > > > > > >  fs/xfs/scrub/agheader_repair.c |  276 ++++++++++++++++++++++++++++++++++++++++
> > > > > > >  fs/xfs/scrub/bitmap.c          |   92 +++++++++++++
> > > > > > >  fs/xfs/scrub/bitmap.h          |    4 +
> > > > > > >  fs/xfs/scrub/scrub.c           |    2 
> > > > > > >  4 files changed, 373 insertions(+), 1 deletion(-)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> > > > > > > index 4842fc598c9b..bfef066c87c3 100644
> > > > > > > --- a/fs/xfs/scrub/agheader_repair.c
> > > > > > > +++ b/fs/xfs/scrub/agheader_repair.c
> > > > > > > @@ -424,3 +424,279 @@ xrep_agf(
> > > > > > >  	memcpy(agf, &old_agf, sizeof(old_agf));
> > > > > > >  	return error;
> > > > > > >  }
> > > > > > > +
> > > > > > ...
> > > > > > > +/* Write out a totally new AGFL. */
> > > > > > > +STATIC void
> > > > > > > +xrep_agfl_init_header(
> > > > > > > +	struct xfs_scrub	*sc,
> > > > > > > +	struct xfs_buf		*agfl_bp,
> > > > > > > +	struct xfs_bitmap	*agfl_extents,
> > > > > > > +	xfs_agblock_t		flcount)
> > > > > > > +{
> > > > > > > +	struct xfs_mount	*mp = sc->mp;
> > > > > > > +	__be32			*agfl_bno;
> > > > > > > +	struct xfs_bitmap_range	*br;
> > > > > > > +	struct xfs_bitmap_range	*n;
> > > > > > > +	struct xfs_agfl		*agfl;
> > > > > > > +	xfs_agblock_t		agbno;
> > > > > > > +	unsigned int		fl_off;
> > > > > > > +
> > > > > > > +	ASSERT(flcount <= xfs_agfl_size(mp));
> > > > > > > +
> > > > > > > +	/* Start rewriting the header. */
> > > > > > > +	agfl = XFS_BUF_TO_AGFL(agfl_bp);
> > > > > > > +	memset(agfl, 0xFF, BBTOB(agfl_bp->b_length));
> > > > > > 
> > > > > > What's the purpose behind 0xFF? Related to NULLAGBLOCK/NULLCOMMITLSN..?
> > > > > 
> > > > > Yes, it prepopulates the AGFL bno[] array with NULLAGBLOCK, then writes
> > > > > in the header fields.
> > > > > 
> > > > > > > +	agfl->agfl_magicnum = cpu_to_be32(XFS_AGFL_MAGIC);
> > > > > > > +	agfl->agfl_seqno = cpu_to_be32(sc->sa.agno);
> > > > > > > +	uuid_copy(&agfl->agfl_uuid, &mp->m_sb.sb_meta_uuid);
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * Fill the AGFL with the remaining blocks.  If agfl_extents has more
> > > > > > > +	 * blocks than fit in the AGFL, they will be freed in a subsequent
> > > > > > > +	 * step.
> > > > > > > +	 */
> > > > > > > +	fl_off = 0;
> > > > > > > +	agfl_bno = XFS_BUF_TO_AGFL_BNO(mp, agfl_bp);
> > > > > > > +	for_each_xfs_bitmap_extent(br, n, agfl_extents) {
> > > > > > > +		agbno = XFS_FSB_TO_AGBNO(mp, br->start);
> > > > > > > +
> > > > > > > +		trace_xrep_agfl_insert(mp, sc->sa.agno, agbno, br->len);
> > > > > > > +
> > > > > > > +		while (br->len > 0 && fl_off < flcount) {
> > > > > > > +			agfl_bno[fl_off] = cpu_to_be32(agbno);
> > > > > > > +			fl_off++;
> > > > > > > +			agbno++;
> > > > > > 
> > > > > > 			/* bump br so we don't reap blocks we've used */
> > > > > > 
> > > > > > (i.e., took me a sec to realize why we bother with ->start)
> > > > > > 
> > > > > > > +			br->start++;
> > > > > > > +			br->len--;
> > > > > > > +		}
> > > > > > > +
> > > > > > > +		if (br->len)
> > > > > > > +			break;
> > > > > > > +		list_del(&br->list);
> > > > > > > +		kmem_free(br);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	/* Write new AGFL to disk. */
> > > > > > > +	xfs_trans_buf_set_type(sc->tp, agfl_bp, XFS_BLFT_AGFL_BUF);
> > > > > > > +	xfs_trans_log_buf(sc->tp, agfl_bp, 0, BBTOB(agfl_bp->b_length) - 1);
> > > > > > > +}
> > > > > > > +
> > > > > > ...
> > > > > > > diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> > > > > > > index c770e2d0b6aa..fdadc9e1dc49 100644
> > > > > > > --- a/fs/xfs/scrub/bitmap.c
> > > > > > > +++ b/fs/xfs/scrub/bitmap.c
> > > > > > > @@ -9,6 +9,7 @@
> > > > > > >  #include "xfs_format.h"
> > > > > > >  #include "xfs_trans_resv.h"
> > > > > > >  #include "xfs_mount.h"
> > > > > > > +#include "xfs_btree.h"
> > > > > > >  #include "scrub/xfs_scrub.h"
> > > > > > >  #include "scrub/scrub.h"
> > > > > > >  #include "scrub/common.h"
> > > > > > > @@ -209,3 +210,94 @@ xfs_bitmap_disunion(
> > > > > > >  }
> > > > > > >  #undef LEFT_ALIGNED
> > > > > > >  #undef RIGHT_ALIGNED
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Record all btree blocks seen while iterating all records of a btree.
> > > > > > > + *
> > > > > > > + * We know that the btree query_all function starts at the left edge and walks
> > > > > > > + * towards the right edge of the tree.  Therefore, we know that we can walk up
> > > > > > > + * the btree cursor towards the root; if the pointer for a given level points
> > > > > > > + * to the first record/key in that block, we haven't seen this block before;
> > > > > > > + * and therefore we need to remember that we saw this block in the btree.
> > > > > > > + *
> > > > > > > + * So if our btree is:
> > > > > > > + *
> > > > > > > + *    4
> > > > > > > + *  / | \
> > > > > > > + * 1  2  3
> > > > > > > + *
> > > > > > > + * Pretend for this example that each leaf block has 100 btree records.  For
> > > > > > > + * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> > > > > > > + * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> > > > > > > + * block 4.  The list is [1, 4].
> > > > > > > + *
> > > > > > > + * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> > > > > > > + * loop.  The list remains [1, 4].
> > > > > > > + *
> > > > > > > + * For the 101st btree record, we've moved onto leaf block 2.  Now
> > > > > > > + * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> > > > > > > + * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> > > > > > > + *
> > > > > > > + * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> > > > > > > + *
> > > > > > > + * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> > > > > > > + * we add 3 to the list.  Now it is [1, 4, 2, 3].
> > > > > > > + *
> > > > > > > + * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> > > > > > > + */
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Record all the buffers pointed to by the btree cursor.  Callers already
> > > > > > > + * engaged in a btree walk should call this function to capture the list of
> > > > > > > + * blocks going from the leaf towards the root.
> > > > > > > + */
> > > > > > > +int
> > > > > > > +xfs_bitmap_set_btcur_path(
> > > > > > > +	struct xfs_bitmap	*bitmap,
> > > > > > > +	struct xfs_btree_cur	*cur)
> > > > > > > +{
> > > > > > > +	struct xfs_buf		*bp;
> > > > > > > +	xfs_fsblock_t		fsb;
> > > > > > > +	int			i;
> > > > > > > +	int			error;
> > > > > > > +
> > > > > > > +	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> > > > > > > +		xfs_btree_get_block(cur, i, &bp);
> > > > > > > +		if (!bp)
> > > > > > > +			continue;
> > > > > > > +		fsb = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > > > +		error = xfs_bitmap_set(bitmap, fsb, 1);
> > > > > > 
> > > > > > Thanks for the comment. It helps explain the bc_ptrs == 1 check above,
> > > > > > but also highlights that xfs_bitmap_set() essentially allocates entries
> > > > > > for duplicate values if they exist. Is this handled by the broader
> > > > > > mechanism, for example, if the rmapbt was corrupted to have multiple
> > > > > > entries for a particular unused OWN_AG block? Or could we end up leaking
> > > > > > that corruption over to the agfl?
> > > > > 
> > > > > Right now we're totally dependent on the rmapbt being sane to rebuild
> > > > > the space metadata.
> > > > > 
> > > > > > I also wonder a bit about memory consumption on filesystems with large
> > > > > > metadata footprints. We essentially have to allocate one of these for
> > > > > > every allocation btree block before we can do the disunion and locate
> > > > > > the agfl-appropriate blocks. If we had a more lookup friendly structure,
> > > > > > perhaps this could be optimized by filtering out bnobt/cntbt blocks
> > > > > > during the associated btree walks..?
> > > > > > 
> > > > > > Have you thought about reusing something like the new in-core extent
> > > > > > tree mechanism as a pure in-memory extent store? It's certainly not
> > > > > > worth reworking something like that right now, but I wonder if we could
> > > > > > save memory via the denser format (and perhaps benefit from code
> > > > > > flexibility, reuse, etc.).
> > > > > 
> > > > > Yes, I was thinking about refactoring the iext btree into a more generic
> > > > > in-core index with 64-bit key so that I could adapt xfs_bitmap to use
> > > > > it.  In the longer term I would /also/ like to use xfs_bitmap to detect
> > > > > xfs_buf cache aliasing when multi-block buffers are in use, but that's
> > > > > further off. :)
> > > > > 
> > > > > As for the memory-intensive record lists in all the btree rebuilders, I
> > > > > have some ideas around that too -- either find a way to build an
> > > > > alternate btree and switch the roots over, or (once we gain the ability
> > > > > to mark an AG unavailable for new allocations) allocate an unlinked
> > > > > inode, store the records in the page cache pages for the file, and
> > > > > release it when we're done.
> > > > > 
> > > > > But, that can wait until I've gotten more of this merged, or get bored.
> > > > > :)
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > +		if (error)
> > > > > > > +			return error;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Collect a btree's block in the bitmap. */
> > > > > > > +STATIC int
> > > > > > > +xfs_bitmap_collect_btblock(
> > > > > > > +	struct xfs_btree_cur	*cur,
> > > > > > > +	int			level,
> > > > > > > +	void			*priv)
> > > > > > > +{
> > > > > > > +	struct xfs_bitmap	*bitmap = priv;
> > > > > > > +	struct xfs_buf		*bp;
> > > > > > > +	xfs_fsblock_t		fsbno;
> > > > > > > +
> > > > > > > +	xfs_btree_get_block(cur, level, &bp);
> > > > > > > +	if (!bp)
> > > > > > > +		return 0;
> > > > > > > +
> > > > > > > +	fsbno = XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn);
> > > > > > > +	return xfs_bitmap_set(bitmap, fsbno, 1);
> > > > > > > +}
> > > > > > > +
> > > > > > > +/* Walk the btree and mark the bitmap wherever a btree block is found. */
> > > > > > > +int
> > > > > > > +xfs_bitmap_set_btblocks(
> > > > > > > +	struct xfs_bitmap	*bitmap,
> > > > > > > +	struct xfs_btree_cur	*cur)
> > > > > > > +{
> > > > > > > +	return xfs_btree_visit_blocks(cur, xfs_bitmap_collect_btblock, bitmap);
> > > > > > > +}
> > > > > > > diff --git a/fs/xfs/scrub/bitmap.h b/fs/xfs/scrub/bitmap.h
> > > > > > > index dad652ee9177..ae8ecbce6fa6 100644
> > > > > > > --- a/fs/xfs/scrub/bitmap.h
> > > > > > > +++ b/fs/xfs/scrub/bitmap.h
> > > > > > > @@ -28,5 +28,9 @@ void xfs_bitmap_destroy(struct xfs_bitmap *bitmap);
> > > > > > >  
> > > > > > >  int xfs_bitmap_set(struct xfs_bitmap *bitmap, uint64_t start, uint64_t len);
> > > > > > >  int xfs_bitmap_disunion(struct xfs_bitmap *bitmap, struct xfs_bitmap *sub);
> > > > > > > +int xfs_bitmap_set_btcur_path(struct xfs_bitmap *bitmap,
> > > > > > > +		struct xfs_btree_cur *cur);
> > > > > > > +int xfs_bitmap_set_btblocks(struct xfs_bitmap *bitmap,
> > > > > > > +		struct xfs_btree_cur *cur);
> > > > > > >  
> > > > > > >  #endif	/* __XFS_SCRUB_BITMAP_H__ */
> > > > > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > > > > index 1e8a17c8e2b9..2670f4cf62f4 100644
> > > > > > > --- a/fs/xfs/scrub/scrub.c
> > > > > > > +++ b/fs/xfs/scrub/scrub.c
> > > > > > > @@ -220,7 +220,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> > > > > > >  		.type	= ST_PERAG,
> > > > > > >  		.setup	= xchk_setup_fs,
> > > > > > >  		.scrub	= xchk_agfl,
> > > > > > > -		.repair	= xrep_notsupported,
> > > > > > > +		.repair	= xrep_agfl,
> > > > > > >  	},
> > > > > > >  	[XFS_SCRUB_TYPE_AGI] = {	/* agi */
> > > > > > >  		.type	= ST_PERAG,
> > > > > > > 
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-08 22:42                         ` Darrick J. Wong
@ 2018-08-09 12:00                           ` Brian Foster
  2018-08-09 15:59                             ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-09 12:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
...
> > > > > So, that early draft spent a lot of time tearing down and reconstructing
> > > > > rmapbt cursors since the standard _btree_query_all isn't suited to that
> > > > > kind of usage.  It was easily twice as slow on a RAM-backed disk just
> > > > > from the rmap cursor overhead and much more complex, so I rewrote it to
> > > > > be simpler.  I also have a slight preference for not touching anything
> > > > > until we're absolutely sure we have all the data we need to repair the
> > > > > structure.
> > > > > 
> > > > 
> > > > Yes, I think that is sane in principle. I'm just a bit concerned about
> > > > how reliable that xfs_repair-like approach will be in the kernel longer
> > > > term, particularly once we start having to deal with large filesystems
> > > > and limited or contended memory, etc. We already have xfs_repair users
> > > > that need to tweak settings because there isn't enough memory available
> > > > to repair the fs. Granted that is for fs-wide repairs and the flipside
> > > > is that we know a single AG can only be up to 1TB. It's certainly
> > > > possible that putting some persistent backing behind the in-core data is
> > > > enough to resolve the problem (and the current approach is certainly
> > > > reasonable enough to me for the initial implementation).
> > > > 
> > > > bjoin limitations aside, I wonder if a cursor roll mechanism that held
> > > > all of the cursor buffers, rolled the transaction and then rejoined all
> > > > said buffers would help us get around that. (Not sure I follow the early
> > > > prototype behavior, but it sounds like we had to restart the rmapbt
> > > > lookup over and over...).
> > > 
> > > Correct.
> > > 
> > > > Another caveat with that approach may be that I think we'd need to be
> > > > sure that the reconstruction operation doesn't ever need to update the
> > > > rmapbt while we're mid walk of the latter.
> > > 
> > > <nod> Looking even farther back in my notes, that was also an issue --
> > > fixing the free list causes blocks to go on or off the agfl, which
> > > causes rmapbt updates, which meant that the only way I could get
> > > in-place updates to work was to re-lookup where we were in the btree and
> > > also try to deal with any rmapbt entries that might have crept in as
> > > result of the record insertion.
> > > 
> > > Getting the concurrency right for each repair function looked like a
> > > difficult problem to solve, but amassing all the records elsewhere and
> > > rebuilding was easy to understand.
> > > 
> > 
> > Yeah. This all points to this kind of strategy being too complex to be
> > worth the prospective benefits in the short term. Clearly we have
> > several, potentially tricky roadblocks to work through before this can
> > be made feasible. Thanks for the background, it's still useful to have
> > this context to compare with whatever we may have to do to support a
> > reclaimable memory approach.
> 
> <nod>  Reclaimable memfd "memory" isn't too difficult, we can call
> kernel_read and kernel_write, though lockdep gets pretty mad about xfs
> taking sb_start_write (on the memfd filesystem) at the same time it has
> sb_starT_write on the xfs (not to mention the stack usage) so I had to
> throw in the extra twist of delegating the actual file io to a workqueue
> item (a la xfs_btree_split).
> 

Ok, I'm more curious what the surrounding code looks like around
managing the underlying file pages. Now that I think of it, the primary
usage was to dump everything into the file and read it back
sequentually, so perhaps this really isn't that difficult to deal with
since the file content is presumably fixed size data structures. (Hmm,
was there a sort in there somewhere as well?).

> > > > That may be an issue for inode btree reconstruction, for example,
> > > > since it looks like inobt block allocation requires rmapbt updates.
> > > > We'd probably need some way to share (or invalidate) a cursor across
> > > > different contexts to deal with that.
> > > 
> > > I might pursue that strategy if we ever hit the point where we can't
> > > find space to store the records (see below).  Another option could be to
> > > divert all deferred items for an AG, build a replacement btree in new
> > > space, then finish all the deferred items... but that's starting to get
> > > into offlineable AGs, which is its own project that I want to tackle
> > > later.
> > > 
> > > (Not that much later, just not this cycle.)
> > > 
> > 
> > *nod*
> > 
> > > > > For other repair functions (like the data/attr fork repairs) we have to
> > > > > scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> > > > > for as long as necessary to extract the extents we want.
> > > > > 
> > > > > > The obvious problem is that we still have some checks that allow the
> > > > > > whole repair operation to bail out before we determine whether we can
> > > > > > start to rebuild the on-disk btrees. These are things like making sure
> > > > > > we can actually read the associated rmapbt blocks (i.e., no read errors
> > > > > > or verifier failures), basic record sanity checks, etc. But ISTM that
> > > > > > isn't anything we couldn't get around with a multi-pass implementation.
> > > > > > Secondary issues might be things like no longer being able to easily
> > > > > > insert the longest free extent range(s) first (meaning we'd have to
> > > > > > stuff the agfl with old btree blocks or figure out some other approach).
> > > > > 
> > > > > Well, you could scan the rmapbt twice -- once to find the longest
> > > > > record, then again to do the actual insertion.
> > > > > 
> > > > 
> > > > Yep, that's what I meant by multi-pass.
> > > > 
> > > > > > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > > > > > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > > > > > callout could not only detect corruption, but validate whether repair
> > > > > > (if requested) is possible based on the kind of checks that are
> > > > > > currently in the repair side rmapbt walkers. Thoughts?r
> > > > > 
> > > > > Yes, scrub basically validates that for us now, with the notable
> > > > > exception of the notorious rmapbt scrubber, which doesn't
> > > > > cross-reference with inode block mappings because that would be a
> > > > > locking nightmare.
> > > > > 
> > > > > > Are there future
> > > > > > changes that are better supported by an in-core tracking structure in
> > > > > > general (assuming we'll eventually replace the linked lists with
> > > > > > something more efficient) as opposed to attempting to optimize out the
> > > > > > need for that tracking at all?
> > > > > 
> > > > > Well, I was thinking that we could just allocate a memfd (or a file on
> > > > > the same xfs once we have AG offlining) and store the records in there.
> > > > > That saves us the list_head overhead and potentially enables access to a
> > > > > lot more storage than pinning things in RAM.
> > > > > 
> > > > 
> > > > Would using the same fs mean we have to store the repair data in a
> > > > separate AG, or somehow locate/use free space in the target AG?
> > > 
> > > As part of building an "offline AG" feature we'd presumably have to
> > > teach the allocators to avoid the offline AGs for allocations, which
> > > would make it so that we could host the repair data files in the same
> > > XFS that's being fixed.  That seems a little risky to me, but the disk
> > > is probably larger than mem+swap.
> > > 
> > 
> > Got it, so we'd use the remaining space in the fs outside of the target
> > AG. ISTM that still presumes the rest of the fs is coherent, but I
> > suppose the offline AG thing helps us with that. We'd just have to make
> > sure we've shut down all currently corrupted AGs before we start to
> > repair a particular corrupted one, and then hope there's still enough
> > free space in the fs to proceed.
> 
> That's a pretty big hope. :)  I think for now 
> 
> > That makes more sense, but I still agree that it seems risky in general.
> > Technical risk aside, there's also usability concerns in that the local
> > free space requirement is another bit of non-determinism
> 
> I don't think it's non-deterministic, it's just hard for the filesystem
> to communicate to the user/admin ahead of time.  Roughly speaking, we
> need to have about as much disk space for the new btree as we had
> allocated for the old one.
> 

Right, maybe non-deterministic is not the best term. What I mean is that
it's not clear to the user why a particular filesystem may not be able
to run a repair (e.g., if it has plenty of reported free space but
enough AGs may be shut down due to corruption). So in certain scenarios
an unrelated corruption or particular ordering of AG repairs could be
the difference between whether an online repair succeeds or defers to
offline repair on the otherwise same filesystem. 

> As far as memory requirements go, in last week's revising of the patches
> I compressed the in-memory record structs down about as far as possible;
> with the removal of the list heads, the memory requirements drop by
> 30-60%.  We require the same amount of memory as would be needed to
> store all of the records in the leaf nodes, and no more, and we can use
> swap space to do it.
> 

Nice. When looking at the existing structures it looked like a worst
case (1TB AG, every other 1k block allocated) could require up to
10-12GB RAM (but I could have easily messed that up). That's not insane
on its own, it's just the question of allocating that much memory in the
kernel. Slimming that down and pushing it into something swappable
doesn't _sound_ too overbearing. I'm not really sure what default distro
swap sizes are these days (some % of RAM?), but it shouldn't be that
hard to find ~10GB of disk space somewhere to facilitate a repair.

> > around the ability to online repair vs. having to punt to xfs_repair,
> > or if the repair consumes whatever free space remains in the fs to the
> > detriment of whatever workload the user presumably wanted to keep the
> > fs online for, etc.
> 
> I've occasionally thought that future xfs_scrub could ask the kernel to
> estimate how much disk and memory it will need for the repair (and
> whether the disk space requirement is fs-scope or AG-scope); then it
> could forego a repair action and recommend xfs_repair if running the
> online repair would take the system below some configurable threshold.
> 

I think something like that would improve usability once we nail down
the core mechanism.

> > > > presume either way we'd have to ensure that AG is either consistent or
> > > > locked out from outside I/O. If we have the total record count we can
> > > 
> > > We usually don't, but for the btrees that have their own record/blocks
> > > counters we might be able to guess a number, fallocate it, and see if
> > > that doesn't ENOSPC.
> > > 
> > > > preallocate the file and hope there is no such other free space
> > > > corruption or something that would allow some other task to mess with
> > > > our blocks. I'm a little skeptical overall on relying on a corrupted
> > > > filesystem to store repair data, but perhaps there are ways to mitigate
> > > > the risks.
> > > 
> > > Store it elsewhere?  /home for root repairs, /root for any other
> > > repair... though if we're going to do that, why not just add a swap file
> > > temporarily?
> > > 
> > 
> > Indeed. The thought crossed my mind about whether we could do something
> > like have an internal/isolated swap file for dedicated XFS allocations
> > to avoid contention with the traditional swap.
> 
> Heh, I think e2fsck has some feature like that where you can pass it a
> swap file.  No idea how much good that does on modern systems where
> there's one huge partition... :)
> 

Interesting. Couldn't you always create an additional swap file, run the
repair then kill it off when it's no longer needed?

> > Userspace could somehow set it up or communicate to the kernel. I have
> > no idea how realistic that is though or if there's a better interface
> > for that kind of thing (i.e., file backed kmem cache?).
> 
> I looked, and there aren't any other mechanisms for unpinnned kernel
> memory allocations.
> 

Ok, it looks like swap or traditional files it is then. ;P

> > What _seems_ beneficial about that approach is we get (potentially
> > external) persistent backing and memory reclaim ability with the
> > traditional memory allocation model.
> >
> > ISTM that if we used a regular file, we'd need to deal with the
> > traditional file interface somehow or another (file read/pagecache
> > lookup -> record ??).
> 
> Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> all we need is a (struct file *).
> 
> > We could repurpose some existing mechanism like the directory code or
> > quota inode mechanism to use xfs buffers for that purpose, but I think
> > that would require us to always use an internal inode. Allowing
> > userspace to pass an fd/file passes that consideration on to the user,
> > which might be more flexible. We could always warn about additional
> > limitations if that fd happens to be based on the target fs.
> 
> <nod> A second advantage of the struct file/kernel_{read,write} approach
> is that we if we ever decide to let userspace pass in a fd, it's trivial
> to feed that struct file to the kernel io routines instead of a memfd
> one.
> 

Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
something like this anyways. Could/should xfs_scrub be responsible for
allocating a memfd and passing along the fd? Another advantage of doing
that is whatever logic we may need to clean up old repair files or
whatever is pushed to userspace.

> > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > swappable or something?
> > > 
> > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > the swap file.
> > > 
> > 
> > Ok.
> > 
> > > > If so, that sounds a reasonable option provided the swap space
> > > > requirement can be made clear to users
> > > 
> > > We can document it.  I don't think it's any worse than xfs_repair being
> > > able to use up all the memory + swap... and since we're probably only
> > > going to be repairing one thing at a time, most likely scrub won't need
> > > as much memory.
> > > 
> > 
> > Right, but as noted below, my concerns with the xfs_repair comparison
> > are that 1.) the kernel generally has more of a limit on anonymous
> > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > it's not clear how effectively running the system out of memory via the
> > kernel will behave from a failure perspective.
> > 
> > IOW, xfs_repair can run the system out of memory but for the most part
> > that ends up being a simple problem for the system: OOM kill the bloated
> > xfs_repair process. For an online repair in a similar situation, I have
> > no idea what's going to happen.
> 
> Back in the days of the huge linked lists the oom killer would target
> other proceses because it doesn't know that the online repair thread is
> sitting on a ton of pinned kernel memory...
> 

Makes sense, kind of what I'd expect...

> > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > we'd still be at risk of other subsystems running into memory
> > allocation problems, filling up swap, the OOM killer going after
> > unrelated processes, etc.  What if, for example, the OOM killer starts
> > picking off processes in service to a running online repair that
> > immediately consumes freed up memory until the system is borked?
> 
> Yeah.  One thing we /could/ do is register an oom notifier that would
> urge any running repair threads to bail out if they can.  It seems to me
> that the oom killer blocks on the oom_notify_list chain, so our handler
> could wait until at least one thread exits before returning.
> 

Ok, something like that could be useful. I agree that we probably don't
need to go that far until the mechanism is nailed down and testing shows
that OOM is a problem.

> > I don't know how likely that is or if it really ends up much different
> > from the analogous xfs_repair situation. My only point right now is
> > that failure scenario is something we should explore for any solution
> > we ultimately consider because it may be an unexpected use case of the
> > underlying mechanism.
> 
> Ideally, online repair would always be the victim since we know we have
> a reasonable fallback.  At least for memfd, however, I think the only
> clues we have to decide the question "is this memfd getting in the way
> of other threads?" is either seeing ENOMEM, short writes, or getting
> kicked by an oom notification.  Maybe that'll be enough?
> 

Hm, yeah. It may be challenging to track memfd usage as such. If
userspace has access to the fd on an OOM notification or whatever, it
might be able to do more accurate analysis based on an fstat() or
something.

Related question... is the online repair sequence currently
interruptible, if xfs_scrub receives a fatal signal while pulling in
entries during an allocbt scan for example?

> > (To the contrary, just using a cached file seems a natural fit from
> > that perspective.)
> 
> Same here.
> 
> > > > and the failure characteristics aren't more severe than for userspace.
> > > > An online repair that puts the broader system at risk of OOM as
> > > > opposed to predictably failing gracefully may not be the most useful
> > > > tool.
> > > 
> > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > for the vm to push back on us if we successfully write all we need to
> > > the memfd but then other processes need some memory.  Obviously, if the
> > > memfd write itself comes up short or fails then we dump the memfd and
> > > error back to userspace.  We might simply have to free array memory
> > > while we iterate the records to minimize the time spent at peak memory
> > > usage.
> > > 
> > 
> > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > may simplify things because we could allocate it up front and know right
> > away whether we just don't have enough memory available to repair.
> 
> Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> the pages at once, provided we have some guesstimate beforehand of how
> much space we think we'll need.
> 
> So long as my earlier statement about the memory requirements being no
> more than the size of the btree leaves is actually true (I haven't
> rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> that and if we don't get it, we kill the memfd and exit.
> 

Indeed. It would be nice if we could do all of the file management bits
in userspace.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +done:
> > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > > ...
> > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > >  }
> > > > > > > > > > > > > +
> > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > +bool
> > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > 
> > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > 
> > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > 
> > > > > > > > > > > --D
> > > > > > > > > > > 
> > > > > > > > > > > > Brian
> > > > > > > > > > > > 
> > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > +	}
> > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > >  }
> > > > > > > > > > > > >  
> > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > +
> > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > 
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-09 12:00                           ` Brian Foster
@ 2018-08-09 15:59                             ` Darrick J. Wong
  2018-08-10 10:33                               ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-09 15:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> ...
> > > > > > So, that early draft spent a lot of time tearing down and reconstructing
> > > > > > rmapbt cursors since the standard _btree_query_all isn't suited to that
> > > > > > kind of usage.  It was easily twice as slow on a RAM-backed disk just
> > > > > > from the rmap cursor overhead and much more complex, so I rewrote it to
> > > > > > be simpler.  I also have a slight preference for not touching anything
> > > > > > until we're absolutely sure we have all the data we need to repair the
> > > > > > structure.
> > > > > > 
> > > > > 
> > > > > Yes, I think that is sane in principle. I'm just a bit concerned about
> > > > > how reliable that xfs_repair-like approach will be in the kernel longer
> > > > > term, particularly once we start having to deal with large filesystems
> > > > > and limited or contended memory, etc. We already have xfs_repair users
> > > > > that need to tweak settings because there isn't enough memory available
> > > > > to repair the fs. Granted that is for fs-wide repairs and the flipside
> > > > > is that we know a single AG can only be up to 1TB. It's certainly
> > > > > possible that putting some persistent backing behind the in-core data is
> > > > > enough to resolve the problem (and the current approach is certainly
> > > > > reasonable enough to me for the initial implementation).
> > > > > 
> > > > > bjoin limitations aside, I wonder if a cursor roll mechanism that held
> > > > > all of the cursor buffers, rolled the transaction and then rejoined all
> > > > > said buffers would help us get around that. (Not sure I follow the early
> > > > > prototype behavior, but it sounds like we had to restart the rmapbt
> > > > > lookup over and over...).
> > > > 
> > > > Correct.
> > > > 
> > > > > Another caveat with that approach may be that I think we'd need to be
> > > > > sure that the reconstruction operation doesn't ever need to update the
> > > > > rmapbt while we're mid walk of the latter.
> > > > 
> > > > <nod> Looking even farther back in my notes, that was also an issue --
> > > > fixing the free list causes blocks to go on or off the agfl, which
> > > > causes rmapbt updates, which meant that the only way I could get
> > > > in-place updates to work was to re-lookup where we were in the btree and
> > > > also try to deal with any rmapbt entries that might have crept in as
> > > > result of the record insertion.
> > > > 
> > > > Getting the concurrency right for each repair function looked like a
> > > > difficult problem to solve, but amassing all the records elsewhere and
> > > > rebuilding was easy to understand.
> > > > 
> > > 
> > > Yeah. This all points to this kind of strategy being too complex to be
> > > worth the prospective benefits in the short term. Clearly we have
> > > several, potentially tricky roadblocks to work through before this can
> > > be made feasible. Thanks for the background, it's still useful to have
> > > this context to compare with whatever we may have to do to support a
> > > reclaimable memory approach.
> > 
> > <nod>  Reclaimable memfd "memory" isn't too difficult, we can call
> > kernel_read and kernel_write, though lockdep gets pretty mad about xfs
> > taking sb_start_write (on the memfd filesystem) at the same time it has
> > sb_starT_write on the xfs (not to mention the stack usage) so I had to
> > throw in the extra twist of delegating the actual file io to a workqueue
> > item (a la xfs_btree_split).
> > 
> 
> Ok, I'm more curious what the surrounding code looks like around
> managing the underlying file pages. Now that I think of it, the primary
> usage was to dump everything into the file and read it back
> sequentually,

Yep.  Simplified, the code is more or less:

array_init(array)
{
	array->filp = shmem_file_create(...);
}

array_destroy(array)
{
	fput(array->filp);
}

array_set(array, nr, ptr)
{
	kernel_write(array->filp, ptr, array->obj_size, nr * array->obj_size);
}

array_get(array, nr, ptr)
{
	kernel_read(array->filp, ptr, array->obj_size, nr * array->obj_size);
}

That's leaving out all the bookkeeping and other weird details to show
pseudocode versions of the file manipulation calls.

I did end up playing a bit of sleight-of-hand with the file io, however
-- all the io is deferred to a workqueue for the dual purpose of
avoiding stack overflows in the memfd file's io paths and to avoid some
sort of deadlock in the page fault handler of the memfd write.  I didn't
investigate the deadlock too deeply, as solving the first problem seemed
to make the second go away.

> so perhaps this really isn't that difficult to deal with since the
> file content is presumably fixed size data structures.

Correct.  There is one user that needs variable-sized records (the
extended attribute repair) for which I've constructed the 'xblob' data
structure which stores blobs in a second memfd and returns the file
offset of a blob as a magic cookie that is recorded in the (fixed size)
attr keys.

Presumably the future directory rebuilder will use xblob too.

> (Hmm, was there a sort in there somewhere as well?).

Yes.  I spent a couple of days implementing a hybrid quicksort/insertion
sort that won't blow out the call stack.

> > > > > That may be an issue for inode btree reconstruction, for example,
> > > > > since it looks like inobt block allocation requires rmapbt updates.
> > > > > We'd probably need some way to share (or invalidate) a cursor across
> > > > > different contexts to deal with that.
> > > > 
> > > > I might pursue that strategy if we ever hit the point where we can't
> > > > find space to store the records (see below).  Another option could be to
> > > > divert all deferred items for an AG, build a replacement btree in new
> > > > space, then finish all the deferred items... but that's starting to get
> > > > into offlineable AGs, which is its own project that I want to tackle
> > > > later.
> > > > 
> > > > (Not that much later, just not this cycle.)
> > > > 
> > > 
> > > *nod*
> > > 
> > > > > > For other repair functions (like the data/attr fork repairs) we have to
> > > > > > scan all the rmapbts for extents, and I'd prefer to lock those AGs only
> > > > > > for as long as necessary to extract the extents we want.
> > > > > > 
> > > > > > > The obvious problem is that we still have some checks that allow the
> > > > > > > whole repair operation to bail out before we determine whether we can
> > > > > > > start to rebuild the on-disk btrees. These are things like making sure
> > > > > > > we can actually read the associated rmapbt blocks (i.e., no read errors
> > > > > > > or verifier failures), basic record sanity checks, etc. But ISTM that
> > > > > > > isn't anything we couldn't get around with a multi-pass implementation.
> > > > > > > Secondary issues might be things like no longer being able to easily
> > > > > > > insert the longest free extent range(s) first (meaning we'd have to
> > > > > > > stuff the agfl with old btree blocks or figure out some other approach).
> > > > > > 
> > > > > > Well, you could scan the rmapbt twice -- once to find the longest
> > > > > > record, then again to do the actual insertion.
> > > > > > 
> > > > > 
> > > > > Yep, that's what I meant by multi-pass.
> > > > > 
> > > > > > > BTW, isn't the typical scrub sequence already multi-pass by virtue of
> > > > > > > the xfs_scrub_metadata() implementation? I'm wondering if the ->scrub()
> > > > > > > callout could not only detect corruption, but validate whether repair
> > > > > > > (if requested) is possible based on the kind of checks that are
> > > > > > > currently in the repair side rmapbt walkers. Thoughts?r
> > > > > > 
> > > > > > Yes, scrub basically validates that for us now, with the notable
> > > > > > exception of the notorious rmapbt scrubber, which doesn't
> > > > > > cross-reference with inode block mappings because that would be a
> > > > > > locking nightmare.
> > > > > > 
> > > > > > > Are there future
> > > > > > > changes that are better supported by an in-core tracking structure in
> > > > > > > general (assuming we'll eventually replace the linked lists with
> > > > > > > something more efficient) as opposed to attempting to optimize out the
> > > > > > > need for that tracking at all?
> > > > > > 
> > > > > > Well, I was thinking that we could just allocate a memfd (or a file on
> > > > > > the same xfs once we have AG offlining) and store the records in there.
> > > > > > That saves us the list_head overhead and potentially enables access to a
> > > > > > lot more storage than pinning things in RAM.
> > > > > > 
> > > > > 
> > > > > Would using the same fs mean we have to store the repair data in a
> > > > > separate AG, or somehow locate/use free space in the target AG?
> > > > 
> > > > As part of building an "offline AG" feature we'd presumably have to
> > > > teach the allocators to avoid the offline AGs for allocations, which
> > > > would make it so that we could host the repair data files in the same
> > > > XFS that's being fixed.  That seems a little risky to me, but the disk
> > > > is probably larger than mem+swap.
> > > > 
> > > 
> > > Got it, so we'd use the remaining space in the fs outside of the target
> > > AG. ISTM that still presumes the rest of the fs is coherent, but I
> > > suppose the offline AG thing helps us with that. We'd just have to make
> > > sure we've shut down all currently corrupted AGs before we start to
> > > repair a particular corrupted one, and then hope there's still enough
> > > free space in the fs to proceed.
> > 
> > That's a pretty big hope. :)  I think for now 
> > 
> > > That makes more sense, but I still agree that it seems risky in general.
> > > Technical risk aside, there's also usability concerns in that the local
> > > free space requirement is another bit of non-determinism
> > 
> > I don't think it's non-deterministic, it's just hard for the filesystem
> > to communicate to the user/admin ahead of time.  Roughly speaking, we
> > need to have about as much disk space for the new btree as we had
> > allocated for the old one.
> > 
> 
> Right, maybe non-deterministic is not the best term. What I mean is that
> it's not clear to the user why a particular filesystem may not be able
> to run a repair (e.g., if it has plenty of reported free space but
> enough AGs may be shut down due to corruption). So in certain scenarios
> an unrelated corruption or particular ordering of AG repairs could be
> the difference between whether an online repair succeeds or defers to
> offline repair on the otherwise same filesystem. 

<nod>

> > As far as memory requirements go, in last week's revising of the patches
> > I compressed the in-memory record structs down about as far as possible;
> > with the removal of the list heads, the memory requirements drop by
> > 30-60%.  We require the same amount of memory as would be needed to
> > store all of the records in the leaf nodes, and no more, and we can use
> > swap space to do it.
> > 
> 
> Nice. When looking at the existing structures it looked like a worst
> case (1TB AG, every other 1k block allocated) could require up to
> 10-12GB RAM (but I could have easily messed that up).r

Sounds about right.  1TB AG = 268 million 4k blocks

bnobt: 8-byte records, or ~2.2GB of memory
inobt: 16-byte records, or ~4.3GB of memory
refcountbt: 12-byte records, or ~3.2GB of memory
rmapbt: 24-byte records, or ~6.4GB of memory

Multiply by 4 for a 1k block filesystem, divide by 16 for a 64k block fs.

Note that if the AG is full and heavily shared then the rmapbt
requirements can exceed that, but that's a known property of rmap in
general.

> That's not insane on its own, it's just the question of allocating
> that much memory in the kernel. Slimming that down and pushing it into
> something swappable doesn't _sound_ too overbearing. I'm not really
> sure what default distro swap sizes are these days (some % of RAM?),

I think so?  I think RH/Centos/OL default to the size of RAM + 2GB
nowadays, and Ubuntu seems to do RAM+sqrt(RAM)?

> but it shouldn't be that hard to find ~10GB of disk space somewhere to
> facilitate a repair.
>
> > > around the ability to online repair vs. having to punt to xfs_repair,
> > > or if the repair consumes whatever free space remains in the fs to the
> > > detriment of whatever workload the user presumably wanted to keep the
> > > fs online for, etc.
> > 
> > I've occasionally thought that future xfs_scrub could ask the kernel to
> > estimate how much disk and memory it will need for the repair (and
> > whether the disk space requirement is fs-scope or AG-scope); then it
> > could forego a repair action and recommend xfs_repair if running the
> > online repair would take the system below some configurable threshold.
> > 
> 
> I think something like that would improve usability once we nail down
> the core mechanism.

Ok, I'll put it on my list of things to do.

> > > > > presume either way we'd have to ensure that AG is either consistent or
> > > > > locked out from outside I/O. If we have the total record count we can
> > > > 
> > > > We usually don't, but for the btrees that have their own record/blocks
> > > > counters we might be able to guess a number, fallocate it, and see if
> > > > that doesn't ENOSPC.
> > > > 
> > > > > preallocate the file and hope there is no such other free space
> > > > > corruption or something that would allow some other task to mess with
> > > > > our blocks. I'm a little skeptical overall on relying on a corrupted
> > > > > filesystem to store repair data, but perhaps there are ways to mitigate
> > > > > the risks.
> > > > 
> > > > Store it elsewhere?  /home for root repairs, /root for any other
> > > > repair... though if we're going to do that, why not just add a swap file
> > > > temporarily?
> > > > 
> > > 
> > > Indeed. The thought crossed my mind about whether we could do something
> > > like have an internal/isolated swap file for dedicated XFS allocations
> > > to avoid contention with the traditional swap.
> > 
> > Heh, I think e2fsck has some feature like that where you can pass it a
> > swap file.  No idea how much good that does on modern systems where
> > there's one huge partition... :)
> > 
> 
> Interesting. Couldn't you always create an additional swap file, run the
> repair then kill it off when it's no longer needed?

Yes, though as I think you said in an earlier reply, it would be nice to
have our own private swap file instead of risking some other process
taking it.

> > > Userspace could somehow set it up or communicate to the kernel. I have
> > > no idea how realistic that is though or if there's a better interface
> > > for that kind of thing (i.e., file backed kmem cache?).
> > 
> > I looked, and there aren't any other mechanisms for unpinnned kernel
> > memory allocations.
> > 
> 
> Ok, it looks like swap or traditional files it is then. ;P
> 
> > > What _seems_ beneficial about that approach is we get (potentially
> > > external) persistent backing and memory reclaim ability with the
> > > traditional memory allocation model.
> > >
> > > ISTM that if we used a regular file, we'd need to deal with the
> > > traditional file interface somehow or another (file read/pagecache
> > > lookup -> record ??).
> > 
> > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > all we need is a (struct file *).
> > 
> > > We could repurpose some existing mechanism like the directory code or
> > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > that would require us to always use an internal inode. Allowing
> > > userspace to pass an fd/file passes that consideration on to the user,
> > > which might be more flexible. We could always warn about additional
> > > limitations if that fd happens to be based on the target fs.
> > 
> > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > to feed that struct file to the kernel io routines instead of a memfd
> > one.
> > 
> 
> Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> something like this anyways. Could/should xfs_scrub be responsible for
> allocating a memfd and passing along the fd? Another advantage of doing
> that is whatever logic we may need to clean up old repair files or
> whatever is pushed to userspace.

There are two ways we could do this -- one is to have the kernel manage
the memfd creation internally (like my patches do now); the other is for
xfs_scrub to pass in creat(O_TMPFILE).

When repair fputs the file (or fdputs the fd if we switch to using
that), the kernel will perform the usual deletion of the zero-linkcount
zero-refcount file.  We get all the "cleanup" for free by closing the
file.

One other potential complication is that a couple of the repair
functions need two memfds.  The extended attribute repair creates a
fixed-record array for attr keys and an xblob to hold names and values;
each structure gets its own memfd.  The refcount repair creates two
fixed-record arrays, one for refcount records and another to act as a
stack of rmaps to compute reference counts.

(In theory the xbitmap could also be converted to use the fixed record
array, but in practice they haven't (yet) become large enough to warrant
it, and there's currently no way to insert or delete records from the
middle of the array.)

> > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > swappable or something?
> > > > 
> > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > the swap file.
> > > > 
> > > 
> > > Ok.
> > > 
> > > > > If so, that sounds a reasonable option provided the swap space
> > > > > requirement can be made clear to users
> > > > 
> > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > able to use up all the memory + swap... and since we're probably only
> > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > as much memory.
> > > > 
> > > 
> > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > are that 1.) the kernel generally has more of a limit on anonymous
> > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > it's not clear how effectively running the system out of memory via the
> > > kernel will behave from a failure perspective.
> > > 
> > > IOW, xfs_repair can run the system out of memory but for the most part
> > > that ends up being a simple problem for the system: OOM kill the bloated
> > > xfs_repair process. For an online repair in a similar situation, I have
> > > no idea what's going to happen.
> > 
> > Back in the days of the huge linked lists the oom killer would target
> > other proceses because it doesn't know that the online repair thread is
> > sitting on a ton of pinned kernel memory...
> > 
> 
> Makes sense, kind of what I'd expect...
> 
> > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > we'd still be at risk of other subsystems running into memory
> > > allocation problems, filling up swap, the OOM killer going after
> > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > picking off processes in service to a running online repair that
> > > immediately consumes freed up memory until the system is borked?
> > 
> > Yeah.  One thing we /could/ do is register an oom notifier that would
> > urge any running repair threads to bail out if they can.  It seems to me
> > that the oom killer blocks on the oom_notify_list chain, so our handler
> > could wait until at least one thread exits before returning.
> > 
> 
> Ok, something like that could be useful. I agree that we probably don't
> need to go that far until the mechanism is nailed down and testing shows
> that OOM is a problem.

It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
"400M of RAM and no swap" scenario.  Granted, pretty much every other
xfs utility also blows out on that so I'm not sure how hard I really
need to try...

> > > I don't know how likely that is or if it really ends up much different
> > > from the analogous xfs_repair situation. My only point right now is
> > > that failure scenario is something we should explore for any solution
> > > we ultimately consider because it may be an unexpected use case of the
> > > underlying mechanism.
> > 
> > Ideally, online repair would always be the victim since we know we have
> > a reasonable fallback.  At least for memfd, however, I think the only
> > clues we have to decide the question "is this memfd getting in the way
> > of other threads?" is either seeing ENOMEM, short writes, or getting
> > kicked by an oom notification.  Maybe that'll be enough?
> > 
> 
> Hm, yeah. It may be challenging to track memfd usage as such. If
> userspace has access to the fd on an OOM notification or whatever, it
> might be able to do more accurate analysis based on an fstat() or
> something.
> 
> Related question... is the online repair sequence currently
> interruptible, if xfs_scrub receives a fatal signal while pulling in
> entries during an allocbt scan for example?

It's interruptible (fatal signals only) during the scan phase, but once
it starts logging metadata updates it will run all the way to
completion.

> > > (To the contrary, just using a cached file seems a natural fit from
> > > that perspective.)
> > 
> > Same here.
> > 
> > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > An online repair that puts the broader system at risk of OOM as
> > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > tool.
> > > > 
> > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > for the vm to push back on us if we successfully write all we need to
> > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > error back to userspace.  We might simply have to free array memory
> > > > while we iterate the records to minimize the time spent at peak memory
> > > > usage.
> > > > 
> > > 
> > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > may simplify things because we could allocate it up front and know right
> > > away whether we just don't have enough memory available to repair.
> > 
> > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > the pages at once, provided we have some guesstimate beforehand of how
> > much space we think we'll need.
> > 
> > So long as my earlier statement about the memory requirements being no
> > more than the size of the btree leaves is actually true (I haven't
> > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > that and if we don't get it, we kill the memfd and exit.
> > 
> 
> Indeed. It would be nice if we could do all of the file management bits
> in userspace.

Agreed, though no file management would be even better. :)

--D

> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > --D
> > > > > > > > > > 
> > > > > > > > > > > Brian
> > > > > > > > > > > 
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > ...
> > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > 
> > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > 
> > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > 
> > > > > > > > > > > > --D
> > > > > > > > > > > > 
> > > > > > > > > > > > > Brian
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > >  }
> > > > > > > > > > > > > >  
> > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > +
> > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-09 15:59                             ` Darrick J. Wong
@ 2018-08-10 10:33                               ` Brian Foster
  2018-08-10 15:39                                 ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-10 10:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
...
> > > > What _seems_ beneficial about that approach is we get (potentially
> > > > external) persistent backing and memory reclaim ability with the
> > > > traditional memory allocation model.
> > > >
> > > > ISTM that if we used a regular file, we'd need to deal with the
> > > > traditional file interface somehow or another (file read/pagecache
> > > > lookup -> record ??).
> > > 
> > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > > all we need is a (struct file *).
> > > 
> > > > We could repurpose some existing mechanism like the directory code or
> > > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > > that would require us to always use an internal inode. Allowing
> > > > userspace to pass an fd/file passes that consideration on to the user,
> > > > which might be more flexible. We could always warn about additional
> > > > limitations if that fd happens to be based on the target fs.
> > > 
> > > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > > to feed that struct file to the kernel io routines instead of a memfd
> > > one.
> > > 
> > 
> > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> > something like this anyways. Could/should xfs_scrub be responsible for
> > allocating a memfd and passing along the fd? Another advantage of doing
> > that is whatever logic we may need to clean up old repair files or
> > whatever is pushed to userspace.
> 
> There are two ways we could do this -- one is to have the kernel manage
> the memfd creation internally (like my patches do now); the other is for
> xfs_scrub to pass in creat(O_TMPFILE).
> 
> When repair fputs the file (or fdputs the fd if we switch to using
> that), the kernel will perform the usual deletion of the zero-linkcount
> zero-refcount file.  We get all the "cleanup" for free by closing the
> file.
> 

Ok. FWIW, the latter approach where xfs_scrub creates a file and passes
the fd along to the kernel seems preferable to me, but perhaps others
have different opinions. We could accept a pathname from the user to
create the file or otherwise attempt to allocate an memfd by default and
pass that along.

> One other potential complication is that a couple of the repair
> functions need two memfds.  The extended attribute repair creates a
> fixed-record array for attr keys and an xblob to hold names and values;
> each structure gets its own memfd.  The refcount repair creates two
> fixed-record arrays, one for refcount records and another to act as a
> stack of rmaps to compute reference counts.
> 

Hmm, I guess there's nothing stopping scrub from passing in two fds.
Maybe it would make more sense for the userspace option to be a path
basename or directory where scrub is allowed to create whatever scratch
files it needs.

That aside, is there any reason the repair mechanism couldn't emulate
multiple files with a single fd via a magic offset delimeter or
something? E.g., "file 1" starts at offset 0, "file 2" starts at offset
1TB, etc. (1TB is probably overkill, but you get the idea..).

Brian

> (In theory the xbitmap could also be converted to use the fixed record
> array, but in practice they haven't (yet) become large enough to warrant
> it, and there's currently no way to insert or delete records from the
> middle of the array.)
> 
> > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > swappable or something?
> > > > > 
> > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > the swap file.
> > > > > 
> > > > 
> > > > Ok.
> > > > 
> > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > requirement can be made clear to users
> > > > > 
> > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > able to use up all the memory + swap... and since we're probably only
> > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > as much memory.
> > > > > 
> > > > 
> > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > it's not clear how effectively running the system out of memory via the
> > > > kernel will behave from a failure perspective.
> > > > 
> > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > no idea what's going to happen.
> > > 
> > > Back in the days of the huge linked lists the oom killer would target
> > > other proceses because it doesn't know that the online repair thread is
> > > sitting on a ton of pinned kernel memory...
> > > 
> > 
> > Makes sense, kind of what I'd expect...
> > 
> > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > we'd still be at risk of other subsystems running into memory
> > > > allocation problems, filling up swap, the OOM killer going after
> > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > picking off processes in service to a running online repair that
> > > > immediately consumes freed up memory until the system is borked?
> > > 
> > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > urge any running repair threads to bail out if they can.  It seems to me
> > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > could wait until at least one thread exits before returning.
> > > 
> > 
> > Ok, something like that could be useful. I agree that we probably don't
> > need to go that far until the mechanism is nailed down and testing shows
> > that OOM is a problem.
> 
> It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> "400M of RAM and no swap" scenario.  Granted, pretty much every other
> xfs utility also blows out on that so I'm not sure how hard I really
> need to try...
> 
> > > > I don't know how likely that is or if it really ends up much different
> > > > from the analogous xfs_repair situation. My only point right now is
> > > > that failure scenario is something we should explore for any solution
> > > > we ultimately consider because it may be an unexpected use case of the
> > > > underlying mechanism.
> > > 
> > > Ideally, online repair would always be the victim since we know we have
> > > a reasonable fallback.  At least for memfd, however, I think the only
> > > clues we have to decide the question "is this memfd getting in the way
> > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > kicked by an oom notification.  Maybe that'll be enough?
> > > 
> > 
> > Hm, yeah. It may be challenging to track memfd usage as such. If
> > userspace has access to the fd on an OOM notification or whatever, it
> > might be able to do more accurate analysis based on an fstat() or
> > something.
> > 
> > Related question... is the online repair sequence currently
> > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > entries during an allocbt scan for example?
> 
> It's interruptible (fatal signals only) during the scan phase, but once
> it starts logging metadata updates it will run all the way to
> completion.
> 
> > > > (To the contrary, just using a cached file seems a natural fit from
> > > > that perspective.)
> > > 
> > > Same here.
> > > 
> > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > tool.
> > > > > 
> > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > for the vm to push back on us if we successfully write all we need to
> > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > error back to userspace.  We might simply have to free array memory
> > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > usage.
> > > > > 
> > > > 
> > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > may simplify things because we could allocate it up front and know right
> > > > away whether we just don't have enough memory available to repair.
> > > 
> > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > the pages at once, provided we have some guesstimate beforehand of how
> > > much space we think we'll need.
> > > 
> > > So long as my earlier statement about the memory requirements being no
> > > more than the size of the btree leaves is actually true (I haven't
> > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > that and if we don't get it, we kill the memfd and exit.
> > > 
> > 
> > Indeed. It would be nice if we could do all of the file management bits
> > in userspace.
> 
> Agreed, though no file management would be even better. :)
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > --D
> > > > > > > > > > > 
> > > > > > > > > > > > Brian
> > > > > > > > > > > > 
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > 
> > > > > > > > > > > > > --D
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-10 10:33                               ` Brian Foster
@ 2018-08-10 15:39                                 ` Darrick J. Wong
  2018-08-10 19:07                                   ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-10 15:39 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote:
> On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> ...
> > > > > What _seems_ beneficial about that approach is we get (potentially
> > > > > external) persistent backing and memory reclaim ability with the
> > > > > traditional memory allocation model.
> > > > >
> > > > > ISTM that if we used a regular file, we'd need to deal with the
> > > > > traditional file interface somehow or another (file read/pagecache
> > > > > lookup -> record ??).
> > > > 
> > > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > > > all we need is a (struct file *).
> > > > 
> > > > > We could repurpose some existing mechanism like the directory code or
> > > > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > > > that would require us to always use an internal inode. Allowing
> > > > > userspace to pass an fd/file passes that consideration on to the user,
> > > > > which might be more flexible. We could always warn about additional
> > > > > limitations if that fd happens to be based on the target fs.
> > > > 
> > > > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > > > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > > > to feed that struct file to the kernel io routines instead of a memfd
> > > > one.
> > > > 
> > > 
> > > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> > > something like this anyways. Could/should xfs_scrub be responsible for
> > > allocating a memfd and passing along the fd? Another advantage of doing
> > > that is whatever logic we may need to clean up old repair files or
> > > whatever is pushed to userspace.
> > 
> > There are two ways we could do this -- one is to have the kernel manage
> > the memfd creation internally (like my patches do now); the other is for
> > xfs_scrub to pass in creat(O_TMPFILE).
> > 
> > When repair fputs the file (or fdputs the fd if we switch to using
> > that), the kernel will perform the usual deletion of the zero-linkcount
> > zero-refcount file.  We get all the "cleanup" for free by closing the
> > file.
> > 
> 
> Ok. FWIW, the latter approach where xfs_scrub creates a file and passes
> the fd along to the kernel seems preferable to me, but perhaps others
> have different opinions. We could accept a pathname from the user to
> create the file or otherwise attempt to allocate an memfd by default and
> pass that along.
> 
> > One other potential complication is that a couple of the repair
> > functions need two memfds.  The extended attribute repair creates a
> > fixed-record array for attr keys and an xblob to hold names and values;
> > each structure gets its own memfd.  The refcount repair creates two
> > fixed-record arrays, one for refcount records and another to act as a
> > stack of rmaps to compute reference counts.
> > 
> 
> Hmm, I guess there's nothing stopping scrub from passing in two fds.
> Maybe it would make more sense for the userspace option to be a path
> basename or directory where scrub is allowed to create whatever scratch
> files it needs.
> 
> That aside, is there any reason the repair mechanism couldn't emulate
> multiple files with a single fd via a magic offset delimeter or
> something? E.g., "file 1" starts at offset 0, "file 2" starts at offset
> 1TB, etc. (1TB is probably overkill, but you get the idea..).

Hmm, ok, so to summarize, I see five options:

1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...)
however many files it needs.

2) Pass in a however many file fds we need and segment the space.

3) Pass in a single file fd.

4) Let the repair code create as many memfd files as it wants.

5) Let the repair code create one memfd file and segment the space.

I'm pretty sure we don't want to support (2) because that just seems
like a requirements communication nightmare and can burn up a lot of
space in struct xfs_scrub_metadata.

(3) and (5) are basically the same except for where the file comes from.
For (3) we'd have to make sure the fd filesystem supports large sparse
files (and presumably isn't the xfs we're trying to repair), which
shouldn't be too difficult to probe.  For (5) we know that tmpfs already
supports large sparse files.  Another difficulty might be that on 32-bit
the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE),
though I suppose at this point we only need two files and 8TB should be
enough for anyone.

(I also think it's reasonable to consider not supporting online repair
on a 32-bit system with a large filesystem...)

In general, the "pass in a thing from userspace" variants come with the
complication that we have to check the functionality of whatever gets
passed in.  On the plus side it likely unlocks access to a lot more
storage than we could get with mem+swap.  On the minus side someone
passes in a fd to a drive-managed SMR on USB 2.0, and...

(1) seems like it would maximize the kernel's flexibility to create as
many (regular, non-sparse) files as it needs, but now we're calling
do_sys_open and managing files ourselves, which might be avoided.

(4) of course is what we do right now. :)

Soooo... the simplest userspace interface (I think) is to allow
userspace to pass in a single file fd.  Scrub can reject it if it
doesn't measure up (fs is the same, sparse not supported, high offsets
not supported, etc.).  If userspace doesn't pass in an fd then we create
a memfd and use that instead.  We end up with a hybrid between (3) and (5).

--D

> Brian
> 
> > (In theory the xbitmap could also be converted to use the fixed record
> > array, but in practice they haven't (yet) become large enough to warrant
> > it, and there's currently no way to insert or delete records from the
> > middle of the array.)
> > 
> > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > > swappable or something?
> > > > > > 
> > > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > > the swap file.
> > > > > > 
> > > > > 
> > > > > Ok.
> > > > > 
> > > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > > requirement can be made clear to users
> > > > > > 
> > > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > > able to use up all the memory + swap... and since we're probably only
> > > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > > as much memory.
> > > > > > 
> > > > > 
> > > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > > it's not clear how effectively running the system out of memory via the
> > > > > kernel will behave from a failure perspective.
> > > > > 
> > > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > > no idea what's going to happen.
> > > > 
> > > > Back in the days of the huge linked lists the oom killer would target
> > > > other proceses because it doesn't know that the online repair thread is
> > > > sitting on a ton of pinned kernel memory...
> > > > 
> > > 
> > > Makes sense, kind of what I'd expect...
> > > 
> > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > > we'd still be at risk of other subsystems running into memory
> > > > > allocation problems, filling up swap, the OOM killer going after
> > > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > > picking off processes in service to a running online repair that
> > > > > immediately consumes freed up memory until the system is borked?
> > > > 
> > > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > > urge any running repair threads to bail out if they can.  It seems to me
> > > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > > could wait until at least one thread exits before returning.
> > > > 
> > > 
> > > Ok, something like that could be useful. I agree that we probably don't
> > > need to go that far until the mechanism is nailed down and testing shows
> > > that OOM is a problem.
> > 
> > It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> > "400M of RAM and no swap" scenario.  Granted, pretty much every other
> > xfs utility also blows out on that so I'm not sure how hard I really
> > need to try...
> > 
> > > > > I don't know how likely that is or if it really ends up much different
> > > > > from the analogous xfs_repair situation. My only point right now is
> > > > > that failure scenario is something we should explore for any solution
> > > > > we ultimately consider because it may be an unexpected use case of the
> > > > > underlying mechanism.
> > > > 
> > > > Ideally, online repair would always be the victim since we know we have
> > > > a reasonable fallback.  At least for memfd, however, I think the only
> > > > clues we have to decide the question "is this memfd getting in the way
> > > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > > kicked by an oom notification.  Maybe that'll be enough?
> > > > 
> > > 
> > > Hm, yeah. It may be challenging to track memfd usage as such. If
> > > userspace has access to the fd on an OOM notification or whatever, it
> > > might be able to do more accurate analysis based on an fstat() or
> > > something.
> > > 
> > > Related question... is the online repair sequence currently
> > > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > > entries during an allocbt scan for example?
> > 
> > It's interruptible (fatal signals only) during the scan phase, but once
> > it starts logging metadata updates it will run all the way to
> > completion.
> > 
> > > > > (To the contrary, just using a cached file seems a natural fit from
> > > > > that perspective.)
> > > > 
> > > > Same here.
> > > > 
> > > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > > tool.
> > > > > > 
> > > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > > for the vm to push back on us if we successfully write all we need to
> > > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > > error back to userspace.  We might simply have to free array memory
> > > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > > usage.
> > > > > > 
> > > > > 
> > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > > may simplify things because we could allocate it up front and know right
> > > > > away whether we just don't have enough memory available to repair.
> > > > 
> > > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > > the pages at once, provided we have some guesstimate beforehand of how
> > > > much space we think we'll need.
> > > > 
> > > > So long as my earlier statement about the memory requirements being no
> > > > more than the size of the btree leaves is actually true (I haven't
> > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > > that and if we don't get it, we kill the memfd and exit.
> > > > 
> > > 
> > > Indeed. It would be nice if we could do all of the file management bits
> > > in userspace.
> > 
> > Agreed, though no file management would be even better. :)
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > --D
> > > > > > > > > > 
> > > > > > > > > > > Brian
> > > > > > > > > > > 
> > > > > > > > > > > > --D
> > > > > > > > > > > > 
> > > > > > > > > > > > > Brian
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-10 15:39                                 ` Darrick J. Wong
@ 2018-08-10 19:07                                   ` Brian Foster
  2018-08-10 19:36                                     ` Darrick J. Wong
  0 siblings, 1 reply; 53+ messages in thread
From: Brian Foster @ 2018-08-10 19:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote:
> On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote:
> > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > ...
> > > > > > What _seems_ beneficial about that approach is we get (potentially
> > > > > > external) persistent backing and memory reclaim ability with the
> > > > > > traditional memory allocation model.
> > > > > >
> > > > > > ISTM that if we used a regular file, we'd need to deal with the
> > > > > > traditional file interface somehow or another (file read/pagecache
> > > > > > lookup -> record ??).
> > > > > 
> > > > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > > > > all we need is a (struct file *).
> > > > > 
> > > > > > We could repurpose some existing mechanism like the directory code or
> > > > > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > > > > that would require us to always use an internal inode. Allowing
> > > > > > userspace to pass an fd/file passes that consideration on to the user,
> > > > > > which might be more flexible. We could always warn about additional
> > > > > > limitations if that fd happens to be based on the target fs.
> > > > > 
> > > > > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > > > > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > > > > to feed that struct file to the kernel io routines instead of a memfd
> > > > > one.
> > > > > 
> > > > 
> > > > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> > > > something like this anyways. Could/should xfs_scrub be responsible for
> > > > allocating a memfd and passing along the fd? Another advantage of doing
> > > > that is whatever logic we may need to clean up old repair files or
> > > > whatever is pushed to userspace.
> > > 
> > > There are two ways we could do this -- one is to have the kernel manage
> > > the memfd creation internally (like my patches do now); the other is for
> > > xfs_scrub to pass in creat(O_TMPFILE).
> > > 
> > > When repair fputs the file (or fdputs the fd if we switch to using
> > > that), the kernel will perform the usual deletion of the zero-linkcount
> > > zero-refcount file.  We get all the "cleanup" for free by closing the
> > > file.
> > > 
> > 
> > Ok. FWIW, the latter approach where xfs_scrub creates a file and passes
> > the fd along to the kernel seems preferable to me, but perhaps others
> > have different opinions. We could accept a pathname from the user to
> > create the file or otherwise attempt to allocate an memfd by default and
> > pass that along.
> > 
> > > One other potential complication is that a couple of the repair
> > > functions need two memfds.  The extended attribute repair creates a
> > > fixed-record array for attr keys and an xblob to hold names and values;
> > > each structure gets its own memfd.  The refcount repair creates two
> > > fixed-record arrays, one for refcount records and another to act as a
> > > stack of rmaps to compute reference counts.
> > > 
> > 
> > Hmm, I guess there's nothing stopping scrub from passing in two fds.
> > Maybe it would make more sense for the userspace option to be a path
> > basename or directory where scrub is allowed to create whatever scratch
> > files it needs.
> > 
> > That aside, is there any reason the repair mechanism couldn't emulate
> > multiple files with a single fd via a magic offset delimeter or
> > something? E.g., "file 1" starts at offset 0, "file 2" starts at offset
> > 1TB, etc. (1TB is probably overkill, but you get the idea..).
> 
> Hmm, ok, so to summarize, I see five options:
> 
> 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...)
> however many files it needs.
> 
> 2) Pass in a however many file fds we need and segment the space.
> 
> 3) Pass in a single file fd.
> 
> 4) Let the repair code create as many memfd files as it wants.
> 
> 5) Let the repair code create one memfd file and segment the space.
> 
> I'm pretty sure we don't want to support (2) because that just seems
> like a requirements communication nightmare and can burn up a lot of
> space in struct xfs_scrub_metadata.
> 
> (3) and (5) are basically the same except for where the file comes from.
> For (3) we'd have to make sure the fd filesystem supports large sparse
> files (and presumably isn't the xfs we're trying to repair), which
> shouldn't be too difficult to probe.  For (5) we know that tmpfs already
> supports large sparse files.  Another difficulty might be that on 32-bit
> the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE),
> though I suppose at this point we only need two files and 8TB should be
> enough for anyone.
> 
> (I also think it's reasonable to consider not supporting online repair
> on a 32-bit system with a large filesystem...)
> 
> In general, the "pass in a thing from userspace" variants come with the
> complication that we have to check the functionality of whatever gets
> passed in.  On the plus side it likely unlocks access to a lot more
> storage than we could get with mem+swap.  On the minus side someone
> passes in a fd to a drive-managed SMR on USB 2.0, and...
> 
> (1) seems like it would maximize the kernel's flexibility to create as
> many (regular, non-sparse) files as it needs, but now we're calling
> do_sys_open and managing files ourselves, which might be avoided.
> 
> (4) of course is what we do right now. :)
> 
> Soooo... the simplest userspace interface (I think) is to allow
> userspace to pass in a single file fd.  Scrub can reject it if it
> doesn't measure up (fs is the same, sparse not supported, high offsets
> not supported, etc.).  If userspace doesn't pass in an fd then we create
> a memfd and use that instead.  We end up with a hybrid between (3) and (5).
> 

That all sounds about right to me except I was thinking userspace would
do the memfd fallback of #5 rather than the kernel, just to keep the
policy out of the kernel as much as possible. Is there any major
advantage to doing it in the kernel? I guess it would slightly
complicate 'xfs_io -c repair' ...

Brian

> --D
> 
> > Brian
> > 
> > > (In theory the xbitmap could also be converted to use the fixed record
> > > array, but in practice they haven't (yet) become large enough to warrant
> > > it, and there's currently no way to insert or delete records from the
> > > middle of the array.)
> > > 
> > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > > > swappable or something?
> > > > > > > 
> > > > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > > > the swap file.
> > > > > > > 
> > > > > > 
> > > > > > Ok.
> > > > > > 
> > > > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > > > requirement can be made clear to users
> > > > > > > 
> > > > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > > > able to use up all the memory + swap... and since we're probably only
> > > > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > > > as much memory.
> > > > > > > 
> > > > > > 
> > > > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > > > it's not clear how effectively running the system out of memory via the
> > > > > > kernel will behave from a failure perspective.
> > > > > > 
> > > > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > > > no idea what's going to happen.
> > > > > 
> > > > > Back in the days of the huge linked lists the oom killer would target
> > > > > other proceses because it doesn't know that the online repair thread is
> > > > > sitting on a ton of pinned kernel memory...
> > > > > 
> > > > 
> > > > Makes sense, kind of what I'd expect...
> > > > 
> > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > > > we'd still be at risk of other subsystems running into memory
> > > > > > allocation problems, filling up swap, the OOM killer going after
> > > > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > > > picking off processes in service to a running online repair that
> > > > > > immediately consumes freed up memory until the system is borked?
> > > > > 
> > > > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > > > urge any running repair threads to bail out if they can.  It seems to me
> > > > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > > > could wait until at least one thread exits before returning.
> > > > > 
> > > > 
> > > > Ok, something like that could be useful. I agree that we probably don't
> > > > need to go that far until the mechanism is nailed down and testing shows
> > > > that OOM is a problem.
> > > 
> > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> > > "400M of RAM and no swap" scenario.  Granted, pretty much every other
> > > xfs utility also blows out on that so I'm not sure how hard I really
> > > need to try...
> > > 
> > > > > > I don't know how likely that is or if it really ends up much different
> > > > > > from the analogous xfs_repair situation. My only point right now is
> > > > > > that failure scenario is something we should explore for any solution
> > > > > > we ultimately consider because it may be an unexpected use case of the
> > > > > > underlying mechanism.
> > > > > 
> > > > > Ideally, online repair would always be the victim since we know we have
> > > > > a reasonable fallback.  At least for memfd, however, I think the only
> > > > > clues we have to decide the question "is this memfd getting in the way
> > > > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > > > kicked by an oom notification.  Maybe that'll be enough?
> > > > > 
> > > > 
> > > > Hm, yeah. It may be challenging to track memfd usage as such. If
> > > > userspace has access to the fd on an OOM notification or whatever, it
> > > > might be able to do more accurate analysis based on an fstat() or
> > > > something.
> > > > 
> > > > Related question... is the online repair sequence currently
> > > > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > > > entries during an allocbt scan for example?
> > > 
> > > It's interruptible (fatal signals only) during the scan phase, but once
> > > it starts logging metadata updates it will run all the way to
> > > completion.
> > > 
> > > > > > (To the contrary, just using a cached file seems a natural fit from
> > > > > > that perspective.)
> > > > > 
> > > > > Same here.
> > > > > 
> > > > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > > > tool.
> > > > > > > 
> > > > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > > > for the vm to push back on us if we successfully write all we need to
> > > > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > > > error back to userspace.  We might simply have to free array memory
> > > > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > > > usage.
> > > > > > > 
> > > > > > 
> > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > > > may simplify things because we could allocate it up front and know right
> > > > > > away whether we just don't have enough memory available to repair.
> > > > > 
> > > > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > > > the pages at once, provided we have some guesstimate beforehand of how
> > > > > much space we think we'll need.
> > > > > 
> > > > > So long as my earlier statement about the memory requirements being no
> > > > > more than the size of the btree leaves is actually true (I haven't
> > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > > > that and if we don't get it, we kill the memfd and exit.
> > > > > 
> > > > 
> > > > Indeed. It would be nice if we could do all of the file management bits
> > > > in userspace.
> > > 
> > > Agreed, though no file management would be even better. :)
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > --D
> > > > > > > > > > > 
> > > > > > > > > > > > Brian
> > > > > > > > > > > > 
> > > > > > > > > > > > > --D
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-10 19:07                                   ` Brian Foster
@ 2018-08-10 19:36                                     ` Darrick J. Wong
  2018-08-11 12:50                                       ` Brian Foster
  0 siblings, 1 reply; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-10 19:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Fri, Aug 10, 2018 at 03:07:40PM -0400, Brian Foster wrote:
> On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote:
> > > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> > > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> > > ...
> > > > > > > What _seems_ beneficial about that approach is we get (potentially
> > > > > > > external) persistent backing and memory reclaim ability with the
> > > > > > > traditional memory allocation model.
> > > > > > >
> > > > > > > ISTM that if we used a regular file, we'd need to deal with the
> > > > > > > traditional file interface somehow or another (file read/pagecache
> > > > > > > lookup -> record ??).
> > > > > > 
> > > > > > Yes, that's all neatly wrapped up in kernel_read() and kernel_write() so
> > > > > > all we need is a (struct file *).
> > > > > > 
> > > > > > > We could repurpose some existing mechanism like the directory code or
> > > > > > > quota inode mechanism to use xfs buffers for that purpose, but I think
> > > > > > > that would require us to always use an internal inode. Allowing
> > > > > > > userspace to pass an fd/file passes that consideration on to the user,
> > > > > > > which might be more flexible. We could always warn about additional
> > > > > > > limitations if that fd happens to be based on the target fs.
> > > > > > 
> > > > > > <nod> A second advantage of the struct file/kernel_{read,write} approach
> > > > > > is that we if we ever decide to let userspace pass in a fd, it's trivial
> > > > > > to feed that struct file to the kernel io routines instead of a memfd
> > > > > > one.
> > > > > > 
> > > > > 
> > > > > Yeah, I like this flexibility. In fact, I'm wondering why we wouldn't do
> > > > > something like this anyways. Could/should xfs_scrub be responsible for
> > > > > allocating a memfd and passing along the fd? Another advantage of doing
> > > > > that is whatever logic we may need to clean up old repair files or
> > > > > whatever is pushed to userspace.
> > > > 
> > > > There are two ways we could do this -- one is to have the kernel manage
> > > > the memfd creation internally (like my patches do now); the other is for
> > > > xfs_scrub to pass in creat(O_TMPFILE).
> > > > 
> > > > When repair fputs the file (or fdputs the fd if we switch to using
> > > > that), the kernel will perform the usual deletion of the zero-linkcount
> > > > zero-refcount file.  We get all the "cleanup" for free by closing the
> > > > file.
> > > > 
> > > 
> > > Ok. FWIW, the latter approach where xfs_scrub creates a file and passes
> > > the fd along to the kernel seems preferable to me, but perhaps others
> > > have different opinions. We could accept a pathname from the user to
> > > create the file or otherwise attempt to allocate an memfd by default and
> > > pass that along.
> > > 
> > > > One other potential complication is that a couple of the repair
> > > > functions need two memfds.  The extended attribute repair creates a
> > > > fixed-record array for attr keys and an xblob to hold names and values;
> > > > each structure gets its own memfd.  The refcount repair creates two
> > > > fixed-record arrays, one for refcount records and another to act as a
> > > > stack of rmaps to compute reference counts.
> > > > 
> > > 
> > > Hmm, I guess there's nothing stopping scrub from passing in two fds.
> > > Maybe it would make more sense for the userspace option to be a path
> > > basename or directory where scrub is allowed to create whatever scratch
> > > files it needs.
> > > 
> > > That aside, is there any reason the repair mechanism couldn't emulate
> > > multiple files with a single fd via a magic offset delimeter or
> > > something? E.g., "file 1" starts at offset 0, "file 2" starts at offset
> > > 1TB, etc. (1TB is probably overkill, but you get the idea..).
> > 
> > Hmm, ok, so to summarize, I see five options:
> > 
> > 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...)
> > however many files it needs.
> > 
> > 2) Pass in a however many file fds we need and segment the space.
> > 
> > 3) Pass in a single file fd.
> > 
> > 4) Let the repair code create as many memfd files as it wants.
> > 
> > 5) Let the repair code create one memfd file and segment the space.
> > 
> > I'm pretty sure we don't want to support (2) because that just seems
> > like a requirements communication nightmare and can burn up a lot of
> > space in struct xfs_scrub_metadata.
> > 
> > (3) and (5) are basically the same except for where the file comes from.
> > For (3) we'd have to make sure the fd filesystem supports large sparse
> > files (and presumably isn't the xfs we're trying to repair), which
> > shouldn't be too difficult to probe.  For (5) we know that tmpfs already
> > supports large sparse files.  Another difficulty might be that on 32-bit
> > the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE),
> > though I suppose at this point we only need two files and 8TB should be
> > enough for anyone.
> > 
> > (I also think it's reasonable to consider not supporting online repair
> > on a 32-bit system with a large filesystem...)
> > 
> > In general, the "pass in a thing from userspace" variants come with the
> > complication that we have to check the functionality of whatever gets
> > passed in.  On the plus side it likely unlocks access to a lot more
> > storage than we could get with mem+swap.  On the minus side someone
> > passes in a fd to a drive-managed SMR on USB 2.0, and...
> > 
> > (1) seems like it would maximize the kernel's flexibility to create as
> > many (regular, non-sparse) files as it needs, but now we're calling
> > do_sys_open and managing files ourselves, which might be avoided.
> > 
> > (4) of course is what we do right now. :)
> > 
> > Soooo... the simplest userspace interface (I think) is to allow
> > userspace to pass in a single file fd.  Scrub can reject it if it
> > doesn't measure up (fs is the same, sparse not supported, high offsets
> > not supported, etc.).  If userspace doesn't pass in an fd then we create
> > a memfd and use that instead.  We end up with a hybrid between (3) and (5).
> > 
> 
> That all sounds about right to me except I was thinking userspace would
> do the memfd fallback of #5 rather than the kernel, just to keep the
> policy out of the kernel as much as possible. Is there any major
> advantage to doing it in the kernel? I guess it would slightly
> complicate 'xfs_io -c repair' ...

Hm.  We'll have to use one of the reserved areas of struct
xfs_scrub_metadata to pass in the file descriptor.  If we create a new
XFS_SCRUB_IFLAG_FD flag to indicate that we're passing in a file
descriptor then either we lose compatibility with old kernels (because
they reject unknown flags) or xfs_scrub will have to try a repair
without a fd (to see if the kernel even cares) and retry if the repair
fails with some prearranged error code that means "give me a swapfile,
please".  Alternately we simply require that the fd cannot be fd 0 since
using stdin for swap space is a stupid idea anyways.

Technically we're not supposed to have flag days, but otoh this is a
xfs-only ioctl for a feature that's still experimental, so perhaps it's
not crucial to maintain compatibility with old kernels where the feature
is incomplete and experimental?

Hmm.  We could define the fd field with the requirement that fd > 0, and
if the repair function requires an fd and one hasn't been provided, it
can fail out with ENOMEM.  If it doesn't need extra memory it can just
ignore the contents of the fd field.  xfs_scrub can then arrange to pass
in mem fds or file fds or whatever.

--D

> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > (In theory the xbitmap could also be converted to use the fixed record
> > > > array, but in practice they haven't (yet) become large enough to warrant
> > > > it, and there's currently no way to insert or delete records from the
> > > > middle of the array.)
> > > > 
> > > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > > > > swappable or something?
> > > > > > > > 
> > > > > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > > > > the swap file.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok.
> > > > > > > 
> > > > > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > > > > requirement can be made clear to users
> > > > > > > > 
> > > > > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > > > > able to use up all the memory + swap... and since we're probably only
> > > > > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > > > > as much memory.
> > > > > > > > 
> > > > > > > 
> > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > > > > it's not clear how effectively running the system out of memory via the
> > > > > > > kernel will behave from a failure perspective.
> > > > > > > 
> > > > > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > > > > no idea what's going to happen.
> > > > > > 
> > > > > > Back in the days of the huge linked lists the oom killer would target
> > > > > > other proceses because it doesn't know that the online repair thread is
> > > > > > sitting on a ton of pinned kernel memory...
> > > > > > 
> > > > > 
> > > > > Makes sense, kind of what I'd expect...
> > > > > 
> > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > > > > we'd still be at risk of other subsystems running into memory
> > > > > > > allocation problems, filling up swap, the OOM killer going after
> > > > > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > > > > picking off processes in service to a running online repair that
> > > > > > > immediately consumes freed up memory until the system is borked?
> > > > > > 
> > > > > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > > > > urge any running repair threads to bail out if they can.  It seems to me
> > > > > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > > > > could wait until at least one thread exits before returning.
> > > > > > 
> > > > > 
> > > > > Ok, something like that could be useful. I agree that we probably don't
> > > > > need to go that far until the mechanism is nailed down and testing shows
> > > > > that OOM is a problem.
> > > > 
> > > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> > > > "400M of RAM and no swap" scenario.  Granted, pretty much every other
> > > > xfs utility also blows out on that so I'm not sure how hard I really
> > > > need to try...
> > > > 
> > > > > > > I don't know how likely that is or if it really ends up much different
> > > > > > > from the analogous xfs_repair situation. My only point right now is
> > > > > > > that failure scenario is something we should explore for any solution
> > > > > > > we ultimately consider because it may be an unexpected use case of the
> > > > > > > underlying mechanism.
> > > > > > 
> > > > > > Ideally, online repair would always be the victim since we know we have
> > > > > > a reasonable fallback.  At least for memfd, however, I think the only
> > > > > > clues we have to decide the question "is this memfd getting in the way
> > > > > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > > > > kicked by an oom notification.  Maybe that'll be enough?
> > > > > > 
> > > > > 
> > > > > Hm, yeah. It may be challenging to track memfd usage as such. If
> > > > > userspace has access to the fd on an OOM notification or whatever, it
> > > > > might be able to do more accurate analysis based on an fstat() or
> > > > > something.
> > > > > 
> > > > > Related question... is the online repair sequence currently
> > > > > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > > > > entries during an allocbt scan for example?
> > > > 
> > > > It's interruptible (fatal signals only) during the scan phase, but once
> > > > it starts logging metadata updates it will run all the way to
> > > > completion.
> > > > 
> > > > > > > (To the contrary, just using a cached file seems a natural fit from
> > > > > > > that perspective.)
> > > > > > 
> > > > > > Same here.
> > > > > > 
> > > > > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > > > > tool.
> > > > > > > > 
> > > > > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > > > > for the vm to push back on us if we successfully write all we need to
> > > > > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > > > > error back to userspace.  We might simply have to free array memory
> > > > > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > > > > usage.
> > > > > > > > 
> > > > > > > 
> > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > > > > may simplify things because we could allocate it up front and know right
> > > > > > > away whether we just don't have enough memory available to repair.
> > > > > > 
> > > > > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > > > > the pages at once, provided we have some guesstimate beforehand of how
> > > > > > much space we think we'll need.
> > > > > > 
> > > > > > So long as my earlier statement about the memory requirements being no
> > > > > > more than the size of the btree leaves is actually true (I haven't
> > > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > > > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > > > > that and if we don't get it, we kill the memfd and exit.
> > > > > > 
> > > > > 
> > > > > Indeed. It would be nice if we could do all of the file management bits
> > > > > in userspace.
> > > > 
> > > > Agreed, though no file management would be even better. :)
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > --D
> > > > > > > > > > 
> > > > > > > > > > > Brian
> > > > > > > > > > > 
> > > > > > > > > > > > --D
> > > > > > > > > > > > 
> > > > > > > > > > > > > Brian
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-10 19:36                                     ` Darrick J. Wong
@ 2018-08-11 12:50                                       ` Brian Foster
  2018-08-11 15:48                                         ` Darrick J. Wong
  2018-08-13  2:46                                         ` Dave Chinner
  0 siblings, 2 replies; 53+ messages in thread
From: Brian Foster @ 2018-08-11 12:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, allison.henderson

On Fri, Aug 10, 2018 at 12:36:51PM -0700, Darrick J. Wong wrote:
> On Fri, Aug 10, 2018 at 03:07:40PM -0400, Brian Foster wrote:
> > On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote:
> > > On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote:
> > > > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> > > > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > > > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
...
> > > 
> > > Hmm, ok, so to summarize, I see five options:
> > > 
> > > 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...)
> > > however many files it needs.
> > > 
> > > 2) Pass in a however many file fds we need and segment the space.
> > > 
> > > 3) Pass in a single file fd.
> > > 
> > > 4) Let the repair code create as many memfd files as it wants.
> > > 
> > > 5) Let the repair code create one memfd file and segment the space.
> > > 
> > > I'm pretty sure we don't want to support (2) because that just seems
> > > like a requirements communication nightmare and can burn up a lot of
> > > space in struct xfs_scrub_metadata.
> > > 
> > > (3) and (5) are basically the same except for where the file comes from.
> > > For (3) we'd have to make sure the fd filesystem supports large sparse
> > > files (and presumably isn't the xfs we're trying to repair), which
> > > shouldn't be too difficult to probe.  For (5) we know that tmpfs already
> > > supports large sparse files.  Another difficulty might be that on 32-bit
> > > the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE),
> > > though I suppose at this point we only need two files and 8TB should be
> > > enough for anyone.
> > > 
> > > (I also think it's reasonable to consider not supporting online repair
> > > on a 32-bit system with a large filesystem...)
> > > 
> > > In general, the "pass in a thing from userspace" variants come with the
> > > complication that we have to check the functionality of whatever gets
> > > passed in.  On the plus side it likely unlocks access to a lot more
> > > storage than we could get with mem+swap.  On the minus side someone
> > > passes in a fd to a drive-managed SMR on USB 2.0, and...
> > > 
> > > (1) seems like it would maximize the kernel's flexibility to create as
> > > many (regular, non-sparse) files as it needs, but now we're calling
> > > do_sys_open and managing files ourselves, which might be avoided.
> > > 
> > > (4) of course is what we do right now. :)
> > > 
> > > Soooo... the simplest userspace interface (I think) is to allow
> > > userspace to pass in a single file fd.  Scrub can reject it if it
> > > doesn't measure up (fs is the same, sparse not supported, high offsets
> > > not supported, etc.).  If userspace doesn't pass in an fd then we create
> > > a memfd and use that instead.  We end up with a hybrid between (3) and (5).
> > > 
> > 
> > That all sounds about right to me except I was thinking userspace would
> > do the memfd fallback of #5 rather than the kernel, just to keep the
> > policy out of the kernel as much as possible. Is there any major
> > advantage to doing it in the kernel? I guess it would slightly
> > complicate 'xfs_io -c repair' ...
> 
> Hm.  We'll have to use one of the reserved areas of struct
> xfs_scrub_metadata to pass in the file descriptor.  If we create a new
> XFS_SCRUB_IFLAG_FD flag to indicate that we're passing in a file
> descriptor then either we lose compatibility with old kernels (because
> they reject unknown flags) or xfs_scrub will have to try a repair
> without a fd (to see if the kernel even cares) and retry if the repair
> fails with some prearranged error code that means "give me a swapfile,
> please".  Alternately we simply require that the fd cannot be fd 0 since
> using stdin for swap space is a stupid idea anyways.
> 

I'm assuming that the kernel would have some basic checks on the fd to
ensure it's usable (seekable, large offsets, etc.), as you mentioned
previously.

With regard to xfs_scrub_metadata, it sounds like we need to deal with
that regardless if we want to support the ability to specify an external
file. Is the issue backwards compatibility with the interface as it
exists today..?

> Technically we're not supposed to have flag days, but otoh this is a
> xfs-only ioctl for a feature that's still experimental, so perhaps it's
> not crucial to maintain compatibility with old kernels where the feature
> is incomplete and experimental?
> 

In my mind, I kind of take the experimental status as all bits/interface
may explode and are otherwise subject to change or disappear. Perhaps
others feel differently, it does seem we've kind of hinted towards the
contrary recently with respect to the per-inode dax bits and then now in
this discussion, but IMO that's kind of an inherent risk of doing
incremental work on complex features upstream.

I dunno, perhaps that's just a misunderstanding on my part. If so, I do
wonder if we should be a bit more cautious (in the future) about
exposing interfaces to experimental features (DEBUG mode only, for
example) for a period of time until the underlying mechanism is fleshed
out enough to establish confidence in the interface. It's one thing if
an experimental feature is shiny new and potentially unstable at the
time it is merged, but enough bits are there for reviewers to understand
the design and interface requirements. It's another thing if the
implementation is not yet complete, because then it's obviously harder
to surmise whether the interface is ultimately sufficient.

This of course is all higher level discussion from how to handle scrub..

> Hmm.  We could define the fd field with the requirement that fd > 0, and
> if the repair function requires an fd and one hasn't been provided, it
> can fail out with ENOMEM.  If it doesn't need extra memory it can just
> ignore the contents of the fd field.  xfs_scrub can then arrange to pass
> in mem fds or file fds or whatever.
> 

Is there a versioning mechanism to the interface? I thought we used that
approach (or planned to..) in other similar internal commands, so a
particular kernel could bump the version and appropriately decide how to
handle older versions.

Brian

> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > (In theory the xbitmap could also be converted to use the fixed record
> > > > > array, but in practice they haven't (yet) become large enough to warrant
> > > > > it, and there's currently no way to insert or delete records from the
> > > > > middle of the array.)
> > > > > 
> > > > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > > > > > swappable or something?
> > > > > > > > > 
> > > > > > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > > > > > the swap file.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Ok.
> > > > > > > > 
> > > > > > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > > > > > requirement can be made clear to users
> > > > > > > > > 
> > > > > > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > > > > > able to use up all the memory + swap... and since we're probably only
> > > > > > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > > > > > as much memory.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > > > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > > > > > it's not clear how effectively running the system out of memory via the
> > > > > > > > kernel will behave from a failure perspective.
> > > > > > > > 
> > > > > > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > > > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > > > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > > > > > no idea what's going to happen.
> > > > > > > 
> > > > > > > Back in the days of the huge linked lists the oom killer would target
> > > > > > > other proceses because it doesn't know that the online repair thread is
> > > > > > > sitting on a ton of pinned kernel memory...
> > > > > > > 
> > > > > > 
> > > > > > Makes sense, kind of what I'd expect...
> > > > > > 
> > > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > > > > > we'd still be at risk of other subsystems running into memory
> > > > > > > > allocation problems, filling up swap, the OOM killer going after
> > > > > > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > > > > > picking off processes in service to a running online repair that
> > > > > > > > immediately consumes freed up memory until the system is borked?
> > > > > > > 
> > > > > > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > > > > > urge any running repair threads to bail out if they can.  It seems to me
> > > > > > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > > > > > could wait until at least one thread exits before returning.
> > > > > > > 
> > > > > > 
> > > > > > Ok, something like that could be useful. I agree that we probably don't
> > > > > > need to go that far until the mechanism is nailed down and testing shows
> > > > > > that OOM is a problem.
> > > > > 
> > > > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> > > > > "400M of RAM and no swap" scenario.  Granted, pretty much every other
> > > > > xfs utility also blows out on that so I'm not sure how hard I really
> > > > > need to try...
> > > > > 
> > > > > > > > I don't know how likely that is or if it really ends up much different
> > > > > > > > from the analogous xfs_repair situation. My only point right now is
> > > > > > > > that failure scenario is something we should explore for any solution
> > > > > > > > we ultimately consider because it may be an unexpected use case of the
> > > > > > > > underlying mechanism.
> > > > > > > 
> > > > > > > Ideally, online repair would always be the victim since we know we have
> > > > > > > a reasonable fallback.  At least for memfd, however, I think the only
> > > > > > > clues we have to decide the question "is this memfd getting in the way
> > > > > > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > > > > > kicked by an oom notification.  Maybe that'll be enough?
> > > > > > > 
> > > > > > 
> > > > > > Hm, yeah. It may be challenging to track memfd usage as such. If
> > > > > > userspace has access to the fd on an OOM notification or whatever, it
> > > > > > might be able to do more accurate analysis based on an fstat() or
> > > > > > something.
> > > > > > 
> > > > > > Related question... is the online repair sequence currently
> > > > > > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > > > > > entries during an allocbt scan for example?
> > > > > 
> > > > > It's interruptible (fatal signals only) during the scan phase, but once
> > > > > it starts logging metadata updates it will run all the way to
> > > > > completion.
> > > > > 
> > > > > > > > (To the contrary, just using a cached file seems a natural fit from
> > > > > > > > that perspective.)
> > > > > > > 
> > > > > > > Same here.
> > > > > > > 
> > > > > > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > > > > > tool.
> > > > > > > > > 
> > > > > > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > > > > > for the vm to push back on us if we successfully write all we need to
> > > > > > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > > > > > error back to userspace.  We might simply have to free array memory
> > > > > > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > > > > > usage.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > > > > > may simplify things because we could allocate it up front and know right
> > > > > > > > away whether we just don't have enough memory available to repair.
> > > > > > > 
> > > > > > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > > > > > the pages at once, provided we have some guesstimate beforehand of how
> > > > > > > much space we think we'll need.
> > > > > > > 
> > > > > > > So long as my earlier statement about the memory requirements being no
> > > > > > > more than the size of the btree leaves is actually true (I haven't
> > > > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > > > > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > > > > > that and if we don't get it, we kill the memfd and exit.
> > > > > > > 
> > > > > > 
> > > > > > Indeed. It would be nice if we could do all of the file management bits
> > > > > > in userspace.
> > > > > 
> > > > > Agreed, though no file management would be even better. :)
> > > > > 
> > > > > --D
> > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > > --D
> > > > > > > 
> > > > > > > > 
> > > > > > > > Brian
> > > > > > > > 
> > > > > > > > > --D
> > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Brian
> > > > > > > > > > 
> > > > > > > > > > > --D
> > > > > > > > > > > 
> > > > > > > > > > > > Brian
> > > > > > > > > > > > 
> > > > > > > > > > > > > --D
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-11 12:50                                       ` Brian Foster
@ 2018-08-11 15:48                                         ` Darrick J. Wong
  2018-08-13  2:46                                         ` Dave Chinner
  1 sibling, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-08-11 15:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Sat, Aug 11, 2018 at 08:50:49AM -0400, Brian Foster wrote:
> On Fri, Aug 10, 2018 at 12:36:51PM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 10, 2018 at 03:07:40PM -0400, Brian Foster wrote:
> > > On Fri, Aug 10, 2018 at 08:39:44AM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 10, 2018 at 06:33:52AM -0400, Brian Foster wrote:
> > > > > On Thu, Aug 09, 2018 at 08:59:59AM -0700, Darrick J. Wong wrote:
> > > > > > On Thu, Aug 09, 2018 at 08:00:28AM -0400, Brian Foster wrote:
> > > > > > > On Wed, Aug 08, 2018 at 03:42:32PM -0700, Darrick J. Wong wrote:
> > > > > > > > On Wed, Aug 08, 2018 at 08:29:54AM -0400, Brian Foster wrote:
> > > > > > > > > On Tue, Aug 07, 2018 at 04:34:58PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > On Fri, Aug 03, 2018 at 06:49:40AM -0400, Brian Foster wrote:
> > > > > > > > > > > On Thu, Aug 02, 2018 at 12:22:05PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > On Thu, Aug 02, 2018 at 09:48:24AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > On Wed, Aug 01, 2018 at 11:28:45PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 02:39:20PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 09:23:16AM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > > On Wed, Aug 01, 2018 at 07:54:09AM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 03:01:25PM -0700, Darrick J. Wong wrote:
> > > > > > > > > > > > > > > > > > On Tue, Jul 31, 2018 at 01:47:23PM -0400, Brian Foster wrote:
> > > > > > > > > > > > > > > > > > > On Sun, Jul 29, 2018 at 10:48:21PM -0700, Darrick J. Wong wrote:
> ...
> > > > 
> > > > Hmm, ok, so to summarize, I see five options:
> > > > 
> > > > 1) Pass in a dirfd, repair can internally openat(dirfd, O_TMPFILE...)
> > > > however many files it needs.
> > > > 
> > > > 2) Pass in a however many file fds we need and segment the space.
> > > > 
> > > > 3) Pass in a single file fd.
> > > > 
> > > > 4) Let the repair code create as many memfd files as it wants.
> > > > 
> > > > 5) Let the repair code create one memfd file and segment the space.
> > > > 
> > > > I'm pretty sure we don't want to support (2) because that just seems
> > > > like a requirements communication nightmare and can burn up a lot of
> > > > space in struct xfs_scrub_metadata.
> > > > 
> > > > (3) and (5) are basically the same except for where the file comes from.
> > > > For (3) we'd have to make sure the fd filesystem supports large sparse
> > > > files (and presumably isn't the xfs we're trying to repair), which
> > > > shouldn't be too difficult to probe.  For (5) we know that tmpfs already
> > > > supports large sparse files.  Another difficulty might be that on 32-bit
> > > > the page cache only supports offsets as high as (ULONG_MAX * PAGE_SIZE),
> > > > though I suppose at this point we only need two files and 8TB should be
> > > > enough for anyone.
> > > > 
> > > > (I also think it's reasonable to consider not supporting online repair
> > > > on a 32-bit system with a large filesystem...)
> > > > 
> > > > In general, the "pass in a thing from userspace" variants come with the
> > > > complication that we have to check the functionality of whatever gets
> > > > passed in.  On the plus side it likely unlocks access to a lot more
> > > > storage than we could get with mem+swap.  On the minus side someone
> > > > passes in a fd to a drive-managed SMR on USB 2.0, and...
> > > > 
> > > > (1) seems like it would maximize the kernel's flexibility to create as
> > > > many (regular, non-sparse) files as it needs, but now we're calling
> > > > do_sys_open and managing files ourselves, which might be avoided.
> > > > 
> > > > (4) of course is what we do right now. :)
> > > > 
> > > > Soooo... the simplest userspace interface (I think) is to allow
> > > > userspace to pass in a single file fd.  Scrub can reject it if it
> > > > doesn't measure up (fs is the same, sparse not supported, high offsets
> > > > not supported, etc.).  If userspace doesn't pass in an fd then we create
> > > > a memfd and use that instead.  We end up with a hybrid between (3) and (5).
> > > > 
> > > 
> > > That all sounds about right to me except I was thinking userspace would
> > > do the memfd fallback of #5 rather than the kernel, just to keep the
> > > policy out of the kernel as much as possible. Is there any major
> > > advantage to doing it in the kernel? I guess it would slightly
> > > complicate 'xfs_io -c repair' ...
> > 
> > Hm.  We'll have to use one of the reserved areas of struct
> > xfs_scrub_metadata to pass in the file descriptor.  If we create a new
> > XFS_SCRUB_IFLAG_FD flag to indicate that we're passing in a file
> > descriptor then either we lose compatibility with old kernels (because
> > they reject unknown flags) or xfs_scrub will have to try a repair
> > without a fd (to see if the kernel even cares) and retry if the repair
> > fails with some prearranged error code that means "give me a swapfile,
> > please".  Alternately we simply require that the fd cannot be fd 0 since
> > using stdin for swap space is a stupid idea anyways.
> > 
> 
> I'm assuming that the kernel would have some basic checks on the fd to
> ensure it's usable (seekable, large offsets, etc.), as you mentioned
> previously.

Of course. :)

> With regard to xfs_scrub_metadata, it sounds like we need to deal with
> that regardless if we want to support the ability to specify an external
> file. Is the issue backwards compatibility with the interface as it
> exists today..?

Yes, my question is how hard do we try to maintain backwards
compatibility with an ioctl that controls an EXPERIMENTAL feature that
is disabled by default in Kconfig? :)

> > Technically we're not supposed to have flag days, but otoh this is a
> > xfs-only ioctl for a feature that's still experimental, so perhaps it's
> > not crucial to maintain compatibility with old kernels where the feature
> > is incomplete and experimental?
> > 
> 
> In my mind, I kind of take the experimental status as all bits/interface
> may explode and are otherwise subject to change or disappear. Perhaps
> others feel differently, it does seem we've kind of hinted towards the
> contrary recently with respect to the per-inode dax bits and then now in
> this discussion, but IMO that's kind of an inherent risk of doing
> incremental work on complex features upstream.
> 
> I dunno, perhaps that's just a misunderstanding on my part. If so, I do
> wonder if we should be a bit more cautious (in the future) about
> exposing interfaces to experimental features (DEBUG mode only, for
> example) for a period of time until the underlying mechanism is fleshed
> out enough to establish confidence in the interface.

That was my reason for hiding it all behind a 'default N' Kconfig
option -- to limit the number of users to those who build their own
kernels.

> It's one thing if an experimental feature is shiny new and potentially
> unstable at the time it is merged, but enough bits are there for
> reviewers to understand the design and interface requirements. It's
> another thing if the implementation is not yet complete, because then
> it's obviously harder to surmise whether the interface is ultimately
> sufficient.

<nod> I decided that it if we left experimental warnings in dmesg and
the xfs_scrub output and forced users to rebuild their kernel to turn on
scrub/repair then it was reasonable that we could change the ioctl
interface without worrying too much about backwards compatibility.

I think it's fine to add a 's32 sm_fd' field that can't be zero and can
be picked up by scrub or repair if they want access to more space.

> This of course is all higher level discussion from how to handle scrub..
> 
> > Hmm.  We could define the fd field with the requirement that fd > 0, and
> > if the repair function requires an fd and one hasn't been provided, it
> > can fail out with ENOMEM.  If it doesn't need extra memory it can just
> > ignore the contents of the fd field.  xfs_scrub can then arrange to pass
> > in mem fds or file fds or whatever.
> > 
> 
> Is there a versioning mechanism to the interface? I thought we used that
> approach (or planned to..) in other similar internal commands, so a
> particular kernel could bump the version and appropriately decide how to
> handle older versions.

There's plenty of space in the structure that's all required to be zero,
so we could easily add a u8 sm_version some day.  The IFLAG bit I
mentioned would be sufficient for the fd field.

--D

> 
> Brian
> 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > (In theory the xbitmap could also be converted to use the fixed record
> > > > > > array, but in practice they haven't (yet) become large enough to warrant
> > > > > > it, and there's currently no way to insert or delete records from the
> > > > > > middle of the array.)
> > > > > > 
> > > > > > > > > > > I'm not familiar with memfd. The manpage suggests it's ram backed, is it
> > > > > > > > > > > swappable or something?
> > > > > > > > > > 
> > > > > > > > > > It's supposed to be.  The quick test I ran (allocate a memfd, write 1GB
> > > > > > > > > > of junk to it on a VM with 400M of RAM) seemed to push about 980MB into
> > > > > > > > > > the swap file.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Ok.
> > > > > > > > > 
> > > > > > > > > > > If so, that sounds a reasonable option provided the swap space
> > > > > > > > > > > requirement can be made clear to users
> > > > > > > > > > 
> > > > > > > > > > We can document it.  I don't think it's any worse than xfs_repair being
> > > > > > > > > > able to use up all the memory + swap... and since we're probably only
> > > > > > > > > > going to be repairing one thing at a time, most likely scrub won't need
> > > > > > > > > > as much memory.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Right, but as noted below, my concerns with the xfs_repair comparison
> > > > > > > > > are that 1.) the kernel generally has more of a limit on anonymous
> > > > > > > > > memory allocations than userspace (i.e., not swappable AFAIU?) and 2.)
> > > > > > > > > it's not clear how effectively running the system out of memory via the
> > > > > > > > > kernel will behave from a failure perspective.
> > > > > > > > > 
> > > > > > > > > IOW, xfs_repair can run the system out of memory but for the most part
> > > > > > > > > that ends up being a simple problem for the system: OOM kill the bloated
> > > > > > > > > xfs_repair process. For an online repair in a similar situation, I have
> > > > > > > > > no idea what's going to happen.
> > > > > > > > 
> > > > > > > > Back in the days of the huge linked lists the oom killer would target
> > > > > > > > other proceses because it doesn't know that the online repair thread is
> > > > > > > > sitting on a ton of pinned kernel memory...
> > > > > > > > 
> > > > > > > 
> > > > > > > Makes sense, kind of what I'd expect...
> > > > > > > 
> > > > > > > > > The hope is that the online repair hits -ENOMEM and unwinds, but ISTM
> > > > > > > > > we'd still be at risk of other subsystems running into memory
> > > > > > > > > allocation problems, filling up swap, the OOM killer going after
> > > > > > > > > unrelated processes, etc.  What if, for example, the OOM killer starts
> > > > > > > > > picking off processes in service to a running online repair that
> > > > > > > > > immediately consumes freed up memory until the system is borked?
> > > > > > > > 
> > > > > > > > Yeah.  One thing we /could/ do is register an oom notifier that would
> > > > > > > > urge any running repair threads to bail out if they can.  It seems to me
> > > > > > > > that the oom killer blocks on the oom_notify_list chain, so our handler
> > > > > > > > could wait until at least one thread exits before returning.
> > > > > > > > 
> > > > > > > 
> > > > > > > Ok, something like that could be useful. I agree that we probably don't
> > > > > > > need to go that far until the mechanism is nailed down and testing shows
> > > > > > > that OOM is a problem.
> > > > > > 
> > > > > > It already is a problem on my contrived "2TB hardlink/reflink farm fs" +
> > > > > > "400M of RAM and no swap" scenario.  Granted, pretty much every other
> > > > > > xfs utility also blows out on that so I'm not sure how hard I really
> > > > > > need to try...
> > > > > > 
> > > > > > > > > I don't know how likely that is or if it really ends up much different
> > > > > > > > > from the analogous xfs_repair situation. My only point right now is
> > > > > > > > > that failure scenario is something we should explore for any solution
> > > > > > > > > we ultimately consider because it may be an unexpected use case of the
> > > > > > > > > underlying mechanism.
> > > > > > > > 
> > > > > > > > Ideally, online repair would always be the victim since we know we have
> > > > > > > > a reasonable fallback.  At least for memfd, however, I think the only
> > > > > > > > clues we have to decide the question "is this memfd getting in the way
> > > > > > > > of other threads?" is either seeing ENOMEM, short writes, or getting
> > > > > > > > kicked by an oom notification.  Maybe that'll be enough?
> > > > > > > > 
> > > > > > > 
> > > > > > > Hm, yeah. It may be challenging to track memfd usage as such. If
> > > > > > > userspace has access to the fd on an OOM notification or whatever, it
> > > > > > > might be able to do more accurate analysis based on an fstat() or
> > > > > > > something.
> > > > > > > 
> > > > > > > Related question... is the online repair sequence currently
> > > > > > > interruptible, if xfs_scrub receives a fatal signal while pulling in
> > > > > > > entries during an allocbt scan for example?
> > > > > > 
> > > > > > It's interruptible (fatal signals only) during the scan phase, but once
> > > > > > it starts logging metadata updates it will run all the way to
> > > > > > completion.
> > > > > > 
> > > > > > > > > (To the contrary, just using a cached file seems a natural fit from
> > > > > > > > > that perspective.)
> > > > > > > > 
> > > > > > > > Same here.
> > > > > > > > 
> > > > > > > > > > > and the failure characteristics aren't more severe than for userspace.
> > > > > > > > > > > An online repair that puts the broader system at risk of OOM as
> > > > > > > > > > > opposed to predictably failing gracefully may not be the most useful
> > > > > > > > > > > tool.
> > > > > > > > > > 
> > > > > > > > > > Agreed.  One huge downside of memfd seems to be the lack of a mechanism
> > > > > > > > > > for the vm to push back on us if we successfully write all we need to
> > > > > > > > > > the memfd but then other processes need some memory.  Obviously, if the
> > > > > > > > > > memfd write itself comes up short or fails then we dump the memfd and
> > > > > > > > > > error back to userspace.  We might simply have to free array memory
> > > > > > > > > > while we iterate the records to minimize the time spent at peak memory
> > > > > > > > > > usage.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Hm, yeah. Some kind of fixed/relative size in-core memory pool approach
> > > > > > > > > may simplify things because we could allocate it up front and know right
> > > > > > > > > away whether we just don't have enough memory available to repair.
> > > > > > > > 
> > > > > > > > Hmm.  Apparently we actually /can/ call fallocate on memfd to grab all
> > > > > > > > the pages at once, provided we have some guesstimate beforehand of how
> > > > > > > > much space we think we'll need.
> > > > > > > > 
> > > > > > > > So long as my earlier statement about the memory requirements being no
> > > > > > > > more than the size of the btree leaves is actually true (I haven't
> > > > > > > > rigorously tried to prove it), we need about (xrep_calc_ag_resblks() *
> > > > > > > > blocksize) worth of space in the memfd file.  Maybe we ask for 1.5x
> > > > > > > > that and if we don't get it, we kill the memfd and exit.
> > > > > > > > 
> > > > > > > 
> > > > > > > Indeed. It would be nice if we could do all of the file management bits
> > > > > > > in userspace.
> > > > > > 
> > > > > > Agreed, though no file management would be even better. :)
> > > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > > --D
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Brian
> > > > > > > > > 
> > > > > > > > > > --D
> > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Brian
> > > > > > > > > > > 
> > > > > > > > > > > > --D
> > > > > > > > > > > > 
> > > > > > > > > > > > > Brian
> > > > > > > > > > > > > 
> > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > > > > > > > +	/* Free all the OWN_AG blocks that are not in the rmapbt/agfl. */
> > > > > > > > > > > > > > > > > > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > > > > > > > > > > > > > > > > > +	return xrep_reap_extents(sc, old_allocbt_blocks, &oinfo,
> > > > > > > > > > > > > > > > > > > > +			XFS_AG_RESV_NONE);
> > > > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > ...
> > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.c b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > > index 0ed68379e551..82f99633a597 100644
> > > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.c
> > > > > > > > > > > > > > > > > > > > @@ -657,3 +657,17 @@ xfs_extent_busy_ag_cmp(
> > > > > > > > > > > > > > > > > > > >  		diff = b1->bno - b2->bno;
> > > > > > > > > > > > > > > > > > > >  	return diff;
> > > > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > +/* Are there any busy extents in this AG? */
> > > > > > > > > > > > > > > > > > > > +bool
> > > > > > > > > > > > > > > > > > > > +xfs_extent_busy_list_empty(
> > > > > > > > > > > > > > > > > > > > +	struct xfs_perag	*pag)
> > > > > > > > > > > > > > > > > > > > +{
> > > > > > > > > > > > > > > > > > > > +	spin_lock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > > +	if (pag->pagb_tree.rb_node) {
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > RB_EMPTY_ROOT()?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Good suggestion, thank you!
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > --D
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Brian
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > +		spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > > +		return false;
> > > > > > > > > > > > > > > > > > > > +	}
> > > > > > > > > > > > > > > > > > > > +	spin_unlock(&pag->pagb_lock);
> > > > > > > > > > > > > > > > > > > > +	return true;
> > > > > > > > > > > > > > > > > > > > +}
> > > > > > > > > > > > > > > > > > > > diff --git a/fs/xfs/xfs_extent_busy.h b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > > index 990ab3891971..2f8c73c712c6 100644
> > > > > > > > > > > > > > > > > > > > --- a/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > > +++ b/fs/xfs/xfs_extent_busy.h
> > > > > > > > > > > > > > > > > > > > @@ -65,4 +65,6 @@ static inline void xfs_extent_busy_sort(struct list_head *list)
> > > > > > > > > > > > > > > > > > > >  	list_sort(NULL, list, xfs_extent_busy_ag_cmp);
> > > > > > > > > > > > > > > > > > > >  }
> > > > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > > > > +bool xfs_extent_busy_list_empty(struct xfs_perag *pag);
> > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > >  #endif /* __XFS_EXTENT_BUSY_H__ */
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > > --
> > > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > > --
> > > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > > --
> > > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > > --
> > > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > > --
> > > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > > --
> > > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > > --
> > > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > > --
> > > > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > > > the body of a message to majordomo@vger.kernel.org
> > > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 05/14] xfs: repair free space btrees
  2018-08-11 12:50                                       ` Brian Foster
  2018-08-11 15:48                                         ` Darrick J. Wong
@ 2018-08-13  2:46                                         ` Dave Chinner
  1 sibling, 0 replies; 53+ messages in thread
From: Dave Chinner @ 2018-08-13  2:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, allison.henderson

On Sat, Aug 11, 2018 at 08:50:49AM -0400, Brian Foster wrote:
> On Fri, Aug 10, 2018 at 12:36:51PM -0700, Darrick J. Wong wrote:
> > Technically we're not supposed to have flag days, but otoh this is a
> > xfs-only ioctl for a feature that's still experimental, so perhaps it's
> > not crucial to maintain compatibility with old kernels where the feature
> > is incomplete and experimental?
> > 
> 
> In my mind, I kind of take the experimental status as all bits/interface
> may explode and are otherwise subject to change or disappear. Perhaps
> others feel differently, it does seem we've kind of hinted towards the
> contrary recently with respect to the per-inode dax bits and then now in
> this discussion, but IMO that's kind of an inherent risk of doing
> incremental work on complex features upstream.

I've always considered that the experimental tag covers the
user/ioctl interfaces as much as it does the functionality and
on-disk format. i.e. like the on-disk format, the ioctl interfaces
are subject to change until we clear the exp. tag, at which point
they are essentially fixed forever. We /try/ not to have to change
them after the initial merge, but sometimes we screw up and need to
fix them before we commit to long term support.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH 06/14] xfs: repair inode btrees
  2018-08-02 14:54   ` Brian Foster
@ 2018-11-06  2:16     ` Darrick J. Wong
  0 siblings, 0 replies; 53+ messages in thread
From: Darrick J. Wong @ 2018-11-06  2:16 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david, allison.henderson

On Thu, Aug 02, 2018 at 10:54:03AM -0400, Brian Foster wrote:
> On Sun, Jul 29, 2018 at 10:48:28PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Use the rmapbt to find inode chunks, query the chunks to compute
> > hole and free masks, and with that information rebuild the inobt
> > and finobt.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/Makefile              |    1 
> >  fs/xfs/scrub/common.c        |    2 
> >  fs/xfs/scrub/ialloc_repair.c |  673 ++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/repair.c        |   20 +
> >  fs/xfs/scrub/repair.h        |   11 +
> >  fs/xfs/scrub/scrub.c         |    4 
> >  fs/xfs/scrub/scrub.h         |    1 
> >  fs/xfs/scrub/trace.h         |    4 
> >  8 files changed, 712 insertions(+), 4 deletions(-)
> >  create mode 100644 fs/xfs/scrub/ialloc_repair.c
> > 
> > 
> ...
> > diff --git a/fs/xfs/scrub/ialloc_repair.c b/fs/xfs/scrub/ialloc_repair.c
> > new file mode 100644
> > index 000000000000..126135c1a147
> > --- /dev/null
> > +++ b/fs/xfs/scrub/ialloc_repair.c
> > @@ -0,0 +1,673 @@
> ...
> > +
> > +/*
> > + * Inode Btree Repair
> > + * ==================
> > + *
> > + * A quick refresher of inode btrees on a v5 filesystem:
> > + *
> > + * - Each inode btree record can describe a single 'inode chunk'.  The chunk
> > + *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
> > + *   inobt record must be aligned to the chunk size.  A chunk can be smaller
> > + *   than a fs block.  One must be careful with 64k-block filesystems whose
> > + *   inodes are smaller than 1k.
> > + *
> > + * - Inode buffers are read into memory in units of 'inode clusters'.  However
> > + *   many inodes fit in a cluster buffer is the smallest number of inodes that
> > + *   can be allocated or freed.  Clusters are never larger than a chunk and
> > + *   never smaller than a fs block.  If sparse inodes are not enabled, then
> > + *   records can be aligned to a cluster.
> > + *
> 

It's been a while since this discussion trailed off.  Let's see how much
of it I remember.  I've been revising this repair function here and
there since August. :)

> I find the wording around alignment in the above two sections a little
> confusing. We distinguish between sparse=0/1 on some points but not
> others, like the cluster buffer being the smallest possible allocation
> unit of inodes, but IIUC that is only the case with sparse=1.
> 
> My general understanding is that inode records should always be aligned
> to sb_inoalignmt, regardless of sparse inodes. For non-sparse
> filesystems, this value can be smaller than the chunk size. For sparse
> filesystems, it must match the chunk size and sb_spino_align defines the
> sparse chunk allocation alignment, which must match the cluster size.

Yeah, the whole section was unclear and downright wrong.  I forgot that
it was possible for a single inode cluster to be mapped by multiple
inobt records, with the result that the whole thing breaks badly on 64k
block filesystems.  I've revised the comment considerably:

/*
 * A quick refresher of inode btrees on a v5 filesystem:
 *
 * - Inode records are read into memory in units of 'inode clusters'.  However
 *   many inodes fit in a cluster buffer is the smallest number of inodes that
 *   can be allocated or freed.  Clusters are never smaller than one fs block
 *   though they can span multiple blocks.  The size (in fs blocks) is
 *   computed with xfs_icluster_size_fsb().  The fs block alignment of a
 *   cluster is computed with xfs_ialloc_cluster_alignment().
 *
 * - Each inode btree record can describe a single 'inode chunk'.  The chunk
 *   size is defined to be 64 inodes.  If sparse inodes are enabled, every
 *   inobt record must be aligned to the chunk size; if not, every record must
 *   be aligned to the start of a cluster.  It is possible to construct an XFS
 *   geometry where one inobt record maps to multiple inode clusters; it is
 *   also possible to construct a geometry where multiple inobt records
 *   map to different parts of one inode cluster.
 *
 * - If sparse inodes are not enabled, the smallest unit of allocation for
 *   inode records is enough to contain one inode chunk's worth of inodes.
 *
 * - If sparse inodes are enabled, the holemask field will be active.  Each
 *   bit of the holemask represents 4 potential inodes; if set, the
 *   corresponding space does *not* contain inodes and must be left alone.
 *   Clusters cannot be smaller than 4 inodes.  The smallest unit of allocation
 *   of inode records is one inode cluster.
 */


> > + * - If sparse inodes are enabled, the holemask field will be active.  Each
> > + *   bit of the holemask represents 4 potential inodes; if set, the
> > + *   corresponding space does *not* contain inodes and must be left alone.
> > + *
> > + * So what's the rebuild algorithm?
> > + *
> > + * Iterate the reverse mapping records looking for OWN_INODES and OWN_INOBT
> > + * records.  The OWN_INOBT records are the old inode btree blocks and will be
> > + * cleared out after we've rebuilt the tree.  Each possible inode chunk within

"Each possible inode cluster..."

> > + * an OWN_INODES record will be read in and the freemask calculated from the

It isn't enough to iterate each possible *cluster* of an OWN_INODES record;
we also have to iterate each possible *chunk* of each of those clusters.

> > + * i_mode data in the inode chunk.  For sparse inodes the holemask will be
> > + * calculated by creating the properly aligned inobt record and punching out
> > + * any chunk that's missing.  Inode allocations and frees grab the AGI first,
> > + * so repair protects itself from concurrent access by locking the AGI.
> > + *
> > + * Once we've reconstructed all the inode records, we can create new inode
> > + * btree roots and reload the btrees.  We rebuild both inode trees at the same
> > + * time because they have the same rmap owner and it would be more complex to
> > + * figure out if the other tree isn't in need of a rebuild and which OWN_INOBT
> > + * blocks it owns.  We have all the data we need to build both, so dump
> > + * everything and start over.
> > + *
> > + * We use the prefix 'xrep_ibt' because we rebuild both inode btrees.
> > + */
> > +
> > +struct xrep_ibt_extent {
> > +	struct list_head	list;
> > +	xfs_inofree_t		freemask;
> > +	xfs_agino_t		startino;
> > +	unsigned int		count;
> > +	unsigned int		usedcount;
> > +	uint16_t		holemask;
> 
> I'm curious why we wouldn't just reuse xfs_inobt_rec_incore here.

I think we can reuse that here.  The ir_free field can be computed via
hweight64(ir_free).

> > +};
> > +
> ...
> > +
> > +/*
> > + * For each inode cluster covering the physical extent recorded by the rmapbt,
> > + * we must calculate the properly aligned startino of that cluster, then
> > + * iterate each cluster to fill in used and filled masks appropriately.  We
> > + * then use the (startino, used, filled) information to construct the
> > + * appropriate inode records.
> > + */
> > +STATIC int
> > +xrep_ibt_process_cluster(
> > +	struct xrep_ibt		*ri,
> > +	xfs_agblock_t		agbno,
> > +	int			blks_per_cluster,
> > +	xfs_agino_t		rec_agino)
> > +{
> > +	struct xfs_imap		imap;
> > +	struct xrep_ibt_extent	*rie;
> > +	struct xfs_dinode	*dip;
> > +	struct xfs_buf		*bp;
> > +	struct xfs_scrub	*sc = ri->sc;
> > +	struct xfs_mount	*mp = sc->mp;
> > +	xfs_ino_t		fsino;
> > +	xfs_inofree_t		usedmask;
> > +	xfs_agino_t		nr_inodes;
> > +	xfs_agino_t		startino;
> > +	xfs_agino_t		clusterino;
> > +	xfs_agino_t		clusteroff;
> > +	xfs_agino_t		agino;
> > +	uint16_t		fillmask;
> > +	bool			inuse;
> > +	int			usedcount;
> > +	int			error;
> > +
> > +	/* The per-AG inum of this inode cluster. */
> > +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> > +
> > +	/* The per-AG inum of the inobt record. */
> > +	startino = rec_agino + rounddown(agino - rec_agino,
> > +			XFS_INODES_PER_CHUNK);
> 
> Hmm, I'm not following what this does. When does startino != rec_agino
> here? Is this related to the multi-chunk-per-block case on large block
> sizes, since I'm not quite following how we handle that case either...?
> Don't we need to factor in inodes_per_block somewhere in here to cover
> the multi-chunk case?
> 
> BTW, that second line could use another indent or two to clarify it's
> part of the rounddown() call.

Not sure how much of a reply I can make to this other than to say that
I refactored this whole section to accomodate having to add a second
level loop to cover each possible inode chunk within an inode cluster.

I also found this all very confusing and rewrote it; hopefully the new
version will be easier to understand.

> > +
> > +	/* The per-AG inum of the cluster within the inobt record. */
> > +	clusteroff = agino - startino;
> > +
> > +	/* Every inode in this holemask slot is filled. */
> > +	nr_inodes = XFS_OFFBNO_TO_AGINO(mp, blks_per_cluster, 0);
> > +	fillmask = xfs_inobt_maskn(clusteroff / XFS_INODES_PER_HOLEMASK_BIT,
> > +			nr_inodes / XFS_INODES_PER_HOLEMASK_BIT);
> > +
> > +	/*
> > +	 * Grab the inode cluster buffer.  This is safe to do with a broken
> > +	 * inobt because imap_to_bp directly maps the buffer without touching
> > +	 * either inode btree.
> > +	 */
> > +	imap.im_blkno = XFS_AGB_TO_DADDR(mp, sc->sa.agno, agbno);
> > +	imap.im_len = XFS_FSB_TO_BB(mp, blks_per_cluster);
> > +	imap.im_boffset = 0;
> > +	error = xfs_imap_to_bp(mp, sc->tp, &imap, &dip, &bp, 0,
> > +			XFS_IGET_UNTRUSTED);
> > +	if (error)
> > +		return error;
> > +
> > +	usedmask = 0;
> > +	usedcount = 0;
> > +	/* Which inodes within this cluster are free? */
> > +	for (clusterino = 0; clusterino < nr_inodes; clusterino++) {
> > +		fsino = XFS_AGINO_TO_INO(mp, sc->sa.agno, agino + clusterino);
> > +		error = xrep_ibt_check_free(sc, bp, fsino,
> > +				clusterino, &inuse);
> > +		if (error) {
> > +			xfs_trans_brelse(sc->tp, bp);
> > +			return error;
> > +		}
> > +		if (inuse) {
> > +			usedcount++;
> > +			usedmask |= XFS_INOBT_MASK(clusteroff + clusterino);
> > +		}
> > +	}
> > +	xfs_trans_brelse(sc->tp, bp);
> > +
> > +	/*
> > +	 * If the last item in the list is our chunk record,
> > +	 * update that.
> > +	 */
> > +	if (!list_empty(ri->extlist)) {
> > +		rie = list_last_entry(ri->extlist, struct xrep_ibt_extent,
> > +				list);
> > +		if (rie->startino + XFS_INODES_PER_CHUNK > startino) {
> > +			rie->freemask &= ~usedmask;
> > +			rie->holemask &= ~fillmask;
> > +			rie->count += nr_inodes;
> > +			rie->usedcount += usedcount;
> > +			return 0;
> > +		}
> 
> And I think if we used the existing in-core record data structure we
> could also reuse existing helpers like __xfs_inobt_rec_merge().

Hmm.  The initialization would still need to be hardcoded since there
isn't a helper for that, and I'm not sure if allocating another incore
rec is worth the trouble to save four lines of code.

> Alternatively, could we allocate/lookup the xrep_ibt_extent earlier and
> update the associated fields directly rather than via the indirection of
> the various local vars?

Yeah, I think this is possible too.

> BTW, I initially thought this was a sparse inode thing but I see a bit
> further down that we process a cluster at a time regardless. That seems
> Ok, but I do wonder if some of this list hackery and whatnot could be
> simplified by walking the clusters here. I guess we'd still need to
> account for separate rmapbt records for sparse chunks, however.

Yep, that's why we need the double-loop accumulated-inobt-record clunkery.

> > +	}
> > +
> > +	/* New inode chunk; add to the list. */
> > +	rie = kmem_alloc(sizeof(struct xrep_ibt_extent), KM_MAYFAIL);
> > +	if (!rie)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&rie->list);
> > +	rie->startino = startino;
> > +	rie->freemask = XFS_INOBT_ALL_FREE & ~usedmask;
> > +	rie->holemask = XFS_INOBT_ALL_FREE & ~fillmask;
> 
> I'm not sure we need the ALL_FREE thing here..? We don't use it in the
> update case above. (Though it would make sense if we allocated this
> structure earlier and initialized it.)

Yes, we've possibly initialized the structure earlier, so that's why we
set the entire free/hole mask and then clear bits out of them as we
discover inodes that are actually present.

(Note that all the list handling crap falls out with the conversion to
the big memory array.)

> > +	rie->count = nr_inodes;
> > +	rie->usedcount = usedcount;
> > +	list_add_tail(&rie->list, ri->extlist);
> > +	ri->nr_records++;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Record extents that belong to inode btrees. */
> > +STATIC int
> > +xrep_ibt_walk_rmap(
> > +	struct xfs_btree_cur	*cur,
> > +	struct xfs_rmap_irec	*rec,
> > +	void			*priv)
> > +{
> > +	struct xrep_ibt		*ri = priv;
> > +	struct xfs_mount	*mp = cur->bc_mp;
> > +	xfs_fsblock_t		fsbno;
> > +	xfs_agblock_t		agbno = rec->rm_startblock;
> > +	xfs_agino_t		inoalign;
> > +	xfs_agino_t		agino;
> > +	xfs_agino_t		rec_agino;
> > +	int			blks_per_cluster;
> > +	int			error = 0;
> > +
> > +	if (xchk_should_terminate(ri->sc, &error))
> > +		return error;
> > +
> > +	/* Fragment of the old btrees; dispose of them later. */
> > +	if (rec->rm_owner == XFS_RMAP_OWN_INOBT) {
> > +		fsbno = XFS_AGB_TO_FSB(mp, ri->sc->sa.agno, agbno);
> > +		return xfs_bitmap_set(ri->btlist, fsbno, rec->rm_blockcount);
> > +	}
> > +
> > +	/* Skip extents which are not owned by this inode and fork. */
> > +	if (rec->rm_owner != XFS_RMAP_OWN_INODES)
> > +		return 0;
> > +
> > +	blks_per_cluster = xfs_icluster_size_fsb(mp);
> > +
> > +	if (agbno % blks_per_cluster != 0)
> > +		return -EFSCORRUPTED;
> > +
> 
> Ok, so we check that agbno is at least cluster aligned...
> 
> Shouldn't we verify that blockcount is sane as well?

Yes, fixed.

> > +	trace_xrep_ibt_walk_rmap(mp, ri->sc->sa.agno, rec->rm_startblock,
> > +			rec->rm_blockcount, rec->rm_owner, rec->rm_offset,
> > +			rec->rm_flags);
> > +
> > +	/*
> > +	 * Determine the inode block alignment, and where the block
> > +	 * ought to start if it's aligned properly.  On a sparse inode
> > +	 * system the rmap doesn't have to start on an alignment boundary,
> > +	 * but the record does.  On pre-sparse filesystems, we /must/
> > +	 * start both rmap and inobt on an alignment boundary.
> > +	 */
> > +	inoalign = xfs_ialloc_cluster_alignment(mp);
> > +	agino = XFS_OFFBNO_TO_AGINO(mp, agbno, 0);
> > +	rec_agino = XFS_OFFBNO_TO_AGINO(mp, rounddown(agbno, inoalign), 0);
> > +	if (!xfs_sb_version_hassparseinodes(&mp->m_sb) && agino != rec_agino)
> > +		return -EFSCORRUPTED;
> > +
> 
> ... then if I follow correctly, verify the block is aligned
> appropriately on !sparse. Firstly, isn't the above logically equivalent
> to the following? E.g., I'm not sure why we need agino here.
> 
> 	if (!sparse && (agbno % inoalign != 0))
> 		return -EFSCORRUPTED;
> 
> I take it that since we're walking the rmap, agbno could refer to a
> sparse cluster. Perhaps we should also check against sb_spino_align in
> the sparse case. FWIW, I think the comment above could be more clear as
> well:
> 
> /*
>  * On a sparse inode fs, agbno could refer to a partial chunk. This
>  * should be aligned to the sparse chunk alignment. On a non-sparse fs,
>  * agbno must always refer to the first block of an inode chunk and so
>  * should be chunk aligned.

Ok.  I'll update it to check sparse chunk alignments.

>  */
> 
> > +	/*
> > +	 * Set up the free/hole masks for each inode cluster that could be
> > +	 * mapped by this rmap record.
> > +	 */
> > +	for (;
> > +	     agbno < rec->rm_startblock + rec->rm_blockcount;
> > +	     agbno += blks_per_cluster) {
> > +		error = xrep_ibt_process_cluster(ri, agbno, blks_per_cluster,
> > +				rec_agino);
> > +		if (error)
> > +			return error;
> > +	}
> 
> Hmm, Ok. We're processing inodes a cluster size at a time regardless of
> the extent length. That makes sense since we presumably need to read and
> process the inode cluster itself.

<nod>

> > +
> > +	return 0;
> > +}
> > +
> ...
> > +/* Build new inode btrees and dispose of the old one. */
> > +STATIC int
> > +xrep_ibt_rebuild_trees(
> > +	struct xfs_scrub	*sc,
> > +	struct list_head	*inode_records,
> > +	struct xfs_owner_info	*oinfo,
> > +	struct xfs_bitmap	*old_iallocbt_blocks)
> > +{
> > +	struct xrep_ibt_extent	*rie;
> > +	struct xrep_ibt_extent	*n;
> > +	int			error;
> > +
> > +	/* Add all records. */
> > +	list_sort(NULL, inode_records, xrep_ibt_extent_cmp);
> > +	list_for_each_entry_safe(rie, n, inode_records, list) {
> > +		error = xrep_ibt_insert_rec(sc, rie);
> > +		if (error)
> > +			return error;
> > +
> > +		list_del(&rie->list);
> > +		kmem_free(rie);
> > +	}
> 
> Same general thoughts here around freeing old blocks and whatnot as for
> the allocbt repairs. Though I assume if we end up tweaking that behavior
> we'll do so across the board.

Er... yes. :)

--D

> Brian
> 
> > +
> > +	/* Free the old inode btree blocks if they're not in use. */
> > +	return xrep_reap_extents(sc, old_iallocbt_blocks, oinfo,
> > +			XFS_AG_RESV_NONE);
> > +}
> > +
> > +/*
> > + * Make our new inode btree roots permanent so that we can start re-adding
> > + * inode records back into the AG.
> > + */
> > +STATIC int
> > +xrep_ibt_commit_new(
> > +	struct xfs_scrub	*sc,
> > +	struct xfs_bitmap	*old_iallocbt_blocks,
> > +	int			log_flags)
> > +{
> > +	int			error;
> > +
> > +	xfs_ialloc_log_agi(sc->tp, sc->sa.agi_bp, log_flags);
> > +
> > +	/* Invalidate all the inobt/finobt blocks in btlist. */
> > +	error = xrep_invalidate_blocks(sc, old_iallocbt_blocks);
> > +	if (error)
> > +		return error;
> > +	error = xrep_roll_ag_trans(sc);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Now that we've succeeded, mark the incore state valid again.  If the
> > +	 * finobt is enabled, make sure we reinitialize the per-AG reservations
> > +	 * when we're done.
> > +	 */
> > +	sc->sa.pag->pagi_init = 1;
> > +	if (xfs_sb_version_hasfinobt(&sc->mp->m_sb))
> > +		sc->reset_perag_resv = true;
> > +	return 0;
> > +}
> > +
> > +/* Repair both inode btrees. */
> > +int
> > +xrep_iallocbt(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	struct xfs_owner_info	oinfo;
> > +	struct list_head	inode_records;
> > +	struct xfs_bitmap	old_iallocbt_blocks;
> > +	struct xfs_mount	*mp = sc->mp;
> > +	int			log_flags = 0;
> > +	int			error = 0;
> > +
> > +	/* We require the rmapbt to rebuild anything. */
> > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > +		return -EOPNOTSUPP;
> > +
> > +	xchk_perag_get(sc->mp, &sc->sa);
> > +
> > +	/* Collect the free space data and find the old btree blocks. */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_INOBT);
> > +	INIT_LIST_HEAD(&inode_records);
> > +	xfs_bitmap_init(&old_iallocbt_blocks);
> > +	error = xrep_ibt_find_inodes(sc, &inode_records, &old_iallocbt_blocks);
> > +	if (error)
> > +		goto out;
> > +
> > +	/*
> > +	 * Blow out the old inode btrees.  This is the point at which
> > +	 * we are no longer able to bail out gracefully.
> > +	 */
> > +	error = xrep_ibt_reset_counters(sc, &inode_records, &log_flags);
> > +	if (error)
> > +		goto out;
> > +	error = xrep_ibt_reset_btrees(sc, &oinfo, &log_flags);
> > +	if (error)
> > +		goto out;
> > +	error = xrep_ibt_commit_new(sc, &old_iallocbt_blocks, log_flags);
> > +	if (error)
> > +		goto out;
> > +
> > +	/* Now rebuild the inode information. */
> > +	error = xrep_ibt_rebuild_trees(sc, &inode_records, &oinfo,
> > +			&old_iallocbt_blocks);
> > +	if (error)
> > +		goto out;
> > +out:
> > +	xrep_ibt_cancel_inorecs(&inode_records);
> > +	xfs_bitmap_destroy(&old_iallocbt_blocks);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 17cf48564390..a44deb6f06ab 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> > @@ -880,3 +880,23 @@ xrep_ino_dqattach(
> >  
> >  	return error;
> >  }
> > +
> > +/*
> > + * Reinitialize the per-AG block reservation for the AG we just fixed.
> > + */
> > +int
> > +xrep_reset_perag_resv(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	int			error;
> > +
> > +	ASSERT(sc->ops->type == ST_PERAG);
> > +	ASSERT(sc->tp);
> > +
> > +	error = xfs_ag_resv_free(sc->sa.pag);
> > +	if (error)
> > +		goto out;
> > +	error = xfs_ag_resv_init(sc->sa.pag, sc->tp);
> > +out:
> > +	return error;
> > +}
> > diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> > index bc1a5f1cbcdc..0cc53dee3228 100644
> > --- a/fs/xfs/scrub/repair.h
> > +++ b/fs/xfs/scrub/repair.h
> > @@ -53,6 +53,7 @@ int xrep_find_ag_btree_roots(struct xfs_scrub *sc, struct xfs_buf *agf_bp,
> >  		struct xrep_find_ag_btree *btree_info, struct xfs_buf *agfl_bp);
> >  void xrep_force_quotacheck(struct xfs_scrub *sc, uint dqtype);
> >  int xrep_ino_dqattach(struct xfs_scrub *sc);
> > +int xrep_reset_perag_resv(struct xfs_scrub *sc);
> >  
> >  /* Metadata repairers */
> >  
> > @@ -62,6 +63,7 @@ int xrep_agf(struct xfs_scrub *sc);
> >  int xrep_agfl(struct xfs_scrub *sc);
> >  int xrep_agi(struct xfs_scrub *sc);
> >  int xrep_allocbt(struct xfs_scrub *sc);
> > +int xrep_iallocbt(struct xfs_scrub *sc);
> >  
> >  #else
> >  
> > @@ -83,12 +85,21 @@ xrep_calc_ag_resblks(
> >  	return 0;
> >  }
> >  
> > +static inline int
> > +xrep_reset_perag_resv(
> > +	struct xfs_scrub	*sc)
> > +{
> > +	ASSERT(0);
> > +	return -EOPNOTSUPP;
> > +}
> > +
> >  #define xrep_probe			xrep_notsupported
> >  #define xrep_superblock			xrep_notsupported
> >  #define xrep_agf			xrep_notsupported
> >  #define xrep_agfl			xrep_notsupported
> >  #define xrep_agi			xrep_notsupported
> >  #define xrep_allocbt			xrep_notsupported
> > +#define xrep_iallocbt			xrep_notsupported
> >  
> >  #endif /* CONFIG_XFS_ONLINE_REPAIR */
> >  
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 2133a3199372..631b0b06db99 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -244,14 +244,14 @@ static const struct xchk_meta_ops meta_scrub_ops[] = {
> >  		.type	= ST_PERAG,
> >  		.setup	= xchk_setup_ag_iallocbt,
> >  		.scrub	= xchk_inobt,
> > -		.repair	= xrep_notsupported,
> > +		.repair	= xrep_iallocbt,
> >  	},
> >  	[XFS_SCRUB_TYPE_FINOBT] = {	/* finobt */
> >  		.type	= ST_PERAG,
> >  		.setup	= xchk_setup_ag_iallocbt,
> >  		.scrub	= xchk_finobt,
> >  		.has	= xfs_sb_version_hasfinobt,
> > -		.repair	= xrep_notsupported,
> > +		.repair	= xrep_iallocbt,
> >  	},
> >  	[XFS_SCRUB_TYPE_RMAPBT] = {	/* rmapbt */
> >  		.type	= ST_PERAG,
> > diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
> > index af323b229c4b..762db46fd696 100644
> > --- a/fs/xfs/scrub/scrub.h
> > +++ b/fs/xfs/scrub/scrub.h
> > @@ -64,6 +64,7 @@ struct xfs_scrub {
> >  	uint				ilock_flags;
> >  	bool				try_harder;
> >  	bool				has_quotaofflock;
> > +	bool				reset_perag_resv;
> >  
> >  	/* State tracking for single-AG operations. */
> >  	struct xchk_ag			sa;
> > diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> > index 26bd5dc68efe..9126dc66f726 100644
> > --- a/fs/xfs/scrub/trace.h
> > +++ b/fs/xfs/scrub/trace.h
> > @@ -552,7 +552,7 @@ DEFINE_EVENT(xrep_rmap_class, name, \
> >  		 uint64_t owner, uint64_t offset, unsigned int flags), \
> >  	TP_ARGS(mp, agno, agbno, len, owner, offset, flags))
> >  DEFINE_REPAIR_RMAP_EVENT(xrep_abt_walk_rmap);
> > -DEFINE_REPAIR_RMAP_EVENT(xrep_ialloc_extent_fn);
> > +DEFINE_REPAIR_RMAP_EVENT(xrep_ibt_walk_rmap);
> >  DEFINE_REPAIR_RMAP_EVENT(xrep_rmap_extent_fn);
> >  DEFINE_REPAIR_RMAP_EVENT(xrep_bmap_extent_fn);
> >  
> > @@ -700,7 +700,7 @@ TRACE_EVENT(xrep_reset_counters,
> >  		  MAJOR(__entry->dev), MINOR(__entry->dev))
> >  )
> >  
> > -TRACE_EVENT(xrep_ialloc_insert,
> > +TRACE_EVENT(xrep_ibt_insert,
> >  	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> >  		 xfs_agino_t startino, uint16_t holemask, uint8_t count,
> >  		 uint8_t freecount, uint64_t freemask),
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2018-11-06 11:39 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-30  5:47 [PATCH v17.1 00/14] xfs-4.19: online repair support Darrick J. Wong
2018-07-30  5:47 ` [PATCH 01/14] xfs: refactor the xrep_extent_list into xfs_bitmap Darrick J. Wong
2018-07-30 16:21   ` Brian Foster
2018-07-30  5:48 ` [PATCH 02/14] xfs: repair the AGF Darrick J. Wong
2018-07-30 16:22   ` Brian Foster
2018-07-30 17:31     ` Darrick J. Wong
2018-07-30 18:19       ` Brian Foster
2018-07-30 18:22         ` Darrick J. Wong
2018-07-30 18:33           ` Brian Foster
2018-07-30  5:48 ` [PATCH 03/14] xfs: repair the AGFL Darrick J. Wong
2018-07-30 16:25   ` Brian Foster
2018-07-30 17:22     ` Darrick J. Wong
2018-07-31 15:10       ` Brian Foster
2018-08-07 22:02         ` Darrick J. Wong
2018-08-08 12:09           ` Brian Foster
2018-08-08 21:26             ` Darrick J. Wong
2018-08-09 11:14               ` Brian Foster
2018-07-30  5:48 ` [PATCH 04/14] xfs: repair the AGI Darrick J. Wong
2018-07-30 18:20   ` Brian Foster
2018-07-30 18:44     ` Darrick J. Wong
2018-07-30  5:48 ` [PATCH 05/14] xfs: repair free space btrees Darrick J. Wong
2018-07-31 17:47   ` Brian Foster
2018-07-31 22:01     ` Darrick J. Wong
2018-08-01 11:54       ` Brian Foster
2018-08-01 16:23         ` Darrick J. Wong
2018-08-01 18:39           ` Brian Foster
2018-08-02  6:28             ` Darrick J. Wong
2018-08-02 13:48               ` Brian Foster
2018-08-02 19:22                 ` Darrick J. Wong
2018-08-03 10:49                   ` Brian Foster
2018-08-07 23:34                     ` Darrick J. Wong
2018-08-08 12:29                       ` Brian Foster
2018-08-08 22:42                         ` Darrick J. Wong
2018-08-09 12:00                           ` Brian Foster
2018-08-09 15:59                             ` Darrick J. Wong
2018-08-10 10:33                               ` Brian Foster
2018-08-10 15:39                                 ` Darrick J. Wong
2018-08-10 19:07                                   ` Brian Foster
2018-08-10 19:36                                     ` Darrick J. Wong
2018-08-11 12:50                                       ` Brian Foster
2018-08-11 15:48                                         ` Darrick J. Wong
2018-08-13  2:46                                         ` Dave Chinner
2018-07-30  5:48 ` [PATCH 06/14] xfs: repair inode btrees Darrick J. Wong
2018-08-02 14:54   ` Brian Foster
2018-11-06  2:16     ` Darrick J. Wong
2018-07-30  5:48 ` [PATCH 07/14] xfs: repair refcount btrees Darrick J. Wong
2018-07-30  5:48 ` [PATCH 08/14] xfs: repair inode records Darrick J. Wong
2018-07-30  5:48 ` [PATCH 09/14] xfs: zap broken inode forks Darrick J. Wong
2018-07-30  5:49 ` [PATCH 10/14] xfs: repair inode block maps Darrick J. Wong
2018-07-30  5:49 ` [PATCH 11/14] xfs: repair damaged symlinks Darrick J. Wong
2018-07-30  5:49 ` [PATCH 12/14] xfs: repair extended attributes Darrick J. Wong
2018-07-30  5:49 ` [PATCH 13/14] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-07-30  5:49 ` [PATCH 14/14] xfs: repair quotas Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.