All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] xfsprogs: big, broken filesystems cause pain
@ 2015-12-21 21:37 Dave Chinner
  2015-12-21 21:37 ` [PATCH 1/9] metadump: clean up btree block region zeroing Dave Chinner
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

Hi folks,

This is a work-in-progress patchset that I"ve been spending the last
week on trying to get xfs_repair to run cleanly through a busted
30TB filesystem image. The first 2 patches were needed just to get
metadump to create the filesystem image, the third is helpful in
tellingme exactly how much of the 38GB of metadata has been
restored.

The next two patches parallelise parts of the repair process;
uncertain inode processing in phase 3 was taking more than 20
minutes, and phase 7 was taking almost 2 hours. Both are trivially
parallelisable - the phase 3 is now down under 5 minutes, but I
haven't fully tested the phase 7 code because I haven't managed to
get a full repair of the original image past phase 6 since I wrote
this patch. I have run it through xfstests many times, but that's
not the same as having it process and correct the link counts on
several million inodes....

Patch 6 was the first crash problem I fixed - this is 17 year old
bug in the directory code, and will also need to be fixed in the
kernel.

Patch 7-9 fix the major problem that was causing issues - the
cache's handling of buffers that were dirty but still corrupt.
xfs_repair doesn't fix all the problems in a buffer in a single pass
- it may make modifications in early phases and then use those
modifications to trigger specific repairs in later phases. However,
when you have 38GB of metadata to check and correct, the buffer
cache is not going to hold all these buffers, and so the reclaim
algorithms are going to have an impact.

That impact was pretty bad - the partially correct buffers were
being tossed away because their write verifiers were failing and
hence never making it to disk.  Hence when the later phase re-read
the buffer, it pull the original uncorrected, corrupt blocks back in
from disk, and so phases 5, 6 and 7 were tripping over corruptions
that were assumed to be fixed and that was causing random memory
corruptions, use after free, etc.

These three patches are a pretty nasty hack to keep the dirty
buffers around until they are fully repaired. The whole userspace
libxfs buffer cache is really showing it's limitations here; it
doesn't scale effectively, it doesn't isolate operations between
independent threads (i.e. per-ag threads), it doesn't handle dirty
objects or writeback failures sanely and it has an overly
complex cache abstraction that has only one user. Ultimately, we need
to rewrite it from scratch, but in the mean time we need to make
repair actually complete properly and hence these patches to hack
the necessary fixes into it.

With these, repair is getting deep into phase 6 on the original
image, before failing moving an inode to lost+found because the
inode has a mismatch between the bmbt size and the number of records
supposedly in the bmbt. That's a new failure I haven't seen before,
so there's still more fixes to come....

-Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/9] metadump: clean up btree block region zeroing
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-04 19:11   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 2/9] metadump: bounds check btree block regions being zeroed Dave Chinner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Abstract out all the common operations in the btree block zeroing
so that we only have one copy of offset/len calculations and don't
require lots of casts for the pointer arithmetic.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 db/metadump.c | 61 +++++++++++++++++++++++------------------------------------
 1 file changed, 24 insertions(+), 37 deletions(-)

diff --git a/db/metadump.c b/db/metadump.c
index 8455fdd..a185da5 100644
--- a/db/metadump.c
+++ b/db/metadump.c
@@ -258,8 +258,8 @@ zero_btree_node(
 	xfs_inobt_key_t		*ikp;
 	xfs_alloc_ptr_t		*app;
 	xfs_alloc_key_t		*akp;
-	void			*zp1, *zp2;
-	int			zlen1, zlen2;
+	char			*zp1, *zp2;
+	char			*key_end;
 
 	nrecs = be16_to_cpu(block->bb_numrecs);
 
@@ -268,43 +268,36 @@ zero_btree_node(
 	case TYP_BMAPBTD:
 		bkp = XFS_BMBT_KEY_ADDR(mp, block, 1);
 		bpp = XFS_BMBT_PTR_ADDR(mp, block, 1, mp->m_bmap_dmxr[1]);
-		zp1 = &bkp[nrecs];
-		zlen1 = (char *)&bpp[0] - (char *)&bkp[nrecs];
-		zp2 = &bpp[nrecs];
-		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
-							(char *)&bpp[nrecs];
+		zp1 = (char *)&bkp[nrecs];
+		zp2 = (char *)&bpp[nrecs];
+		key_end = (char *)bpp;
 		break;
 	case TYP_INOBT:
 	case TYP_FINOBT:
 		ikp = XFS_INOBT_KEY_ADDR(mp, block, 1);
 		ipp = XFS_INOBT_PTR_ADDR(mp, block, 1, mp->m_inobt_mxr[1]);
-		zp1 = &ikp[nrecs];
-		zlen1 = (char *)&ipp[0] - (char *)&ikp[nrecs];
-		zp2 = &ipp[nrecs];
-		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
-							(char *)&ipp[nrecs];
+		zp1 = (char *)&ikp[nrecs];
+		zp2 = (char *)&ipp[nrecs];
+		key_end = (char *)ipp;
 		break;
 	case TYP_BNOBT:
 	case TYP_CNTBT:
 		akp = XFS_ALLOC_KEY_ADDR(mp, block, 1);
 		app = XFS_ALLOC_PTR_ADDR(mp, block, 1, mp->m_alloc_mxr[1]);
-		zp1 = &akp[nrecs];
-		zlen1 = (char *)&app[0] - (char *)&akp[nrecs];
-		zp2 = &app[nrecs];
-		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
-							(char *)&app[nrecs];
+		zp1 = (char *)&akp[nrecs];
+		zp2 = (char *)&app[nrecs];
+		key_end = (char *)app;
 		break;
 	default:
-		zp1 = NULL;
-		break;
+		return;
 	}
 
-	if (zp1 && zp2) {
-		/* Zero from end of keys to beginning of pointers */
-		memset(zp1, 0, zlen1);
-		/* Zero from end of pointers to end of block */
-		memset(zp2, 0, zlen2);
-	}
+
+	/* Zero from end of keys to beginning of pointers */
+	memset(zp1, 0, key_end - zp1);
+
+	/* Zero from end of pointers to end of block */
+	memset(zp2, 0, (char *)block + mp->m_sb.sb_blocksize - zp2);
 }
 
 static void
@@ -316,8 +309,7 @@ zero_btree_leaf(
 	struct xfs_bmbt_rec	*brp;
 	struct xfs_inobt_rec	*irp;
 	struct xfs_alloc_rec	*arp;
-	void			*zp;
-	int			zlen;
+	char			*zp;
 
 	nrecs = be16_to_cpu(block->bb_numrecs);
 
@@ -325,29 +317,24 @@ zero_btree_leaf(
 	case TYP_BMAPBTA:
 	case TYP_BMAPBTD:
 		brp = XFS_BMBT_REC_ADDR(mp, block, 1);
-		zp = &brp[nrecs];
-		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&brp[nrecs];
+		zp = (char *)&brp[nrecs];
 		break;
 	case TYP_INOBT:
 	case TYP_FINOBT:
 		irp = XFS_INOBT_REC_ADDR(mp, block, 1);
-		zp = &irp[nrecs];
-		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&irp[nrecs];
+		zp = (char *)&irp[nrecs];
 		break;
 	case TYP_BNOBT:
 	case TYP_CNTBT:
 		arp = XFS_ALLOC_REC_ADDR(mp, block, 1);
-		zp = &arp[nrecs];
-		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&arp[nrecs];
+		zp = (char *)&arp[nrecs];
 		break;
 	default:
-		zp = NULL;
-		break;
+		return;
 	}
 
 	/* Zero from end of records to end of block */
-	if (zp && zlen < mp->m_sb.sb_blocksize)
-		memset(zp, 0, zlen);
+	memset(zp, 0, (char *)block + mp->m_sb.sb_blocksize - zp);
 }
 
 static void
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/9] metadump: bounds check btree block regions being zeroed
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
  2015-12-21 21:37 ` [PATCH 1/9] metadump: clean up btree block region zeroing Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-04 19:11   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 3/9] xfs_mdrestore: correctly account bytes read Dave Chinner
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Arkadiusz Miskiewicz reported that metadump was crashing on one of
his corrupted filesystems, and the trace indicated that it was
zeroing unused regions in inode btree blocks when it failed. The
btree block had a corrupt nrecs field, which was resulting in an out
of bounds memset() occurring.  Ensure that the region being
generated for zeroing is within bounds before executing the zeroing.

Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 db/metadump.c | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/db/metadump.c b/db/metadump.c
index a185da5..1769fdf 100644
--- a/db/metadump.c
+++ b/db/metadump.c
@@ -246,6 +246,11 @@ write_buf(
 	return seenint() ? -EINTR : 0;
 }
 
+/*
+ * We could be processing a corrupt block, so we can't trust any of
+ * the offsets or lengths to be within the buffer range. Hence check
+ * carefully!
+ */
 static void
 zero_btree_node(
 	struct xfs_btree_block	*block,
@@ -262,10 +267,15 @@ zero_btree_node(
 	char			*key_end;
 
 	nrecs = be16_to_cpu(block->bb_numrecs);
+	if (nrecs < 0)
+		return;
 
 	switch (btype) {
 	case TYP_BMAPBTA:
 	case TYP_BMAPBTD:
+		if (nrecs > mp->m_bmap_dmxr[1])
+			return;
+
 		bkp = XFS_BMBT_KEY_ADDR(mp, block, 1);
 		bpp = XFS_BMBT_PTR_ADDR(mp, block, 1, mp->m_bmap_dmxr[1]);
 		zp1 = (char *)&bkp[nrecs];
@@ -274,6 +284,9 @@ zero_btree_node(
 		break;
 	case TYP_INOBT:
 	case TYP_FINOBT:
+		if (nrecs > mp->m_inobt_mxr[1])
+			return;
+
 		ikp = XFS_INOBT_KEY_ADDR(mp, block, 1);
 		ipp = XFS_INOBT_PTR_ADDR(mp, block, 1, mp->m_inobt_mxr[1]);
 		zp1 = (char *)&ikp[nrecs];
@@ -282,6 +295,9 @@ zero_btree_node(
 		break;
 	case TYP_BNOBT:
 	case TYP_CNTBT:
+		if (nrecs > mp->m_alloc_mxr[1])
+			return;
+
 		akp = XFS_ALLOC_KEY_ADDR(mp, block, 1);
 		app = XFS_ALLOC_PTR_ADDR(mp, block, 1, mp->m_alloc_mxr[1]);
 		zp1 = (char *)&akp[nrecs];
@@ -300,6 +316,11 @@ zero_btree_node(
 	memset(zp2, 0, (char *)block + mp->m_sb.sb_blocksize - zp2);
 }
 
+/*
+ * We could be processing a corrupt block, so we can't trust any of
+ * the offsets or lengths to be within the buffer range. Hence check
+ * carefully!
+ */
 static void
 zero_btree_leaf(
 	struct xfs_btree_block	*block,
@@ -312,20 +333,31 @@ zero_btree_leaf(
 	char			*zp;
 
 	nrecs = be16_to_cpu(block->bb_numrecs);
+	if (nrecs < 0)
+		return;
 
 	switch (btype) {
 	case TYP_BMAPBTA:
 	case TYP_BMAPBTD:
+		if (nrecs > mp->m_bmap_dmxr[1])
+			return;
+
 		brp = XFS_BMBT_REC_ADDR(mp, block, 1);
 		zp = (char *)&brp[nrecs];
 		break;
 	case TYP_INOBT:
 	case TYP_FINOBT:
+		if (nrecs > mp->m_inobt_mxr[1])
+			return;
+
 		irp = XFS_INOBT_REC_ADDR(mp, block, 1);
 		zp = (char *)&irp[nrecs];
 		break;
 	case TYP_BNOBT:
 	case TYP_CNTBT:
+		if (nrecs > mp->m_alloc_mxr[1])
+			return;
+
 		arp = XFS_ALLOC_REC_ADDR(mp, block, 1);
 		zp = (char *)&arp[nrecs];
 		break;
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/9] xfs_mdrestore: correctly account bytes read
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
  2015-12-21 21:37 ` [PATCH 1/9] metadump: clean up btree block region zeroing Dave Chinner
  2015-12-21 21:37 ` [PATCH 2/9] metadump: bounds check btree block regions being zeroed Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-04 19:12   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 4/9] repair: parallelise phase 7 Dave Chinner
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Progess indication comes in the form of a "X MB read" output. This
doesn't match up with the actual number of bytes read from the
metadump file because it only accounts header blocks in the file,
not actual metadata blocks that are restored, Hence the number
reported is usually much lower than the size of the metadump file,
hence it's impossible to use to guage progress of the restore.

While there, fix the progress output so that it overwrites the
previous progress output line correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mdrestore/xfs_mdrestore.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mdrestore/xfs_mdrestore.c b/mdrestore/xfs_mdrestore.c
index ebc5e54..70a160c 100644
--- a/mdrestore/xfs_mdrestore.c
+++ b/mdrestore/xfs_mdrestore.c
@@ -133,7 +133,7 @@ perform_restore(
 
 	for (;;) {
 		if (show_progress && (bytes_read & ((1 << 20) - 1)) == 0)
-			print_progress("%lld MB read\n", bytes_read >> 20);
+			print_progress("%lld MB read", bytes_read >> 20);
 
 		for (cur_index = 0; cur_index < mb_count; cur_index++) {
 			if (pwrite64(dst_fd, &block_buffer[cur_index <<
@@ -160,7 +160,7 @@ perform_restore(
 								1, src_f) != 1)
 			fatal("error reading from file: %s\n", strerror(errno));
 
-		bytes_read += block_size;
+		bytes_read += block_size + (mb_count << tmb.mb_blocklog);
 	}
 
 	if (progress_since_warning)
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/9] repair: parallelise phase 7
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (2 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 3/9] xfs_mdrestore: correctly account bytes read Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-04 19:12   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3 Dave Chinner
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

It operates on a single AG at a time, sequentially, doing inode updates. All the
data structures accessed and modified are per AG, as are the modification to
on-disk structures. Hence we can run this phase concurrently across multiple
AGs.

This is important for large, broken filesystem repairs, where there can be
millions of inodes that need link counts updated. Once such repair image takes
more than 45 minutes to run phase 7 as a single threaded operation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase7.c     | 77 ++++++++++++++++++++++++++++++++++-------------------
 repair/progress.c   |  4 +--
 repair/protos.h     |  2 +-
 repair/xfs_repair.c |  2 +-
 4 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/repair/phase7.c b/repair/phase7.c
index b1e3a55..91dad02 100644
--- a/repair/phase7.c
+++ b/repair/phase7.c
@@ -26,6 +26,7 @@
 #include "dinode.h"
 #include "versions.h"
 #include "progress.h"
+#include "threads.h"
 
 /* dinoc is a pointer to the IN-CORE dinode core */
 static void
@@ -108,45 +109,67 @@ update_inode_nlinks(
 	IRELE(ip);
 }
 
-void
-phase7(xfs_mount_t *mp)
+/*
+ * for each ag, look at each inode 1 at a time. If the number of
+ * links is bad, reset it, log the inode core, commit the transaction
+ */
+static void
+do_link_updates(
+	struct work_queue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
 {
 	ino_tree_node_t		*irec;
-	int			i;
 	int			j;
 	__uint32_t		nrefs;
 
+	irec = findfirst_inode_rec(agno);
+
+	while (irec != NULL)  {
+		for (j = 0; j < XFS_INODES_PER_CHUNK; j++)  {
+			ASSERT(is_inode_confirmed(irec, j));
+
+			if (is_inode_free(irec, j))
+				continue;
+
+			ASSERT(no_modify || is_inode_reached(irec, j));
+
+			nrefs = num_inode_references(irec, j);
+			ASSERT(no_modify || nrefs > 0);
+
+			if (get_inode_disk_nlinks(irec, j) != nrefs)
+				update_inode_nlinks(wq->mp,
+					XFS_AGINO_TO_INO(wq->mp, agno,
+						irec->ino_startnum + j),
+					nrefs);
+		}
+		irec = next_ino_rec(irec);
+	}
+
+	PROG_RPT_INC(prog_rpt_done[agno], 1);
+}
+
+void
+phase7(
+	struct xfs_mount	*mp,
+	int			scan_threads)
+{
+	struct work_queue	wq;
+	int			agno;
+
 	if (!no_modify)
 		do_log(_("Phase 7 - verify and correct link counts...\n"));
 	else
 		do_log(_("Phase 7 - verify link counts...\n"));
 
-	/*
-	 * for each ag, look at each inode 1 at a time. If the number of
-	 * links is bad, reset it, log the inode core, commit the transaction
-	 */
-	for (i = 0; i < glob_agcount; i++)  {
-		irec = findfirst_inode_rec(i);
-
-		while (irec != NULL)  {
-			for (j = 0; j < XFS_INODES_PER_CHUNK; j++)  {
-				ASSERT(is_inode_confirmed(irec, j));
+	set_progress_msg(PROGRESS_FMT_CORR_LINK, (__uint64_t) glob_agcount);
 
-				if (is_inode_free(irec, j))
-					continue;
+	create_work_queue(&wq, mp, scan_threads);
 
-				ASSERT(no_modify || is_inode_reached(irec, j));
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
+		queue_work(&wq, do_link_updates, agno, NULL);
 
-				nrefs = num_inode_references(irec, j);
-				ASSERT(no_modify || nrefs > 0);
+	destroy_work_queue(&wq);
 
-				if (get_inode_disk_nlinks(irec, j) != nrefs)
-					update_inode_nlinks(mp,
-						XFS_AGINO_TO_INO(mp, i,
-							irec->ino_startnum + j),
-						nrefs);
-			}
-			irec = next_ino_rec(irec);
-		}
-	}
+	print_final_rpt();
 }
diff --git a/repair/progress.c b/repair/progress.c
index 418b803..2a09b23 100644
--- a/repair/progress.c
+++ b/repair/progress.c
@@ -75,9 +75,9 @@ progress_rpt_t progress_rpt_reports[] = {
 {FMT2, N_("moving disconnected inodes to lost+found"),		/* 12 */
 	&rpt_fmts[FMT2], &rpt_types[TYPE_INODE]},
 {FMT1, N_("verify and correct link counts"),			/* 13 */
-	&rpt_fmts[FMT1], &rpt_types[TYPE_INODE]},
+	&rpt_fmts[FMT1], &rpt_types[TYPE_AG]},
 {FMT1, N_("verify link counts"),				/* 14 */
-	&rpt_fmts[FMT1], &rpt_types[TYPE_INODE]}
+	&rpt_fmts[FMT1], &rpt_types[TYPE_AG]}
 };
 
 pthread_t	report_thread;
diff --git a/repair/protos.h b/repair/protos.h
index 9d5a2a6..b113aca 100644
--- a/repair/protos.h
+++ b/repair/protos.h
@@ -50,7 +50,7 @@ void	phase3(struct xfs_mount *);
 void	phase4(struct xfs_mount *);
 void	phase5(struct xfs_mount *);
 void	phase6(struct xfs_mount *);
-void	phase7(struct xfs_mount *);
+void	phase7(struct xfs_mount *, int);
 
 int	verify_set_agheader(struct xfs_mount *, struct xfs_buf *,
 		struct xfs_sb *, struct xfs_agf *, struct xfs_agi *,
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index 3eaced4..fcdb212 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -893,7 +893,7 @@ main(int argc, char **argv)
 		phase6(mp);
 		timestamp(PHASE_END, 6, NULL);
 
-		phase7(mp);
+		phase7(mp, phase2_threads);
 		timestamp(PHASE_END, 7, NULL);
 	} else  {
 		do_warn(
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (3 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 4/9] repair: parallelise phase 7 Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-04 19:12   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 6/9] libxfs: directory node splitting does not have an extra block Dave Chinner
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

This can take a long time when there are millions of uncertain inodes in badly
broken filesystems. THe processing is per-ag, the data structures are all
per-ag, and the load is mostly CPU time spent checking CRCs on each
uncertaini inode. Parallelising reduced the runtime of this phase on a badly
broken filesytem from ~30 minutes to under 5 miniutes.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 repair/phase3.c     | 59 ++++++++++++++++++++++++++++++++++++++++++++---------
 repair/protos.h     |  2 +-
 repair/xfs_repair.c |  2 +-
 3 files changed, 51 insertions(+), 12 deletions(-)

diff --git a/repair/phase3.c b/repair/phase3.c
index 76c9440..0890a27 100644
--- a/repair/phase3.c
+++ b/repair/phase3.c
@@ -28,6 +28,7 @@
 #include "dinode.h"
 #include "progress.h"
 #include "bmap.h"
+#include "threads.h"
 
 static void
 process_agi_unlinked(
@@ -87,10 +88,33 @@ process_ags(
 	do_inode_prefetch(mp, ag_stride, process_ag_func, false, false);
 }
 
+static void
+do_uncertain_aginodes(
+	work_queue_t	*wq,
+	xfs_agnumber_t	agno,
+	void		*arg)
+{
+	int		*count = arg;
+
+	*count = process_uncertain_aginodes(wq->mp, agno);
+
+#ifdef XR_INODE_TRACE
+	fprintf(stderr,
+		"\t\t phase 3 - ag %d process_uncertain_inodes returns %d\n",
+		*count, j);
+#endif
+
+	PROG_RPT_INC(prog_rpt_done[agno], 1);
+}
+
 void
-phase3(xfs_mount_t *mp)
+phase3(
+	struct xfs_mount *mp,
+	int		scan_threads)
 {
-	int 			i, j;
+	int			i, j;
+	int			*counts;
+	work_queue_t		wq;
 
 	do_log(_("Phase 3 - for each AG...\n"));
 	if (!no_modify)
@@ -129,20 +153,35 @@ phase3(xfs_mount_t *mp)
 	 */
 	do_log(_("        - process newly discovered inodes...\n"));
 	set_progress_msg(PROG_FMT_NEW_INODES, (__uint64_t) glob_agcount);
+
+	counts = calloc(sizeof(*counts), mp->m_sb.sb_agcount);
+	if (!counts) {
+		do_abort(_("no memory for uncertain inode counts\n"));
+		return;
+	}
+
 	do  {
 		/*
 		 * have to loop until no ag has any uncertain
 		 * inodes
 		 */
 		j = 0;
-		for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
-			j += process_uncertain_aginodes(mp, i);
-#ifdef XR_INODE_TRACE
-			fprintf(stderr,
-				"\t\t phase 3 - process_uncertain_inodes returns %d\n", j);
-#endif
-			PROG_RPT_INC(prog_rpt_done[i], 1);
-		}
+		memset(counts, 0, mp->m_sb.sb_agcount * sizeof(*counts));
+
+		create_work_queue(&wq, mp, scan_threads);
+
+		for (i = 0; i < mp->m_sb.sb_agcount; i++)
+			queue_work(&wq, do_uncertain_aginodes, i, &counts[i]);
+
+		destroy_work_queue(&wq);
+
+		/* tally up the counts */
+		for (i = 0; i < mp->m_sb.sb_agcount; i++)
+			j += counts[i];
+
 	} while (j != 0);
+
+	free(counts);
+
 	print_final_rpt();
 }
diff --git a/repair/protos.h b/repair/protos.h
index b113aca..0290420 100644
--- a/repair/protos.h
+++ b/repair/protos.h
@@ -46,7 +46,7 @@ void	thread_init(void);
 
 void	phase1(struct xfs_mount *);
 void	phase2(struct xfs_mount *, int);
-void	phase3(struct xfs_mount *);
+void	phase3(struct xfs_mount *, int);
 void	phase4(struct xfs_mount *);
 void	phase5(struct xfs_mount *);
 void	phase6(struct xfs_mount *);
diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
index fcdb212..5d5f3aa 100644
--- a/repair/xfs_repair.c
+++ b/repair/xfs_repair.c
@@ -871,7 +871,7 @@ main(int argc, char **argv)
 	if (do_prefetch)
 		init_prefetch(mp);
 
-	phase3(mp);
+	phase3(mp, phase2_threads);
 	timestamp(PHASE_END, 3, NULL);
 
 	phase4(mp);
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/9] libxfs: directory node splitting does not have an extra block
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (4 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3 Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-05 18:34   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 7/9] libxfs: don't discard dirty buffers Dave Chinner
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

xfs_da3_split() has to handle all thre versions of the
directory/attribute btree structure. The attr tree is v1, the dir
tre is v2 or v3. The main difference between the v1 and v2/3 trees
is the way tree nodes are split - in the v1 tree we can require a
double split to occur because the object to be inserted may be
larger than the space made by splitting a leaf. In this case we need
to do a double split - one to split the full leaf, then another to
allocate an empty leaf block in the correct location for the new
entry.  This does not happen with dir (v2/v3) formats as the objects
being inserted are always guaranteed to fit into the new space in
the split blocks.

The problem is that when we are doing the first root split, there
can be three leaf blocks that need to be updated for an attr tree,
but there will ony ever be two blocks for the dir tree. The code,
however, expects that there will always be an extra block (i.e.
three blocks) that needs updating. When rebuilding broken
directories, xfs_repair trips over this condition in the directroy
code and SEGVs because state->extrablk.bp == NULL. i.e. there is no
extra block.

Indeed, for directories they *may* be an extra block on this buffer
pointer. However, it's guaranteed not to be a leaf block (i.e. a
directory data block) - the directory code only ever places hash
index or free space blocks in this pointer (as a cursor of
sorts), and so to use it as a directory data block will immediately
corrupt the directory. This manifests itself in repair as a
directory corruption in a repaired directory, leaving the directory
rebuild incomplete.

This is a dir v2 zero-day bug - it was in the initial dir v2 commit
that was made back in February 1998.

Fix this by making the update code aware of whether the extra block
is a valid node for linking into the tree, rather than asserting
(incorrectly) that the extra block must be valid. This brings the
code in line with xfs_da3_node_split() which, from the first dir v2
commit, handled the conditional extra block case correctly....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/xfs_da_btree.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
index bf5fe21..58a0361 100644
--- a/libxfs/xfs_da_btree.c
+++ b/libxfs/xfs_da_btree.c
@@ -356,6 +356,7 @@ xfs_da3_split(
 	int			action = 0;
 	int			error;
 	int			i;
+	bool			use_extra = false;
 
 	trace_xfs_da_split(state->args);
 
@@ -395,6 +396,7 @@ xfs_da3_split(
 			 * Entry wouldn't fit, split the leaf again.
 			 */
 			state->extravalid = 1;
+			use_extra = true;
 			if (state->inleaf) {
 				state->extraafter = 0;	/* before newblk */
 				trace_xfs_attr_leaf_split_before(state->args);
@@ -454,8 +456,13 @@ xfs_da3_split(
 	/*
 	 * Update pointers to the node which used to be block 0 and
 	 * just got bumped because of the addition of a new root node.
-	 * There might be three blocks involved if a double split occurred,
-	 * and the original block 0 could be at any position in the list.
+	 * There might be three blocks involved if a double split occurred and
+	 * in this case use_extra will be set. This can only happen for attr
+	 * trees, as extrablk for dir trees can only contain data or free space
+	 * blocks, never leaf/node blocks.
+	 *
+	 * Note that the original block 0 could be at any position in the list
+	 * of blocks in the tree.
 	 *
 	 * Note: the magic numbers and sibling pointers are in the same
 	 * physical place for both v2 and v3 headers (by design). Hence it
@@ -464,31 +471,37 @@ xfs_da3_split(
 	 */
 	node = oldblk->bp->b_addr;
 	if (node->hdr.info.forw) {
+		bp = NULL;
 		if (be32_to_cpu(node->hdr.info.forw) == addblk->blkno) {
 			bp = addblk->bp;
-		} else {
+		} else if (use_extra) {
 			ASSERT(state->extravalid);
 			bp = state->extrablk.bp;
 		}
-		node = bp->b_addr;
-		node->hdr.info.back = cpu_to_be32(oldblk->blkno);
-		xfs_trans_log_buf(state->args->trans, bp,
-		    XFS_DA_LOGRANGE(node, &node->hdr.info,
-		    sizeof(node->hdr.info)));
+		if (bp) {
+			node = bp->b_addr;
+			node->hdr.info.back = cpu_to_be32(oldblk->blkno);
+			xfs_trans_log_buf(state->args->trans, bp,
+					  XFS_DA_LOGRANGE(node, &node->hdr.info,
+					  sizeof(node->hdr.info)));
+		}
 	}
 	node = oldblk->bp->b_addr;
 	if (node->hdr.info.back) {
+		bp = NULL;
 		if (be32_to_cpu(node->hdr.info.back) == addblk->blkno) {
 			bp = addblk->bp;
 		} else {
 			ASSERT(state->extravalid);
 			bp = state->extrablk.bp;
 		}
-		node = bp->b_addr;
-		node->hdr.info.forw = cpu_to_be32(oldblk->blkno);
-		xfs_trans_log_buf(state->args->trans, bp,
-		    XFS_DA_LOGRANGE(node, &node->hdr.info,
-		    sizeof(node->hdr.info)));
+		if (bp) {
+			node = bp->b_addr;
+			node->hdr.info.forw = cpu_to_be32(oldblk->blkno);
+			xfs_trans_log_buf(state->args->trans, bp,
+					  XFS_DA_LOGRANGE(node, &node->hdr.info,
+					  sizeof(node->hdr.info)));
+		}
 	}
 	addblk->bp = NULL;
 	return 0;
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/9] libxfs: don't discard dirty buffers
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (5 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 6/9] libxfs: directory node splitting does not have an extra block Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-05 18:34   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers Dave Chinner
  2015-12-21 21:37 ` [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs Dave Chinner
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

When we release a buffer from the cache, if it is dirty we wite it
to disk then put the buffer on the free list for recycling. However,
if the write fails (e.g. verifier failure due unfixed corruption) we
effectively throw the buffer and it contents away. This causes all
sorts of problems for xfs_repair as it then re-reads the buffer from
disk on the next access and hence loses all the corrections that had
previously been made, resulting in tripping over corruptions in code
that assumes the corruptions have already been fixed/flagged in the
buffer it receives.

TO fix this, we have to make the cache aware that writes can fail,
and keep the buffer in cache when writes fail. Hence we have to add
an explicit error notification to the flush operation, and we need
to do that before we release the buffer to the free list. This also
means that we need to remove the writeback code from the release
mechanisms, instead replacing them with assertions that the buffers
are already clean.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/cache.h |  2 +-
 libxfs/cache.c  | 15 ++++++++++++++-
 libxfs/rdwr.c   | 44 +++++++++++++++++++++++++++-----------------
 3 files changed, 42 insertions(+), 19 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 0a84c69..87826be 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -64,7 +64,7 @@ typedef void *cache_key_t;
 
 typedef void (*cache_walk_t)(struct cache_node *);
 typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
-typedef void (*cache_node_flush_t)(struct cache_node *);
+typedef int (*cache_node_flush_t)(struct cache_node *);
 typedef void (*cache_node_relse_t)(struct cache_node *);
 typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
 					  unsigned int);
diff --git a/libxfs/cache.c b/libxfs/cache.c
index 4753a1d..a48ebd9 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -219,6 +219,12 @@ cache_shake(
 		if (pthread_mutex_trylock(&node->cn_mutex) != 0)
 			continue;
 
+		/* can't release dirty objects */
+		if (cache->flush(node)) {
+			pthread_mutex_unlock(&node->cn_mutex);
+			continue;
+		}
+
 		hash = cache->c_hash + node->cn_hashidx;
 		if (pthread_mutex_trylock(&hash->ch_mutex) != 0) {
 			pthread_mutex_unlock(&node->cn_mutex);
@@ -311,6 +317,13 @@ __cache_node_purge(
 		pthread_mutex_unlock(&node->cn_mutex);
 		return count;
 	}
+
+	/* can't purge dirty objects */
+	if (cache->flush(node)) {
+		pthread_mutex_unlock(&node->cn_mutex);
+		return 1;
+	}
+
 	mru = &cache->c_mrus[node->cn_priority];
 	pthread_mutex_lock(&mru->cm_mutex);
 	list_del_init(&node->cn_mru);
@@ -321,7 +334,7 @@ __cache_node_purge(
 	pthread_mutex_destroy(&node->cn_mutex);
 	list_del_init(&node->cn_hash);
 	cache->relse(node);
-	return count;
+	return 0;
 }
 
 /*
diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 7a04985..0337a21 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -659,6 +659,8 @@ __libxfs_getbufr(int blen)
 		bp = kmem_zone_zalloc(xfs_buf_zone, 0);
 	pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
 	bp->b_ops = NULL;
+	if (bp->b_flags & LIBXFS_B_DIRTY)
+		fprintf(stderr, "found dirty buffer (bulk) on free list!");
 
 	return bp;
 }
@@ -1223,23 +1225,26 @@ libxfs_iomove(xfs_buf_t *bp, uint boff, int len, void *data, int flags)
 }
 
 static void
-libxfs_brelse(struct cache_node *node)
+libxfs_brelse(
+	struct cache_node	*node)
 {
-	xfs_buf_t		*bp = (xfs_buf_t *)node;
+	struct xfs_buf		*bp = (struct xfs_buf *)node;
 
-	if (bp != NULL) {
-		if (bp->b_flags & LIBXFS_B_DIRTY)
-			libxfs_writebufr(bp);
-		pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
-		list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
-		pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
-	}
+	if (!bp)
+		return;
+	if (bp->b_flags & LIBXFS_B_DIRTY)
+		fprintf(stderr,
+			"releasing dirty buffer to free list!");
+
+	pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
+	list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
+	pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
 }
 
 static unsigned int
 libxfs_bulkrelse(
-	struct cache 		*cache,
-	struct list_head 	*list)
+	struct cache		*cache,
+	struct list_head	*list)
 {
 	xfs_buf_t		*bp;
 	int			count = 0;
@@ -1249,7 +1254,8 @@ libxfs_bulkrelse(
 
 	list_for_each_entry(bp, list, b_node.cn_mru) {
 		if (bp->b_flags & LIBXFS_B_DIRTY)
-			libxfs_writebufr(bp);
+			fprintf(stderr,
+				"releasing dirty buffer (bulk) to free list!");
 		count++;
 	}
 
@@ -1260,18 +1266,22 @@ libxfs_bulkrelse(
 	return count;
 }
 
-static void
-libxfs_bflush(struct cache_node *node)
+static int
+libxfs_bflush(
+	struct cache_node	*node)
 {
-	xfs_buf_t		*bp = (xfs_buf_t *)node;
+	struct xfs_buf		*bp = (struct xfs_buf *)node;
 
-	if ((bp != NULL) && (bp->b_flags & LIBXFS_B_DIRTY))
-		libxfs_writebufr(bp);
+	if (bp->b_flags & LIBXFS_B_DIRTY)
+		return libxfs_writebufr(bp);
+	return 0;
 }
 
 void
 libxfs_putbufr(xfs_buf_t *bp)
 {
+	if (bp->b_flags & LIBXFS_B_DIRTY)
+		libxfs_writebufr(bp);
 	libxfs_brelse((struct cache_node *)bp);
 }
 
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (6 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 7/9] libxfs: don't discard dirty buffers Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-05 18:34   ` Brian Foster
  2015-12-21 21:37 ` [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs Dave Chinner
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

now that we try to write dirty buffers before we release them, we
can get buildup of unwritable dirty buffers on the LRU lists, This
results in the cache shaker repeatedly trying to write out these
buffers every time the cache fills up. This results in more
corruption warnings, and takes up a lot of time doing reclaiming
nothing. This can effectively livelock the processing parts of phase
4.

Fix this by not trying to write buffers with corruption errors on
them. These errors will get cleared when the buffer is re-read and
fixed and them marked dirty again. At which point, we'll be able to
write them and so the cache can reclaim them successfully.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 libxfs/rdwr.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
index 0337a21..a1f0029 100644
--- a/libxfs/rdwr.c
+++ b/libxfs/rdwr.c
@@ -1103,7 +1103,6 @@ int
 libxfs_writebufr(xfs_buf_t *bp)
 {
 	int	fd = libxfs_device_to_fd(bp->b_target->dev);
-	int	error = 0;
 
 	/*
 	 * we never write buffers that are marked stale. This indicates they
@@ -1134,7 +1133,7 @@ libxfs_writebufr(xfs_buf_t *bp)
 	}
 
 	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
-		error = __write_buf(fd, bp->b_addr, bp->b_bcount,
+		bp->b_error = __write_buf(fd, bp->b_addr, bp->b_bcount,
 				    LIBXFS_BBTOOFF64(bp->b_bn), bp->b_flags);
 	} else {
 		int	i;
@@ -1144,11 +1143,10 @@ libxfs_writebufr(xfs_buf_t *bp)
 			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_map[i].bm_bn);
 			int len = BBTOB(bp->b_map[i].bm_len);
 
-			error = __write_buf(fd, buf, len, offset, bp->b_flags);
-			if (error) {
-				bp->b_error = error;
+			bp->b_error = __write_buf(fd, buf, len, offset,
+						  bp->b_flags);
+			if (bp->b_error)
 				break;
-			}
 			buf += len;
 		}
 	}
@@ -1157,14 +1155,14 @@ libxfs_writebufr(xfs_buf_t *bp)
 	printf("%lx: %s: wrote %u bytes, blkno=%llu(%llu), %p, error %d\n",
 			pthread_self(), __FUNCTION__, bp->b_bcount,
 			(long long)LIBXFS_BBTOOFF64(bp->b_bn),
-			(long long)bp->b_bn, bp, error);
+			(long long)bp->b_bn, bp, bp->b_error);
 #endif
-	if (!error) {
+	if (!bp->b_error) {
 		bp->b_flags |= LIBXFS_B_UPTODATE;
 		bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
 				 LIBXFS_B_UNCHECKED);
 	}
-	return error;
+	return bp->b_error;
 }
 
 int
@@ -1266,15 +1264,22 @@ libxfs_bulkrelse(
 	return count;
 }
 
+/*
+ * When a buffer is marked dirty, the error is cleared. Hence if we are trying
+ * to flush a buffer prior to cache reclaim that has an error on it it means
+ * we've already tried to flush it and it failed. Prevent repeated corruption
+ * errors from being reported by skipping such buffers - when the corruption is
+ * fixed the buffer will be marked dirty again and we can write it again.
+ */
 static int
 libxfs_bflush(
 	struct cache_node	*node)
 {
 	struct xfs_buf		*bp = (struct xfs_buf *)node;
 
-	if (bp->b_flags & LIBXFS_B_DIRTY)
+	if (!bp->b_error && bp->b_flags & LIBXFS_B_DIRTY)
 		return libxfs_writebufr(bp);
-	return 0;
+	return bp->b_error;
 }
 
 void
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs
  2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
                   ` (7 preceding siblings ...)
  2015-12-21 21:37 ` [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers Dave Chinner
@ 2015-12-21 21:37 ` Dave Chinner
  2016-01-05 18:34   ` Brian Foster
  8 siblings, 1 reply; 21+ messages in thread
From: Dave Chinner @ 2015-12-21 21:37 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

There's no point trying to free buffers that are dirty and return
errors on flush as we have to keep them around until the corruption
is fixed. Hence if we fail to flush an inode during a cache shake,
move the buffer to a special dirty MRU list that the cache does not
shake. This prevents memory pressure from seeing these buffers, but
allows subsequent cache lookups to still find them through the hash.
This ensures we don't waste huge amounts of CPU trying to flush and
reclaim buffers that canot be flushed or reclaimed.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 include/cache.h |  3 ++-
 libxfs/cache.c  | 34 +++++++++++++++++++++++++++++-----
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/include/cache.h b/include/cache.h
index 87826be..55761d3 100644
--- a/include/cache.h
+++ b/include/cache.h
@@ -51,6 +51,7 @@ enum {
 #define CACHE_BASE_PRIORITY	0
 #define CACHE_PREFETCH_PRIORITY	8
 #define CACHE_MAX_PRIORITY	15
+#define CACHE_DIRTY_PRIORITY	(CACHE_MAX_PRIORITY + 1)
 
 /*
  * Simple, generic implementation of a cache (arbitrary data).
@@ -115,7 +116,7 @@ struct cache {
 	unsigned int		c_hashsize;	/* hash bucket count */
 	unsigned int		c_hashshift;	/* hash key shift */
 	struct cache_hash	*c_hash;	/* hash table buckets */
-	struct cache_mru	c_mrus[CACHE_MAX_PRIORITY + 1];
+	struct cache_mru	c_mrus[CACHE_DIRTY_PRIORITY + 1];
 	unsigned long long	c_misses;	/* cache misses */
 	unsigned long long	c_hits;		/* cache hits */
 	unsigned int 		c_max;		/* max nodes ever used */
diff --git a/libxfs/cache.c b/libxfs/cache.c
index a48ebd9..d5ea461 100644
--- a/libxfs/cache.c
+++ b/libxfs/cache.c
@@ -81,7 +81,7 @@ cache_init(
 		pthread_mutex_init(&cache->c_hash[i].ch_mutex, NULL);
 	}
 
-	for (i = 0; i <= CACHE_MAX_PRIORITY; i++) {
+	for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
 		list_head_init(&cache->c_mrus[i].cm_list);
 		cache->c_mrus[i].cm_count = 0;
 		pthread_mutex_init(&cache->c_mrus[i].cm_mutex, NULL);
@@ -154,7 +154,7 @@ cache_destroy(
 		list_head_destroy(&cache->c_hash[i].ch_list);
 		pthread_mutex_destroy(&cache->c_hash[i].ch_mutex);
 	}
-	for (i = 0; i <= CACHE_MAX_PRIORITY; i++) {
+	for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++) {
 		list_head_destroy(&cache->c_mrus[i].cm_list);
 		pthread_mutex_destroy(&cache->c_mrus[i].cm_mutex);
 	}
@@ -183,6 +183,27 @@ cache_generic_bulkrelse(
 }
 
 /*
+ * Park unflushable nodes on their own special MRU so that cache_shake() doesn't
+ * end up repeatedly scanning them in the futile attempt to clean them before
+ * reclaim.
+ */
+static void
+cache_move_to_dirty_mru(
+	struct cache		*cache,
+	struct cache_node	*node)
+{
+	struct cache_mru	*mru;
+
+	mru = &cache->c_mrus[CACHE_DIRTY_PRIORITY];
+
+	pthread_mutex_lock(&mru->cm_mutex);
+	node->cn_priority = CACHE_DIRTY_PRIORITY;
+	list_move(&node->cn_mru, &mru->cm_list);
+	mru->cm_count++;
+	pthread_mutex_unlock(&mru->cm_mutex);
+}
+
+/*
  * We've hit the limit on cache size, so we need to start reclaiming
  * nodes we've used. The MRU specified by the priority is shaken.
  * Returns new priority at end of the call (in case we call again).
@@ -202,10 +223,11 @@ cache_shake(
 	struct cache_node *	node;
 	unsigned int		count;
 
-	ASSERT(priority <= CACHE_MAX_PRIORITY);
-	if (priority > CACHE_MAX_PRIORITY)
+	ASSERT(priority <= CACHE_DIRTY_PRIORITY);
+	if (priority > CACHE_MAX_PRIORITY && !all)
 		priority = 0;
 
+
 	mru = &cache->c_mrus[priority];
 	count = 0;
 	list_head_init(&temp);
@@ -221,6 +243,8 @@ cache_shake(
 
 		/* can't release dirty objects */
 		if (cache->flush(node)) {
+			cache_move_to_dirty_mru(cache, node);
+			mru->cm_count--;
 			pthread_mutex_unlock(&node->cn_mutex);
 			continue;
 		}
@@ -578,7 +602,7 @@ cache_purge(
 {
 	int			i;
 
-	for (i = 0; i <= CACHE_MAX_PRIORITY; i++)
+	for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++)
 		cache_shake(cache, i, 1);
 
 #ifdef CACHE_DEBUG
-- 
2.5.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/9] metadump: clean up btree block region zeroing
  2015-12-21 21:37 ` [PATCH 1/9] metadump: clean up btree block region zeroing Dave Chinner
@ 2016-01-04 19:11   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-04 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:01AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Abstract out all the common operations in the btree block zeroing
> so that we only have one copy of offset/len calculations and don't
> require lots of casts for the pointer arithmetic.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  db/metadump.c | 61 +++++++++++++++++++++++------------------------------------
>  1 file changed, 24 insertions(+), 37 deletions(-)
> 
> diff --git a/db/metadump.c b/db/metadump.c
> index 8455fdd..a185da5 100644
> --- a/db/metadump.c
> +++ b/db/metadump.c
> @@ -258,8 +258,8 @@ zero_btree_node(
>  	xfs_inobt_key_t		*ikp;
>  	xfs_alloc_ptr_t		*app;
>  	xfs_alloc_key_t		*akp;
> -	void			*zp1, *zp2;
> -	int			zlen1, zlen2;
> +	char			*zp1, *zp2;
> +	char			*key_end;
>  
>  	nrecs = be16_to_cpu(block->bb_numrecs);
>  
> @@ -268,43 +268,36 @@ zero_btree_node(
>  	case TYP_BMAPBTD:
>  		bkp = XFS_BMBT_KEY_ADDR(mp, block, 1);
>  		bpp = XFS_BMBT_PTR_ADDR(mp, block, 1, mp->m_bmap_dmxr[1]);
> -		zp1 = &bkp[nrecs];
> -		zlen1 = (char *)&bpp[0] - (char *)&bkp[nrecs];
> -		zp2 = &bpp[nrecs];
> -		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
> -							(char *)&bpp[nrecs];
> +		zp1 = (char *)&bkp[nrecs];
> +		zp2 = (char *)&bpp[nrecs];
> +		key_end = (char *)bpp;
>  		break;
>  	case TYP_INOBT:
>  	case TYP_FINOBT:
>  		ikp = XFS_INOBT_KEY_ADDR(mp, block, 1);
>  		ipp = XFS_INOBT_PTR_ADDR(mp, block, 1, mp->m_inobt_mxr[1]);
> -		zp1 = &ikp[nrecs];
> -		zlen1 = (char *)&ipp[0] - (char *)&ikp[nrecs];
> -		zp2 = &ipp[nrecs];
> -		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
> -							(char *)&ipp[nrecs];
> +		zp1 = (char *)&ikp[nrecs];
> +		zp2 = (char *)&ipp[nrecs];
> +		key_end = (char *)ipp;
>  		break;
>  	case TYP_BNOBT:
>  	case TYP_CNTBT:
>  		akp = XFS_ALLOC_KEY_ADDR(mp, block, 1);
>  		app = XFS_ALLOC_PTR_ADDR(mp, block, 1, mp->m_alloc_mxr[1]);
> -		zp1 = &akp[nrecs];
> -		zlen1 = (char *)&app[0] - (char *)&akp[nrecs];
> -		zp2 = &app[nrecs];
> -		zlen2 = (char *)block + mp->m_sb.sb_blocksize -
> -							(char *)&app[nrecs];
> +		zp1 = (char *)&akp[nrecs];
> +		zp2 = (char *)&app[nrecs];
> +		key_end = (char *)app;
>  		break;
>  	default:
> -		zp1 = NULL;
> -		break;
> +		return;
>  	}
>  
> -	if (zp1 && zp2) {
> -		/* Zero from end of keys to beginning of pointers */
> -		memset(zp1, 0, zlen1);
> -		/* Zero from end of pointers to end of block */
> -		memset(zp2, 0, zlen2);
> -	}
> +
> +	/* Zero from end of keys to beginning of pointers */
> +	memset(zp1, 0, key_end - zp1);
> +
> +	/* Zero from end of pointers to end of block */
> +	memset(zp2, 0, (char *)block + mp->m_sb.sb_blocksize - zp2);
>  }
>  
>  static void
> @@ -316,8 +309,7 @@ zero_btree_leaf(
>  	struct xfs_bmbt_rec	*brp;
>  	struct xfs_inobt_rec	*irp;
>  	struct xfs_alloc_rec	*arp;
> -	void			*zp;
> -	int			zlen;
> +	char			*zp;
>  
>  	nrecs = be16_to_cpu(block->bb_numrecs);
>  
> @@ -325,29 +317,24 @@ zero_btree_leaf(
>  	case TYP_BMAPBTA:
>  	case TYP_BMAPBTD:
>  		brp = XFS_BMBT_REC_ADDR(mp, block, 1);
> -		zp = &brp[nrecs];
> -		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&brp[nrecs];
> +		zp = (char *)&brp[nrecs];
>  		break;
>  	case TYP_INOBT:
>  	case TYP_FINOBT:
>  		irp = XFS_INOBT_REC_ADDR(mp, block, 1);
> -		zp = &irp[nrecs];
> -		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&irp[nrecs];
> +		zp = (char *)&irp[nrecs];
>  		break;
>  	case TYP_BNOBT:
>  	case TYP_CNTBT:
>  		arp = XFS_ALLOC_REC_ADDR(mp, block, 1);
> -		zp = &arp[nrecs];
> -		zlen = (char *)block + mp->m_sb.sb_blocksize - (char *)&arp[nrecs];
> +		zp = (char *)&arp[nrecs];
>  		break;
>  	default:
> -		zp = NULL;
> -		break;
> +		return;
>  	}
>  
>  	/* Zero from end of records to end of block */
> -	if (zp && zlen < mp->m_sb.sb_blocksize)
> -		memset(zp, 0, zlen);
> +	memset(zp, 0, (char *)block + mp->m_sb.sb_blocksize - zp);
>  }
>  
>  static void
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/9] metadump: bounds check btree block regions being zeroed
  2015-12-21 21:37 ` [PATCH 2/9] metadump: bounds check btree block regions being zeroed Dave Chinner
@ 2016-01-04 19:11   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-04 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:02AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Arkadiusz Miskiewicz reported that metadump was crashing on one of
> his corrupted filesystems, and the trace indicated that it was
> zeroing unused regions in inode btree blocks when it failed. The
> btree block had a corrupt nrecs field, which was resulting in an out
> of bounds memset() occurring.  Ensure that the region being
> generated for zeroing is within bounds before executing the zeroing.
> 
> Reported-by: Arkadiusz Miskiewicz <arekm@maven.pl>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  db/metadump.c | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
> 
> diff --git a/db/metadump.c b/db/metadump.c
> index a185da5..1769fdf 100644
> --- a/db/metadump.c
> +++ b/db/metadump.c
...
> @@ -300,6 +316,11 @@ zero_btree_node(
>  	memset(zp2, 0, (char *)block + mp->m_sb.sb_blocksize - zp2);
>  }
>  
> +/*
> + * We could be processing a corrupt block, so we can't trust any of
> + * the offsets or lengths to be within the buffer range. Hence check
> + * carefully!
> + */
>  static void
>  zero_btree_leaf(
>  	struct xfs_btree_block	*block,
> @@ -312,20 +333,31 @@ zero_btree_leaf(
>  	char			*zp;
>  
>  	nrecs = be16_to_cpu(block->bb_numrecs);
> +	if (nrecs < 0)
> +		return;
>  
>  	switch (btype) {
>  	case TYP_BMAPBTA:
>  	case TYP_BMAPBTD:
> +		if (nrecs > mp->m_bmap_dmxr[1])
> +			return;
> +

Shouldn't we use the 0 index max recs value (for leaf blocks) throughout
this function? (e.g, mp->m_bmap_dmxr[0])

Brian

>  		brp = XFS_BMBT_REC_ADDR(mp, block, 1);
>  		zp = (char *)&brp[nrecs];
>  		break;
>  	case TYP_INOBT:
>  	case TYP_FINOBT:
> +		if (nrecs > mp->m_inobt_mxr[1])
> +			return;
> +
>  		irp = XFS_INOBT_REC_ADDR(mp, block, 1);
>  		zp = (char *)&irp[nrecs];
>  		break;
>  	case TYP_BNOBT:
>  	case TYP_CNTBT:
> +		if (nrecs > mp->m_alloc_mxr[1])
> +			return;
> +
>  		arp = XFS_ALLOC_REC_ADDR(mp, block, 1);
>  		zp = (char *)&arp[nrecs];
>  		break;
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/9] xfs_mdrestore: correctly account bytes read
  2015-12-21 21:37 ` [PATCH 3/9] xfs_mdrestore: correctly account bytes read Dave Chinner
@ 2016-01-04 19:12   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-04 19:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:03AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Progess indication comes in the form of a "X MB read" output. This
> doesn't match up with the actual number of bytes read from the
> metadump file because it only accounts header blocks in the file,
> not actual metadata blocks that are restored, Hence the number
> reported is usually much lower than the size of the metadump file,
> hence it's impossible to use to guage progress of the restore.

				  gauge

Otherwise looks good:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> 
> While there, fix the progress output so that it overwrites the
> previous progress output line correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  mdrestore/xfs_mdrestore.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mdrestore/xfs_mdrestore.c b/mdrestore/xfs_mdrestore.c
> index ebc5e54..70a160c 100644
> --- a/mdrestore/xfs_mdrestore.c
> +++ b/mdrestore/xfs_mdrestore.c
> @@ -133,7 +133,7 @@ perform_restore(
>  
>  	for (;;) {
>  		if (show_progress && (bytes_read & ((1 << 20) - 1)) == 0)
> -			print_progress("%lld MB read\n", bytes_read >> 20);
> +			print_progress("%lld MB read", bytes_read >> 20);
>  
>  		for (cur_index = 0; cur_index < mb_count; cur_index++) {
>  			if (pwrite64(dst_fd, &block_buffer[cur_index <<
> @@ -160,7 +160,7 @@ perform_restore(
>  								1, src_f) != 1)
>  			fatal("error reading from file: %s\n", strerror(errno));
>  
> -		bytes_read += block_size;
> +		bytes_read += block_size + (mb_count << tmb.mb_blocklog);
>  	}
>  
>  	if (progress_since_warning)
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/9] repair: parallelise phase 7
  2015-12-21 21:37 ` [PATCH 4/9] repair: parallelise phase 7 Dave Chinner
@ 2016-01-04 19:12   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-04 19:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:04AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> It operates on a single AG at a time, sequentially, doing inode updates. All the
> data structures accessed and modified are per AG, as are the modification to
> on-disk structures. Hence we can run this phase concurrently across multiple
> AGs.
> 
> This is important for large, broken filesystem repairs, where there can be
> millions of inodes that need link counts updated. Once such repair image takes
> more than 45 minutes to run phase 7 as a single threaded operation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  repair/phase7.c     | 77 ++++++++++++++++++++++++++++++++++-------------------
>  repair/progress.c   |  4 +--
>  repair/protos.h     |  2 +-
>  repair/xfs_repair.c |  2 +-
>  4 files changed, 54 insertions(+), 31 deletions(-)
> 
> diff --git a/repair/phase7.c b/repair/phase7.c
> index b1e3a55..91dad02 100644
> --- a/repair/phase7.c
> +++ b/repair/phase7.c
> @@ -26,6 +26,7 @@
>  #include "dinode.h"
>  #include "versions.h"
>  #include "progress.h"
> +#include "threads.h"
>  
>  /* dinoc is a pointer to the IN-CORE dinode core */
>  static void
> @@ -108,45 +109,67 @@ update_inode_nlinks(
>  	IRELE(ip);
>  }
>  
> -void
> -phase7(xfs_mount_t *mp)
> +/*
> + * for each ag, look at each inode 1 at a time. If the number of
> + * links is bad, reset it, log the inode core, commit the transaction
> + */
> +static void
> +do_link_updates(
> +	struct work_queue	*wq,
> +	xfs_agnumber_t		agno,
> +	void			*arg)
>  {
>  	ino_tree_node_t		*irec;
> -	int			i;
>  	int			j;
>  	__uint32_t		nrefs;
>  
> +	irec = findfirst_inode_rec(agno);
> +
> +	while (irec != NULL)  {
> +		for (j = 0; j < XFS_INODES_PER_CHUNK; j++)  {
> +			ASSERT(is_inode_confirmed(irec, j));
> +
> +			if (is_inode_free(irec, j))
> +				continue;
> +
> +			ASSERT(no_modify || is_inode_reached(irec, j));
> +
> +			nrefs = num_inode_references(irec, j);
> +			ASSERT(no_modify || nrefs > 0);
> +
> +			if (get_inode_disk_nlinks(irec, j) != nrefs)
> +				update_inode_nlinks(wq->mp,
> +					XFS_AGINO_TO_INO(wq->mp, agno,
> +						irec->ino_startnum + j),
> +					nrefs);
> +		}
> +		irec = next_ino_rec(irec);
> +	}
> +
> +	PROG_RPT_INC(prog_rpt_done[agno], 1);
> +}
> +
> +void
> +phase7(
> +	struct xfs_mount	*mp,
> +	int			scan_threads)
> +{
> +	struct work_queue	wq;
> +	int			agno;
> +
>  	if (!no_modify)
>  		do_log(_("Phase 7 - verify and correct link counts...\n"));
>  	else
>  		do_log(_("Phase 7 - verify link counts...\n"));
>  
> -	/*
> -	 * for each ag, look at each inode 1 at a time. If the number of
> -	 * links is bad, reset it, log the inode core, commit the transaction
> -	 */
> -	for (i = 0; i < glob_agcount; i++)  {
> -		irec = findfirst_inode_rec(i);
> -
> -		while (irec != NULL)  {
> -			for (j = 0; j < XFS_INODES_PER_CHUNK; j++)  {
> -				ASSERT(is_inode_confirmed(irec, j));
> +	set_progress_msg(PROGRESS_FMT_CORR_LINK, (__uint64_t) glob_agcount);
>  
> -				if (is_inode_free(irec, j))
> -					continue;
> +	create_work_queue(&wq, mp, scan_threads);
>  
> -				ASSERT(no_modify || is_inode_reached(irec, j));
> +	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++)
> +		queue_work(&wq, do_link_updates, agno, NULL);
>  
> -				nrefs = num_inode_references(irec, j);
> -				ASSERT(no_modify || nrefs > 0);
> +	destroy_work_queue(&wq);
>  
> -				if (get_inode_disk_nlinks(irec, j) != nrefs)
> -					update_inode_nlinks(mp,
> -						XFS_AGINO_TO_INO(mp, i,
> -							irec->ino_startnum + j),
> -						nrefs);
> -			}
> -			irec = next_ino_rec(irec);
> -		}
> -	}
> +	print_final_rpt();
>  }
> diff --git a/repair/progress.c b/repair/progress.c
> index 418b803..2a09b23 100644
> --- a/repair/progress.c
> +++ b/repair/progress.c
> @@ -75,9 +75,9 @@ progress_rpt_t progress_rpt_reports[] = {
>  {FMT2, N_("moving disconnected inodes to lost+found"),		/* 12 */
>  	&rpt_fmts[FMT2], &rpt_types[TYPE_INODE]},
>  {FMT1, N_("verify and correct link counts"),			/* 13 */
> -	&rpt_fmts[FMT1], &rpt_types[TYPE_INODE]},
> +	&rpt_fmts[FMT1], &rpt_types[TYPE_AG]},
>  {FMT1, N_("verify link counts"),				/* 14 */
> -	&rpt_fmts[FMT1], &rpt_types[TYPE_INODE]}
> +	&rpt_fmts[FMT1], &rpt_types[TYPE_AG]}
>  };
>  
>  pthread_t	report_thread;
> diff --git a/repair/protos.h b/repair/protos.h
> index 9d5a2a6..b113aca 100644
> --- a/repair/protos.h
> +++ b/repair/protos.h
> @@ -50,7 +50,7 @@ void	phase3(struct xfs_mount *);
>  void	phase4(struct xfs_mount *);
>  void	phase5(struct xfs_mount *);
>  void	phase6(struct xfs_mount *);
> -void	phase7(struct xfs_mount *);
> +void	phase7(struct xfs_mount *, int);
>  
>  int	verify_set_agheader(struct xfs_mount *, struct xfs_buf *,
>  		struct xfs_sb *, struct xfs_agf *, struct xfs_agi *,
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index 3eaced4..fcdb212 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -893,7 +893,7 @@ main(int argc, char **argv)
>  		phase6(mp);
>  		timestamp(PHASE_END, 6, NULL);
>  
> -		phase7(mp);
> +		phase7(mp, phase2_threads);
>  		timestamp(PHASE_END, 7, NULL);
>  	} else  {
>  		do_warn(
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3
  2015-12-21 21:37 ` [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3 Dave Chinner
@ 2016-01-04 19:12   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-04 19:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:05AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> This can take a long time when there are millions of uncertain inodes in badly
> broken filesystems. THe processing is per-ag, the data structures are all
> per-ag, and the load is mostly CPU time spent checking CRCs on each
> uncertaini inode. Parallelising reduced the runtime of this phase on a badly
> broken filesytem from ~30 minutes to under 5 miniutes.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

This one seems a bit more scary simply because the amount of work
involved in phase 3, such as inode processing and whatnot. I don't think
that everything in there is necessarily AG local as the commit log
description above implies. On a skim through, I do see that we have
ag_locks[agno].lock for cases that can cross AG boundaries, such as
checking block allocation state (get_bmap()/set_bmap()) of bmapbt blocks
(iiuc?), for example.

So I can't really spot any actual problems, but there's a lot of code
down in there. I'm fine with it as long as testing bears out and this
gets a reasonable amount of soak time such that we can hopefully catch
any serious issues or areas currently lacking sufficient locking:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  repair/phase3.c     | 59 ++++++++++++++++++++++++++++++++++++++++++++---------
>  repair/protos.h     |  2 +-
>  repair/xfs_repair.c |  2 +-
>  3 files changed, 51 insertions(+), 12 deletions(-)
> 
> diff --git a/repair/phase3.c b/repair/phase3.c
> index 76c9440..0890a27 100644
> --- a/repair/phase3.c
> +++ b/repair/phase3.c
> @@ -28,6 +28,7 @@
>  #include "dinode.h"
>  #include "progress.h"
>  #include "bmap.h"
> +#include "threads.h"
>  
>  static void
>  process_agi_unlinked(
> @@ -87,10 +88,33 @@ process_ags(
>  	do_inode_prefetch(mp, ag_stride, process_ag_func, false, false);
>  }
>  
> +static void
> +do_uncertain_aginodes(
> +	work_queue_t	*wq,
> +	xfs_agnumber_t	agno,
> +	void		*arg)
> +{
> +	int		*count = arg;
> +
> +	*count = process_uncertain_aginodes(wq->mp, agno);
> +
> +#ifdef XR_INODE_TRACE
> +	fprintf(stderr,
> +		"\t\t phase 3 - ag %d process_uncertain_inodes returns %d\n",
> +		*count, j);
> +#endif
> +
> +	PROG_RPT_INC(prog_rpt_done[agno], 1);
> +}
> +
>  void
> -phase3(xfs_mount_t *mp)
> +phase3(
> +	struct xfs_mount *mp,
> +	int		scan_threads)
>  {
> -	int 			i, j;
> +	int			i, j;
> +	int			*counts;
> +	work_queue_t		wq;
>  
>  	do_log(_("Phase 3 - for each AG...\n"));
>  	if (!no_modify)
> @@ -129,20 +153,35 @@ phase3(xfs_mount_t *mp)
>  	 */
>  	do_log(_("        - process newly discovered inodes...\n"));
>  	set_progress_msg(PROG_FMT_NEW_INODES, (__uint64_t) glob_agcount);
> +
> +	counts = calloc(sizeof(*counts), mp->m_sb.sb_agcount);
> +	if (!counts) {
> +		do_abort(_("no memory for uncertain inode counts\n"));
> +		return;
> +	}
> +
>  	do  {
>  		/*
>  		 * have to loop until no ag has any uncertain
>  		 * inodes
>  		 */
>  		j = 0;
> -		for (i = 0; i < mp->m_sb.sb_agcount; i++)  {
> -			j += process_uncertain_aginodes(mp, i);
> -#ifdef XR_INODE_TRACE
> -			fprintf(stderr,
> -				"\t\t phase 3 - process_uncertain_inodes returns %d\n", j);
> -#endif
> -			PROG_RPT_INC(prog_rpt_done[i], 1);
> -		}
> +		memset(counts, 0, mp->m_sb.sb_agcount * sizeof(*counts));
> +
> +		create_work_queue(&wq, mp, scan_threads);
> +
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> +			queue_work(&wq, do_uncertain_aginodes, i, &counts[i]);
> +
> +		destroy_work_queue(&wq);
> +
> +		/* tally up the counts */
> +		for (i = 0; i < mp->m_sb.sb_agcount; i++)
> +			j += counts[i];
> +
>  	} while (j != 0);
> +
> +	free(counts);
> +
>  	print_final_rpt();
>  }
> diff --git a/repair/protos.h b/repair/protos.h
> index b113aca..0290420 100644
> --- a/repair/protos.h
> +++ b/repair/protos.h
> @@ -46,7 +46,7 @@ void	thread_init(void);
>  
>  void	phase1(struct xfs_mount *);
>  void	phase2(struct xfs_mount *, int);
> -void	phase3(struct xfs_mount *);
> +void	phase3(struct xfs_mount *, int);
>  void	phase4(struct xfs_mount *);
>  void	phase5(struct xfs_mount *);
>  void	phase6(struct xfs_mount *);
> diff --git a/repair/xfs_repair.c b/repair/xfs_repair.c
> index fcdb212..5d5f3aa 100644
> --- a/repair/xfs_repair.c
> +++ b/repair/xfs_repair.c
> @@ -871,7 +871,7 @@ main(int argc, char **argv)
>  	if (do_prefetch)
>  		init_prefetch(mp);
>  
> -	phase3(mp);
> +	phase3(mp, phase2_threads);
>  	timestamp(PHASE_END, 3, NULL);
>  
>  	phase4(mp);
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] libxfs: directory node splitting does not have an extra block
  2015-12-21 21:37 ` [PATCH 6/9] libxfs: directory node splitting does not have an extra block Dave Chinner
@ 2016-01-05 18:34   ` Brian Foster
  2016-01-05 22:07     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2016-01-05 18:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:06AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> xfs_da3_split() has to handle all thre versions of the
> directory/attribute btree structure. The attr tree is v1, the dir
> tre is v2 or v3. The main difference between the v1 and v2/3 trees
> is the way tree nodes are split - in the v1 tree we can require a
> double split to occur because the object to be inserted may be
> larger than the space made by splitting a leaf. In this case we need
> to do a double split - one to split the full leaf, then another to
> allocate an empty leaf block in the correct location for the new
> entry.  This does not happen with dir (v2/v3) formats as the objects
> being inserted are always guaranteed to fit into the new space in
> the split blocks.
> 
> The problem is that when we are doing the first root split, there
> can be three leaf blocks that need to be updated for an attr tree,
> but there will ony ever be two blocks for the dir tree. The code,
> however, expects that there will always be an extra block (i.e.
> three blocks) that needs updating. When rebuilding broken
> directories, xfs_repair trips over this condition in the directroy
> code and SEGVs because state->extrablk.bp == NULL. i.e. there is no
> extra block.
> 
> Indeed, for directories they *may* be an extra block on this buffer
> pointer. However, it's guaranteed not to be a leaf block (i.e. a
> directory data block) - the directory code only ever places hash
> index or free space blocks in this pointer (as a cursor of
> sorts), and so to use it as a directory data block will immediately
> corrupt the directory. This manifests itself in repair as a
> directory corruption in a repaired directory, leaving the directory
> rebuild incomplete.
> 
> This is a dir v2 zero-day bug - it was in the initial dir v2 commit
> that was made back in February 1998.
> 
> Fix this by making the update code aware of whether the extra block
> is a valid node for linking into the tree, rather than asserting
> (incorrectly) that the extra block must be valid. This brings the
> code in line with xfs_da3_node_split() which, from the first dir v2
> commit, handled the conditional extra block case correctly....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Thanks for the explanation. The logic in xfs_da3_split() is a bit
confusing and I'm not sure I'm following it quite correctly. A couple
questions below...

>  libxfs/xfs_da_btree.c | 39 ++++++++++++++++++++++++++-------------
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
> index bf5fe21..58a0361 100644
> --- a/libxfs/xfs_da_btree.c
> +++ b/libxfs/xfs_da_btree.c
> @@ -356,6 +356,7 @@ xfs_da3_split(
>  	int			action = 0;
>  	int			error;
>  	int			i;
> +	bool			use_extra = false;
>  
>  	trace_xfs_da_split(state->args);
>  
> @@ -395,6 +396,7 @@ xfs_da3_split(
>  			 * Entry wouldn't fit, split the leaf again.
>  			 */
>  			state->extravalid = 1;
> +			use_extra = true;
>  			if (state->inleaf) {
>  				state->extraafter = 0;	/* before newblk */
>  				trace_xfs_attr_leaf_split_before(state->args);

So here we walk up through a tree, doing splits as necessary. In this
particular case, we have the potential attr leaf double-split case
described above. If this happens, we set extrablk and use_extra. Is it
the case that if this occurs, we know the loop is good for at least one
more iteration (since this is a leaf block)?

If so, we get to a node block that might be the root and call
xfs_da3_node_split(). That function potentially splits the node block
and adds the entries for the previous split, "consuming" extrablk if it
had been set as well. In fact, if the node/root doesn't split, we don't
carry on to any of the below code at all since addblk is set to NULL
(even if the double split had occurred).

Given that, I'm not seeing how extrablk should be used by the following
root split code at all under normal circumstances. Wouldn't it always be
handled before that point? To frame the question another way, if the
double split is only a leaf block operation, shouldn't only addblk be
relevant to the root block split code? I could very easily be missing
something here...

Brian

> @@ -454,8 +456,13 @@ xfs_da3_split(
>  	/*
>  	 * Update pointers to the node which used to be block 0 and
>  	 * just got bumped because of the addition of a new root node.
> -	 * There might be three blocks involved if a double split occurred,
> -	 * and the original block 0 could be at any position in the list.
> +	 * There might be three blocks involved if a double split occurred and
> +	 * in this case use_extra will be set. This can only happen for attr
> +	 * trees, as extrablk for dir trees can only contain data or free space
> +	 * blocks, never leaf/node blocks.
> +	 *
> +	 * Note that the original block 0 could be at any position in the list
> +	 * of blocks in the tree.
>  	 *
>  	 * Note: the magic numbers and sibling pointers are in the same
>  	 * physical place for both v2 and v3 headers (by design). Hence it
> @@ -464,31 +471,37 @@ xfs_da3_split(
>  	 */
>  	node = oldblk->bp->b_addr;
>  	if (node->hdr.info.forw) {
> +		bp = NULL;
>  		if (be32_to_cpu(node->hdr.info.forw) == addblk->blkno) {
>  			bp = addblk->bp;
> -		} else {
> +		} else if (use_extra) {
>  			ASSERT(state->extravalid);
>  			bp = state->extrablk.bp;
>  		}
> -		node = bp->b_addr;
> -		node->hdr.info.back = cpu_to_be32(oldblk->blkno);
> -		xfs_trans_log_buf(state->args->trans, bp,
> -		    XFS_DA_LOGRANGE(node, &node->hdr.info,
> -		    sizeof(node->hdr.info)));
> +		if (bp) {
> +			node = bp->b_addr;
> +			node->hdr.info.back = cpu_to_be32(oldblk->blkno);
> +			xfs_trans_log_buf(state->args->trans, bp,
> +					  XFS_DA_LOGRANGE(node, &node->hdr.info,
> +					  sizeof(node->hdr.info)));
> +		}
>  	}
>  	node = oldblk->bp->b_addr;
>  	if (node->hdr.info.back) {
> +		bp = NULL;
>  		if (be32_to_cpu(node->hdr.info.back) == addblk->blkno) {
>  			bp = addblk->bp;
>  		} else {
>  			ASSERT(state->extravalid);
>  			bp = state->extrablk.bp;
>  		}
> -		node = bp->b_addr;
> -		node->hdr.info.forw = cpu_to_be32(oldblk->blkno);
> -		xfs_trans_log_buf(state->args->trans, bp,
> -		    XFS_DA_LOGRANGE(node, &node->hdr.info,
> -		    sizeof(node->hdr.info)));
> +		if (bp) {
> +			node = bp->b_addr;
> +			node->hdr.info.forw = cpu_to_be32(oldblk->blkno);
> +			xfs_trans_log_buf(state->args->trans, bp,
> +					  XFS_DA_LOGRANGE(node, &node->hdr.info,
> +					  sizeof(node->hdr.info)));
> +		}
>  	}
>  	addblk->bp = NULL;
>  	return 0;
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 7/9] libxfs: don't discard dirty buffers
  2015-12-21 21:37 ` [PATCH 7/9] libxfs: don't discard dirty buffers Dave Chinner
@ 2016-01-05 18:34   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-05 18:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:07AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we release a buffer from the cache, if it is dirty we wite it
> to disk then put the buffer on the free list for recycling. However,
> if the write fails (e.g. verifier failure due unfixed corruption) we
> effectively throw the buffer and it contents away. This causes all
> sorts of problems for xfs_repair as it then re-reads the buffer from
> disk on the next access and hence loses all the corrections that had
> previously been made, resulting in tripping over corruptions in code
> that assumes the corruptions have already been fixed/flagged in the
> buffer it receives.
> 
> TO fix this, we have to make the cache aware that writes can fail,
> and keep the buffer in cache when writes fail. Hence we have to add
> an explicit error notification to the flush operation, and we need
> to do that before we release the buffer to the free list. This also
> means that we need to remove the writeback code from the release
> mechanisms, instead replacing them with assertions that the buffers
> are already clean.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  include/cache.h |  2 +-
>  libxfs/cache.c  | 15 ++++++++++++++-
>  libxfs/rdwr.c   | 44 +++++++++++++++++++++++++++-----------------
>  3 files changed, 42 insertions(+), 19 deletions(-)
> 
> diff --git a/include/cache.h b/include/cache.h
> index 0a84c69..87826be 100644
> --- a/include/cache.h
> +++ b/include/cache.h
> @@ -64,7 +64,7 @@ typedef void *cache_key_t;
>  
>  typedef void (*cache_walk_t)(struct cache_node *);
>  typedef struct cache_node * (*cache_node_alloc_t)(cache_key_t);
> -typedef void (*cache_node_flush_t)(struct cache_node *);
> +typedef int (*cache_node_flush_t)(struct cache_node *);
>  typedef void (*cache_node_relse_t)(struct cache_node *);
>  typedef unsigned int (*cache_node_hash_t)(cache_key_t, unsigned int,
>  					  unsigned int);
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index 4753a1d..a48ebd9 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
> @@ -219,6 +219,12 @@ cache_shake(
>  		if (pthread_mutex_trylock(&node->cn_mutex) != 0)
>  			continue;
>  
> +		/* can't release dirty objects */
> +		if (cache->flush(node)) {
> +			pthread_mutex_unlock(&node->cn_mutex);
> +			continue;
> +		}
> +
>  		hash = cache->c_hash + node->cn_hashidx;
>  		if (pthread_mutex_trylock(&hash->ch_mutex) != 0) {
>  			pthread_mutex_unlock(&node->cn_mutex);
> @@ -311,6 +317,13 @@ __cache_node_purge(
>  		pthread_mutex_unlock(&node->cn_mutex);
>  		return count;
>  	}
> +
> +	/* can't purge dirty objects */
> +	if (cache->flush(node)) {
> +		pthread_mutex_unlock(&node->cn_mutex);
> +		return 1;
> +	}
> +
>  	mru = &cache->c_mrus[node->cn_priority];
>  	pthread_mutex_lock(&mru->cm_mutex);
>  	list_del_init(&node->cn_mru);
> @@ -321,7 +334,7 @@ __cache_node_purge(
>  	pthread_mutex_destroy(&node->cn_mutex);
>  	list_del_init(&node->cn_hash);
>  	cache->relse(node);
> -	return count;
> +	return 0;
>  }
>  
>  /*
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 7a04985..0337a21 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -659,6 +659,8 @@ __libxfs_getbufr(int blen)
>  		bp = kmem_zone_zalloc(xfs_buf_zone, 0);
>  	pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
>  	bp->b_ops = NULL;
> +	if (bp->b_flags & LIBXFS_B_DIRTY)
> +		fprintf(stderr, "found dirty buffer (bulk) on free list!");
>  
>  	return bp;
>  }
> @@ -1223,23 +1225,26 @@ libxfs_iomove(xfs_buf_t *bp, uint boff, int len, void *data, int flags)
>  }
>  
>  static void
> -libxfs_brelse(struct cache_node *node)
> +libxfs_brelse(
> +	struct cache_node	*node)
>  {
> -	xfs_buf_t		*bp = (xfs_buf_t *)node;
> +	struct xfs_buf		*bp = (struct xfs_buf *)node;
>  
> -	if (bp != NULL) {
> -		if (bp->b_flags & LIBXFS_B_DIRTY)
> -			libxfs_writebufr(bp);
> -		pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
> -		list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
> -		pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
> -	}
> +	if (!bp)
> +		return;
> +	if (bp->b_flags & LIBXFS_B_DIRTY)
> +		fprintf(stderr,
> +			"releasing dirty buffer to free list!");
> +
> +	pthread_mutex_lock(&xfs_buf_freelist.cm_mutex);
> +	list_add(&bp->b_node.cn_mru, &xfs_buf_freelist.cm_list);
> +	pthread_mutex_unlock(&xfs_buf_freelist.cm_mutex);
>  }
>  
>  static unsigned int
>  libxfs_bulkrelse(
> -	struct cache 		*cache,
> -	struct list_head 	*list)
> +	struct cache		*cache,
> +	struct list_head	*list)
>  {
>  	xfs_buf_t		*bp;
>  	int			count = 0;
> @@ -1249,7 +1254,8 @@ libxfs_bulkrelse(
>  
>  	list_for_each_entry(bp, list, b_node.cn_mru) {
>  		if (bp->b_flags & LIBXFS_B_DIRTY)
> -			libxfs_writebufr(bp);
> +			fprintf(stderr,
> +				"releasing dirty buffer (bulk) to free list!");
>  		count++;
>  	}
>  
> @@ -1260,18 +1266,22 @@ libxfs_bulkrelse(
>  	return count;
>  }
>  
> -static void
> -libxfs_bflush(struct cache_node *node)
> +static int
> +libxfs_bflush(
> +	struct cache_node	*node)
>  {
> -	xfs_buf_t		*bp = (xfs_buf_t *)node;
> +	struct xfs_buf		*bp = (struct xfs_buf *)node;
>  
> -	if ((bp != NULL) && (bp->b_flags & LIBXFS_B_DIRTY))
> -		libxfs_writebufr(bp);
> +	if (bp->b_flags & LIBXFS_B_DIRTY)
> +		return libxfs_writebufr(bp);
> +	return 0;
>  }
>  
>  void
>  libxfs_putbufr(xfs_buf_t *bp)
>  {
> +	if (bp->b_flags & LIBXFS_B_DIRTY)
> +		libxfs_writebufr(bp);
>  	libxfs_brelse((struct cache_node *)bp);
>  }
>  
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers
  2015-12-21 21:37 ` [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers Dave Chinner
@ 2016-01-05 18:34   ` Brian Foster
  0 siblings, 0 replies; 21+ messages in thread
From: Brian Foster @ 2016-01-05 18:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:08AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> now that we try to write dirty buffers before we release them, we
> can get buildup of unwritable dirty buffers on the LRU lists, This
> results in the cache shaker repeatedly trying to write out these
> buffers every time the cache fills up. This results in more
> corruption warnings, and takes up a lot of time doing reclaiming
> nothing. This can effectively livelock the processing parts of phase
> 4.
> 
> Fix this by not trying to write buffers with corruption errors on
> them. These errors will get cleared when the buffer is re-read and
> fixed and them marked dirty again. At which point, we'll be able to
> write them and so the cache can reclaim them successfully.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  libxfs/rdwr.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/libxfs/rdwr.c b/libxfs/rdwr.c
> index 0337a21..a1f0029 100644
> --- a/libxfs/rdwr.c
> +++ b/libxfs/rdwr.c
> @@ -1103,7 +1103,6 @@ int
>  libxfs_writebufr(xfs_buf_t *bp)
>  {
>  	int	fd = libxfs_device_to_fd(bp->b_target->dev);
> -	int	error = 0;
>  
>  	/*
>  	 * we never write buffers that are marked stale. This indicates they
> @@ -1134,7 +1133,7 @@ libxfs_writebufr(xfs_buf_t *bp)
>  	}
>  
>  	if (!(bp->b_flags & LIBXFS_B_DISCONTIG)) {
> -		error = __write_buf(fd, bp->b_addr, bp->b_bcount,
> +		bp->b_error = __write_buf(fd, bp->b_addr, bp->b_bcount,
>  				    LIBXFS_BBTOOFF64(bp->b_bn), bp->b_flags);
>  	} else {
>  		int	i;
> @@ -1144,11 +1143,10 @@ libxfs_writebufr(xfs_buf_t *bp)
>  			off64_t	offset = LIBXFS_BBTOOFF64(bp->b_map[i].bm_bn);
>  			int len = BBTOB(bp->b_map[i].bm_len);
>  
> -			error = __write_buf(fd, buf, len, offset, bp->b_flags);
> -			if (error) {
> -				bp->b_error = error;
> +			bp->b_error = __write_buf(fd, buf, len, offset,
> +						  bp->b_flags);
> +			if (bp->b_error)
>  				break;
> -			}
>  			buf += len;
>  		}
>  	}
> @@ -1157,14 +1155,14 @@ libxfs_writebufr(xfs_buf_t *bp)
>  	printf("%lx: %s: wrote %u bytes, blkno=%llu(%llu), %p, error %d\n",
>  			pthread_self(), __FUNCTION__, bp->b_bcount,
>  			(long long)LIBXFS_BBTOOFF64(bp->b_bn),
> -			(long long)bp->b_bn, bp, error);
> +			(long long)bp->b_bn, bp, bp->b_error);
>  #endif
> -	if (!error) {
> +	if (!bp->b_error) {
>  		bp->b_flags |= LIBXFS_B_UPTODATE;
>  		bp->b_flags &= ~(LIBXFS_B_DIRTY | LIBXFS_B_EXIT |
>  				 LIBXFS_B_UNCHECKED);
>  	}
> -	return error;
> +	return bp->b_error;
>  }
>  
>  int
> @@ -1266,15 +1264,22 @@ libxfs_bulkrelse(
>  	return count;
>  }
>  
> +/*
> + * When a buffer is marked dirty, the error is cleared. Hence if we are trying
> + * to flush a buffer prior to cache reclaim that has an error on it it means
> + * we've already tried to flush it and it failed. Prevent repeated corruption
> + * errors from being reported by skipping such buffers - when the corruption is
> + * fixed the buffer will be marked dirty again and we can write it again.
> + */
>  static int
>  libxfs_bflush(
>  	struct cache_node	*node)
>  {
>  	struct xfs_buf		*bp = (struct xfs_buf *)node;
>  
> -	if (bp->b_flags & LIBXFS_B_DIRTY)
> +	if (!bp->b_error && bp->b_flags & LIBXFS_B_DIRTY)
>  		return libxfs_writebufr(bp);
> -	return 0;
> +	return bp->b_error;
>  }
>  
>  void
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs
  2015-12-21 21:37 ` [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs Dave Chinner
@ 2016-01-05 18:34   ` Brian Foster
  2016-01-05 23:58     ` Dave Chinner
  0 siblings, 1 reply; 21+ messages in thread
From: Brian Foster @ 2016-01-05 18:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Dec 22, 2015 at 08:37:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There's no point trying to free buffers that are dirty and return
> errors on flush as we have to keep them around until the corruption
> is fixed. Hence if we fail to flush an inode during a cache shake,
> move the buffer to a special dirty MRU list that the cache does not
> shake. This prevents memory pressure from seeing these buffers, but
> allows subsequent cache lookups to still find them through the hash.
> This ensures we don't waste huge amounts of CPU trying to flush and
> reclaim buffers that canot be flushed or reclaimed.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  include/cache.h |  3 ++-
>  libxfs/cache.c  | 34 +++++++++++++++++++++++++++++-----
>  2 files changed, 31 insertions(+), 6 deletions(-)
> 
...
> diff --git a/libxfs/cache.c b/libxfs/cache.c
> index a48ebd9..d5ea461 100644
> --- a/libxfs/cache.c
> +++ b/libxfs/cache.c
...
> @@ -202,10 +223,11 @@ cache_shake(
>  	struct cache_node *	node;
>  	unsigned int		count;
>  
> -	ASSERT(priority <= CACHE_MAX_PRIORITY);
> -	if (priority > CACHE_MAX_PRIORITY)
> +	ASSERT(priority <= CACHE_DIRTY_PRIORITY);
> +	if (priority > CACHE_MAX_PRIORITY && !all)
>  		priority = 0;
>  
> +

Extra newline. FWIW, it also looks like the only cache_shake() caller
where all == 0 already prevents calling with priority >
CACHE_MAX_PRIORITY. I think a brief comment in one or both places with
regard to why >max priority is skipped unless 'all == 1' would be good,
though.

Also, it looks like the loop in cache_report() could be updated to dump
the dirty priority mru entry count.

Finally, what happens once a buffer on the dirty mru is fully repaired,
rewritten and released? Is it placed right back on the "unshakeable"
dirty mru or is cn_priority updated somewhere? On further digging, it
looks like a subsequent buffer lookup (__cache_lookup()) drops the
priority, though it appears to be designed to deal with prefetched
buffers and the associated B_INODE..B_DIR_BMAP mappings.

Brian

>  	mru = &cache->c_mrus[priority];
>  	count = 0;
>  	list_head_init(&temp);
> @@ -221,6 +243,8 @@ cache_shake(
>  
>  		/* can't release dirty objects */
>  		if (cache->flush(node)) {
> +			cache_move_to_dirty_mru(cache, node);
> +			mru->cm_count--;
>  			pthread_mutex_unlock(&node->cn_mutex);
>  			continue;
>  		}
> @@ -578,7 +602,7 @@ cache_purge(
>  {
>  	int			i;
>  
> -	for (i = 0; i <= CACHE_MAX_PRIORITY; i++)
> +	for (i = 0; i <= CACHE_DIRTY_PRIORITY; i++)
>  		cache_shake(cache, i, 1);
>  
>  #ifdef CACHE_DEBUG
> -- 
> 2.5.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 6/9] libxfs: directory node splitting does not have an extra block
  2016-01-05 18:34   ` Brian Foster
@ 2016-01-05 22:07     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2016-01-05 22:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Jan 05, 2016 at 01:34:02PM -0500, Brian Foster wrote:
> On Tue, Dec 22, 2015 at 08:37:06AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > xfs_da3_split() has to handle all thre versions of the
> > directory/attribute btree structure. The attr tree is v1, the dir
> > tre is v2 or v3. The main difference between the v1 and v2/3 trees
> > is the way tree nodes are split - in the v1 tree we can require a
> > double split to occur because the object to be inserted may be
> > larger than the space made by splitting a leaf. In this case we need
> > to do a double split - one to split the full leaf, then another to
> > allocate an empty leaf block in the correct location for the new
> > entry.  This does not happen with dir (v2/v3) formats as the objects
> > being inserted are always guaranteed to fit into the new space in
> > the split blocks.
> > 
> > The problem is that when we are doing the first root split, there
> > can be three leaf blocks that need to be updated for an attr tree,
> > but there will ony ever be two blocks for the dir tree. The code,
> > however, expects that there will always be an extra block (i.e.
> > three blocks) that needs updating. When rebuilding broken
> > directories, xfs_repair trips over this condition in the directroy
> > code and SEGVs because state->extrablk.bp == NULL. i.e. there is no
> > extra block.
> > 
> > Indeed, for directories they *may* be an extra block on this buffer
> > pointer. However, it's guaranteed not to be a leaf block (i.e. a
> > directory data block) - the directory code only ever places hash
> > index or free space blocks in this pointer (as a cursor of
> > sorts), and so to use it as a directory data block will immediately
> > corrupt the directory. This manifests itself in repair as a
> > directory corruption in a repaired directory, leaving the directory
> > rebuild incomplete.
> > 
> > This is a dir v2 zero-day bug - it was in the initial dir v2 commit
> > that was made back in February 1998.
> > 
> > Fix this by making the update code aware of whether the extra block
> > is a valid node for linking into the tree, rather than asserting
> > (incorrectly) that the extra block must be valid. This brings the
> > code in line with xfs_da3_node_split() which, from the first dir v2
> > commit, handled the conditional extra block case correctly....
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Thanks for the explanation. The logic in xfs_da3_split() is a bit
> confusing and I'm not sure I'm following it quite correctly. A couple
> questions below...
> 
> >  libxfs/xfs_da_btree.c | 39 ++++++++++++++++++++++++++-------------
> >  1 file changed, 26 insertions(+), 13 deletions(-)
> > 
> > diff --git a/libxfs/xfs_da_btree.c b/libxfs/xfs_da_btree.c
> > index bf5fe21..58a0361 100644
> > --- a/libxfs/xfs_da_btree.c
> > +++ b/libxfs/xfs_da_btree.c
> > @@ -356,6 +356,7 @@ xfs_da3_split(
> >  	int			action = 0;
> >  	int			error;
> >  	int			i;
> > +	bool			use_extra = false;
> >  
> >  	trace_xfs_da_split(state->args);
> >  
> > @@ -395,6 +396,7 @@ xfs_da3_split(
> >  			 * Entry wouldn't fit, split the leaf again.
> >  			 */
> >  			state->extravalid = 1;
> > +			use_extra = true;
> >  			if (state->inleaf) {
> >  				state->extraafter = 0;	/* before newblk */
> >  				trace_xfs_attr_leaf_split_before(state->args);
> 
> So here we walk up through a tree, doing splits as necessary. In this
> particular case, we have the potential attr leaf double-split case
> described above. If this happens, we set extrablk and use_extra. Is it
> the case that if this occurs, we know the loop is good for at least one
> more iteration (since this is a leaf block)?

Yes, I think so, because it's only a "leaf" format attr tree if it's
got a single block. If we need more than one leaf block, we have to
add another level which adds a node. Hence if we are splitting a
leaf, we already have to be in node format.

> If so, we get to a node block that might be the root and call
> xfs_da3_node_split(). That function potentially splits the node block
> and adds the entries for the previous split, "consuming" extrablk if it
> had been set as well.

Right - it is the direct parent that consumes the extra block.  So
we split the node block, which allocates a new block that is stored
in newblk, and that then gets stored in addblk, all the way back up
to the root.

> In fact, if the node/root doesn't split, we don't
> carry on to any of the below code at all since addblk is set to NULL
> (even if the double split had occurred).

Yup, because we don't have any pointers that we need to fix up after
the node split.  It's only when we split the root node that we need
to do the pointer fixup in xfs_da3_split().

> Given that, I'm not seeing how extrablk should be used by the following
> root split code at all under normal circumstances. Wouldn't it always be
> handled before that point?

I think you're right. Which means I can remove all the extrablk
handling from the code, rather than just from the directory block
handling.

Thanks for taking the time to understand the code, the change and
asking a really good question, Brian. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs
  2016-01-05 18:34   ` Brian Foster
@ 2016-01-05 23:58     ` Dave Chinner
  0 siblings, 0 replies; 21+ messages in thread
From: Dave Chinner @ 2016-01-05 23:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Tue, Jan 05, 2016 at 01:34:17PM -0500, Brian Foster wrote:
> On Tue, Dec 22, 2015 at 08:37:09AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > There's no point trying to free buffers that are dirty and return
> > errors on flush as we have to keep them around until the corruption
> > is fixed. Hence if we fail to flush an inode during a cache shake,
> > move the buffer to a special dirty MRU list that the cache does not
> > shake. This prevents memory pressure from seeing these buffers, but
> > allows subsequent cache lookups to still find them through the hash.
> > This ensures we don't waste huge amounts of CPU trying to flush and
> > reclaim buffers that canot be flushed or reclaimed.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  include/cache.h |  3 ++-
> >  libxfs/cache.c  | 34 +++++++++++++++++++++++++++++-----
> >  2 files changed, 31 insertions(+), 6 deletions(-)
> > 
> ...
> > diff --git a/libxfs/cache.c b/libxfs/cache.c
> > index a48ebd9..d5ea461 100644
> > --- a/libxfs/cache.c
> > +++ b/libxfs/cache.c
> ...
> > @@ -202,10 +223,11 @@ cache_shake(
> >  	struct cache_node *	node;
> >  	unsigned int		count;
> >  
> > -	ASSERT(priority <= CACHE_MAX_PRIORITY);
> > -	if (priority > CACHE_MAX_PRIORITY)
> > +	ASSERT(priority <= CACHE_DIRTY_PRIORITY);
> > +	if (priority > CACHE_MAX_PRIORITY && !all)
> >  		priority = 0;
> >  
> > +
> 
> Extra newline. FWIW, it also looks like the only cache_shake() caller
> where all == 0 already prevents calling with priority >
> CACHE_MAX_PRIORITY. I think a brief comment in one or both places with
> regard to why >max priority is skipped unless 'all == 1' would be good,
> though.

OK, can add.

> Also, it looks like the loop in cache_report() could be updated to dump
> the dirty priority mru entry count.

OK.

> 
> Finally, what happens once a buffer on the dirty mru is fully repaired,
> rewritten and released? Is it placed right back on the "unshakeable"
> dirty mru or is cn_priority updated somewhere? On further digging, it
> looks like a subsequent buffer lookup (__cache_lookup()) drops the
> priority, though it appears to be designed to deal with prefetched
> buffers and the associated B_INODE..B_DIR_BMAP mappings.

Somehow I missed sending the last patch in the series which
addresses this exact issue (libxfs: reset dirty buffer priority on
lookup) by saving the priority when the buffer is moved to the dirty
MRU and restoring it when the buffer is removed from the dirty MRU
on the next successful lookup. I'll make sure that's in the updated
patch series ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-01-05 23:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-21 21:37 [PATCH 0/9] xfsprogs: big, broken filesystems cause pain Dave Chinner
2015-12-21 21:37 ` [PATCH 1/9] metadump: clean up btree block region zeroing Dave Chinner
2016-01-04 19:11   ` Brian Foster
2015-12-21 21:37 ` [PATCH 2/9] metadump: bounds check btree block regions being zeroed Dave Chinner
2016-01-04 19:11   ` Brian Foster
2015-12-21 21:37 ` [PATCH 3/9] xfs_mdrestore: correctly account bytes read Dave Chinner
2016-01-04 19:12   ` Brian Foster
2015-12-21 21:37 ` [PATCH 4/9] repair: parallelise phase 7 Dave Chinner
2016-01-04 19:12   ` Brian Foster
2015-12-21 21:37 ` [PATCH 5/9] repair: parallelise uncertin inode processing in phase 3 Dave Chinner
2016-01-04 19:12   ` Brian Foster
2015-12-21 21:37 ` [PATCH 6/9] libxfs: directory node splitting does not have an extra block Dave Chinner
2016-01-05 18:34   ` Brian Foster
2016-01-05 22:07     ` Dave Chinner
2015-12-21 21:37 ` [PATCH 7/9] libxfs: don't discard dirty buffers Dave Chinner
2016-01-05 18:34   ` Brian Foster
2015-12-21 21:37 ` [PATCH 8/9] libxfs: don't repeatedly shake unwritable buffers Dave Chinner
2016-01-05 18:34   ` Brian Foster
2015-12-21 21:37 ` [PATCH 9/9] libxfs: keep unflushable buffers off the cache MRUs Dave Chinner
2016-01-05 18:34   ` Brian Foster
2016-01-05 23:58     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.