[PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan
@ 2022-05-05 16:08 Darrick J. Wong
  2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs

Hi all,

Dave Chinner introduced a patch to the userspace workqueues that allows
for controlling the maximum queue depth as part of phase6 cleanups for
xfs_repair.  This enables us to fix a problem in xfs_scrub wherein we
fail to parallelize inode scans to the maximum extent possible if a
filesystem's inode usage isn't evenly balanced across AGs.

Resolve this problem by using two workqueues for the inode scan -- one
that calls INUMBERS to find all the inobt records for the filesystem and
creates separate work items for each record; and a second workqueue to
turn the inobt records into BULKSTAT calls to do the actual scanning.
We use the queue depth control to avoid excessive queuing of inode
chunks.  This creates more threads to manage, but it means that we avoid
the problem of one AG's inode scan continuing on long after the other
threads ran out of work.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance
---
 scrub/inodes.c |  343 ++++++++++++++++++++++++++++++++++++++++++--------------
 scrub/phase3.c |   44 ++++++-
 scrub/phase4.c |    6 -
 scrub/repair.c |    2 
 scrub/repair.h |    4 -
 5 files changed, 300 insertions(+), 99 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/4] xfs_scrub: widen action list length variables
  2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
  2022-05-16 21:46   ` Eric Sandeen
  2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

On a 32-bit system it's possible for there to be so many items in the
repair list that we overflow a size_t.  Widen this to unsigned long
long.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase4.c |    6 +++---
 scrub/repair.c |    2 +-
 scrub/repair.h |    4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)


diff --git a/scrub/phase4.c b/scrub/phase4.c
index 559b2779..f97847d8 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -30,8 +30,8 @@ repair_ag(
 	struct scrub_ctx		*ctx = (struct scrub_ctx *)wq->wq_ctx;
 	bool				*aborted = priv;
 	struct action_list		*alist;
-	size_t				unfixed;
-	size_t				new_unfixed;
+	unsigned long long		unfixed;
+	unsigned long long		new_unfixed;
 	unsigned int			flags = 0;
 	int				ret;
 
@@ -168,7 +168,7 @@ phase4_estimate(
 	int			*rshift)
 {
 	xfs_agnumber_t		agno;
-	size_t			need_fixing = 0;
+	unsigned long long	need_fixing = 0;
 
 	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
 		need_fixing += action_list_length(&ctx->action_lists[agno]);
diff --git a/scrub/repair.c b/scrub/repair.c
index bb026101..67900ea4 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -189,7 +189,7 @@ action_list_init(
 }
 
 /* Number of repairs in this list. */
-size_t
+unsigned long long
 action_list_length(
 	struct action_list		*alist)
 {
diff --git a/scrub/repair.h b/scrub/repair.h
index 4261be49..102e5779 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -8,7 +8,7 @@
 
 struct action_list {
 	struct list_head	list;
-	size_t			nr;
+	unsigned long long	nr;
 	bool			sorted;
 };
 
@@ -22,7 +22,7 @@ static inline bool action_list_empty(const struct action_list *alist)
 	return list_empty(&alist->list);
 }
 
-size_t action_list_length(struct action_list *alist);
+unsigned long long action_list_length(struct action_list *alist);
 void action_list_add(struct action_list *dest, struct action_item *item);
 void action_list_splice(struct action_list *dest, struct action_list *src);
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads
  2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
  2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
  2022-05-16 22:33   ` Eric Sandeen
  2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
  2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
  3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In the next patch, we're going to rewrite scrub_scan_all_inodes to
schedule per-inogrp workqueue items that will run the iterator function.
In other words, the worker threads in phase 3 wil soon cease to be
per-AG threads.

To prepare for this, we must modify phase 3 so that any writes to shared
state are protected by the appropriate per-AG locks.  As far as I can
tell, the only updates to shared state are the per-AG action lists, so
create some per-AG locks for phase 3 and create locked wrappers for the
action_list_* functions if we find things to repair.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/phase3.c |   44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)


diff --git a/scrub/phase3.c b/scrub/phase3.c
index d659a779..65e903f2 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -26,6 +26,9 @@ struct scrub_inode_ctx {
 	/* Number of inodes scanned. */
 	struct ptcounter	*icount;
 
+	/* per-AG locks to protect the repair lists */
+	pthread_mutex_t		*locks;
+
 	/* Set to true to abort all threads. */
 	bool			aborted;
 
@@ -48,6 +51,24 @@ report_close_error(
 	str_errno(ctx, descr);
 }
 
+/*
+ * Defer all the repairs until phase 4, being careful about locking since the
+ * inode scrub threads are not per-AG.
+ */
+static void
+defer_inode_repair(
+	struct scrub_inode_ctx	*ictx,
+	xfs_agnumber_t		agno,
+	struct action_list	*alist)
+{
+	if (alist->nr == 0)
+		return;
+
+	pthread_mutex_lock(&ictx->locks[agno]);
+	action_list_defer(ictx->ctx, agno, alist);
+	pthread_mutex_unlock(&ictx->locks[agno]);
+}
+
 /* Run repair actions now and defer unfinished items for later. */
 static int
 try_inode_repair(
@@ -71,7 +92,7 @@ try_inode_repair(
 	if (ret)
 		return ret;
 
-	action_list_defer(ictx->ctx, agno, alist);
+	defer_inode_repair(ictx, agno, alist);
 	return 0;
 }
 
@@ -184,7 +205,7 @@ scrub_inode(
 	progress_add(1);
 
 	if (!error && !ictx->aborted)
-		action_list_defer(ctx, agno, &alist);
+		defer_inode_repair(ictx, agno, &alist);
 
 	if (fd >= 0) {
 		int	err2;
@@ -217,12 +238,21 @@ phase3_func(
 		return err;
 	}
 
+	ictx.locks = calloc(ctx->mnt.fsgeom.agcount, sizeof(pthread_mutex_t));
+	if (!ictx.locks) {
+		str_errno(ctx, _("creating per-AG repair list locks"));
+		err = ENOMEM;
+		goto out_ptcounter;
+	}
+
 	/*
 	 * If we already have ag/fs metadata to repair from previous phases,
 	 * we would rather not try to repair file metadata until we've tried
 	 * to repair the space metadata.
 	 */
 	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
+		pthread_mutex_init(&ictx.locks[agno], NULL);
+
 		if (!action_list_empty(&ctx->action_lists[agno]))
 			ictx.always_defer_repairs = true;
 	}
@@ -231,17 +261,21 @@ phase3_func(
 	if (!err && ictx.aborted)
 		err = ECANCELED;
 	if (err)
-		goto free;
+		goto out_locks;
 
 	scrub_report_preen_triggers(ctx);
 	err = ptcounter_value(ictx.icount, &val);
 	if (err) {
 		str_liberror(ctx, err, _("summing scanned inode counter"));
-		return err;
+		goto out_locks;
 	}
 
 	ctx->inodes_checked = val;
-free:
+out_locks:
+	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
+		pthread_mutex_destroy(&ictx.locks[agno]);
+	free(ictx.locks);
+out_ptcounter:
 	ptcounter_free(ictx.icount);
 	return err;
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs
  2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
  2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
  2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
  2022-05-17 14:11   ` Eric Sandeen
  2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
  3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Use the bounded workqueue functionality to spread the inode chunk scan
load across the CPUs more evenly.  First, we create per-AG workers to
walk each AG's inode btree to create inode batch work items for each
inobt record.  These items are added to a (second) bounded workqueue
that invokes BULKSTAT and invokes the caller's function on each bulkstat
record.

By splitting the work items into batches of 64 inodes instead of one
thread per AG, we keep the level of parallelism at a reasonably high
level almost all the way to the end of the inode scan if the inodes are
not evenly divided across AGs or if a few files have far more extent
records than average.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/inodes.c |  336 +++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 248 insertions(+), 88 deletions(-)


diff --git a/scrub/inodes.c b/scrub/inodes.c
index 80af8a74..41e5fdc7 100644
--- a/scrub/inodes.c
+++ b/scrub/inodes.c
@@ -16,6 +16,7 @@
 #include "xfs_scrub.h"
 #include "common.h"
 #include "inodes.h"
+#include "descr.h"
 #include "libfrog/fsgeom.h"
 #include "libfrog/bulkstat.h"
 
@@ -49,7 +50,7 @@
 static void
 bulkstat_for_inumbers(
 	struct scrub_ctx	*ctx,
-	const char		*descr,
+	struct descr		*dsc,
 	const struct xfs_inumbers *inumbers,
 	struct xfs_bulkstat_req	*breq)
 {
@@ -65,7 +66,7 @@ bulkstat_for_inumbers(
 	if (error) {
 		char	errbuf[DESCR_BUFSZ];
 
-		str_info(ctx, descr, "%s",
+		str_info(ctx, descr_render(dsc), "%s",
 			 strerror_r(error, errbuf, DESCR_BUFSZ));
 	}
 
@@ -95,61 +96,206 @@ bulkstat_for_inumbers(
 
 /* BULKSTAT wrapper routines. */
 struct scan_inodes {
+	struct workqueue	wq_bulkstat;
 	scrub_inode_iter_fn	fn;
 	void			*arg;
+	unsigned int		nr_threads;
 	bool			aborted;
 };
 
 /*
- * Call into the filesystem for inode/bulkstat information and call our
- * iterator function.  We'll try to fill the bulkstat information in batches,
- * but we also can detect iget failures.
+ * A single unit of inode scan work.  This contains a pointer to the parent
+ * information, followed by an INUMBERS request structure, followed by a
+ * BULKSTAT request structure.  The last two are VLAs, so we can't represent
+ * them here.
  */
-static void
-scan_ag_inodes(
-	struct workqueue	*wq,
-	xfs_agnumber_t		agno,
-	void			*arg)
+struct scan_ichunk {
+	struct scan_inodes	*si;
+};
+
+static inline struct xfs_inumbers_req *
+ichunk_to_inumbers(
+	struct scan_ichunk	*ichunk)
 {
-	struct xfs_handle	handle = { };
-	char			descr[DESCR_BUFSZ];
+	char			*p = (char *)ichunk;
+
+	return (struct xfs_inumbers_req *)(p + sizeof(struct scan_ichunk));
+}
+
+static inline struct xfs_bulkstat_req *
+ichunk_to_bulkstat(
+	struct scan_ichunk	*ichunk)
+{
+	char			*p = (char *)ichunk_to_inumbers(ichunk);
+
+	return (struct xfs_bulkstat_req *)(p + XFS_INUMBERS_REQ_SIZE(1));
+}
+
+static inline int
+alloc_ichunk(
+	struct scan_inodes	*si,
+	uint32_t		agno,
+	uint64_t		startino,
+	struct scan_ichunk	**ichunkp)
+{
+	struct scan_ichunk	*ichunk;
 	struct xfs_inumbers_req	*ireq;
 	struct xfs_bulkstat_req	*breq;
-	struct scan_inodes	*si = arg;
-	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
-	struct xfs_bulkstat	*bs;
-	struct xfs_inumbers	*inumbers;
-	uint64_t		nextino = cvt_agino_to_ino(&ctx->mnt, agno, 0);
-	int			i;
-	int			error;
-	int			stale_count = 0;
-
-	snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+
+	ichunk = calloc(1, sizeof(struct scan_ichunk) +
+			   XFS_INUMBERS_REQ_SIZE(1) +
+			   XFS_BULKSTAT_REQ_SIZE(LIBFROG_BULKSTAT_CHUNKSIZE));
+	if (!ichunk)
+		return -errno;
+
+	ichunk->si = si;
+
+	ireq = ichunk_to_inumbers(ichunk);
+	ireq->hdr.icount = 1;
+	ireq->hdr.ino = startino;
+	ireq->hdr.agno = agno;
+	ireq->hdr.flags |= XFS_BULK_IREQ_AGNO;
+
+	breq = ichunk_to_bulkstat(ichunk);
+	breq->hdr.icount = LIBFROG_BULKSTAT_CHUNKSIZE;
+
+	*ichunkp = ichunk;
+	return 0;
+}
+
+int
+render_ino_from_bulkstat(
+	struct scrub_ctx	*ctx,
+	char			*buf,
+	size_t			buflen,
+	void			*data)
+{
+	struct xfs_bulkstat	*bstat = data;
+
+	return scrub_render_ino_descr(ctx, buf, buflen, bstat->bs_ino,
+			bstat->bs_gen, NULL);
+}
+
+static int
+render_inumbers_from_agno(
+	struct scrub_ctx	*ctx,
+	char			*buf,
+	size_t			buflen,
+	void			*data)
+{
+	xfs_agnumber_t		*agno = data;
+
+	return snprintf(buf, buflen, _("dev %d:%d AG %u inodes"),
 				major(ctx->fsinfo.fs_datadev),
 				minor(ctx->fsinfo.fs_datadev),
-				agno);
+				*agno);
+}
+
+/*
+ * Call BULKSTAT for information on a single chunk's worth of inodes and call
+ * our iterator function.  We'll try to fill the bulkstat information in
+ * batches, but we also can detect iget failures.
+ */
+static void
+scan_ag_bulkstat(
+	struct workqueue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct xfs_handle	handle = { };
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
+	struct scan_ichunk	*ichunk = arg;
+	struct xfs_inumbers_req	*ireq = ichunk_to_inumbers(ichunk);
+	struct xfs_bulkstat_req	*breq = ichunk_to_bulkstat(ichunk);
+	struct scan_inodes	*si = ichunk->si;
+	struct xfs_bulkstat	*bs;
+	struct xfs_inumbers	*inumbers = &ireq->inumbers[0];
+	int			i;
+	int			error;
+	int			stale_count = 0;
+	DEFINE_DESCR(dsc_bulkstat, ctx, render_ino_from_bulkstat);
+	DEFINE_DESCR(dsc_inumbers, ctx, render_inumbers_from_agno);
+
+	descr_set(&dsc_inumbers, &agno);
 
 	memcpy(&handle.ha_fsid, ctx->fshandle, sizeof(handle.ha_fsid));
 	handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
 			sizeof(handle.ha_fid.fid_len);
 	handle.ha_fid.fid_pad = 0;
 
-	error = -xfrog_bulkstat_alloc_req(LIBFROG_BULKSTAT_CHUNKSIZE, 0, &breq);
-	if (error) {
-		str_liberror(ctx, error, descr);
-		si->aborted = true;
-		return;
+retry:
+	bulkstat_for_inumbers(ctx, &dsc_inumbers, inumbers, breq);
+
+	/* Iterate all the inodes. */
+	bs = &breq->bulkstat[0];
+	for (i = 0; !si->aborted && i < inumbers->xi_alloccount; i++, bs++) {
+		descr_set(&dsc_bulkstat, bs);
+		handle.ha_fid.fid_ino = bs->bs_ino;
+		handle.ha_fid.fid_gen = bs->bs_gen;
+		error = si->fn(ctx, &handle, bs, si->arg);
+		switch (error) {
+		case 0:
+			break;
+		case ESTALE: {
+			stale_count++;
+			if (stale_count < 30) {
+				ireq->hdr.ino = inumbers->xi_startino;
+				error = -xfrog_inumbers(&ctx->mnt, ireq);
+				if (error)
+					goto err;
+				goto retry;
+			}
+			str_info(ctx, descr_render(&dsc_bulkstat),
+_("Changed too many times during scan; giving up."));
+			si->aborted = true;
+			goto out;
+		}
+		case ECANCELED:
+			error = 0;
+			fallthrough;
+		default:
+			goto err;
+		}
+		if (scrub_excessive_errors(ctx)) {
+			si->aborted = true;
+			goto out;
+		}
 	}
 
-	error = -xfrog_inumbers_alloc_req(1, 0, &ireq);
+err:
 	if (error) {
-		str_liberror(ctx, error, descr);
-		free(breq);
+		str_liberror(ctx, error, descr_render(&dsc_bulkstat));
 		si->aborted = true;
-		return;
 	}
-	inumbers = &ireq->inumbers[0];
-	xfrog_inumbers_set_ag(ireq, agno);
+out:
+	free(ichunk);
+}
+
+/*
+ * Call INUMBERS for information about inode chunks, then queue the inumbers
+ * responses in the bulkstat workqueue.  This helps us maximize CPU parallelism
+ * if the filesystem AGs are not evenly loaded.
+ */
+static void
+scan_ag_inumbers(
+	struct workqueue	*wq,
+	xfs_agnumber_t		agno,
+	void			*arg)
+{
+	struct scan_ichunk	*ichunk = NULL;
+	struct scan_inodes	*si = arg;
+	struct scrub_ctx	*ctx = (struct scrub_ctx *)wq->wq_ctx;
+	struct xfs_inumbers_req	*ireq;
+	uint64_t		nextino = cvt_agino_to_ino(&ctx->mnt, agno, 0);
+	int			error;
+	DEFINE_DESCR(dsc, ctx, render_inumbers_from_agno);
+
+	descr_set(&dsc, &agno);
+
+	error = alloc_ichunk(si, agno, 0, &ichunk);
+	if (error)
+		goto err;
+	ireq = ichunk_to_inumbers(ichunk);
 
 	/* Find the inode chunk & alloc mask */
 	error = -xfrog_inumbers(&ctx->mnt, ireq);
@@ -158,8 +304,8 @@ scan_ag_inodes(
 		 * Make sure that we always make forward progress while we
 		 * scan the inode btree.
 		 */
-		if (nextino > inumbers->xi_startino) {
-			str_corrupt(ctx, descr,
+		if (nextino > ireq->inumbers[0].xi_startino) {
+			str_corrupt(ctx, descr_render(&dsc),
 	_("AG %u inode btree is corrupt near agino %lu, got %lu"), agno,
 				cvt_ino_to_agino(&ctx->mnt, nextino),
 				cvt_ino_to_agino(&ctx->mnt,
@@ -169,64 +315,53 @@ scan_ag_inodes(
 		}
 		nextino = ireq->hdr.ino;
 
-		/*
-		 * We can have totally empty inode chunks on filesystems where
-		 * there are more than 64 inodes per block.  Skip these.
-		 */
-		if (inumbers->xi_alloccount == 0)
-			goto igrp_retry;
-
-		bulkstat_for_inumbers(ctx, descr, inumbers, breq);
-
-		/* Iterate all the inodes. */
-		for (i = 0, bs = breq->bulkstat;
-		     !si->aborted && i < inumbers->xi_alloccount;
-		     i++, bs++) {
-			handle.ha_fid.fid_ino = bs->bs_ino;
-			handle.ha_fid.fid_gen = bs->bs_gen;
-			error = si->fn(ctx, &handle, bs, si->arg);
-			switch (error) {
-			case 0:
-				break;
-			case ESTALE: {
-				char	idescr[DESCR_BUFSZ];
-
-				stale_count++;
-				if (stale_count < 30) {
-					ireq->hdr.ino = inumbers->xi_startino;
-					goto igrp_retry;
-				}
-				scrub_render_ino_descr(ctx, idescr, DESCR_BUFSZ,
-						bs->bs_ino, bs->bs_gen, NULL);
-				str_info(ctx, idescr,
-_("Changed too many times during scan; giving up."));
-				break;
-			}
-			case ECANCELED:
-				error = 0;
-				fallthrough;
-			default:
-				goto err;
-			}
-			if (scrub_excessive_errors(ctx)) {
+		if (ireq->inumbers[0].xi_alloccount == 0) {
+			/*
+			 * We can have totally empty inode chunks on
+			 * filesystems where there are more than 64 inodes per
+			 * block.  Skip these.
+			 */
+			;
+		} else if (si->nr_threads > 0) {
+			/* Queue this inode chunk on the bulkstat workqueue. */
+			error = -workqueue_add(&si->wq_bulkstat,
+					scan_ag_bulkstat, agno, ichunk);
+			if (error) {
 				si->aborted = true;
+				str_liberror(ctx, error,
+						_("queueing bulkstat work"));
 				goto out;
 			}
+			ichunk = NULL;
+		} else {
+			/*
+			 * Only one thread, call bulkstat directly.  Remember,
+			 * ichunk is freed by the worker before returning.
+			 */
+			scan_ag_bulkstat(wq, agno, ichunk);
+			ichunk = NULL;
+			if (si->aborted)
+				break;
 		}
 
-		stale_count = 0;
-igrp_retry:
+		if (!ichunk) {
+			error = alloc_ichunk(si, agno, nextino, &ichunk);
+			if (error)
+				goto err;
+		}
+		ireq = ichunk_to_inumbers(ichunk);
+
 		error = -xfrog_inumbers(&ctx->mnt, ireq);
 	}
 
 err:
 	if (error) {
-		str_liberror(ctx, error, descr);
+		str_liberror(ctx, error, descr_render(&dsc));
 		si->aborted = true;
 	}
 out:
-	free(ireq);
-	free(breq);
+	if (ichunk)
+		free(ichunk);
 }
 
 /*
@@ -242,33 +377,58 @@ scrub_scan_all_inodes(
 	struct scan_inodes	si = {
 		.fn		= fn,
 		.arg		= arg,
+		.nr_threads	= scrub_nproc_workqueue(ctx),
 	};
 	xfs_agnumber_t		agno;
-	struct workqueue	wq;
+	struct workqueue	wq_inumbers;
+	unsigned int		max_bulkstat;
 	int			ret;
 
-	ret = -workqueue_create(&wq, (struct xfs_mount *)ctx,
-			scrub_nproc_workqueue(ctx));
+	/*
+	 * The bulkstat workqueue should queue at most one inobt block's worth
+	 * of inode chunk records per worker thread.  If we're running in
+	 * single thread mode (nr_threads==0) then we skip the workqueues.
+	 */
+	max_bulkstat = si.nr_threads * (ctx->mnt.fsgeom.blocksize / 16);
+
+	ret = -workqueue_create_bound(&si.wq_bulkstat, (struct xfs_mount *)ctx,
+			si.nr_threads, max_bulkstat);
 	if (ret) {
 		str_liberror(ctx, ret, _("creating bulkstat workqueue"));
 		return -1;
 	}
 
+	ret = -workqueue_create(&wq_inumbers, (struct xfs_mount *)ctx,
+			si.nr_threads);
+	if (ret) {
+		str_liberror(ctx, ret, _("creating inumbers workqueue"));
+		si.aborted = true;
+		goto kill_bulkstat;
+	}
+
 	for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
-		ret = -workqueue_add(&wq, scan_ag_inodes, agno, &si);
+		ret = -workqueue_add(&wq_inumbers, scan_ag_inumbers, agno, &si);
 		if (ret) {
 			si.aborted = true;
-			str_liberror(ctx, ret, _("queueing bulkstat work"));
+			str_liberror(ctx, ret, _("queueing inumbers work"));
 			break;
 		}
 	}
 
-	ret = -workqueue_terminate(&wq);
+	ret = -workqueue_terminate(&wq_inumbers);
+	if (ret) {
+		si.aborted = true;
+		str_liberror(ctx, ret, _("finishing inumbers work"));
+	}
+	workqueue_destroy(&wq_inumbers);
+
+kill_bulkstat:
+	ret = -workqueue_terminate(&si.wq_bulkstat);
 	if (ret) {
 		si.aborted = true;
 		str_liberror(ctx, ret, _("finishing bulkstat work"));
 	}
-	workqueue_destroy(&wq);
+	workqueue_destroy(&si.wq_bulkstat);
 
 	return si.aborted ? -1 : 0;
 }


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode
  2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
  2022-05-18  2:47   ` Eric Sandeen
  3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
  To: sandeen, djwong; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

If we decide to restart an inode chunk walk because one of the inodes is
stale, make sure that we don't walk an inode that's been scanned before.
This ensure we always make forward progress.

Found by observing backwards inode scan progress while running xfs/285.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 scrub/inodes.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)


diff --git a/scrub/inodes.c b/scrub/inodes.c
index 41e5fdc7..85bc1a06 100644
--- a/scrub/inodes.c
+++ b/scrub/inodes.c
@@ -210,6 +210,7 @@ scan_ag_bulkstat(
 	struct scan_inodes	*si = ichunk->si;
 	struct xfs_bulkstat	*bs;
 	struct xfs_inumbers	*inumbers = &ireq->inumbers[0];
+	uint64_t		last_ino = 0;
 	int			i;
 	int			error;
 	int			stale_count = 0;
@@ -229,8 +230,13 @@ scan_ag_bulkstat(
 	/* Iterate all the inodes. */
 	bs = &breq->bulkstat[0];
 	for (i = 0; !si->aborted && i < inumbers->xi_alloccount; i++, bs++) {
+		uint64_t	scan_ino = bs->bs_ino;
+
+		if (scan_ino < last_ino)
+			continue;
+
 		descr_set(&dsc_bulkstat, bs);
-		handle.ha_fid.fid_ino = bs->bs_ino;
+		handle.ha_fid.fid_ino = scan_ino;
 		handle.ha_fid.fid_gen = bs->bs_gen;
 		error = si->fn(ctx, &handle, bs, si->arg);
 		switch (error) {
@@ -260,6 +266,7 @@ _("Changed too many times during scan; giving up."));
 			si->aborted = true;
 			goto out;
 		}
+		last_ino = scan_ino;
 	}
 
 err:


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/4] xfs_scrub: widen action list length variables
  2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
@ 2022-05-16 21:46   ` Eric Sandeen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-16 21:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> On a 32-bit system it's possible for there to be so many items in the
> repair list that we overflow a size_t.  Widen this to unsigned long
> long.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads
  2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
@ 2022-05-16 22:33   ` Eric Sandeen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-16 22:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In the next patch, we're going to rewrite scrub_scan_all_inodes to
> schedule per-inogrp workqueue items that will run the iterator function.
> In other words, the worker threads in phase 3 wil soon cease to be
> per-AG threads.
> 
> To prepare for this, we must modify phase 3 so that any writes to shared
> state are protected by the appropriate per-AG locks.  As far as I can
> tell, the only updates to shared state are the per-AG action lists, so
> create some per-AG locks for phase 3 and create locked wrappers for the
> action_list_* functions if we find things to repair.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks fine.

Reviewed-by: Eric Sandeen <sandeen@redhat.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs
  2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
@ 2022-05-17 14:11   ` Eric Sandeen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-17 14:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Use the bounded workqueue functionality to spread the inode chunk scan
> load across the CPUs more evenly.  First, we create per-AG workers to
> walk each AG's inode btree to create inode batch work items for each
> inobt record.  These items are added to a (second) bounded workqueue
> that invokes BULKSTAT and invokes the caller's function on each bulkstat
> record.
> 
> By splitting the work items into batches of 64 inodes instead of one
> thread per AG, we keep the level of parallelism at a reasonably high
> level almost all the way to the end of the inode scan if the inodes are
> not evenly divided across AGs or if a few files have far more extent
> records than average.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Insofar as I at least read through it, applied it, built it, and
regression-tested it and didn't see anything wrong,

Reviewed-by: Eric Sandeen <sandeen@redhat.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode
  2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
@ 2022-05-18  2:47   ` Eric Sandeen
  0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-18  2:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> If we decide to restart an inode chunk walk because one of the inodes is
> stale, make sure that we don't walk an inode that's been scanned before.
> This ensure we always make forward progress.
> 
> Found by observing backwards inode scan progress while running xfs/285.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Would you mind if I nitpicked a:

/* ensure forward progress if we retried */

comment above the inode number test before I commit? 

Reviewed-by: Eric Sandeen <sandeen@redhat.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-05-18  2:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
2022-05-16 21:46   ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
2022-05-16 22:33   ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
2022-05-17 14:11   ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
2022-05-18  2:47   ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.