* [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan
@ 2022-05-05 16:08 Darrick J. Wong
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs
Hi all,
Dave Chinner introduced a patch to the userspace workqueues that allows
for controlling the maximum queue depth as part of phase6 cleanups for
xfs_repair. This enables us to fix a problem in xfs_scrub wherein we
fail to parallelize inode scans to the maximum extent possible if a
filesystem's inode usage isn't evenly balanced across AGs.
Resolve this problem by using two workqueues for the inode scan -- one
that calls INUMBERS to find all the inobt records for the filesystem and
creates separate work items for each record; and a second workqueue to
turn the inobt records into BULKSTAT calls to do the actual scanning.
We use the queue depth control to avoid excessive queuing of inode
chunks. This creates more threads to manage, but it means that we avoid
the problem of one AG's inode scan continuing on long after the other
threads ran out of work.
If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.
This is an extraordinary way to destroy everything. Enjoy!
Comments and questions are, as always, welcome.
--D
xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance
---
scrub/inodes.c | 343 ++++++++++++++++++++++++++++++++++++++++++--------------
scrub/phase3.c | 44 ++++++-
scrub/phase4.c | 6 -
scrub/repair.c | 2
scrub/repair.h | 4 -
5 files changed, 300 insertions(+), 99 deletions(-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH 1/4] xfs_scrub: widen action list length variables
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
2022-05-16 21:46 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
On a 32-bit system it's possible for there to be so many items in the
repair list that we overflow a size_t. Widen this to unsigned long
long.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
scrub/phase4.c | 6 +++---
scrub/repair.c | 2 +-
scrub/repair.h | 4 ++--
3 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/scrub/phase4.c b/scrub/phase4.c
index 559b2779..f97847d8 100644
--- a/scrub/phase4.c
+++ b/scrub/phase4.c
@@ -30,8 +30,8 @@ repair_ag(
struct scrub_ctx *ctx = (struct scrub_ctx *)wq->wq_ctx;
bool *aborted = priv;
struct action_list *alist;
- size_t unfixed;
- size_t new_unfixed;
+ unsigned long long unfixed;
+ unsigned long long new_unfixed;
unsigned int flags = 0;
int ret;
@@ -168,7 +168,7 @@ phase4_estimate(
int *rshift)
{
xfs_agnumber_t agno;
- size_t need_fixing = 0;
+ unsigned long long need_fixing = 0;
for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
need_fixing += action_list_length(&ctx->action_lists[agno]);
diff --git a/scrub/repair.c b/scrub/repair.c
index bb026101..67900ea4 100644
--- a/scrub/repair.c
+++ b/scrub/repair.c
@@ -189,7 +189,7 @@ action_list_init(
}
/* Number of repairs in this list. */
-size_t
+unsigned long long
action_list_length(
struct action_list *alist)
{
diff --git a/scrub/repair.h b/scrub/repair.h
index 4261be49..102e5779 100644
--- a/scrub/repair.h
+++ b/scrub/repair.h
@@ -8,7 +8,7 @@
struct action_list {
struct list_head list;
- size_t nr;
+ unsigned long long nr;
bool sorted;
};
@@ -22,7 +22,7 @@ static inline bool action_list_empty(const struct action_list *alist)
return list_empty(&alist->list);
}
-size_t action_list_length(struct action_list *alist);
+unsigned long long action_list_length(struct action_list *alist);
void action_list_add(struct action_list *dest, struct action_item *item);
void action_list_splice(struct action_list *dest, struct action_list *src);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
2022-05-16 22:33 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
In the next patch, we're going to rewrite scrub_scan_all_inodes to
schedule per-inogrp workqueue items that will run the iterator function.
In other words, the worker threads in phase 3 wil soon cease to be
per-AG threads.
To prepare for this, we must modify phase 3 so that any writes to shared
state are protected by the appropriate per-AG locks. As far as I can
tell, the only updates to shared state are the per-AG action lists, so
create some per-AG locks for phase 3 and create locked wrappers for the
action_list_* functions if we find things to repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
scrub/phase3.c | 44 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 39 insertions(+), 5 deletions(-)
diff --git a/scrub/phase3.c b/scrub/phase3.c
index d659a779..65e903f2 100644
--- a/scrub/phase3.c
+++ b/scrub/phase3.c
@@ -26,6 +26,9 @@ struct scrub_inode_ctx {
/* Number of inodes scanned. */
struct ptcounter *icount;
+ /* per-AG locks to protect the repair lists */
+ pthread_mutex_t *locks;
+
/* Set to true to abort all threads. */
bool aborted;
@@ -48,6 +51,24 @@ report_close_error(
str_errno(ctx, descr);
}
+/*
+ * Defer all the repairs until phase 4, being careful about locking since the
+ * inode scrub threads are not per-AG.
+ */
+static void
+defer_inode_repair(
+ struct scrub_inode_ctx *ictx,
+ xfs_agnumber_t agno,
+ struct action_list *alist)
+{
+ if (alist->nr == 0)
+ return;
+
+ pthread_mutex_lock(&ictx->locks[agno]);
+ action_list_defer(ictx->ctx, agno, alist);
+ pthread_mutex_unlock(&ictx->locks[agno]);
+}
+
/* Run repair actions now and defer unfinished items for later. */
static int
try_inode_repair(
@@ -71,7 +92,7 @@ try_inode_repair(
if (ret)
return ret;
- action_list_defer(ictx->ctx, agno, alist);
+ defer_inode_repair(ictx, agno, alist);
return 0;
}
@@ -184,7 +205,7 @@ scrub_inode(
progress_add(1);
if (!error && !ictx->aborted)
- action_list_defer(ctx, agno, &alist);
+ defer_inode_repair(ictx, agno, &alist);
if (fd >= 0) {
int err2;
@@ -217,12 +238,21 @@ phase3_func(
return err;
}
+ ictx.locks = calloc(ctx->mnt.fsgeom.agcount, sizeof(pthread_mutex_t));
+ if (!ictx.locks) {
+ str_errno(ctx, _("creating per-AG repair list locks"));
+ err = ENOMEM;
+ goto out_ptcounter;
+ }
+
/*
* If we already have ag/fs metadata to repair from previous phases,
* we would rather not try to repair file metadata until we've tried
* to repair the space metadata.
*/
for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
+ pthread_mutex_init(&ictx.locks[agno], NULL);
+
if (!action_list_empty(&ctx->action_lists[agno]))
ictx.always_defer_repairs = true;
}
@@ -231,17 +261,21 @@ phase3_func(
if (!err && ictx.aborted)
err = ECANCELED;
if (err)
- goto free;
+ goto out_locks;
scrub_report_preen_triggers(ctx);
err = ptcounter_value(ictx.icount, &val);
if (err) {
str_liberror(ctx, err, _("summing scanned inode counter"));
- return err;
+ goto out_locks;
}
ctx->inodes_checked = val;
-free:
+out_locks:
+ for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++)
+ pthread_mutex_destroy(&ictx.locks[agno]);
+ free(ictx.locks);
+out_ptcounter:
ptcounter_free(ictx.icount);
return err;
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
2022-05-17 14:11 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
Use the bounded workqueue functionality to spread the inode chunk scan
load across the CPUs more evenly. First, we create per-AG workers to
walk each AG's inode btree to create inode batch work items for each
inobt record. These items are added to a (second) bounded workqueue
that invokes BULKSTAT and invokes the caller's function on each bulkstat
record.
By splitting the work items into batches of 64 inodes instead of one
thread per AG, we keep the level of parallelism at a reasonably high
level almost all the way to the end of the inode scan if the inodes are
not evenly divided across AGs or if a few files have far more extent
records than average.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
scrub/inodes.c | 336 +++++++++++++++++++++++++++++++++++++++++---------------
1 file changed, 248 insertions(+), 88 deletions(-)
diff --git a/scrub/inodes.c b/scrub/inodes.c
index 80af8a74..41e5fdc7 100644
--- a/scrub/inodes.c
+++ b/scrub/inodes.c
@@ -16,6 +16,7 @@
#include "xfs_scrub.h"
#include "common.h"
#include "inodes.h"
+#include "descr.h"
#include "libfrog/fsgeom.h"
#include "libfrog/bulkstat.h"
@@ -49,7 +50,7 @@
static void
bulkstat_for_inumbers(
struct scrub_ctx *ctx,
- const char *descr,
+ struct descr *dsc,
const struct xfs_inumbers *inumbers,
struct xfs_bulkstat_req *breq)
{
@@ -65,7 +66,7 @@ bulkstat_for_inumbers(
if (error) {
char errbuf[DESCR_BUFSZ];
- str_info(ctx, descr, "%s",
+ str_info(ctx, descr_render(dsc), "%s",
strerror_r(error, errbuf, DESCR_BUFSZ));
}
@@ -95,61 +96,206 @@ bulkstat_for_inumbers(
/* BULKSTAT wrapper routines. */
struct scan_inodes {
+ struct workqueue wq_bulkstat;
scrub_inode_iter_fn fn;
void *arg;
+ unsigned int nr_threads;
bool aborted;
};
/*
- * Call into the filesystem for inode/bulkstat information and call our
- * iterator function. We'll try to fill the bulkstat information in batches,
- * but we also can detect iget failures.
+ * A single unit of inode scan work. This contains a pointer to the parent
+ * information, followed by an INUMBERS request structure, followed by a
+ * BULKSTAT request structure. The last two are VLAs, so we can't represent
+ * them here.
*/
-static void
-scan_ag_inodes(
- struct workqueue *wq,
- xfs_agnumber_t agno,
- void *arg)
+struct scan_ichunk {
+ struct scan_inodes *si;
+};
+
+static inline struct xfs_inumbers_req *
+ichunk_to_inumbers(
+ struct scan_ichunk *ichunk)
{
- struct xfs_handle handle = { };
- char descr[DESCR_BUFSZ];
+ char *p = (char *)ichunk;
+
+ return (struct xfs_inumbers_req *)(p + sizeof(struct scan_ichunk));
+}
+
+static inline struct xfs_bulkstat_req *
+ichunk_to_bulkstat(
+ struct scan_ichunk *ichunk)
+{
+ char *p = (char *)ichunk_to_inumbers(ichunk);
+
+ return (struct xfs_bulkstat_req *)(p + XFS_INUMBERS_REQ_SIZE(1));
+}
+
+static inline int
+alloc_ichunk(
+ struct scan_inodes *si,
+ uint32_t agno,
+ uint64_t startino,
+ struct scan_ichunk **ichunkp)
+{
+ struct scan_ichunk *ichunk;
struct xfs_inumbers_req *ireq;
struct xfs_bulkstat_req *breq;
- struct scan_inodes *si = arg;
- struct scrub_ctx *ctx = (struct scrub_ctx *)wq->wq_ctx;
- struct xfs_bulkstat *bs;
- struct xfs_inumbers *inumbers;
- uint64_t nextino = cvt_agino_to_ino(&ctx->mnt, agno, 0);
- int i;
- int error;
- int stale_count = 0;
-
- snprintf(descr, DESCR_BUFSZ, _("dev %d:%d AG %u inodes"),
+
+ ichunk = calloc(1, sizeof(struct scan_ichunk) +
+ XFS_INUMBERS_REQ_SIZE(1) +
+ XFS_BULKSTAT_REQ_SIZE(LIBFROG_BULKSTAT_CHUNKSIZE));
+ if (!ichunk)
+ return -errno;
+
+ ichunk->si = si;
+
+ ireq = ichunk_to_inumbers(ichunk);
+ ireq->hdr.icount = 1;
+ ireq->hdr.ino = startino;
+ ireq->hdr.agno = agno;
+ ireq->hdr.flags |= XFS_BULK_IREQ_AGNO;
+
+ breq = ichunk_to_bulkstat(ichunk);
+ breq->hdr.icount = LIBFROG_BULKSTAT_CHUNKSIZE;
+
+ *ichunkp = ichunk;
+ return 0;
+}
+
+int
+render_ino_from_bulkstat(
+ struct scrub_ctx *ctx,
+ char *buf,
+ size_t buflen,
+ void *data)
+{
+ struct xfs_bulkstat *bstat = data;
+
+ return scrub_render_ino_descr(ctx, buf, buflen, bstat->bs_ino,
+ bstat->bs_gen, NULL);
+}
+
+static int
+render_inumbers_from_agno(
+ struct scrub_ctx *ctx,
+ char *buf,
+ size_t buflen,
+ void *data)
+{
+ xfs_agnumber_t *agno = data;
+
+ return snprintf(buf, buflen, _("dev %d:%d AG %u inodes"),
major(ctx->fsinfo.fs_datadev),
minor(ctx->fsinfo.fs_datadev),
- agno);
+ *agno);
+}
+
+/*
+ * Call BULKSTAT for information on a single chunk's worth of inodes and call
+ * our iterator function. We'll try to fill the bulkstat information in
+ * batches, but we also can detect iget failures.
+ */
+static void
+scan_ag_bulkstat(
+ struct workqueue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct xfs_handle handle = { };
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->wq_ctx;
+ struct scan_ichunk *ichunk = arg;
+ struct xfs_inumbers_req *ireq = ichunk_to_inumbers(ichunk);
+ struct xfs_bulkstat_req *breq = ichunk_to_bulkstat(ichunk);
+ struct scan_inodes *si = ichunk->si;
+ struct xfs_bulkstat *bs;
+ struct xfs_inumbers *inumbers = &ireq->inumbers[0];
+ int i;
+ int error;
+ int stale_count = 0;
+ DEFINE_DESCR(dsc_bulkstat, ctx, render_ino_from_bulkstat);
+ DEFINE_DESCR(dsc_inumbers, ctx, render_inumbers_from_agno);
+
+ descr_set(&dsc_inumbers, &agno);
memcpy(&handle.ha_fsid, ctx->fshandle, sizeof(handle.ha_fsid));
handle.ha_fid.fid_len = sizeof(xfs_fid_t) -
sizeof(handle.ha_fid.fid_len);
handle.ha_fid.fid_pad = 0;
- error = -xfrog_bulkstat_alloc_req(LIBFROG_BULKSTAT_CHUNKSIZE, 0, &breq);
- if (error) {
- str_liberror(ctx, error, descr);
- si->aborted = true;
- return;
+retry:
+ bulkstat_for_inumbers(ctx, &dsc_inumbers, inumbers, breq);
+
+ /* Iterate all the inodes. */
+ bs = &breq->bulkstat[0];
+ for (i = 0; !si->aborted && i < inumbers->xi_alloccount; i++, bs++) {
+ descr_set(&dsc_bulkstat, bs);
+ handle.ha_fid.fid_ino = bs->bs_ino;
+ handle.ha_fid.fid_gen = bs->bs_gen;
+ error = si->fn(ctx, &handle, bs, si->arg);
+ switch (error) {
+ case 0:
+ break;
+ case ESTALE: {
+ stale_count++;
+ if (stale_count < 30) {
+ ireq->hdr.ino = inumbers->xi_startino;
+ error = -xfrog_inumbers(&ctx->mnt, ireq);
+ if (error)
+ goto err;
+ goto retry;
+ }
+ str_info(ctx, descr_render(&dsc_bulkstat),
+_("Changed too many times during scan; giving up."));
+ si->aborted = true;
+ goto out;
+ }
+ case ECANCELED:
+ error = 0;
+ fallthrough;
+ default:
+ goto err;
+ }
+ if (scrub_excessive_errors(ctx)) {
+ si->aborted = true;
+ goto out;
+ }
}
- error = -xfrog_inumbers_alloc_req(1, 0, &ireq);
+err:
if (error) {
- str_liberror(ctx, error, descr);
- free(breq);
+ str_liberror(ctx, error, descr_render(&dsc_bulkstat));
si->aborted = true;
- return;
}
- inumbers = &ireq->inumbers[0];
- xfrog_inumbers_set_ag(ireq, agno);
+out:
+ free(ichunk);
+}
+
+/*
+ * Call INUMBERS for information about inode chunks, then queue the inumbers
+ * responses in the bulkstat workqueue. This helps us maximize CPU parallelism
+ * if the filesystem AGs are not evenly loaded.
+ */
+static void
+scan_ag_inumbers(
+ struct workqueue *wq,
+ xfs_agnumber_t agno,
+ void *arg)
+{
+ struct scan_ichunk *ichunk = NULL;
+ struct scan_inodes *si = arg;
+ struct scrub_ctx *ctx = (struct scrub_ctx *)wq->wq_ctx;
+ struct xfs_inumbers_req *ireq;
+ uint64_t nextino = cvt_agino_to_ino(&ctx->mnt, agno, 0);
+ int error;
+ DEFINE_DESCR(dsc, ctx, render_inumbers_from_agno);
+
+ descr_set(&dsc, &agno);
+
+ error = alloc_ichunk(si, agno, 0, &ichunk);
+ if (error)
+ goto err;
+ ireq = ichunk_to_inumbers(ichunk);
/* Find the inode chunk & alloc mask */
error = -xfrog_inumbers(&ctx->mnt, ireq);
@@ -158,8 +304,8 @@ scan_ag_inodes(
* Make sure that we always make forward progress while we
* scan the inode btree.
*/
- if (nextino > inumbers->xi_startino) {
- str_corrupt(ctx, descr,
+ if (nextino > ireq->inumbers[0].xi_startino) {
+ str_corrupt(ctx, descr_render(&dsc),
_("AG %u inode btree is corrupt near agino %lu, got %lu"), agno,
cvt_ino_to_agino(&ctx->mnt, nextino),
cvt_ino_to_agino(&ctx->mnt,
@@ -169,64 +315,53 @@ scan_ag_inodes(
}
nextino = ireq->hdr.ino;
- /*
- * We can have totally empty inode chunks on filesystems where
- * there are more than 64 inodes per block. Skip these.
- */
- if (inumbers->xi_alloccount == 0)
- goto igrp_retry;
-
- bulkstat_for_inumbers(ctx, descr, inumbers, breq);
-
- /* Iterate all the inodes. */
- for (i = 0, bs = breq->bulkstat;
- !si->aborted && i < inumbers->xi_alloccount;
- i++, bs++) {
- handle.ha_fid.fid_ino = bs->bs_ino;
- handle.ha_fid.fid_gen = bs->bs_gen;
- error = si->fn(ctx, &handle, bs, si->arg);
- switch (error) {
- case 0:
- break;
- case ESTALE: {
- char idescr[DESCR_BUFSZ];
-
- stale_count++;
- if (stale_count < 30) {
- ireq->hdr.ino = inumbers->xi_startino;
- goto igrp_retry;
- }
- scrub_render_ino_descr(ctx, idescr, DESCR_BUFSZ,
- bs->bs_ino, bs->bs_gen, NULL);
- str_info(ctx, idescr,
-_("Changed too many times during scan; giving up."));
- break;
- }
- case ECANCELED:
- error = 0;
- fallthrough;
- default:
- goto err;
- }
- if (scrub_excessive_errors(ctx)) {
+ if (ireq->inumbers[0].xi_alloccount == 0) {
+ /*
+ * We can have totally empty inode chunks on
+ * filesystems where there are more than 64 inodes per
+ * block. Skip these.
+ */
+ ;
+ } else if (si->nr_threads > 0) {
+ /* Queue this inode chunk on the bulkstat workqueue. */
+ error = -workqueue_add(&si->wq_bulkstat,
+ scan_ag_bulkstat, agno, ichunk);
+ if (error) {
si->aborted = true;
+ str_liberror(ctx, error,
+ _("queueing bulkstat work"));
goto out;
}
+ ichunk = NULL;
+ } else {
+ /*
+ * Only one thread, call bulkstat directly. Remember,
+ * ichunk is freed by the worker before returning.
+ */
+ scan_ag_bulkstat(wq, agno, ichunk);
+ ichunk = NULL;
+ if (si->aborted)
+ break;
}
- stale_count = 0;
-igrp_retry:
+ if (!ichunk) {
+ error = alloc_ichunk(si, agno, nextino, &ichunk);
+ if (error)
+ goto err;
+ }
+ ireq = ichunk_to_inumbers(ichunk);
+
error = -xfrog_inumbers(&ctx->mnt, ireq);
}
err:
if (error) {
- str_liberror(ctx, error, descr);
+ str_liberror(ctx, error, descr_render(&dsc));
si->aborted = true;
}
out:
- free(ireq);
- free(breq);
+ if (ichunk)
+ free(ichunk);
}
/*
@@ -242,33 +377,58 @@ scrub_scan_all_inodes(
struct scan_inodes si = {
.fn = fn,
.arg = arg,
+ .nr_threads = scrub_nproc_workqueue(ctx),
};
xfs_agnumber_t agno;
- struct workqueue wq;
+ struct workqueue wq_inumbers;
+ unsigned int max_bulkstat;
int ret;
- ret = -workqueue_create(&wq, (struct xfs_mount *)ctx,
- scrub_nproc_workqueue(ctx));
+ /*
+ * The bulkstat workqueue should queue at most one inobt block's worth
+ * of inode chunk records per worker thread. If we're running in
+ * single thread mode (nr_threads==0) then we skip the workqueues.
+ */
+ max_bulkstat = si.nr_threads * (ctx->mnt.fsgeom.blocksize / 16);
+
+ ret = -workqueue_create_bound(&si.wq_bulkstat, (struct xfs_mount *)ctx,
+ si.nr_threads, max_bulkstat);
if (ret) {
str_liberror(ctx, ret, _("creating bulkstat workqueue"));
return -1;
}
+ ret = -workqueue_create(&wq_inumbers, (struct xfs_mount *)ctx,
+ si.nr_threads);
+ if (ret) {
+ str_liberror(ctx, ret, _("creating inumbers workqueue"));
+ si.aborted = true;
+ goto kill_bulkstat;
+ }
+
for (agno = 0; agno < ctx->mnt.fsgeom.agcount; agno++) {
- ret = -workqueue_add(&wq, scan_ag_inodes, agno, &si);
+ ret = -workqueue_add(&wq_inumbers, scan_ag_inumbers, agno, &si);
if (ret) {
si.aborted = true;
- str_liberror(ctx, ret, _("queueing bulkstat work"));
+ str_liberror(ctx, ret, _("queueing inumbers work"));
break;
}
}
- ret = -workqueue_terminate(&wq);
+ ret = -workqueue_terminate(&wq_inumbers);
+ if (ret) {
+ si.aborted = true;
+ str_liberror(ctx, ret, _("finishing inumbers work"));
+ }
+ workqueue_destroy(&wq_inumbers);
+
+kill_bulkstat:
+ ret = -workqueue_terminate(&si.wq_bulkstat);
if (ret) {
si.aborted = true;
str_liberror(ctx, ret, _("finishing bulkstat work"));
}
- workqueue_destroy(&wq);
+ workqueue_destroy(&si.wq_bulkstat);
return si.aborted ? -1 : 0;
}
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
` (2 preceding siblings ...)
2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
@ 2022-05-05 16:08 ` Darrick J. Wong
2022-05-18 2:47 ` Eric Sandeen
3 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2022-05-05 16:08 UTC (permalink / raw)
To: sandeen, djwong; +Cc: linux-xfs
From: Darrick J. Wong <djwong@kernel.org>
If we decide to restart an inode chunk walk because one of the inodes is
stale, make sure that we don't walk an inode that's been scanned before.
This ensure we always make forward progress.
Found by observing backwards inode scan progress while running xfs/285.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
scrub/inodes.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/scrub/inodes.c b/scrub/inodes.c
index 41e5fdc7..85bc1a06 100644
--- a/scrub/inodes.c
+++ b/scrub/inodes.c
@@ -210,6 +210,7 @@ scan_ag_bulkstat(
struct scan_inodes *si = ichunk->si;
struct xfs_bulkstat *bs;
struct xfs_inumbers *inumbers = &ireq->inumbers[0];
+ uint64_t last_ino = 0;
int i;
int error;
int stale_count = 0;
@@ -229,8 +230,13 @@ scan_ag_bulkstat(
/* Iterate all the inodes. */
bs = &breq->bulkstat[0];
for (i = 0; !si->aborted && i < inumbers->xi_alloccount; i++, bs++) {
+ uint64_t scan_ino = bs->bs_ino;
+
+ if (scan_ino < last_ino)
+ continue;
+
descr_set(&dsc_bulkstat, bs);
- handle.ha_fid.fid_ino = bs->bs_ino;
+ handle.ha_fid.fid_ino = scan_ino;
handle.ha_fid.fid_gen = bs->bs_gen;
error = si->fn(ctx, &handle, bs, si->arg);
switch (error) {
@@ -260,6 +266,7 @@ _("Changed too many times during scan; giving up."));
si->aborted = true;
goto out;
}
+ last_ino = scan_ino;
}
err:
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH 1/4] xfs_scrub: widen action list length variables
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
@ 2022-05-16 21:46 ` Eric Sandeen
0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-16 21:46 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> On a 32-bit system it's possible for there to be so many items in the
> repair list that we overflow a size_t. Widen this to unsigned long
> long.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads
2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
@ 2022-05-16 22:33 ` Eric Sandeen
0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-16 22:33 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> In the next patch, we're going to rewrite scrub_scan_all_inodes to
> schedule per-inogrp workqueue items that will run the iterator function.
> In other words, the worker threads in phase 3 wil soon cease to be
> per-AG threads.
>
> To prepare for this, we must modify phase 3 so that any writes to shared
> state are protected by the appropriate per-AG locks. As far as I can
> tell, the only updates to shared state are the per-AG action lists, so
> create some per-AG locks for phase 3 and create locked wrappers for the
> action_list_* functions if we find things to repair.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Looks fine.
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs
2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
@ 2022-05-17 14:11 ` Eric Sandeen
0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-17 14:11 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Use the bounded workqueue functionality to spread the inode chunk scan
> load across the CPUs more evenly. First, we create per-AG workers to
> walk each AG's inode btree to create inode batch work items for each
> inobt record. These items are added to a (second) bounded workqueue
> that invokes BULKSTAT and invokes the caller's function on each bulkstat
> record.
>
> By splitting the work items into batches of 64 inodes instead of one
> thread per AG, we keep the level of parallelism at a reasonably high
> level almost all the way to the end of the inode scan if the inodes are
> not evenly divided across AGs or if a few files have far more extent
> records than average.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Insofar as I at least read through it, applied it, built it, and
regression-tested it and didn't see anything wrong,
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode
2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
@ 2022-05-18 2:47 ` Eric Sandeen
0 siblings, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2022-05-18 2:47 UTC (permalink / raw)
To: Darrick J. Wong; +Cc: linux-xfs
On 5/5/22 11:08 AM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> If we decide to restart an inode chunk walk because one of the inodes is
> stale, make sure that we don't walk an inode that's been scanned before.
> This ensure we always make forward progress.
>
> Found by observing backwards inode scan progress while running xfs/285.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Would you mind if I nitpicked a:
/* ensure forward progress if we retried */
comment above the inode number test before I commit?
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2022-05-18 2:47 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-05 16:08 [PATCHSET 0/4] xfs_scrub: improve balancing of threads for inode scan Darrick J. Wong
2022-05-05 16:08 ` [PATCH 1/4] xfs_scrub: widen action list length variables Darrick J. Wong
2022-05-16 21:46 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 2/4] xfs_scrub: prepare phase3 for per-inogrp worker threads Darrick J. Wong
2022-05-16 22:33 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 3/4] xfs_scrub: balance inode chunk scan across CPUs Darrick J. Wong
2022-05-17 14:11 ` Eric Sandeen
2022-05-05 16:08 ` [PATCH 4/4] xfs_scrub: don't revisit scanned inodes when reprocessing a stale inode Darrick J. Wong
2022-05-18 2:47 ` Eric Sandeen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.