All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/10] xfs: refactor and improve inode iteration
@ 2019-06-04 21:49 Darrick J. Wong
  2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
                   ` (9 more replies)
  0 siblings, 10 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This next series refactors all the inode walking code in XFS into a
single set of helper functions.  The goal is to separate the mechanics
of iterating a subset of inode in the filesystem from bulkstat.

Next we introduce a parallel inode walk feature to speed up quotacheck
on large filesystems.  Finally, we port the existing bulkstat and
inumbers ioctls in preparation for the next series.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=parallel-iwalk

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=parallel-iwalk

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
@ 2019-06-04 21:49 ` Darrick J. Wong
  2019-06-10 13:58   ` Brian Foster
  2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new iterator function to simplify walking inodes in an XFS
filesystem.  This new iterator will replace the existing open-coded
walking that goes on in various places.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                  |    1 
 fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
 fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
 fs/xfs/xfs_itable.c              |    5 
 fs/xfs/xfs_itable.h              |    8 +
 fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iwalk.h               |   18 ++
 fs/xfs/xfs_trace.h               |   40 ++++
 8 files changed, 502 insertions(+), 4 deletions(-)
 create mode 100644 fs/xfs/xfs_iwalk.c
 create mode 100644 fs/xfs/xfs_iwalk.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 91831975363b..74d30ef0dbce 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -80,6 +80,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_iops.o \
 				   xfs_inode.o \
 				   xfs_itable.o \
+				   xfs_iwalk.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index ac4b65da4c2b..cb7eac2f51c0 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -564,6 +564,34 @@ xfs_inobt_max_size(
 					XFS_INODES_PER_CHUNK);
 }
 
+/* Read AGI and create inobt cursor. */
+int
+xfs_inobt_cur(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp)
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	ASSERT(*agi_bpp == NULL);
+
+	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
+	if (error)
+		return error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
+	if (!cur) {
+		xfs_trans_brelse(tp, *agi_bpp);
+		*agi_bpp = NULL;
+		return -ENOMEM;
+	}
+	*curpp = cur;
+	return 0;
+}
+
 static int
 xfs_inobt_count_blocks(
 	struct xfs_mount	*mp,
@@ -576,11 +604,10 @@ xfs_inobt_count_blocks(
 	struct xfs_btree_cur	*cur;
 	int			error;
 
-	error = xfs_ialloc_read_agi(mp, tp, agno, &agbp);
+	error = xfs_inobt_cur(mp, tp, agno, &cur, &agbp);
 	if (error)
 		return error;
 
-	cur = xfs_inobt_init_cursor(mp, tp, agbp, agno, btnum);
 	error = xfs_btree_count_blocks(cur, tree_blocks);
 	xfs_btree_del_cursor(cur, error);
 	xfs_trans_brelse(tp, agbp);
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index ebdd0c6b8766..1bc44b4a2b6c 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -64,5 +64,8 @@ int xfs_finobt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, xfs_extlen_t *ask, xfs_extlen_t *used);
 extern xfs_extlen_t xfs_iallocbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
+int xfs_inobt_cur(struct xfs_mount *mp, struct xfs_trans *tp,
+		xfs_agnumber_t agno, struct xfs_btree_cur **curpp,
+		struct xfs_buf **agi_bpp);
 
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index eef307cf90a7..3ca1c454afe6 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -19,6 +19,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_health.h"
+#include "xfs_iwalk.h"
 
 /*
  * Return stat information for one inode.
@@ -161,7 +162,7 @@ xfs_bulkstat_one(
  * Loop over all clusters in a chunk for a given incore inode allocation btree
  * record.  Do a readahead if there are any allocated inodes in that cluster.
  */
-STATIC void
+void
 xfs_bulkstat_ichunk_ra(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
@@ -195,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
  * are some left allocated, update the data for the pointed-to record as well as
  * return the count of grabbed inodes.
  */
-STATIC int
+int
 xfs_bulkstat_grab_ichunk(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
 	xfs_agino_t			agino,	/* starting inode of chunk */
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 8a822285b671..369e3f159d4e 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -84,4 +84,12 @@ xfs_inumbers(
 	void			__user *buffer, /* buffer with inode info */
 	inumbers_fmt_pf		formatter);
 
+/* Temporarily needed while we refactor functions. */
+struct xfs_btree_cur;
+struct xfs_inobt_rec_incore;
+void xfs_bulkstat_ichunk_ra(struct xfs_mount *mp, xfs_agnumber_t agno,
+		struct xfs_inobt_rec_incore *irec);
+int xfs_bulkstat_grab_ichunk(struct xfs_btree_cur *cur, xfs_agino_t agino,
+		int *icount, struct xfs_inobt_rec_incore *irec);
+
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
new file mode 100644
index 000000000000..3e6c06e69c75
--- /dev/null
+++ b/fs/xfs/xfs_iwalk.c
@@ -0,0 +1,400 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_itable.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_icache.h"
+#include "xfs_health.h"
+#include "xfs_trans.h"
+#include "xfs_iwalk.h"
+
+/*
+ * Walking Inodes in the Filesystem
+ * ================================
+ *
+ * This iterator function walks a subset of filesystem inodes in increasing
+ * order from @startino until there are no more inodes.  For each allocated
+ * inode it finds, it calls a walk function with the relevant inode number and
+ * a pointer to caller-provided data.  The walk function can return the usual
+ * negative error code to stop the iteration; 0 to continue the iteration; or
+ * XFS_IWALK_ABORT to stop the iteration.  This return value is returned to the
+ * caller.
+ *
+ * Internally, we allow the walk function to do anything, which means that we
+ * cannot maintain the inobt cursor or our lock on the AGI buffer.  We
+ * therefore cache the inobt records in kernel memory and only call the walk
+ * function when our memory buffer is full.  @nr_recs is the number of records
+ * that we've cached, and @sz_recs is the size of our cache.
+ *
+ * It is the responsibility of the walk function to ensure it accesses
+ * allocated inodes, as the inobt records may be stale by the time they are
+ * acted upon.
+ */
+
+struct xfs_iwalk_ag {
+	struct xfs_mount		*mp;
+	struct xfs_trans		*tp;
+
+	/* Where do we start the traversal? */
+	xfs_ino_t			startino;
+
+	/* Array of inobt records we cache. */
+	struct xfs_inobt_rec_incore	*recs;
+
+	/* Number of entries allocated for the @recs array. */
+	unsigned int			sz_recs;
+
+	/* Number of entries in the @recs array that are in use. */
+	unsigned int			nr_recs;
+
+	/* Inode walk function and data pointer. */
+	xfs_iwalk_fn			iwalk_fn;
+	void				*data;
+};
+
+/* Allocate memory for a walk. */
+STATIC int
+xfs_iwalk_alloc(
+	struct xfs_iwalk_ag	*iwag)
+{
+	size_t			size;
+
+	ASSERT(iwag->recs == NULL);
+	iwag->nr_recs = 0;
+
+	/* Allocate a prefetch buffer for inobt records. */
+	size = iwag->sz_recs * sizeof(struct xfs_inobt_rec_incore);
+	iwag->recs = kmem_alloc(size, KM_MAYFAIL);
+	if (iwag->recs == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/* Free memory we allocated for a walk. */
+STATIC void
+xfs_iwalk_free(
+	struct xfs_iwalk_ag	*iwag)
+{
+	kmem_free(iwag->recs);
+}
+
+/* For each inuse inode in each cached inobt record, call our function. */
+STATIC int
+xfs_iwalk_ag_recs(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	xfs_ino_t			ino;
+	unsigned int			i, j;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	for (i = 0; i < iwag->nr_recs; i++) {
+		struct xfs_inobt_rec_incore	*irec = &iwag->recs[i];
+
+		trace_xfs_iwalk_ag_rec(mp, agno, irec);
+
+		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
+			/* Skip if this inode is free */
+			if (XFS_INOBT_MASK(j) & irec->ir_free)
+				continue;
+
+			/* Otherwise call our function. */
+			ino = XFS_AGINO_TO_INO(mp, agno, irec->ir_startino + j);
+			error = iwag->iwalk_fn(mp, tp, ino, iwag->data);
+			if (error)
+				return error;
+		}
+	}
+
+	return 0;
+}
+
+/* Delete cursor and let go of AGI. */
+static inline void
+xfs_iwalk_del_inobt(
+	struct xfs_trans	*tp,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp,
+	int			error)
+{
+	if (*curpp) {
+		xfs_btree_del_cursor(*curpp, error);
+		*curpp = NULL;
+	}
+	if (*agi_bpp) {
+		xfs_trans_brelse(tp, *agi_bpp);
+		*agi_bpp = NULL;
+	}
+}
+
+/*
+ * Set ourselves up for walking inobt records starting from a given point in
+ * the filesystem.
+ *
+ * If caller passed in a nonzero start inode number, load the record from the
+ * inobt and make the record look like all the inodes before agino are free so
+ * that we skip them, and then move the cursor to the next inobt record.  This
+ * is how we support starting an iwalk in the middle of an inode chunk.
+ *
+ * If the caller passed in a start number of zero, move the cursor to the first
+ * inobt record.
+ *
+ * The caller is responsible for cleaning up the cursor and buffer pointer
+ * regardless of the error status.
+ */
+STATIC int
+xfs_iwalk_ag_start(
+	struct xfs_iwalk_ag	*iwag,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp,
+	int			*has_more)
+{
+	struct xfs_mount	*mp = iwag->mp;
+	struct xfs_trans	*tp = iwag->tp;
+	int			icount;
+	int			error;
+
+	/* Set up a fresh cursor and empty the inobt cache. */
+	iwag->nr_recs = 0;
+	error = xfs_inobt_cur(mp, tp, agno, curpp, agi_bpp);
+	if (error)
+		return error;
+
+	/* Starting at the beginning of the AG?  That's easy! */
+	if (agino == 0)
+		return xfs_inobt_lookup(*curpp, 0, XFS_LOOKUP_GE, has_more);
+
+	/*
+	 * Otherwise, we have to grab the inobt record where we left off, stuff
+	 * the record into our cache, and then see if there are more records.
+	 * We require a lookup cache of at least two elements so that we don't
+	 * have to deal with tearing down the cursor to walk the records.
+	 */
+	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
+			&iwag->recs[iwag->nr_recs]);
+	if (error)
+		return error;
+	if (icount)
+		iwag->nr_recs++;
+
+	/*
+	 * set_prefetch is supposed to give us a large enough inobt record
+	 * cache that grab_ichunk can stage a partial first record and the loop
+	 * body can cache a record without having to check for cache space
+	 * until after it reads an inobt record.
+	 */
+	ASSERT(iwag->nr_recs < iwag->sz_recs);
+
+	return xfs_btree_increment(*curpp, 0, has_more);
+}
+
+typedef int (*xfs_iwalk_ag_recs_fn)(struct xfs_iwalk_ag *iwag);
+
+/*
+ * The inobt record cache is full, so preserve the inobt cursor state and
+ * run callbacks on the cached inobt records.  When we're done, restore the
+ * cursor state to wherever the cursor would have been had the cache not been
+ * full (and therefore we could've just incremented the cursor).
+ */
+STATIC int
+xfs_iwalk_run_callbacks(
+	struct xfs_iwalk_ag		*iwag,
+	xfs_iwalk_ag_recs_fn		walk_ag_recs_fn,
+	xfs_agnumber_t			agno,
+	struct xfs_btree_cur		**curpp,
+	struct xfs_buf			**agi_bpp,
+	int				*has_more)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_inobt_rec_incore	*irec;
+	xfs_agino_t			restart;
+	int				error;
+
+	ASSERT(iwag->nr_recs == iwag->sz_recs);
+
+	/* Delete cursor but remember the last record we cached... */
+	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
+	irec = &iwag->recs[iwag->nr_recs - 1];
+	restart = irec->ir_startino + XFS_INODES_PER_CHUNK - 1;
+
+	error = walk_ag_recs_fn(iwag);
+	if (error)
+		return error;
+
+	/* ...empty the cache... */
+	iwag->nr_recs = 0;
+
+	/* ...and recreate the cursor just past where we left off. */
+	error = xfs_inobt_cur(mp, tp, agno, curpp, agi_bpp);
+	if (error)
+		return error;
+
+	return xfs_inobt_lookup(*curpp, restart, XFS_LOOKUP_GE, has_more);
+}
+
+/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
+STATIC int
+xfs_iwalk_ag(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_buf			*agi_bp = NULL;
+	struct xfs_btree_cur		*cur = NULL;
+	xfs_agnumber_t			agno;
+	xfs_agino_t			agino;
+	int				has_more;
+	int				error = 0;
+
+	/* Set up our cursor at the right place in the inode btree. */
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
+
+	while (!error && has_more) {
+		struct xfs_inobt_rec_incore	*irec;
+
+		cond_resched();
+
+		/* Fetch the inobt record. */
+		irec = &iwag->recs[iwag->nr_recs];
+		error = xfs_inobt_get_rec(cur, irec, &has_more);
+		if (error || !has_more)
+			break;
+
+		/* No allocated inodes in this chunk; skip it. */
+		if (irec->ir_freecount == irec->ir_count) {
+			error = xfs_btree_increment(cur, 0, &has_more);
+			if (error)
+				break;
+			continue;
+		}
+
+		/*
+		 * Start readahead for this inode chunk in anticipation of
+		 * walking the inodes.
+		 */
+		xfs_bulkstat_ichunk_ra(mp, agno, irec);
+
+		/*
+		 * If there's space in the buffer for more records, increment
+		 * the btree cursor and grab more.
+		 */
+		if (++iwag->nr_recs < iwag->sz_recs) {
+			error = xfs_btree_increment(cur, 0, &has_more);
+			if (error || !has_more)
+				break;
+			continue;
+		}
+
+		/*
+		 * Otherwise, we need to save cursor state and run the callback
+		 * function on the cached records.  The run_callbacks function
+		 * is supposed to return a cursor pointing to the record where
+		 * we would be if we had been able to increment like above.
+		 */
+		error = xfs_iwalk_run_callbacks(iwag, xfs_iwalk_ag_recs, agno,
+				&cur, &agi_bp, &has_more);
+	}
+
+	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+
+	/* Walk any records left behind in the cache. */
+	if (iwag->nr_recs == 0 || error)
+		return error;
+
+	return xfs_iwalk_ag_recs(iwag);
+}
+
+/*
+ * Given the number of inodes to prefetch, set the number of inobt records that
+ * we cache in memory, which controls the number of inodes we try to read
+ * ahead.
+ *
+ * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
+ * this should be plenty of inodes to read ahead.  This number (256 inobt
+ * records) was chosen so that the cache is never more than a single memory
+ * page.
+ */
+static inline void
+xfs_iwalk_set_prefetch(
+	struct xfs_iwalk_ag	*iwag,
+	unsigned int		max_prefetch)
+{
+	if (max_prefetch)
+		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
+					XFS_INODES_PER_CHUNK;
+	else
+		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
+
+	/*
+	 * Allocate enough space to prefetch at least two records so that we
+	 * can cache both the inobt record where the iwalk started and the next
+	 * record.  This simplifies the AG inode walk loop setup code.
+	 */
+	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
+}
+
+/*
+ * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
+ * will be called for each allocated inode, being passed the inode's number and
+ * @data.  @max_prefetch controls how many inobt records' worth of inodes we
+ * try to readahead.
+ */
+int
+xfs_iwalk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		startino,
+	xfs_iwalk_fn		iwalk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_iwalk_ag	iwag = {
+		.mp		= mp,
+		.tp		= tp,
+		.iwalk_fn	= iwalk_fn,
+		.data		= data,
+		.startino	= startino,
+	};
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
+	error = xfs_iwalk_alloc(&iwag);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_iwalk_ag(&iwag);
+		if (error)
+			break;
+		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	xfs_iwalk_free(&iwag);
+	return error;
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
new file mode 100644
index 000000000000..45b1baabcd2d
--- /dev/null
+++ b/fs/xfs/xfs_iwalk.h
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_IWALK_H__
+#define __XFS_IWALK_H__
+
+/* Walk all inodes in the filesystem starting from @startino. */
+typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
+			    xfs_ino_t ino, void *data);
+/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
+#define XFS_IWALK_ABORT	(1)
+
+int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+
+#endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2464ea351f83..f9bb1d50bc0e 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
 
+TRACE_EVENT(xfs_iwalk_ag,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agino_t startino),
+	TP_ARGS(mp, agno, startino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, startino)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startino = startino;
+	),
+	TP_printk("dev %d:%d agno %d startino %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
+		  __entry->startino)
+)
+
+TRACE_EVENT(xfs_iwalk_ag_rec,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 struct xfs_inobt_rec_incore *irec),
+	TP_ARGS(mp, agno, irec),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, startino)
+		__field(uint64_t, freemask)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startino = irec->ir_startino;
+		__entry->freemask = irec->ir_free;
+	),
+	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
+		  __entry->startino, __entry->freemask)
+)
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
  2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
@ 2019-06-04 21:49 ` Darrick J. Wong
  2019-06-10 13:58   ` Brian Foster
  2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, Dave Chinner

From: Darrick J. Wong <darrick.wong@oracle.com>

Convert quotacheck to use the new iwalk iterator to dig through the
inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
 1 file changed, 20 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index aa6b6db3db0e..a5b2260406a8 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -26,6 +26,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_cksum.h"
+#include "xfs_iwalk.h"
 
 /*
  * The global quota manager. There is only one of these for the entire
@@ -1118,17 +1119,15 @@ xfs_qm_quotacheck_dqadjust(
 /* ARGSUSED */
 STATIC int
 xfs_qm_dqusage_adjust(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* not used */
-	int		ubsize,		/* not used */
-	int		*ubused,	/* not used */
-	int		*res)		/* result code value */
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
-	xfs_inode_t	*ip;
-	xfs_qcnt_t	nblks;
-	xfs_filblks_t	rtblks = 0;	/* total rt blks */
-	int		error;
+	struct xfs_inode	*ip;
+	xfs_qcnt_t		nblks;
+	xfs_filblks_t		rtblks = 0;	/* total rt blks */
+	int			error;
 
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
@@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
 	 * rootino must have its resources accounted for, not so with the quota
 	 * inodes.
 	 */
-	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
-		*res = BULKSTAT_RV_NOTHING;
-		return -EINVAL;
-	}
+	if (xfs_is_quota_inode(&mp->m_sb, ino))
+		return 0;
 
 	/*
 	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
 	 * at mount time and therefore nobody will be racing chown/chproj.
 	 */
-	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
-	if (error) {
-		*res = BULKSTAT_RV_NOTHING;
+	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
+	if (error == -EINVAL || error == -ENOENT)
+		return 0;
+	if (error)
 		return error;
-	}
 
 	ASSERT(ip->i_delayed_blks == 0);
 
@@ -1157,7 +1154,7 @@ xfs_qm_dqusage_adjust(
 		struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
 
 		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
-			error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+			error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
 			if (error)
 				goto error0;
 		}
@@ -1200,13 +1197,8 @@ xfs_qm_dqusage_adjust(
 			goto error0;
 	}
 
-	xfs_irele(ip);
-	*res = BULKSTAT_RV_DIDONE;
-	return 0;
-
 error0:
 	xfs_irele(ip);
-	*res = BULKSTAT_RV_GIVEUP;
 	return error;
 }
 
@@ -1270,18 +1262,13 @@ STATIC int
 xfs_qm_quotacheck(
 	xfs_mount_t	*mp)
 {
-	int			done, count, error, error2;
-	xfs_ino_t		lastino;
-	size_t			structsz;
+	int			error, error2;
 	uint			flags;
 	LIST_HEAD		(buffer_list);
 	struct xfs_inode	*uip = mp->m_quotainfo->qi_uquotaip;
 	struct xfs_inode	*gip = mp->m_quotainfo->qi_gquotaip;
 	struct xfs_inode	*pip = mp->m_quotainfo->qi_pquotaip;
 
-	count = INT_MAX;
-	structsz = 1;
-	lastino = 0;
 	flags = 0;
 
 	ASSERT(uip || gip || pip);
@@ -1318,18 +1305,9 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	do {
-		/*
-		 * Iterate thru all the inodes in the file system,
-		 * adjusting the corresponding dquot counters in core.
-		 */
-		error = xfs_bulkstat(mp, &lastino, &count,
-				     xfs_qm_dqusage_adjust,
-				     structsz, NULL, &done);
-		if (error)
-			break;
-
-	} while (!done);
+	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	if (error)
+		goto error_return;
 
 	/*
 	 * We've made all the changes that we need to make incore.  Flush them

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
  2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
  2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
@ 2019-06-04 21:49 ` Darrick J. Wong
  2019-06-10 13:59   ` Brian Foster
  2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When userspace passes in a @lastip pointer we should copy the results
back, even if the @ocount pointer is NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   13 ++++++-------
 fs/xfs/xfs_ioctl32.c |   13 ++++++-------
 2 files changed, 12 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d7dfc13f30f5..5ffbdcff3dba 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -768,14 +768,13 @@ xfs_ioc_bulkstat(
 	if (error)
 		return error;
 
-	if (bulkreq.ocount != NULL) {
-		if (copy_to_user(bulkreq.lastip, &inlast,
-						sizeof(xfs_ino_t)))
-			return -EFAULT;
+	if (bulkreq.lastip != NULL &&
+	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+		return -EFAULT;
 
-		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
-			return -EFAULT;
-	}
+	if (bulkreq.ocount != NULL &&
+	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+		return -EFAULT;
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 614fc6886d24..814ffe6fbab7 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -310,14 +310,13 @@ xfs_compat_ioc_bulkstat(
 	if (error)
 		return error;
 
-	if (bulkreq.ocount != NULL) {
-		if (copy_to_user(bulkreq.lastip, &inlast,
-						sizeof(xfs_ino_t)))
-			return -EFAULT;
+	if (bulkreq.lastip != NULL &&
+	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+		return -EFAULT;
 
-		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
-			return -EFAULT;
-	}
+	if (bulkreq.ocount != NULL &&
+	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+		return -EFAULT;
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (2 preceding siblings ...)
  2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
@ 2019-06-04 21:49 ` Darrick J. Wong
  2019-06-10 14:02   ` Brian Foster
  2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new ibulk structure incore to help us deal with bulk inode stat
state tracking and then convert the bulkstat code to use the new iwalk
iterator.  This disentangles inode walking from bulk stat control for
simpler code and enables us to isolate the formatter functions to the
ioctl handling code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   65 ++++++--
 fs/xfs/xfs_ioctl.h   |    5 +
 fs/xfs/xfs_ioctl32.c |   88 +++++------
 fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
 fs/xfs/xfs_itable.h  |   79 ++++------
 5 files changed, 245 insertions(+), 399 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5ffbdcff3dba..43734901aeb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -721,16 +721,28 @@ xfs_ioc_space(
 	return error;
 }
 
+/* Return 0 on success or positive error */
+int
+xfs_bulkstat_one_fmt(
+	struct xfs_ibulk	*breq,
+	const struct xfs_bstat	*bstat)
+{
+	if (copy_to_user(breq->ubuffer, bstat, sizeof(*bstat)))
+		return -EFAULT;
+	return xfs_ibulk_advance(breq, sizeof(struct xfs_bstat));
+}
+
 STATIC int
 xfs_ioc_bulkstat(
 	xfs_mount_t		*mp,
 	unsigned int		cmd,
 	void			__user *arg)
 {
-	xfs_fsop_bulkreq_t	bulkreq;
-	int			count;	/* # of records returned */
-	xfs_ino_t		inlast;	/* last inode number */
-	int			done;
+	struct xfs_fsop_bulkreq	bulkreq;
+	struct xfs_ibulk	breq = {
+		.mp		= mp,
+	};
+	xfs_ino_t		lastino;
 	int			error;
 
 	/* done = 1 if there are more stats to get and if bulkstat */
@@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
 	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
 		return -EFAULT;
 
-	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
+	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
 		return -EFAULT;
 
-	if ((count = bulkreq.icount) <= 0)
+	if (bulkreq.icount <= 0)
 		return -EINVAL;
 
 	if (bulkreq.ubuffer == NULL)
 		return -EINVAL;
 
-	if (cmd == XFS_IOC_FSINUMBERS)
-		error = xfs_inumbers(mp, &inlast, &count,
+	breq.ubuffer = bulkreq.ubuffer;
+	breq.icount = bulkreq.icount;
+
+	/*
+	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
+	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
+	 * that *lastip contains either zero or the number of the last inode to
+	 * be examined by the previous call and return results starting with
+	 * the next inode after that.  The new bulk request functions take the
+	 * inode to start with, so we have to adjust the lastino/startino
+	 * parameter to maintain correct function.
+	 */
+	if (cmd == XFS_IOC_FSINUMBERS) {
+		int	count = breq.icount;
+
+		breq.startino = lastino;
+		error = xfs_inumbers(mp, &breq.startino, &count,
 					bulkreq.ubuffer, xfs_inumbers_fmt);
-	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
-		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
-					sizeof(xfs_bstat_t), NULL, &done);
-	else	/* XFS_IOC_FSBULKSTAT */
-		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
-				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
-				     &done);
+		breq.ocount = count;
+		lastino = breq.startino;
+	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
+		breq.startino = lastino;
+		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
+		lastino = breq.startino;
+	} else {	/* XFS_IOC_FSBULKSTAT */
+		breq.startino = lastino + 1;
+		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
+		lastino = breq.startino - 1;
+	}
 
 	if (error)
 		return error;
 
 	if (bulkreq.lastip != NULL &&
-	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
 		return -EFAULT;
 
 	if (bulkreq.ocount != NULL &&
-	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
 		return -EFAULT;
 
 	return 0;
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index 4b17f67c888a..f32c8aadfeba 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -77,4 +77,9 @@ xfs_set_dmattrs(
 	uint			evmask,
 	uint16_t		state);
 
+struct xfs_ibulk;
+struct xfs_bstat;
+
+int xfs_bulkstat_one_fmt(struct xfs_ibulk *breq, const struct xfs_bstat *bstat);
+
 #endif
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 814ffe6fbab7..add15819daf3 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -172,15 +172,10 @@ xfs_bstime_store_compat(
 /* Return 0 on success or positive error (to xfs_bulkstat()) */
 STATIC int
 xfs_bulkstat_one_fmt_compat(
-	void			__user *ubuffer,
-	int			ubsize,
-	int			*ubused,
-	const xfs_bstat_t	*buffer)
+	struct xfs_ibulk	*breq,
+	const struct xfs_bstat	*buffer)
 {
-	compat_xfs_bstat_t	__user *p32 = ubuffer;
-
-	if (ubsize < sizeof(*p32))
-		return -ENOMEM;
+	struct compat_xfs_bstat	__user *p32 = breq->ubuffer;
 
 	if (put_user(buffer->bs_ino,	  &p32->bs_ino)		||
 	    put_user(buffer->bs_mode,	  &p32->bs_mode)	||
@@ -205,23 +200,8 @@ xfs_bulkstat_one_fmt_compat(
 	    put_user(buffer->bs_dmstate,  &p32->bs_dmstate)	||
 	    put_user(buffer->bs_aextents, &p32->bs_aextents))
 		return -EFAULT;
-	if (ubused)
-		*ubused = sizeof(*p32);
-	return 0;
-}
 
-STATIC int
-xfs_bulkstat_one_compat(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* buffer to place output in */
-	int		ubsize,		/* size of buffer */
-	int		*ubused,	/* bytes used by me */
-	int		*stat)		/* BULKSTAT_RV_... */
-{
-	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
-				    xfs_bulkstat_one_fmt_compat,
-				    ubused, stat);
+	return xfs_ibulk_advance(breq, sizeof(struct compat_xfs_bstat));
 }
 
 /* copied from xfs_ioctl.c */
@@ -232,10 +212,11 @@ xfs_compat_ioc_bulkstat(
 	compat_xfs_fsop_bulkreq_t __user *p32)
 {
 	u32			addr;
-	xfs_fsop_bulkreq_t	bulkreq;
-	int			count;	/* # of records returned */
-	xfs_ino_t		inlast;	/* last inode number */
-	int			done;
+	struct xfs_fsop_bulkreq	bulkreq;
+	struct xfs_ibulk	breq = {
+		.mp		= mp,
+	};
+	xfs_ino_t		lastino;
 	int			error;
 
 	/*
@@ -245,8 +226,7 @@ xfs_compat_ioc_bulkstat(
 	 * functions and structure size are the correct ones to use ...
 	 */
 	inumbers_fmt_pf inumbers_func = xfs_inumbers_fmt_compat;
-	bulkstat_one_pf	bs_one_func = xfs_bulkstat_one_compat;
-	size_t bs_one_size = sizeof(struct compat_xfs_bstat);
+	bulkstat_one_fmt_pf	bs_one_func = xfs_bulkstat_one_fmt_compat;
 
 #ifdef CONFIG_X86_X32
 	if (in_x32_syscall()) {
@@ -259,8 +239,7 @@ xfs_compat_ioc_bulkstat(
 		 * x32 userspace expects.
 		 */
 		inumbers_func = xfs_inumbers_fmt;
-		bs_one_func = xfs_bulkstat_one;
-		bs_one_size = sizeof(struct xfs_bstat);
+		bs_one_func = xfs_bulkstat_one_fmt;
 	}
 #endif
 
@@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
 		return -EFAULT;
 	bulkreq.ocount = compat_ptr(addr);
 
-	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
+	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
 		return -EFAULT;
+	breq.startino = lastino + 1;
 
-	if ((count = bulkreq.icount) <= 0)
+	if (bulkreq.icount <= 0)
 		return -EINVAL;
 
 	if (bulkreq.ubuffer == NULL)
 		return -EINVAL;
 
+	breq.ubuffer = bulkreq.ubuffer;
+	breq.icount = bulkreq.icount;
+
+	/*
+	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
+	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
+	 * that *lastip contains either zero or the number of the last inode to
+	 * be examined by the previous call and return results starting with
+	 * the next inode after that.  The new bulk request functions take the
+	 * inode to start with, so we have to adjust the lastino/startino
+	 * parameter to maintain correct function.
+	 */
 	if (cmd == XFS_IOC_FSINUMBERS_32) {
-		error = xfs_inumbers(mp, &inlast, &count,
+		int	count = breq.icount;
+
+		breq.startino = lastino;
+		error = xfs_inumbers(mp, &breq.startino, &count,
 				bulkreq.ubuffer, inumbers_func);
+		breq.ocount = count;
+		lastino = breq.startino;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
-		int res;
-
-		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
-				bs_one_size, NULL, &res);
+		breq.startino = lastino;
+		error = xfs_bulkstat_one(&breq, bs_one_func);
+		lastino = breq.startino;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
-		error = xfs_bulkstat(mp, &inlast, &count,
-			bs_one_func, bs_one_size,
-			bulkreq.ubuffer, &done);
-	} else
+		breq.startino = lastino + 1;
+		error = xfs_bulkstat(&breq, bs_one_func);
+		lastino = breq.startino - 1;
+	} else {
 		error = -EINVAL;
+	}
 	if (error)
 		return error;
 
+	lastino = breq.startino - 1;
 	if (bulkreq.lastip != NULL &&
-	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
 		return -EFAULT;
 
 	if (bulkreq.ocount != NULL &&
-	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
 		return -EFAULT;
 
 	return 0;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 3ca1c454afe6..87c597ea1df7 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -22,37 +22,63 @@
 #include "xfs_iwalk.h"
 
 /*
- * Return stat information for one inode.
- * Return 0 if ok, else errno.
+ * Bulk Stat
+ * =========
+ *
+ * Use the inode walking functions to fill out struct xfs_bstat for every
+ * allocated inode, then pass the stat information to some externally provided
+ * iteration function.
  */
-int
+
+struct xfs_bstat_chunk {
+	bulkstat_one_fmt_pf	formatter;
+	struct xfs_ibulk	*breq;
+};
+
+/*
+ * Fill out the bulkstat info for a single inode and report it somewhere.
+ *
+ * bc->breq->lastino is effectively the inode cursor as we walk through the
+ * filesystem.  Therefore, we update it any time we need to move the cursor
+ * forward, regardless of whether or not we're sending any bstat information
+ * back to userspace.  If the inode is internal metadata or, has been freed
+ * out from under us, we just simply keep going.
+ *
+ * However, if any other type of error happens we want to stop right where we
+ * are so that userspace will call back with exact number of the bad inode and
+ * we can send back an error code.
+ *
+ * Note that if the formatter tells us there's no space left in the buffer we
+ * move the cursor forward and abort the walk.
+ */
+STATIC int
 xfs_bulkstat_one_int(
-	struct xfs_mount	*mp,		/* mount point for filesystem */
-	xfs_ino_t		ino,		/* inode to get data for */
-	void __user		*buffer,	/* buffer to place output in */
-	int			ubsize,		/* size of buffer */
-	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
-	int			*ubused,	/* bytes used by me */
-	int			*stat)		/* BULKSTAT_RV_... */
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
+	struct xfs_bstat_chunk	*bc = data;
 	struct xfs_icdinode	*dic;		/* dinode core info pointer */
 	struct xfs_inode	*ip;		/* incore inode pointer */
 	struct inode		*inode;
 	struct xfs_bstat	*buf;		/* return buffer */
 	int			error = 0;	/* error value */
 
-	*stat = BULKSTAT_RV_NOTHING;
-
-	if (!buffer || xfs_internal_inum(mp, ino))
+	if (xfs_internal_inum(mp, ino)) {
+		bc->breq->startino = ino + 1;
 		return -EINVAL;
+	}
 
 	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
 	if (!buf)
 		return -ENOMEM;
 
-	error = xfs_iget(mp, NULL, ino,
+	error = xfs_iget(mp, tp, ino,
 			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
 			 XFS_ILOCK_SHARED, &ip);
+	if (error == -ENOENT || error == -EINVAL)
+		bc->breq->startino = ino + 1;
 	if (error)
 		goto out_free;
 
@@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	xfs_irele(ip);
 
-	error = formatter(buffer, ubsize, ubused, buf);
-	if (!error)
-		*stat = BULKSTAT_RV_DIDONE;
-
- out_free:
+	error = bc->formatter(bc->breq, buf);
+	switch (error) {
+	case XFS_IBULK_BUFFER_FULL:
+		error = XFS_IWALK_ABORT;
+		/* fall through */
+	case 0:
+		bc->breq->startino = ino + 1;
+		break;
+	}
+out_free:
 	kmem_free(buf);
 	return error;
 }
 
-/* Return 0 on success or positive error */
-STATIC int
-xfs_bulkstat_one_fmt(
-	void			__user *ubuffer,
-	int			ubsize,
-	int			*ubused,
-	const xfs_bstat_t	*buffer)
-{
-	if (ubsize < sizeof(*buffer))
-		return -ENOMEM;
-	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
-		return -EFAULT;
-	if (ubused)
-		*ubused = sizeof(*buffer);
-	return 0;
-}
-
+/* Bulkstat a single inode. */
 int
 xfs_bulkstat_one(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* buffer to place output in */
-	int		ubsize,		/* size of buffer */
-	int		*ubused,	/* bytes used by me */
-	int		*stat)		/* BULKSTAT_RV_... */
+	struct xfs_ibulk	*breq,
+	bulkstat_one_fmt_pf	formatter)
 {
-	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
-				    xfs_bulkstat_one_fmt, ubused, stat);
+	struct xfs_bstat_chunk	bc = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
+	int			error;
+
+	breq->icount = 1;
+	breq->ocount = 0;
+
+	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
+
+	/*
+	 * If we reported one inode to userspace then we abort because we hit
+	 * the end of the buffer.  Don't leak that back to userspace.
+	 */
+	if (error == XFS_IWALK_ABORT)
+		error = 0;
+
+	return error;
 }
 
 /*
@@ -251,256 +279,65 @@ xfs_bulkstat_grab_ichunk(
 
 #define XFS_BULKSTAT_UBLEFT(ubleft)	((ubleft) >= statstruct_size)
 
-struct xfs_bulkstat_agichunk {
-	char		__user **ac_ubuffer;/* pointer into user's buffer */
-	int		ac_ubleft;	/* bytes left in user's buffer */
-	int		ac_ubelem;	/* spaces used in user's buffer */
-};
-
-/*
- * Process inodes in chunk with a pointer to a formatter function
- * that will iget the inode and fill in the appropriate structure.
- */
 static int
-xfs_bulkstat_ag_ichunk(
-	struct xfs_mount		*mp,
-	xfs_agnumber_t			agno,
-	struct xfs_inobt_rec_incore	*irbp,
-	bulkstat_one_pf			formatter,
-	size_t				statstruct_size,
-	struct xfs_bulkstat_agichunk	*acp,
-	xfs_agino_t			*last_agino)
+xfs_bulkstat_iwalk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
-	char				__user **ubufp = acp->ac_ubuffer;
-	int				chunkidx;
-	int				error = 0;
-	xfs_agino_t			agino = irbp->ir_startino;
-
-	for (chunkidx = 0; chunkidx < XFS_INODES_PER_CHUNK;
-	     chunkidx++, agino++) {
-		int		fmterror;
-		int		ubused;
-
-		/* inode won't fit in buffer, we are done */
-		if (acp->ac_ubleft < statstruct_size)
-			break;
-
-		/* Skip if this inode is free */
-		if (XFS_INOBT_MASK(chunkidx) & irbp->ir_free)
-			continue;
-
-		/* Get the inode and fill in a single buffer */
-		ubused = statstruct_size;
-		error = formatter(mp, XFS_AGINO_TO_INO(mp, agno, agino),
-				  *ubufp, acp->ac_ubleft, &ubused, &fmterror);
-
-		if (fmterror == BULKSTAT_RV_GIVEUP ||
-		    (error && error != -ENOENT && error != -EINVAL)) {
-			acp->ac_ubleft = 0;
-			ASSERT(error);
-			break;
-		}
-
-		/* be careful not to leak error if at end of chunk */
-		if (fmterror == BULKSTAT_RV_NOTHING || error) {
-			error = 0;
-			continue;
-		}
-
-		*ubufp += ubused;
-		acp->ac_ubleft -= ubused;
-		acp->ac_ubelem++;
-	}
-
-	/*
-	 * Post-update *last_agino. At this point, agino will always point one
-	 * inode past the last inode we processed successfully. Hence we
-	 * substract that inode when setting the *last_agino cursor so that we
-	 * return the correct cookie to userspace. On the next bulkstat call,
-	 * the inode under the lastino cookie will be skipped as we have already
-	 * processed it here.
-	 */
-	*last_agino = agino - 1;
+	int			error;
 
+	error = xfs_bulkstat_one_int(mp, tp, ino, data);
+	/* bulkstat just skips over missing inodes */
+	if (error == -ENOENT || error == -EINVAL)
+		return 0;
 	return error;
 }
 
 /*
- * Return stat information in bulk (by-inode) for the filesystem.
+ * Check the incoming lastino parameter.
+ *
+ * We allow any inode value that could map to physical space inside the
+ * filesystem because if there are no inodes there, bulkstat moves on to the
+ * next chunk.  In other words, the magic agino value of zero takes us to the
+ * first chunk in the AG, and an agino value past the end of the AG takes us to
+ * the first chunk in the next AG.
+ *
+ * Therefore we can end early if the requested inode is beyond the end of the
+ * filesystem or doesn't map properly.
  */
-int					/* error status */
-xfs_bulkstat(
-	xfs_mount_t		*mp,	/* mount point for filesystem */
-	xfs_ino_t		*lastinop, /* last inode returned */
-	int			*ubcountp, /* size of buffer/count returned */
-	bulkstat_one_pf		formatter, /* func that'd fill a single buf */
-	size_t			statstruct_size, /* sizeof struct filling */
-	char			__user *ubuffer, /* buffer with inode stats */
-	int			*done)	/* 1 if there are more stats to get */
+static inline bool
+xfs_bulkstat_already_done(
+	struct xfs_mount	*mp,
+	xfs_ino_t		startino)
 {
-	xfs_buf_t		*agbp;	/* agi header buffer */
-	xfs_agino_t		agino;	/* inode # in allocation group */
-	xfs_agnumber_t		agno;	/* allocation group number */
-	xfs_btree_cur_t		*cur;	/* btree cursor for ialloc btree */
-	xfs_inobt_rec_incore_t	*irbuf;	/* start of irec buffer */
-	int			nirbuf;	/* size of irbuf */
-	int			ubcount; /* size of user's buffer */
-	struct xfs_bulkstat_agichunk ac;
-	int			error = 0;
-
-	/*
-	 * Get the last inode value, see if there's nothing to do.
-	 */
-	agno = XFS_INO_TO_AGNO(mp, *lastinop);
-	agino = XFS_INO_TO_AGINO(mp, *lastinop);
-	if (agno >= mp->m_sb.sb_agcount ||
-	    *lastinop != XFS_AGINO_TO_INO(mp, agno, agino)) {
-		*done = 1;
-		*ubcountp = 0;
-		return 0;
-	}
-
-	ubcount = *ubcountp; /* statstruct's */
-	ac.ac_ubuffer = &ubuffer;
-	ac.ac_ubleft = ubcount * statstruct_size; /* bytes */;
-	ac.ac_ubelem = 0;
-
-	*ubcountp = 0;
-	*done = 0;
-
-	irbuf = kmem_zalloc_large(PAGE_SIZE * 4, KM_SLEEP);
-	if (!irbuf)
-		return -ENOMEM;
-	nirbuf = (PAGE_SIZE * 4) / sizeof(*irbuf);
-
-	/*
-	 * Loop over the allocation groups, starting from the last
-	 * inode returned; 0 means start of the allocation group.
-	 */
-	while (agno < mp->m_sb.sb_agcount) {
-		struct xfs_inobt_rec_incore	*irbp = irbuf;
-		struct xfs_inobt_rec_incore	*irbufend = irbuf + nirbuf;
-		bool				end_of_ag = false;
-		int				icount = 0;
-		int				stat;
-
-		error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
-		if (error)
-			break;
-		/*
-		 * Allocate and initialize a btree cursor for ialloc btree.
-		 */
-		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
-					    XFS_BTNUM_INO);
-		if (agino > 0) {
-			/*
-			 * In the middle of an allocation group, we need to get
-			 * the remainder of the chunk we're in.
-			 */
-			struct xfs_inobt_rec_incore	r;
-
-			error = xfs_bulkstat_grab_ichunk(cur, agino, &icount, &r);
-			if (error)
-				goto del_cursor;
-			if (icount) {
-				irbp->ir_startino = r.ir_startino;
-				irbp->ir_holemask = r.ir_holemask;
-				irbp->ir_count = r.ir_count;
-				irbp->ir_freecount = r.ir_freecount;
-				irbp->ir_free = r.ir_free;
-				irbp++;
-			}
-			/* Increment to the next record */
-			error = xfs_btree_increment(cur, 0, &stat);
-		} else {
-			/* Start of ag.  Lookup the first inode chunk */
-			error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &stat);
-		}
-		if (error || stat == 0) {
-			end_of_ag = true;
-			goto del_cursor;
-		}
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, startino);
 
-		/*
-		 * Loop through inode btree records in this ag,
-		 * until we run out of inodes or space in the buffer.
-		 */
-		while (irbp < irbufend && icount < ubcount) {
-			struct xfs_inobt_rec_incore	r;
-
-			error = xfs_inobt_get_rec(cur, &r, &stat);
-			if (error || stat == 0) {
-				end_of_ag = true;
-				goto del_cursor;
-			}
-
-			/*
-			 * If this chunk has any allocated inodes, save it.
-			 * Also start read-ahead now for this chunk.
-			 */
-			if (r.ir_freecount < r.ir_count) {
-				xfs_bulkstat_ichunk_ra(mp, agno, &r);
-				irbp->ir_startino = r.ir_startino;
-				irbp->ir_holemask = r.ir_holemask;
-				irbp->ir_count = r.ir_count;
-				irbp->ir_freecount = r.ir_freecount;
-				irbp->ir_free = r.ir_free;
-				irbp++;
-				icount += r.ir_count - r.ir_freecount;
-			}
-			error = xfs_btree_increment(cur, 0, &stat);
-			if (error || stat == 0) {
-				end_of_ag = true;
-				goto del_cursor;
-			}
-			cond_resched();
-		}
+	return agno >= mp->m_sb.sb_agcount ||
+	       startino != XFS_AGINO_TO_INO(mp, agno, agino);
+}
 
-		/*
-		 * Drop the btree buffers and the agi buffer as we can't hold any
-		 * of the locks these represent when calling iget. If there is a
-		 * pending error, then we are done.
-		 */
-del_cursor:
-		xfs_btree_del_cursor(cur, error);
-		xfs_buf_relse(agbp);
-		if (error)
-			break;
-		/*
-		 * Now format all the good inodes into the user's buffer. The
-		 * call to xfs_bulkstat_ag_ichunk() sets up the agino pointer
-		 * for the next loop iteration.
-		 */
-		irbufend = irbp;
-		for (irbp = irbuf;
-		     irbp < irbufend && ac.ac_ubleft >= statstruct_size;
-		     irbp++) {
-			error = xfs_bulkstat_ag_ichunk(mp, agno, irbp,
-					formatter, statstruct_size, &ac,
-					&agino);
-			if (error)
-				break;
+/* Return stat information in bulk (by-inode) for the filesystem. */
+int
+xfs_bulkstat(
+	struct xfs_ibulk	*breq,
+	bulkstat_one_fmt_pf	formatter)
+{
+	struct xfs_bstat_chunk	bc = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
+	int			error;
 
-			cond_resched();
-		}
+	breq->ocount = 0;
 
-		/*
-		 * If we've run out of space or had a formatting error, we
-		 * are now done
-		 */
-		if (ac.ac_ubleft < statstruct_size || error)
-			break;
+	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
+		return 0;
 
-		if (end_of_ag) {
-			agno++;
-			agino = 0;
-		}
-	}
-	/*
-	 * Done, we're either out of filesystem or space to put the data.
-	 */
-	kmem_free(irbuf);
-	*ubcountp = ac.ac_ubelem;
+	error = xfs_iwalk(breq->mp, NULL, breq->startino, xfs_bulkstat_iwalk,
+			breq->icount, &bc);
 
 	/*
 	 * We found some inodes, so clear the error status and return them.
@@ -509,17 +346,9 @@ xfs_bulkstat(
 	 * triggered again and propagated to userspace as there will be no
 	 * formatted inodes in the buffer.
 	 */
-	if (ac.ac_ubelem)
+	if (breq->ocount > 0)
 		error = 0;
 
-	/*
-	 * If we ran out of filesystem, lastino will point off the end of
-	 * the filesystem so the next call will return immediately.
-	 */
-	*lastinop = XFS_AGINO_TO_INO(mp, agno, agino);
-	if (agno >= mp->m_sb.sb_agcount)
-		*done = 1;
-
 	return error;
 }
 
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 369e3f159d4e..366d391eb11f 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -5,63 +5,46 @@
 #ifndef __XFS_ITABLE_H__
 #define	__XFS_ITABLE_H__
 
-/*
- * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
- * structures (by the dmi library). This is a pointer to a formatter function
- * that will iget the inode and fill in the appropriate structure.
- * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
- */
-typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
-			       xfs_ino_t	ino,
-			       void		__user *buffer,
-			       int		ubsize,
-			       int		*ubused,
-			       int		*stat);
+/* In-memory representation of a userspace request for batch inode data. */
+struct xfs_ibulk {
+	struct xfs_mount	*mp;
+	void __user		*ubuffer; /* user output buffer */
+	xfs_ino_t		startino; /* start with this inode */
+	unsigned int		icount;   /* number of elements in ubuffer */
+	unsigned int		ocount;   /* number of records returned */
+};
+
+/* Return value that means we want to abort the walk. */
+#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
+
+/* Return value that means the formatting buffer is now full. */
+#define XFS_IBULK_BUFFER_FULL	(2)
 
 /*
- * Values for stat return value.
+ * Advance the user buffer pointer by one record of the given size.  If the
+ * buffer is now full, return the appropriate error code.
  */
-#define BULKSTAT_RV_NOTHING	0
-#define BULKSTAT_RV_DIDONE	1
-#define BULKSTAT_RV_GIVEUP	2
+static inline int
+xfs_ibulk_advance(
+	struct xfs_ibulk	*breq,
+	size_t			bytes)
+{
+	char __user		*b = breq->ubuffer;
+
+	breq->ubuffer = b + bytes;
+	breq->ocount++;
+	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
+}
 
 /*
  * Return stat information in bulk (by-inode) for the filesystem.
  */
-int					/* error status */
-xfs_bulkstat(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	*lastino,	/* last inode returned */
-	int		*count,		/* size of buffer/count returned */
-	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
-	size_t		statstruct_size,/* sizeof struct that we're filling */
-	char		__user *ubuffer,/* buffer with inode stats */
-	int		*done);		/* 1 if there are more stats to get */
 
-typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
-	void			__user *ubuffer, /* buffer to write to */
-	int			ubsize,		 /* remaining user buffer sz */
-	int			*ubused,	 /* bytes used by formatter */
-	const xfs_bstat_t	*buffer);        /* buffer to read from */
+typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
+		const struct xfs_bstat *bstat);
 
-int
-xfs_bulkstat_one_int(
-	xfs_mount_t		*mp,
-	xfs_ino_t		ino,
-	void			__user *buffer,
-	int			ubsize,
-	bulkstat_one_fmt_pf	formatter,
-	int			*ubused,
-	int			*stat);
-
-int
-xfs_bulkstat_one(
-	xfs_mount_t		*mp,
-	xfs_ino_t		ino,
-	void			__user *buffer,
-	int			ubsize,
-	int			*ubused,
-	int			*stat);
+int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
+int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 
 typedef int (*inumbers_fmt_pf)(
 	void			__user *ubuffer, /* buffer to write to */

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (3 preceding siblings ...)
  2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
@ 2019-06-04 21:49 ` Darrick J. Wong
  2019-06-10 14:02   ` Brian Foster
  2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:49 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've reworked the bulkstat code to use iwalk, we can move the
old bulkstat ichunk helpers to xfs_iwalk.c.  No functional changes here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_itable.c |   93 --------------------------------------------------
 fs/xfs/xfs_itable.h |    8 ----
 fs/xfs/xfs_iwalk.c  |   95 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 103 deletions(-)


diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 87c597ea1df7..06abe5c9c0ee 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -186,99 +186,6 @@ xfs_bulkstat_one(
 	return error;
 }
 
-/*
- * Loop over all clusters in a chunk for a given incore inode allocation btree
- * record.  Do a readahead if there are any allocated inodes in that cluster.
- */
-void
-xfs_bulkstat_ichunk_ra(
-	struct xfs_mount		*mp,
-	xfs_agnumber_t			agno,
-	struct xfs_inobt_rec_incore	*irec)
-{
-	struct xfs_ino_geometry		*igeo = M_IGEO(mp);
-	xfs_agblock_t			agbno;
-	struct blk_plug			plug;
-	int				i;	/* inode chunk index */
-
-	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
-
-	blk_start_plug(&plug);
-	for (i = 0;
-	     i < XFS_INODES_PER_CHUNK;
-	     i += igeo->inodes_per_cluster,
-			agbno += igeo->blocks_per_cluster) {
-		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
-		    ~irec->ir_free) {
-			xfs_btree_reada_bufs(mp, agno, agbno,
-					igeo->blocks_per_cluster,
-					&xfs_inode_buf_ops);
-		}
-	}
-	blk_finish_plug(&plug);
-}
-
-/*
- * Lookup the inode chunk that the given inode lives in and then get the record
- * if we found the chunk.  If the inode was not the last in the chunk and there
- * are some left allocated, update the data for the pointed-to record as well as
- * return the count of grabbed inodes.
- */
-int
-xfs_bulkstat_grab_ichunk(
-	struct xfs_btree_cur		*cur,	/* btree cursor */
-	xfs_agino_t			agino,	/* starting inode of chunk */
-	int				*icount,/* return # of inodes grabbed */
-	struct xfs_inobt_rec_incore	*irec)	/* btree record */
-{
-	int				idx;	/* index into inode chunk */
-	int				stat;
-	int				error = 0;
-
-	/* Lookup the inode chunk that this inode lives in */
-	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
-	if (error)
-		return error;
-	if (!stat) {
-		*icount = 0;
-		return error;
-	}
-
-	/* Get the record, should always work */
-	error = xfs_inobt_get_rec(cur, irec, &stat);
-	if (error)
-		return error;
-	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
-
-	/* Check if the record contains the inode in request */
-	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
-		*icount = 0;
-		return 0;
-	}
-
-	idx = agino - irec->ir_startino + 1;
-	if (idx < XFS_INODES_PER_CHUNK &&
-	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
-		int	i;
-
-		/* We got a right chunk with some left inodes allocated at it.
-		 * Grab the chunk record.  Mark all the uninteresting inodes
-		 * free -- because they're before our start point.
-		 */
-		for (i = 0; i < idx; i++) {
-			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
-				irec->ir_freecount++;
-		}
-
-		irec->ir_free |= xfs_inobt_maskn(0, idx);
-		*icount = irec->ir_count - irec->ir_freecount;
-	}
-
-	return 0;
-}
-
-#define XFS_BULKSTAT_UBLEFT(ubleft)	((ubleft) >= statstruct_size)
-
 static int
 xfs_bulkstat_iwalk(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 366d391eb11f..a2562fe8d282 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -67,12 +67,4 @@ xfs_inumbers(
 	void			__user *buffer, /* buffer with inode info */
 	inumbers_fmt_pf		formatter);
 
-/* Temporarily needed while we refactor functions. */
-struct xfs_btree_cur;
-struct xfs_inobt_rec_incore;
-void xfs_bulkstat_ichunk_ra(struct xfs_mount *mp, xfs_agnumber_t agno,
-		struct xfs_inobt_rec_incore *irec);
-int xfs_bulkstat_grab_ichunk(struct xfs_btree_cur *cur, xfs_agino_t agino,
-		int *icount, struct xfs_inobt_rec_incore *irec);
-
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 3e6c06e69c75..bef0c4907781 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -66,6 +66,97 @@ struct xfs_iwalk_ag {
 	void				*data;
 };
 
+/*
+ * Loop over all clusters in a chunk for a given incore inode allocation btree
+ * record.  Do a readahead if there are any allocated inodes in that cluster.
+ */
+STATIC void
+xfs_iwalk_ichunk_ra(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno,
+	struct xfs_inobt_rec_incore	*irec)
+{
+	struct xfs_ino_geometry		*igeo = M_IGEO(mp);
+	xfs_agblock_t			agbno;
+	struct blk_plug			plug;
+	int				i;	/* inode chunk index */
+
+	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
+
+	blk_start_plug(&plug);
+	for (i = 0;
+	     i < XFS_INODES_PER_CHUNK;
+	     i += igeo->inodes_per_cluster,
+			agbno += igeo->blocks_per_cluster) {
+		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
+		    ~irec->ir_free) {
+			xfs_btree_reada_bufs(mp, agno, agbno,
+					igeo->blocks_per_cluster,
+					&xfs_inode_buf_ops);
+		}
+	}
+	blk_finish_plug(&plug);
+}
+
+/*
+ * Lookup the inode chunk that the given inode lives in and then get the record
+ * if we found the chunk.  If the inode was not the last in the chunk and there
+ * are some left allocated, update the data for the pointed-to record as well as
+ * return the count of grabbed inodes.
+ */
+STATIC int
+xfs_iwalk_grab_ichunk(
+	struct xfs_btree_cur		*cur,	/* btree cursor */
+	xfs_agino_t			agino,	/* starting inode of chunk */
+	int				*icount,/* return # of inodes grabbed */
+	struct xfs_inobt_rec_incore	*irec)	/* btree record */
+{
+	int				idx;	/* index into inode chunk */
+	int				stat;
+	int				error = 0;
+
+	/* Lookup the inode chunk that this inode lives in */
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*icount = 0;
+		return error;
+	}
+
+	/* Get the record, should always work */
+	error = xfs_inobt_get_rec(cur, irec, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
+
+	/* Check if the record contains the inode in request */
+	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
+		*icount = 0;
+		return 0;
+	}
+
+	idx = agino - irec->ir_startino + 1;
+	if (idx < XFS_INODES_PER_CHUNK &&
+	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
+		int	i;
+
+		/* We got a right chunk with some left inodes allocated at it.
+		 * Grab the chunk record.  Mark all the uninteresting inodes
+		 * free -- because they're before our start point.
+		 */
+		for (i = 0; i < idx; i++) {
+			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
+				irec->ir_freecount++;
+		}
+
+		irec->ir_free |= xfs_inobt_maskn(0, idx);
+		*icount = irec->ir_count - irec->ir_freecount;
+	}
+
+	return 0;
+}
+
 /* Allocate memory for a walk. */
 STATIC int
 xfs_iwalk_alloc(
@@ -190,7 +281,7 @@ xfs_iwalk_ag_start(
 	 * We require a lookup cache of at least two elements so that we don't
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
-	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
+	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
 			&iwag->recs[iwag->nr_recs]);
 	if (error)
 		return error;
@@ -295,7 +386,7 @@ xfs_iwalk_ag(
 		 * Start readahead for this inode chunk in anticipation of
 		 * walking the inodes.
 		 */
-		xfs_bulkstat_ichunk_ra(mp, agno, irec);
+		xfs_iwalk_ichunk_ra(mp, agno, irec);
 
 		/*
 		 * If there's space in the buffer for more records, increment

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (4 preceding siblings ...)
  2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
@ 2019-06-04 21:50 ` Darrick J. Wong
  2019-06-10 19:32   ` Brian Foster
  2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:50 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that the inode chunk grabbing function is a static function in the
iwalk code, change its behavior so that @agino is the inode where we
want to /start/ the iteration.  This reduces cognitive friction with the
callers and simplifes the code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |   37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index bef0c4907781..9ad017ddbae7 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -99,10 +99,10 @@ xfs_iwalk_ichunk_ra(
 }
 
 /*
- * Lookup the inode chunk that the given inode lives in and then get the record
- * if we found the chunk.  If the inode was not the last in the chunk and there
- * are some left allocated, update the data for the pointed-to record as well as
- * return the count of grabbed inodes.
+ * Lookup the inode chunk that the given @agino lives in and then get the
+ * record if we found the chunk.  Set the bits in @irec's free mask that
+ * correspond to the inodes before @agino so that we skip them.  This is how we
+ * restart an inode walk that was interrupted in the middle of an inode record.
  */
 STATIC int
 xfs_iwalk_grab_ichunk(
@@ -113,6 +113,7 @@ xfs_iwalk_grab_ichunk(
 {
 	int				idx;	/* index into inode chunk */
 	int				stat;
+	int				i;
 	int				error = 0;
 
 	/* Lookup the inode chunk that this inode lives in */
@@ -136,24 +137,20 @@ xfs_iwalk_grab_ichunk(
 		return 0;
 	}
 
-	idx = agino - irec->ir_startino + 1;
-	if (idx < XFS_INODES_PER_CHUNK &&
-	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
-		int	i;
+	idx = agino - irec->ir_startino;
 
-		/* We got a right chunk with some left inodes allocated at it.
-		 * Grab the chunk record.  Mark all the uninteresting inodes
-		 * free -- because they're before our start point.
-		 */
-		for (i = 0; i < idx; i++) {
-			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
-				irec->ir_freecount++;
-		}
-
-		irec->ir_free |= xfs_inobt_maskn(0, idx);
-		*icount = irec->ir_count - irec->ir_freecount;
+	/*
+	 * We got a right chunk with some left inodes allocated at it.  Grab
+	 * the chunk record.  Mark all the uninteresting inodes free because
+	 * they're before our start point.
+	 */
+	for (i = 0; i < idx; i++) {
+		if (XFS_INOBT_MASK(i) & ~irec->ir_free)
+			irec->ir_freecount++;
 	}
 
+	irec->ir_free |= xfs_inobt_maskn(0, idx);
+	*icount = irec->ir_count - irec->ir_freecount;
 	return 0;
 }
 
@@ -281,7 +278,7 @@ xfs_iwalk_ag_start(
 	 * We require a lookup cache of at least two elements so that we don't
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
-	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
+	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
 			&iwag->recs[iwag->nr_recs]);
 	if (error)
 		return error;

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (5 preceding siblings ...)
  2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
@ 2019-06-04 21:50 ` Darrick J. Wong
  2019-06-10 19:32   ` Brian Foster
  2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:50 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 9ad017ddbae7..8595258b5001 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -84,16 +84,16 @@ xfs_iwalk_ichunk_ra(
 	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
 
 	blk_start_plug(&plug);
-	for (i = 0;
-	     i < XFS_INODES_PER_CHUNK;
-	     i += igeo->inodes_per_cluster,
-			agbno += igeo->blocks_per_cluster) {
-		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
-		    ~irec->ir_free) {
+	for (i = 0; i < XFS_INODES_PER_CHUNK; i += igeo->inodes_per_cluster) {
+		xfs_inofree_t	imask;
+
+		imask = xfs_inobt_maskn(i, igeo->inodes_per_cluster);
+		if (imask & ~irec->ir_free) {
 			xfs_btree_reada_bufs(mp, agno, agbno,
 					igeo->blocks_per_cluster,
 					&xfs_inode_buf_ops);
 		}
+		agbno += igeo->blocks_per_cluster;
 	}
 	blk_finish_plug(&plug);
 }

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (6 preceding siblings ...)
  2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
@ 2019-06-04 21:50 ` Darrick J. Wong
  2019-06-10 19:40   ` Brian Foster
  2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
  2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:50 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a parallel iwalk implementation and switch quotacheck to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile      |    1 
 fs/xfs/xfs_globals.c |    3 +
 fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iwalk.h   |    2 +
 fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
 fs/xfs/xfs_qm.c      |    2 -
 fs/xfs/xfs_sysctl.h  |    6 ++
 fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
 fs/xfs/xfs_trace.h   |   18 +++++++
 10 files changed, 317 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_pwork.c
 create mode 100644 fs/xfs/xfs_pwork.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 74d30ef0dbce..48940a27d4aa 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
 				   xfs_super.o \
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index d0d377384120..4f93f2c4dc38 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
 	.fstrm_timer	= {	1,		30*100,		3600*100},
 	.eofb_timer	= {	1,		300,		3600*24},
 	.cowb_timer	= {	1,		1800,		3600*24},
+#ifdef DEBUG
+	.pwork_threads	= {	0,		0,		NR_CPUS	},
+#endif
 };
 
 struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 8595258b5001..71ee1628aa70 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -21,6 +21,7 @@
 #include "xfs_health.h"
 #include "xfs_trans.h"
 #include "xfs_iwalk.h"
+#include "xfs_pwork.h"
 
 /*
  * Walking Inodes in the Filesystem
@@ -46,6 +47,9 @@
  */
 
 struct xfs_iwalk_ag {
+	/* parallel work control data; will be null if single threaded */
+	struct xfs_pwork		pwork;
+
 	struct xfs_mount		*mp;
 	struct xfs_trans		*tp;
 
@@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
 		trace_xfs_iwalk_ag_rec(mp, agno, irec);
 
 		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
+			if (xfs_pwork_want_abort(&iwag->pwork))
+				return 0;
+
 			/* Skip if this inode is free */
 			if (XFS_INOBT_MASK(j) & irec->ir_free)
 				continue;
@@ -360,7 +367,7 @@ xfs_iwalk_ag(
 	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
 	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
 
-	while (!error && has_more) {
+	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
 		struct xfs_inobt_rec_incore	*irec;
 
 		cond_resched();
@@ -409,7 +416,7 @@ xfs_iwalk_ag(
 	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
 
 	/* Walk any records left behind in the cache. */
-	if (iwag->nr_recs == 0 || error)
+	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
 		return error;
 
 	return xfs_iwalk_ag_recs(iwag);
@@ -465,6 +472,7 @@ xfs_iwalk(
 		.iwalk_fn	= iwalk_fn,
 		.data		= data,
 		.startino	= startino,
+		.pwork		= XFS_PWORK_SINGLE_THREADED,
 	};
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
 	int			error;
@@ -486,3 +494,67 @@ xfs_iwalk(
 	xfs_iwalk_free(&iwag);
 	return error;
 }
+
+/* Run per-thread iwalk work. */
+static int
+xfs_iwalk_ag_work(
+	struct xfs_mount	*mp,
+	struct xfs_pwork	*pwork)
+{
+	struct xfs_iwalk_ag	*iwag;
+	int			error;
+
+	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
+	error = xfs_iwalk_alloc(iwag);
+	if (error)
+		goto out;
+
+	error = xfs_iwalk_ag(iwag);
+	xfs_iwalk_free(iwag);
+out:
+	kmem_free(iwag);
+	return error;
+}
+
+/*
+ * Walk all the inodes in the filesystem using multiple threads to process each
+ * AG.
+ */
+int
+xfs_iwalk_threaded(
+	struct xfs_mount	*mp,
+	xfs_ino_t		startino,
+	xfs_iwalk_fn		iwalk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_pwork_ctl	pctl;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	unsigned int		nr_threads;
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
+	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
+			nr_threads);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		struct xfs_iwalk_ag	*iwag;
+
+		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
+		iwag->mp = mp;
+		iwag->tp = NULL;
+		iwag->iwalk_fn = iwalk_fn;
+		iwag->data = data;
+		iwag->startino = startino;
+		iwag->recs = NULL;
+		xfs_iwalk_set_prefetch(iwag, max_prefetch);
+		xfs_pwork_queue(&pctl, &iwag->pwork);
+		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	return xfs_pwork_destroy(&pctl);
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 45b1baabcd2d..40233a05a766 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -14,5 +14,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
 
 int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
 
 #endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
new file mode 100644
index 000000000000..19605a3a2482
--- /dev/null
+++ b/fs/xfs/xfs_pwork.c
@@ -0,0 +1,122 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trace.h"
+#include "xfs_sysctl.h"
+#include "xfs_pwork.h"
+
+/*
+ * Parallel Work Queue
+ * ===================
+ *
+ * Abstract away the details of running a large and "obviously" parallelizable
+ * task across multiple CPUs.  Callers initialize the pwork control object with
+ * a desired level of parallelization and a work function.  Next, they embed
+ * struct xfs_pwork in whatever structure they use to pass work context to a
+ * worker thread and queue that pwork.  The work function will be passed the
+ * pwork item when it is run (from process context) and any returned error will
+ * cause all threads to abort.
+ * 
+ * This is the rough equivalent of the xfsprogs workqueue code, though we can't
+ * reuse that name here.
+ */
+
+/* Invoke our caller's function. */
+static void
+xfs_pwork_work(
+	struct work_struct	*work)
+{
+	struct xfs_pwork	*pwork;
+	struct xfs_pwork_ctl	*pctl;
+	int			error;
+
+	pwork = container_of(work, struct xfs_pwork, work);
+	pctl = pwork->pctl;
+	error = pctl->work_fn(pctl->mp, pwork);
+	if (error && !pctl->error)
+		pctl->error = error;
+}
+
+/*
+ * Set up control data for parallel work.  @work_fn is the function that will
+ * be called.  @tag will be written into the kernel threads.  @nr_threads is
+ * the level of parallelism desired, or 0 for no limit.
+ */
+int
+xfs_pwork_init(
+	struct xfs_mount	*mp,
+	struct xfs_pwork_ctl	*pctl,
+	xfs_pwork_work_fn	work_fn,
+	const char		*tag,
+	unsigned int		nr_threads)
+{
+#ifdef DEBUG
+	if (xfs_globals.pwork_threads > 0)
+		nr_threads = xfs_globals.pwork_threads;
+#endif
+	trace_xfs_pwork_init(mp, nr_threads, current->pid);
+
+	pctl->wq = alloc_workqueue("%s-%d", WQ_FREEZABLE, nr_threads, tag,
+			current->pid);
+	if (!pctl->wq)
+		return -ENOMEM;
+	pctl->work_fn = work_fn;
+	pctl->error = 0;
+	pctl->mp = mp;
+
+	return 0;
+}
+
+/* Queue some parallel work. */
+void
+xfs_pwork_queue(
+	struct xfs_pwork_ctl	*pctl,
+	struct xfs_pwork	*pwork)
+{
+	INIT_WORK(&pwork->work, xfs_pwork_work);
+	pwork->pctl = pctl;
+	queue_work(pctl->wq, &pwork->work);
+}
+
+/* Wait for the work to finish and tear down the control structure. */
+int
+xfs_pwork_destroy(
+	struct xfs_pwork_ctl	*pctl)
+{
+	destroy_workqueue(pctl->wq);
+	pctl->wq = NULL;
+	return pctl->error;
+}
+
+/*
+ * Return the amount of parallelism that the data device can handle, or 0 for
+ * no limit.
+ */
+unsigned int
+xfs_pwork_guess_datadev_parallelism(
+	struct xfs_mount	*mp)
+{
+	struct xfs_buftarg	*btp = mp->m_ddev_targp;
+	int			iomin;
+	int			ioopt;
+
+	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
+		return num_online_cpus();
+	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
+		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
+	iomin = bdev_io_min(btp->bt_bdev);
+	ioopt = bdev_io_opt(btp->bt_bdev);
+	if (iomin && ioopt)
+		return ioopt / iomin;
+
+	return 1;
+}
diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
new file mode 100644
index 000000000000..e0c1354a2d8c
--- /dev/null
+++ b/fs/xfs/xfs_pwork.h
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_PWORK_H__
+#define __XFS_PWORK_H__
+
+struct xfs_pwork;
+struct xfs_mount;
+
+typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
+
+/*
+ * Parallel work coordination structure.
+ */
+struct xfs_pwork_ctl {
+	struct workqueue_struct	*wq;
+	struct xfs_mount	*mp;
+	xfs_pwork_work_fn	work_fn;
+	int			error;
+};
+
+/*
+ * Embed this parallel work control item inside your own work structure,
+ * then queue work with it.
+ */
+struct xfs_pwork {
+	struct work_struct	work;
+	struct xfs_pwork_ctl	*pctl;
+};
+
+#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
+
+/* Have we been told to abort? */
+static inline bool
+xfs_pwork_want_abort(
+	struct xfs_pwork	*pwork)
+{
+	return pwork->pctl && pwork->pctl->error;
+}
+
+int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
+		xfs_pwork_work_fn work_fn, const char *tag,
+		unsigned int nr_threads);
+void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
+int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
+unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
+
+#endif /* __XFS_PWORK_H__ */
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index a5b2260406a8..e4f3785f7a64 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
 	if (error)
 		goto error_return;
 
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index ad7f9be13087..b555e045e2f4 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -37,6 +37,9 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
 	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
+#ifdef DEBUG
+	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
+#endif
 } xfs_param_t;
 
 /*
@@ -82,6 +85,9 @@ enum {
 extern xfs_param_t	xfs_params;
 
 struct xfs_globals {
+#ifdef DEBUG
+	int	pwork_threads;		/* parallel workqueue threads */
+#endif
 	int	log_recovery_delay;	/* log recovery delay (secs) */
 	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index cabda13f3c64..910e6b9cb1a7 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -206,11 +206,51 @@ always_cow_show(
 }
 XFS_SYSFS_ATTR_RW(always_cow);
 
+#ifdef DEBUG
+/*
+ * Override how many threads the parallel work queue is allowed to create.
+ * This has to be a debug-only global (instead of an errortag) because one of
+ * the main users of parallel workqueues is mount time quotacheck.
+ */
+STATIC ssize_t
+pwork_threads_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 0 || val > NR_CPUS)
+		return -EINVAL;
+
+	xfs_globals.pwork_threads = val;
+
+	return count;
+}
+
+STATIC ssize_t
+pwork_threads_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
+}
+XFS_SYSFS_ATTR_RW(pwork_threads);
+#endif /* DEBUG */
+
 static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(bug_on_assert),
 	ATTR_LIST(log_recovery_delay),
 	ATTR_LIST(mount_delay),
 	ATTR_LIST(always_cow),
+#ifdef DEBUG
+	ATTR_LIST(pwork_threads),
+#endif
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f9bb1d50bc0e..658cbade1998 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
 		  __entry->startino, __entry->freemask)
 )
 
+TRACE_EVENT(xfs_pwork_init,
+	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
+	TP_ARGS(mp, nr_threads, pid),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, nr_threads)
+		__field(pid_t, pid)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->nr_threads = nr_threads;
+		__entry->pid = pid;
+	),
+	TP_printk("dev %d:%d nr_threads %u pid %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr_threads, __entry->pid)
+)
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 09/10] xfs: poll waiting for quotacheck
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (7 preceding siblings ...)
  2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
@ 2019-06-04 21:50 ` Darrick J. Wong
  2019-06-11 15:07   ` Brian Foster
  2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:50 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a pwork destroy function that uses polling instead of
uninterruptible sleep to wait for work items to finish so that we can
touch the softlockup watchdog.  IOWs, gross hack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |    3 +++
 fs/xfs/xfs_iwalk.h |    3 ++-
 fs/xfs/xfs_pwork.c |   21 +++++++++++++++++++++
 fs/xfs/xfs_pwork.h |    2 ++
 fs/xfs/xfs_qm.c    |    2 +-
 5 files changed, 29 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 71ee1628aa70..c4a9c4c246b7 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -526,6 +526,7 @@ xfs_iwalk_threaded(
 	xfs_ino_t		startino,
 	xfs_iwalk_fn		iwalk_fn,
 	unsigned int		max_prefetch,
+	bool			polled,
 	void			*data)
 {
 	struct xfs_pwork_ctl	pctl;
@@ -556,5 +557,7 @@ xfs_iwalk_threaded(
 		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
 	}
 
+	if (polled)
+		return xfs_pwork_destroy_poll(&pctl);
 	return xfs_pwork_destroy(&pctl);
 }
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 40233a05a766..76d8f87a39ef 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -15,6 +15,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
 int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
 int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
-		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
+		void *data);
 
 #endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
index 19605a3a2482..3b885e0b52ac 100644
--- a/fs/xfs/xfs_pwork.c
+++ b/fs/xfs/xfs_pwork.c
@@ -13,6 +13,7 @@
 #include "xfs_trace.h"
 #include "xfs_sysctl.h"
 #include "xfs_pwork.h"
+#include <linux/nmi.h>
 
 /*
  * Parallel Work Queue
@@ -44,6 +45,7 @@ xfs_pwork_work(
 	error = pctl->work_fn(pctl->mp, pwork);
 	if (error && !pctl->error)
 		pctl->error = error;
+	atomic_dec(&pctl->nr_work);
 }
 
 /*
@@ -72,6 +74,7 @@ xfs_pwork_init(
 	pctl->work_fn = work_fn;
 	pctl->error = 0;
 	pctl->mp = mp;
+	atomic_set(&pctl->nr_work, 0);
 
 	return 0;
 }
@@ -84,6 +87,7 @@ xfs_pwork_queue(
 {
 	INIT_WORK(&pwork->work, xfs_pwork_work);
 	pwork->pctl = pctl;
+	atomic_inc(&pctl->nr_work);
 	queue_work(pctl->wq, &pwork->work);
 }
 
@@ -97,6 +101,23 @@ xfs_pwork_destroy(
 	return pctl->error;
 }
 
+/*
+ * Wait for the work to finish and tear down the control structure.
+ * Continually poll completion status and touch the soft lockup watchdog.
+ * This is for things like mount that hold locks.
+ */
+int
+xfs_pwork_destroy_poll(
+	struct xfs_pwork_ctl	*pctl)
+{
+	while (atomic_read(&pctl->nr_work) > 0) {
+		msleep(1);
+		touch_softlockup_watchdog();
+	}
+
+	return xfs_pwork_destroy(pctl);
+}
+
 /*
  * Return the amount of parallelism that the data device can handle, or 0 for
  * no limit.
diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
index e0c1354a2d8c..08da723a8dc9 100644
--- a/fs/xfs/xfs_pwork.h
+++ b/fs/xfs/xfs_pwork.h
@@ -18,6 +18,7 @@ struct xfs_pwork_ctl {
 	struct workqueue_struct	*wq;
 	struct xfs_mount	*mp;
 	xfs_pwork_work_fn	work_fn;
+	atomic_t		nr_work;
 	int			error;
 };
 
@@ -45,6 +46,7 @@ int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
 		unsigned int nr_threads);
 void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
 int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
+int xfs_pwork_destroy_poll(struct xfs_pwork_ctl *pctl);
 unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
 
 #endif /* __XFS_PWORK_H__ */
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e4f3785f7a64..de6a623ada02 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, true, NULL);
 	if (error)
 		goto error_return;
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions
  2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (8 preceding siblings ...)
  2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
@ 2019-06-04 21:50 ` Darrick J. Wong
  2019-06-11 15:08   ` Brian Foster
  9 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-04 21:50 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we have generic functions to walk inode records, refactor the
INUMBERS implementation to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   20 ++++--
 fs/xfs/xfs_ioctl.h   |    2 +
 fs/xfs/xfs_ioctl32.c |   35 ++++------
 fs/xfs/xfs_itable.c  |  168 ++++++++++++++++++++------------------------------
 fs/xfs/xfs_itable.h  |   22 +------
 fs/xfs/xfs_iwalk.c   |  161 ++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iwalk.h   |   12 ++++
 7 files changed, 262 insertions(+), 158 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 43734901aeb9..4fa9a2c8b029 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -732,6 +732,16 @@ xfs_bulkstat_one_fmt(
 	return xfs_ibulk_advance(breq, sizeof(struct xfs_bstat));
 }
 
+int
+xfs_inumbers_fmt(
+	struct xfs_ibulk	*breq,
+	const struct xfs_inogrp	*igrp)
+{
+	if (copy_to_user(breq->ubuffer, igrp, sizeof(*igrp)))
+		return -EFAULT;
+	return xfs_ibulk_advance(breq, sizeof(struct xfs_inogrp));
+}
+
 STATIC int
 xfs_ioc_bulkstat(
 	xfs_mount_t		*mp,
@@ -779,13 +789,9 @@ xfs_ioc_bulkstat(
 	 * parameter to maintain correct function.
 	 */
 	if (cmd == XFS_IOC_FSINUMBERS) {
-		int	count = breq.icount;
-
-		breq.startino = lastino;
-		error = xfs_inumbers(mp, &breq.startino, &count,
-					bulkreq.ubuffer, xfs_inumbers_fmt);
-		breq.ocount = count;
-		lastino = breq.startino;
+		breq.startino = lastino + 1;
+		error = xfs_inumbers(&breq, xfs_inumbers_fmt);
+		lastino = breq.startino - 1;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
 		breq.startino = lastino;
 		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index f32c8aadfeba..fb303eaa8863 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -79,7 +79,9 @@ xfs_set_dmattrs(
 
 struct xfs_ibulk;
 struct xfs_bstat;
+struct xfs_inogrp;
 
 int xfs_bulkstat_one_fmt(struct xfs_ibulk *breq, const struct xfs_bstat *bstat);
+int xfs_inumbers_fmt(struct xfs_ibulk *breq, const struct xfs_inogrp *igrp);
 
 #endif
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index add15819daf3..dd53a9692e68 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -85,22 +85,17 @@ xfs_compat_growfs_rt_copyin(
 
 STATIC int
 xfs_inumbers_fmt_compat(
-	void			__user *ubuffer,
-	const struct xfs_inogrp	*buffer,
-	long			count,
-	long			*written)
+	struct xfs_ibulk	*breq,
+	const struct xfs_inogrp	*igrp)
 {
-	compat_xfs_inogrp_t	__user *p32 = ubuffer;
-	long			i;
+	struct compat_xfs_inogrp __user *p32 = breq->ubuffer;
 
-	for (i = 0; i < count; i++) {
-		if (put_user(buffer[i].xi_startino,   &p32[i].xi_startino) ||
-		    put_user(buffer[i].xi_alloccount, &p32[i].xi_alloccount) ||
-		    put_user(buffer[i].xi_allocmask,  &p32[i].xi_allocmask))
-			return -EFAULT;
-	}
-	*written = count * sizeof(*p32);
-	return 0;
+	if (put_user(igrp->xi_startino,   &p32->xi_startino) ||
+	    put_user(igrp->xi_alloccount, &p32->xi_alloccount) ||
+	    put_user(igrp->xi_allocmask,  &p32->xi_allocmask))
+		return -EFAULT;
+
+	return xfs_ibulk_advance(breq, sizeof(struct compat_xfs_inogrp));
 }
 
 #else
@@ -225,7 +220,7 @@ xfs_compat_ioc_bulkstat(
 	 * to userpace memory via bulkreq.ubuffer.  Normally the compat
 	 * functions and structure size are the correct ones to use ...
 	 */
-	inumbers_fmt_pf inumbers_func = xfs_inumbers_fmt_compat;
+	inumbers_fmt_pf		inumbers_func = xfs_inumbers_fmt_compat;
 	bulkstat_one_fmt_pf	bs_one_func = xfs_bulkstat_one_fmt_compat;
 
 #ifdef CONFIG_X86_X32
@@ -286,13 +281,9 @@ xfs_compat_ioc_bulkstat(
 	 * parameter to maintain correct function.
 	 */
 	if (cmd == XFS_IOC_FSINUMBERS_32) {
-		int	count = breq.icount;
-
-		breq.startino = lastino;
-		error = xfs_inumbers(mp, &breq.startino, &count,
-				bulkreq.ubuffer, inumbers_func);
-		breq.ocount = count;
-		lastino = breq.startino;
+		breq.startino = lastino + 1;
+		error = xfs_inumbers(&breq, inumbers_func);
+		lastino = breq.startino - 1;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
 		breq.startino = lastino;
 		error = xfs_bulkstat_one(&breq, bs_one_func);
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 06abe5c9c0ee..bade54d6ac64 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -259,121 +259,85 @@ xfs_bulkstat(
 	return error;
 }
 
-int
-xfs_inumbers_fmt(
-	void			__user *ubuffer, /* buffer to write to */
-	const struct xfs_inogrp	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written)	/* # of bytes written */
+struct xfs_inumbers_chunk {
+	inumbers_fmt_pf		formatter;
+	struct xfs_ibulk	*breq;
+};
+
+/*
+ * INUMBERS
+ * ========
+ * This is how we export inode btree records to userspace, so that XFS tools
+ * can figure out where inodes are allocated.
+ */
+
+/*
+ * Format the inode group structure and report it somewhere.
+ *
+ * Similar to xfs_bulkstat_one_int, lastino is the inode cursor as we walk
+ * through the filesystem so we move it forward unless there was a runtime
+ * error.  If the formatter tells us the buffer is now full we also move the
+ * cursor forward and abort the walk.
+ */
+STATIC int
+xfs_inumbers_walk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	const struct xfs_inobt_rec_incore *irec,
+	void			*data)
 {
-	if (copy_to_user(ubuffer, buffer, count * sizeof(*buffer)))
-		return -EFAULT;
-	*written = count * sizeof(*buffer);
-	return 0;
+	struct xfs_inogrp	inogrp = {
+		.xi_startino	= XFS_AGINO_TO_INO(mp, agno, irec->ir_startino),
+		.xi_alloccount	= irec->ir_count - irec->ir_freecount,
+		.xi_allocmask	= ~irec->ir_free,
+	};
+	struct xfs_inumbers_chunk *ic = data;
+	xfs_agino_t		agino;
+	int			error;
+
+	error = ic->formatter(ic->breq, &inogrp);
+	if (error && error != XFS_IBULK_BUFFER_FULL)
+		return error;
+	if (error == XFS_IBULK_BUFFER_FULL)
+		error = XFS_INOBT_WALK_ABORT;
+
+	agino = irec->ir_startino + XFS_INODES_PER_CHUNK;
+	ic->breq->startino = XFS_AGINO_TO_INO(mp, agno, agino);
+	return error;
 }
 
 /*
  * Return inode number table for the filesystem.
  */
-int					/* error status */
+int
 xfs_inumbers(
-	struct xfs_mount	*mp,/* mount point for filesystem */
-	xfs_ino_t		*lastino,/* last inode returned */
-	int			*count,/* size of buffer/count returned */
-	void			__user *ubuffer,/* buffer with inode descriptions */
+	struct xfs_ibulk	*breq,
 	inumbers_fmt_pf		formatter)
 {
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, *lastino);
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, *lastino);
-	struct xfs_btree_cur	*cur = NULL;
-	struct xfs_buf		*agbp = NULL;
-	struct xfs_inogrp	*buffer;
-	int			bcount;
-	int			left = *count;
-	int			bufidx = 0;
+	struct xfs_inumbers_chunk ic = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
 	int			error = 0;
 
-	*count = 0;
-	if (agno >= mp->m_sb.sb_agcount ||
-	    *lastino != XFS_AGINO_TO_INO(mp, agno, agino))
-		return error;
+	breq->ocount = 0;
 
-	bcount = min(left, (int)(PAGE_SIZE / sizeof(*buffer)));
-	buffer = kmem_zalloc(bcount * sizeof(*buffer), KM_SLEEP);
-	do {
-		struct xfs_inobt_rec_incore	r;
-		int				stat;
-
-		if (!agbp) {
-			error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
-			if (error)
-				break;
-
-			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
-						    XFS_BTNUM_INO);
-			error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
-						 &stat);
-			if (error)
-				break;
-			if (!stat)
-				goto next_ag;
-		}
-
-		error = xfs_inobt_get_rec(cur, &r, &stat);
-		if (error)
-			break;
-		if (!stat)
-			goto next_ag;
-
-		agino = r.ir_startino + XFS_INODES_PER_CHUNK - 1;
-		buffer[bufidx].xi_startino =
-			XFS_AGINO_TO_INO(mp, agno, r.ir_startino);
-		buffer[bufidx].xi_alloccount = r.ir_count - r.ir_freecount;
-		buffer[bufidx].xi_allocmask = ~r.ir_free;
-		if (++bufidx == bcount) {
-			long	written;
-
-			error = formatter(ubuffer, buffer, bufidx, &written);
-			if (error)
-				break;
-			ubuffer += written;
-			*count += bufidx;
-			bufidx = 0;
-		}
-		if (!--left)
-			break;
-
-		error = xfs_btree_increment(cur, 0, &stat);
-		if (error)
-			break;
-		if (stat)
-			continue;
-
-next_ag:
-		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-		cur = NULL;
-		xfs_buf_relse(agbp);
-		agbp = NULL;
-		agino = 0;
-		agno++;
-	} while (agno < mp->m_sb.sb_agcount);
-
-	if (!error) {
-		if (bufidx) {
-			long	written;
-
-			error = formatter(ubuffer, buffer, bufidx, &written);
-			if (!error)
-				*count += bufidx;
-		}
-		*lastino = XFS_AGINO_TO_INO(mp, agno, agino);
-	}
+	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
+		return 0;
+
+	error = xfs_inobt_walk(breq->mp, NULL, breq->startino,
+			xfs_inumbers_walk, breq->icount, &ic);
 
-	kmem_free(buffer);
-	if (cur)
-		xfs_btree_del_cursor(cur, error);
-	if (agbp)
-		xfs_buf_relse(agbp);
+	/*
+	 * We found some inode groups, so clear the error status and return
+	 * them.  The lastino pointer will point directly at the inode that
+	 * triggered any error that occurred, so on the next call the error
+	 * will be triggered again and propagated to userspace as there will be
+	 * no formatted inode groups in the buffer.
+	 */
+	if (breq->ocount > 0)
+		error = 0;
 
 	return error;
 }
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index a2562fe8d282..b4c89454e27a 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -46,25 +46,9 @@ typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
 int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 
-typedef int (*inumbers_fmt_pf)(
-	void			__user *ubuffer, /* buffer to write to */
-	const xfs_inogrp_t	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written);	/* # of bytes written */
+typedef int (*inumbers_fmt_pf)(struct xfs_ibulk *breq,
+		const struct xfs_inogrp *igrp);
 
-int
-xfs_inumbers_fmt(
-	void			__user *ubuffer, /* buffer to write to */
-	const xfs_inogrp_t	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written);	/* # of bytes written */
-
-int					/* error status */
-xfs_inumbers(
-	xfs_mount_t		*mp,	/* mount point for filesystem */
-	xfs_ino_t		*last,	/* last inode returned */
-	int			*count,	/* size of buffer/count returned */
-	void			__user *buffer, /* buffer with inode info */
-	inumbers_fmt_pf		formatter);
+int xfs_inumbers(struct xfs_ibulk *breq, inumbers_fmt_pf formatter);
 
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index c4a9c4c246b7..3a35d1cf7e14 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -66,7 +66,10 @@ struct xfs_iwalk_ag {
 	unsigned int			nr_recs;
 
 	/* Inode walk function and data pointer. */
-	xfs_iwalk_fn			iwalk_fn;
+	union {
+		xfs_iwalk_fn		iwalk_fn;
+		xfs_inobt_walk_fn	inobt_walk_fn;
+	};
 	void				*data;
 };
 
@@ -104,16 +107,18 @@ xfs_iwalk_ichunk_ra(
 
 /*
  * Lookup the inode chunk that the given @agino lives in and then get the
- * record if we found the chunk.  Set the bits in @irec's free mask that
- * correspond to the inodes before @agino so that we skip them.  This is how we
- * restart an inode walk that was interrupted in the middle of an inode record.
+ * record if we found the chunk.  If @trim is set, set the bits in @irec's free
+ * mask that correspond to the inodes before @agino so that we skip them.
+ * This is how we restart an inode walk that was interrupted in the middle of
+ * an inode record.
  */
 STATIC int
 xfs_iwalk_grab_ichunk(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
 	xfs_agino_t			agino,	/* starting inode of chunk */
 	int				*icount,/* return # of inodes grabbed */
-	struct xfs_inobt_rec_incore	*irec)	/* btree record */
+	struct xfs_inobt_rec_incore	*irec,	/* btree record */
+	bool				trim)
 {
 	int				idx;	/* index into inode chunk */
 	int				stat;
@@ -141,6 +146,12 @@ xfs_iwalk_grab_ichunk(
 		return 0;
 	}
 
+	/* Return the entire record if the caller wants the whole thing. */
+	if (!trim) {
+		*icount = irec->ir_count;
+		return 0;
+	}
+
 	idx = agino - irec->ir_startino;
 
 	/*
@@ -262,7 +273,8 @@ xfs_iwalk_ag_start(
 	xfs_agino_t		agino,
 	struct xfs_btree_cur	**curpp,
 	struct xfs_buf		**agi_bpp,
-	int			*has_more)
+	int			*has_more,
+	bool			trim)
 {
 	struct xfs_mount	*mp = iwag->mp;
 	struct xfs_trans	*tp = iwag->tp;
@@ -286,7 +298,7 @@ xfs_iwalk_ag_start(
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
 	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
-			&iwag->recs[iwag->nr_recs]);
+			&iwag->recs[iwag->nr_recs], trim);
 	if (error)
 		return error;
 	if (icount)
@@ -365,7 +377,8 @@ xfs_iwalk_ag(
 	/* Set up our cursor at the right place in the inode btree. */
 	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
 	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
-	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
+			true);
 
 	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
 		struct xfs_inobt_rec_incore	*irec;
@@ -561,3 +574,135 @@ xfs_iwalk_threaded(
 		return xfs_pwork_destroy_poll(&pctl);
 	return xfs_pwork_destroy(&pctl);
 }
+
+/* For each inuse inode in each cached inobt record, call our function. */
+STATIC int
+xfs_inobt_walk_ag_recs(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_inobt_rec_incore	*irec;
+	unsigned int			i;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	for (i = 0, irec = iwag->recs; i < iwag->nr_recs; i++, irec++) {
+		trace_xfs_iwalk_ag_rec(mp, agno, irec);
+		error = iwag->inobt_walk_fn(mp, tp, agno, irec, iwag->data);
+		if (error)
+			return error;
+	}
+
+	iwag->nr_recs = 0;
+	return 0;
+}
+
+/*
+ * Walk all inode btree records in a single AG, from @iwag->startino to the end
+ * of the AG.
+ */
+STATIC int
+xfs_inobt_walk_ag(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_buf			*agi_bp = NULL;
+	struct xfs_btree_cur		*cur = NULL;
+	xfs_agnumber_t			agno;
+	xfs_agino_t			agino;
+	int				has_more;
+	int				error = 0;
+
+	/* Set up our cursor at the right place in the inode btree. */
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
+			false);
+
+	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
+		struct xfs_inobt_rec_incore	*irec;
+
+		cond_resched();
+
+		/* Fetch the inobt record. */
+		irec = &iwag->recs[iwag->nr_recs];
+		error = xfs_inobt_get_rec(cur, irec, &has_more);
+		if (error || !has_more)
+			break;
+
+		/*
+		 * If there's space in the buffer for more records, increment
+		 * the btree cursor and grab more.
+		 */
+		if (++iwag->nr_recs < iwag->sz_recs) {
+			error = xfs_btree_increment(cur, 0, &has_more);
+			if (error || !has_more)
+				break;
+			continue;
+		}
+
+		/*
+		 * Otherwise, we need to save cursor state and run the callback
+		 * function on the cached records.  The run_callbacks function
+		 * is supposed to return a cursor pointing to the record where
+		 * we would be if we had been able to increment like above.
+		 */
+		error = xfs_iwalk_run_callbacks(iwag, xfs_inobt_walk_ag_recs,
+				agno, &cur, &agi_bp, &has_more);
+	}
+
+	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+
+	/* Walk any records left behind in the cache. */
+	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
+		return error;
+
+	return xfs_inobt_walk_ag_recs(iwag);
+}
+
+/*
+ * Walk all inode btree records in the filesystem starting from @startino.  The
+ * @inobt_walk_fn will be called for each btree record, being passed the incore
+ * record and @data.  @max_prefetch controls how many inobt records we try to
+ * cache ahead of time.
+ */
+int
+xfs_inobt_walk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		startino,
+	xfs_inobt_walk_fn	inobt_walk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_iwalk_ag	iwag = {
+		.mp		= mp,
+		.tp		= tp,
+		.inobt_walk_fn	= inobt_walk_fn,
+		.data		= data,
+		.startino	= startino,
+		.pwork		= XFS_PWORK_SINGLE_THREADED,
+	};
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	xfs_iwalk_set_prefetch(&iwag, max_prefetch * XFS_INODES_PER_CHUNK);
+	error = xfs_iwalk_alloc(&iwag);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_inobt_walk_ag(&iwag);
+		if (error)
+			break;
+		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	xfs_iwalk_free(&iwag);
+	return error;
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 76d8f87a39ef..20bee93d4676 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -18,4 +18,16 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
 		void *data);
 
+/* Walk all inode btree records in the filesystem starting from @startino. */
+typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
+				 xfs_agnumber_t agno,
+				 const struct xfs_inobt_rec_incore *irec,
+				 void *data);
+/* Return value (for xfs_inobt_walk_fn) that aborts the walk immediately. */
+#define XFS_INOBT_WALK_ABORT	(XFS_IWALK_ABORT)
+
+int xfs_inobt_walk(struct xfs_mount *mp, struct xfs_trans *tp,
+		xfs_ino_t startino, xfs_inobt_walk_fn inobt_walk_fn,
+		unsigned int max_prefetch, void *data);
+
 #endif /* __XFS_IWALK_H__ */

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
@ 2019-06-10 13:58   ` Brian Foster
  2019-06-10 16:59     ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-10 13:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:49:34PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a new iterator function to simplify walking inodes in an XFS
> filesystem.  This new iterator will replace the existing open-coded
> walking that goes on in various places.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/Makefile                  |    1 
>  fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
>  fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
>  fs/xfs/xfs_itable.c              |    5 
>  fs/xfs/xfs_itable.h              |    8 +
>  fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_iwalk.h               |   18 ++
>  fs/xfs/xfs_trace.h               |   40 ++++
>  8 files changed, 502 insertions(+), 4 deletions(-)
>  create mode 100644 fs/xfs/xfs_iwalk.c
>  create mode 100644 fs/xfs/xfs_iwalk.h
> 
> 
...
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index ac4b65da4c2b..cb7eac2f51c0 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -564,6 +564,34 @@ xfs_inobt_max_size(
>  					XFS_INODES_PER_CHUNK);
>  }
>  
> +/* Read AGI and create inobt cursor. */
> +int
> +xfs_inobt_cur(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	struct xfs_btree_cur	**curpp,
> +	struct xfs_buf		**agi_bpp)
> +{
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	ASSERT(*agi_bpp == NULL);
> +

FYI, the xfs_inobt_count_blocks() caller doesn't initialize the pointer
according to the assert.

> +	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
> +	if (!cur) {
> +		xfs_trans_brelse(tp, *agi_bpp);
> +		*agi_bpp = NULL;
> +		return -ENOMEM;
> +	}
> +	*curpp = cur;
> +	return 0;
> +}
> +
>  static int
>  xfs_inobt_count_blocks(
>  	struct xfs_mount	*mp,
...
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> new file mode 100644
> index 000000000000..3e6c06e69c75
> --- /dev/null
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -0,0 +1,400 @@
...
> +/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
> +STATIC int
> +xfs_iwalk_ag(
> +	struct xfs_iwalk_ag		*iwag)
> +{
> +	struct xfs_mount		*mp = iwag->mp;
> +	struct xfs_trans		*tp = iwag->tp;
> +	struct xfs_buf			*agi_bp = NULL;
> +	struct xfs_btree_cur		*cur = NULL;
> +	xfs_agnumber_t			agno;
> +	xfs_agino_t			agino;
> +	int				has_more;
> +	int				error = 0;
> +
> +	/* Set up our cursor at the right place in the inode btree. */
> +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> +
> +	while (!error && has_more) {
> +		struct xfs_inobt_rec_incore	*irec;
> +
> +		cond_resched();
> +
> +		/* Fetch the inobt record. */
> +		irec = &iwag->recs[iwag->nr_recs];
> +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> +		if (error || !has_more)
> +			break;
> +
> +		/* No allocated inodes in this chunk; skip it. */
> +		if (irec->ir_freecount == irec->ir_count) {
> +			error = xfs_btree_increment(cur, 0, &has_more);
> +			if (error)
> +				break;
> +			continue;
> +		}
> +
> +		/*
> +		 * Start readahead for this inode chunk in anticipation of
> +		 * walking the inodes.
> +		 */
> +		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> +
> +		/*
> +		 * If there's space in the buffer for more records, increment
> +		 * the btree cursor and grab more.
> +		 */
> +		if (++iwag->nr_recs < iwag->sz_recs) {
> +			error = xfs_btree_increment(cur, 0, &has_more);
> +			if (error || !has_more)
> +				break;
> +			continue;
> +		}
> +
> +		/*
> +		 * Otherwise, we need to save cursor state and run the callback
> +		 * function on the cached records.  The run_callbacks function
> +		 * is supposed to return a cursor pointing to the record where
> +		 * we would be if we had been able to increment like above.
> +		 */
> +		error = xfs_iwalk_run_callbacks(iwag, xfs_iwalk_ag_recs, agno,
> +				&cur, &agi_bp, &has_more);
> +	}
> +
> +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> +
> +	/* Walk any records left behind in the cache. */
> +	if (iwag->nr_recs == 0 || error)
> +		return error;
> +
> +	return xfs_iwalk_ag_recs(iwag);

Hmm, I find the above pattern to process the leftover records a bit
confusing because of how it is open coded. Could we find a way to reuse
xfs_iwalk_run_callbacks() in both cases so it looks more obvious? For
example, pass a flag to indicate whether the callback helper should
recreate the cursor for continued processing. FWIW, it looks like
has_more already reflects that state in the current logic above.

> +}
> +
> +/*
> + * Given the number of inodes to prefetch, set the number of inobt records that
> + * we cache in memory, which controls the number of inodes we try to read
> + * ahead.
> + *
> + * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
> + * this should be plenty of inodes to read ahead.  This number (256 inobt
> + * records) was chosen so that the cache is never more than a single memory
> + * page.
> + */
> +static inline void
> +xfs_iwalk_set_prefetch(
> +	struct xfs_iwalk_ag	*iwag,
> +	unsigned int		max_prefetch)
> +{
> +	if (max_prefetch)
> +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> +					XFS_INODES_PER_CHUNK;
> +	else
> +		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
> +

Perhaps this should use PAGE_SIZE or a related macro?

Brian

> +	/*
> +	 * Allocate enough space to prefetch at least two records so that we
> +	 * can cache both the inobt record where the iwalk started and the next
> +	 * record.  This simplifies the AG inode walk loop setup code.
> +	 */
> +	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
> +}
> +
> +/*
> + * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
> + * will be called for each allocated inode, being passed the inode's number and
> + * @data.  @max_prefetch controls how many inobt records' worth of inodes we
> + * try to readahead.
> + */
> +int
> +xfs_iwalk(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_ino_t		startino,
> +	xfs_iwalk_fn		iwalk_fn,
> +	unsigned int		max_prefetch,
> +	void			*data)
> +{
> +	struct xfs_iwalk_ag	iwag = {
> +		.mp		= mp,
> +		.tp		= tp,
> +		.iwalk_fn	= iwalk_fn,
> +		.data		= data,
> +		.startino	= startino,
> +	};
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> +	int			error;
> +
> +	ASSERT(agno < mp->m_sb.sb_agcount);
> +
> +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> +	error = xfs_iwalk_alloc(&iwag);
> +	if (error)
> +		return error;
> +
> +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_iwalk_ag(&iwag);
> +		if (error)
> +			break;
> +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> +	}
> +
> +	xfs_iwalk_free(&iwag);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> new file mode 100644
> index 000000000000..45b1baabcd2d
> --- /dev/null
> +++ b/fs/xfs/xfs_iwalk.h
> @@ -0,0 +1,18 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + */
> +#ifndef __XFS_IWALK_H__
> +#define __XFS_IWALK_H__
> +
> +/* Walk all inodes in the filesystem starting from @startino. */
> +typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> +			    xfs_ino_t ino, void *data);
> +/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
> +#define XFS_IWALK_ABORT	(1)
> +
> +int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> +
> +#endif /* __XFS_IWALK_H__ */
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 2464ea351f83..f9bb1d50bc0e 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
>  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
>  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
>  
> +TRACE_EVENT(xfs_iwalk_ag,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		 xfs_agino_t startino),
> +	TP_ARGS(mp, agno, startino),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_agino_t, startino)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->startino = startino;
> +	),
> +	TP_printk("dev %d:%d agno %d startino %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> +		  __entry->startino)
> +)
> +
> +TRACE_EVENT(xfs_iwalk_ag_rec,
> +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		 struct xfs_inobt_rec_incore *irec),
> +	TP_ARGS(mp, agno, irec),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_agnumber_t, agno)
> +		__field(xfs_agino_t, startino)
> +		__field(uint64_t, freemask)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->agno = agno;
> +		__entry->startino = irec->ir_startino;
> +		__entry->freemask = irec->ir_free;
> +	),
> +	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> +		  __entry->startino, __entry->freemask)
> +)
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
@ 2019-06-10 13:58   ` Brian Foster
  2019-06-10 17:10     ` Darrick J. Wong
  2019-06-11 23:23     ` Dave Chinner
  0 siblings, 2 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-10 13:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, Dave Chinner

On Tue, Jun 04, 2019 at 02:49:40PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Convert quotacheck to use the new iwalk iterator to dig through the
> inodes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
>  1 file changed, 20 insertions(+), 42 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index aa6b6db3db0e..a5b2260406a8 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
...
> @@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
>  	 * rootino must have its resources accounted for, not so with the quota
>  	 * inodes.
>  	 */
> -	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
> -		*res = BULKSTAT_RV_NOTHING;
> -		return -EINVAL;
> -	}
> +	if (xfs_is_quota_inode(&mp->m_sb, ino))
> +		return 0;
>  
>  	/*
>  	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
>  	 * at mount time and therefore nobody will be racing chown/chproj.
>  	 */
> -	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
> -	if (error) {
> -		*res = BULKSTAT_RV_NOTHING;
> +	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);

I was wondering if we should start using IGET_UNTRUSTED here, but I
guess we're 1.) protected by quotacheck context and 2.) have the same
record validity semantics as the existing bulkstat walker. LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	if (error == -EINVAL || error == -ENOENT)
> +		return 0;
> +	if (error)
>  		return error;
> -	}
>  
>  	ASSERT(ip->i_delayed_blks == 0);
>  
> @@ -1157,7 +1154,7 @@ xfs_qm_dqusage_adjust(
>  		struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
>  
>  		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> -			error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
> +			error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
>  			if (error)
>  				goto error0;
>  		}
> @@ -1200,13 +1197,8 @@ xfs_qm_dqusage_adjust(
>  			goto error0;
>  	}
>  
> -	xfs_irele(ip);
> -	*res = BULKSTAT_RV_DIDONE;
> -	return 0;
> -
>  error0:
>  	xfs_irele(ip);
> -	*res = BULKSTAT_RV_GIVEUP;
>  	return error;
>  }
>  
> @@ -1270,18 +1262,13 @@ STATIC int
>  xfs_qm_quotacheck(
>  	xfs_mount_t	*mp)
>  {
> -	int			done, count, error, error2;
> -	xfs_ino_t		lastino;
> -	size_t			structsz;
> +	int			error, error2;
>  	uint			flags;
>  	LIST_HEAD		(buffer_list);
>  	struct xfs_inode	*uip = mp->m_quotainfo->qi_uquotaip;
>  	struct xfs_inode	*gip = mp->m_quotainfo->qi_gquotaip;
>  	struct xfs_inode	*pip = mp->m_quotainfo->qi_pquotaip;
>  
> -	count = INT_MAX;
> -	structsz = 1;
> -	lastino = 0;
>  	flags = 0;
>  
>  	ASSERT(uip || gip || pip);
> @@ -1318,18 +1305,9 @@ xfs_qm_quotacheck(
>  		flags |= XFS_PQUOTA_CHKD;
>  	}
>  
> -	do {
> -		/*
> -		 * Iterate thru all the inodes in the file system,
> -		 * adjusting the corresponding dquot counters in core.
> -		 */
> -		error = xfs_bulkstat(mp, &lastino, &count,
> -				     xfs_qm_dqusage_adjust,
> -				     structsz, NULL, &done);
> -		if (error)
> -			break;
> -
> -	} while (!done);
> +	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> +	if (error)
> +		goto error_return;
>  
>  	/*
>  	 * We've made all the changes that we need to make incore.  Flush them
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one
  2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
@ 2019-06-10 13:59   ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-10 13:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:49:47PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When userspace passes in a @lastip pointer we should copy the results
> back, even if the @ocount pointer is NULL.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_ioctl.c   |   13 ++++++-------
>  fs/xfs/xfs_ioctl32.c |   13 ++++++-------
>  2 files changed, 12 insertions(+), 14 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index d7dfc13f30f5..5ffbdcff3dba 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
> @@ -768,14 +768,13 @@ xfs_ioc_bulkstat(
>  	if (error)
>  		return error;
>  
> -	if (bulkreq.ocount != NULL) {
> -		if (copy_to_user(bulkreq.lastip, &inlast,
> -						sizeof(xfs_ino_t)))
> -			return -EFAULT;
> +	if (bulkreq.lastip != NULL &&
> +	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> +		return -EFAULT;
>  
> -		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> -			return -EFAULT;
> -	}
> +	if (bulkreq.ocount != NULL &&
> +	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> +		return -EFAULT;
>  
>  	return 0;
>  }
> diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> index 614fc6886d24..814ffe6fbab7 100644
> --- a/fs/xfs/xfs_ioctl32.c
> +++ b/fs/xfs/xfs_ioctl32.c
> @@ -310,14 +310,13 @@ xfs_compat_ioc_bulkstat(
>  	if (error)
>  		return error;
>  
> -	if (bulkreq.ocount != NULL) {
> -		if (copy_to_user(bulkreq.lastip, &inlast,
> -						sizeof(xfs_ino_t)))
> -			return -EFAULT;
> +	if (bulkreq.lastip != NULL &&
> +	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> +		return -EFAULT;
>  
> -		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> -			return -EFAULT;
> -	}
> +	if (bulkreq.ocount != NULL &&
> +	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> +		return -EFAULT;
>  
>  	return 0;
>  }
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure
  2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
@ 2019-06-10 14:02   ` Brian Foster
  2019-06-10 17:38     ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-10 14:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:49:53PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a new ibulk structure incore to help us deal with bulk inode stat
> state tracking and then convert the bulkstat code to use the new iwalk
> iterator.  This disentangles inode walking from bulk stat control for
> simpler code and enables us to isolate the formatter functions to the
> ioctl handling code.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_ioctl.c   |   65 ++++++--
>  fs/xfs/xfs_ioctl.h   |    5 +
>  fs/xfs/xfs_ioctl32.c |   88 +++++------
>  fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
>  fs/xfs/xfs_itable.h  |   79 ++++------
>  5 files changed, 245 insertions(+), 399 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> index 5ffbdcff3dba..43734901aeb9 100644
> --- a/fs/xfs/xfs_ioctl.c
> +++ b/fs/xfs/xfs_ioctl.c
...
> @@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
>  	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
>  		return -EFAULT;
>  
> -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
>  		return -EFAULT;
>  
> -	if ((count = bulkreq.icount) <= 0)
> +	if (bulkreq.icount <= 0)
>  		return -EINVAL;
>  
>  	if (bulkreq.ubuffer == NULL)
>  		return -EINVAL;
>  
> -	if (cmd == XFS_IOC_FSINUMBERS)
> -		error = xfs_inumbers(mp, &inlast, &count,
> +	breq.ubuffer = bulkreq.ubuffer;
> +	breq.icount = bulkreq.icount;
> +
> +	/*
> +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> +	 * that *lastip contains either zero or the number of the last inode to
> +	 * be examined by the previous call and return results starting with
> +	 * the next inode after that.  The new bulk request functions take the
> +	 * inode to start with, so we have to adjust the lastino/startino
> +	 * parameter to maintain correct function.
> +	 */

It's kind of difficult to tell what's new or old when just looking at
the code. The comment suggests FSINUMBERS and FSBULKSTAT use the same
interface wrt to lastip, but they aren't handled the same below. I take
it this is because xfs_inumbers() still has the same behavior whereas
xfs_bulkstat() has been changed to operate based on breq rather than
lastip..?

> +	if (cmd == XFS_IOC_FSINUMBERS) {
> +		int	count = breq.icount;
> +
> +		breq.startino = lastino;
> +		error = xfs_inumbers(mp, &breq.startino, &count,
>  					bulkreq.ubuffer, xfs_inumbers_fmt);
> -	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
> -		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
> -					sizeof(xfs_bstat_t), NULL, &done);
> -	else	/* XFS_IOC_FSBULKSTAT */
> -		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
> -				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
> -				     &done);
> +		breq.ocount = count;
> +		lastino = breq.startino;
> +	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
> +		breq.startino = lastino;
> +		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
> +		lastino = breq.startino;
> +	} else {	/* XFS_IOC_FSBULKSTAT */
> +		breq.startino = lastino + 1;
> +		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
> +		lastino = breq.startino - 1;
> +	}
>  
...
> diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> index 814ffe6fbab7..add15819daf3 100644
> --- a/fs/xfs/xfs_ioctl32.c
> +++ b/fs/xfs/xfs_ioctl32.c
...
> @@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
>  		return -EFAULT;
>  	bulkreq.ocount = compat_ptr(addr);
>  
> -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
>  		return -EFAULT;
> +	breq.startino = lastino + 1;
>  
> -	if ((count = bulkreq.icount) <= 0)
> +	if (bulkreq.icount <= 0)
>  		return -EINVAL;
>  
>  	if (bulkreq.ubuffer == NULL)
>  		return -EINVAL;
>  
> +	breq.ubuffer = bulkreq.ubuffer;
> +	breq.icount = bulkreq.icount;
> +
> +	/*
> +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> +	 * that *lastip contains either zero or the number of the last inode to
> +	 * be examined by the previous call and return results starting with
> +	 * the next inode after that.  The new bulk request functions take the
> +	 * inode to start with, so we have to adjust the lastino/startino
> +	 * parameter to maintain correct function.
> +	 */

(Same comment here.)

>  	if (cmd == XFS_IOC_FSINUMBERS_32) {
> -		error = xfs_inumbers(mp, &inlast, &count,
> +		int	count = breq.icount;
> +
> +		breq.startino = lastino;
> +		error = xfs_inumbers(mp, &breq.startino, &count,
>  				bulkreq.ubuffer, inumbers_func);
> +		breq.ocount = count;
> +		lastino = breq.startino;
>  	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
> -		int res;
> -
> -		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
> -				bs_one_size, NULL, &res);
> +		breq.startino = lastino;
> +		error = xfs_bulkstat_one(&breq, bs_one_func);
> +		lastino = breq.startino;
>  	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
> -		error = xfs_bulkstat(mp, &inlast, &count,
> -			bs_one_func, bs_one_size,
> -			bulkreq.ubuffer, &done);
> -	} else
> +		breq.startino = lastino + 1;
> +		error = xfs_bulkstat(&breq, bs_one_func);
> +		lastino = breq.startino - 1;
> +	} else {
>  		error = -EINVAL;
> +	}
>  	if (error)
>  		return error;
>  
> +	lastino = breq.startino - 1;

Should this be here?

>  	if (bulkreq.lastip != NULL &&
> -	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> +	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
>  		return -EFAULT;
>  
>  	if (bulkreq.ocount != NULL &&
> -	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> +	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
>  		return -EFAULT;
>  
>  	return 0;
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 3ca1c454afe6..87c597ea1df7 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -22,37 +22,63 @@
>  #include "xfs_iwalk.h"
>  
>  /*
> - * Return stat information for one inode.
> - * Return 0 if ok, else errno.
> + * Bulk Stat
> + * =========
> + *
> + * Use the inode walking functions to fill out struct xfs_bstat for every
> + * allocated inode, then pass the stat information to some externally provided
> + * iteration function.
>   */
> -int
> +
> +struct xfs_bstat_chunk {
> +	bulkstat_one_fmt_pf	formatter;
> +	struct xfs_ibulk	*breq;
> +};
> +
> +/*
> + * Fill out the bulkstat info for a single inode and report it somewhere.
> + *
> + * bc->breq->lastino is effectively the inode cursor as we walk through the
> + * filesystem.  Therefore, we update it any time we need to move the cursor
> + * forward, regardless of whether or not we're sending any bstat information
> + * back to userspace.  If the inode is internal metadata or, has been freed
> + * out from under us, we just simply keep going.
> + *
> + * However, if any other type of error happens we want to stop right where we
> + * are so that userspace will call back with exact number of the bad inode and
> + * we can send back an error code.
> + *
> + * Note that if the formatter tells us there's no space left in the buffer we
> + * move the cursor forward and abort the walk.
> + */
> +STATIC int
>  xfs_bulkstat_one_int(
> -	struct xfs_mount	*mp,		/* mount point for filesystem */
> -	xfs_ino_t		ino,		/* inode to get data for */
> -	void __user		*buffer,	/* buffer to place output in */
> -	int			ubsize,		/* size of buffer */
> -	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
> -	int			*ubused,	/* bytes used by me */
> -	int			*stat)		/* BULKSTAT_RV_... */
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_ino_t		ino,
> +	void			*data)
>  {

Any reason this function takes an 'ino' param considering it's sourced
from breq->startino and we bump that value from within this function?
The latter seems slightly misplaced to me since it doesn't appear to
control the iteration.

It also looks like we bump startino in almost all cases. Exceptions are
memory allocation failure of the buffer and formatter error. Hmm.. could
we lift the buffer to the bc and reuse it to avoid that error entirely
(along with repeated allocs/frees)? From there, perhaps we could lift
the ->startino update to the callers based on something like error !=
-EFAULT..? (Alternatively, the caller could update it first and then
walk it back if error == -EFAULT).

> +	struct xfs_bstat_chunk	*bc = data;
>  	struct xfs_icdinode	*dic;		/* dinode core info pointer */
>  	struct xfs_inode	*ip;		/* incore inode pointer */
>  	struct inode		*inode;
>  	struct xfs_bstat	*buf;		/* return buffer */
>  	int			error = 0;	/* error value */
>  
> -	*stat = BULKSTAT_RV_NOTHING;
> -
> -	if (!buffer || xfs_internal_inum(mp, ino))
> +	if (xfs_internal_inum(mp, ino)) {
> +		bc->breq->startino = ino + 1;
>  		return -EINVAL;
> +	}
>  
>  	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
>  	if (!buf)
>  		return -ENOMEM;
>  
> -	error = xfs_iget(mp, NULL, ino,
> +	error = xfs_iget(mp, tp, ino,
>  			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
>  			 XFS_ILOCK_SHARED, &ip);
> +	if (error == -ENOENT || error == -EINVAL)
> +		bc->breq->startino = ino + 1;
>  	if (error)
>  		goto out_free;
>  
> @@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  	xfs_irele(ip);
>  
> -	error = formatter(buffer, ubsize, ubused, buf);
> -	if (!error)
> -		*stat = BULKSTAT_RV_DIDONE;
> -
> - out_free:
> +	error = bc->formatter(bc->breq, buf);
> +	switch (error) {
> +	case XFS_IBULK_BUFFER_FULL:
> +		error = XFS_IWALK_ABORT;
> +		/* fall through */
> +	case 0:
> +		bc->breq->startino = ino + 1;
> +		break;
> +	}
> +out_free:
>  	kmem_free(buf);
>  	return error;
>  }
>  
> -/* Return 0 on success or positive error */
> -STATIC int
> -xfs_bulkstat_one_fmt(
> -	void			__user *ubuffer,
> -	int			ubsize,
> -	int			*ubused,
> -	const xfs_bstat_t	*buffer)
> -{
> -	if (ubsize < sizeof(*buffer))
> -		return -ENOMEM;
> -	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
> -		return -EFAULT;
> -	if (ubused)
> -		*ubused = sizeof(*buffer);
> -	return 0;
> -}
> -
> +/* Bulkstat a single inode. */
>  int
>  xfs_bulkstat_one(
> -	xfs_mount_t	*mp,		/* mount point for filesystem */
> -	xfs_ino_t	ino,		/* inode number to get data for */
> -	void		__user *buffer,	/* buffer to place output in */
> -	int		ubsize,		/* size of buffer */
> -	int		*ubused,	/* bytes used by me */
> -	int		*stat)		/* BULKSTAT_RV_... */
> +	struct xfs_ibulk	*breq,
> +	bulkstat_one_fmt_pf	formatter)
>  {
> -	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
> -				    xfs_bulkstat_one_fmt, ubused, stat);
> +	struct xfs_bstat_chunk	bc = {
> +		.formatter	= formatter,
> +		.breq		= breq,
> +	};
> +	int			error;
> +
> +	breq->icount = 1;
> +	breq->ocount = 0;
> +

So ->icount is already set by the caller based on user input. I'd guess
this is set here to guarantee a single cycle in the event that the user
provided value is >1, but that seems unnecessary since we're calling the
internal helper to handle a single inode.

If we want to set ->icount for whatever reason, can we do it in the
caller where it's more obvious? That also shows that the ->ocount init
is unnecessary since the whole structure is initialized in the caller.

> +	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
> +
> +	/*
> +	 * If we reported one inode to userspace then we abort because we hit
> +	 * the end of the buffer.  Don't leak that back to userspace.
> +	 */
> +	if (error == XFS_IWALK_ABORT)
> +		error = 0;
> +
> +	return error;
>  }
>  
>  /*
...
> diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> index 369e3f159d4e..366d391eb11f 100644
> --- a/fs/xfs/xfs_itable.h
> +++ b/fs/xfs/xfs_itable.h
> @@ -5,63 +5,46 @@
>  #ifndef __XFS_ITABLE_H__
>  #define	__XFS_ITABLE_H__
>  
> -/*
> - * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
> - * structures (by the dmi library). This is a pointer to a formatter function
> - * that will iget the inode and fill in the appropriate structure.
> - * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
> - */
> -typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
> -			       xfs_ino_t	ino,
> -			       void		__user *buffer,
> -			       int		ubsize,
> -			       int		*ubused,
> -			       int		*stat);
> +/* In-memory representation of a userspace request for batch inode data. */
> +struct xfs_ibulk {
> +	struct xfs_mount	*mp;
> +	void __user		*ubuffer; /* user output buffer */
> +	xfs_ino_t		startino; /* start with this inode */
> +	unsigned int		icount;   /* number of elements in ubuffer */
> +	unsigned int		ocount;   /* number of records returned */
> +};
> +
> +/* Return value that means we want to abort the walk. */
> +#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
> +
> +/* Return value that means the formatting buffer is now full. */
> +#define XFS_IBULK_BUFFER_FULL	(2)
>  

It might be wise to define this such that it's guaranteed not to be the
same as the abort value, since that is defined externally to this
header. IBULK_ABORT + 1 perhaps?

Brian

>  /*
> - * Values for stat return value.
> + * Advance the user buffer pointer by one record of the given size.  If the
> + * buffer is now full, return the appropriate error code.
>   */
> -#define BULKSTAT_RV_NOTHING	0
> -#define BULKSTAT_RV_DIDONE	1
> -#define BULKSTAT_RV_GIVEUP	2
> +static inline int
> +xfs_ibulk_advance(
> +	struct xfs_ibulk	*breq,
> +	size_t			bytes)
> +{
> +	char __user		*b = breq->ubuffer;
> +
> +	breq->ubuffer = b + bytes;
> +	breq->ocount++;
> +	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
> +}
>  
>  /*
>   * Return stat information in bulk (by-inode) for the filesystem.
>   */
> -int					/* error status */
> -xfs_bulkstat(
> -	xfs_mount_t	*mp,		/* mount point for filesystem */
> -	xfs_ino_t	*lastino,	/* last inode returned */
> -	int		*count,		/* size of buffer/count returned */
> -	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
> -	size_t		statstruct_size,/* sizeof struct that we're filling */
> -	char		__user *ubuffer,/* buffer with inode stats */
> -	int		*done);		/* 1 if there are more stats to get */
>  
> -typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
> -	void			__user *ubuffer, /* buffer to write to */
> -	int			ubsize,		 /* remaining user buffer sz */
> -	int			*ubused,	 /* bytes used by formatter */
> -	const xfs_bstat_t	*buffer);        /* buffer to read from */
> +typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
> +		const struct xfs_bstat *bstat);
>  
> -int
> -xfs_bulkstat_one_int(
> -	xfs_mount_t		*mp,
> -	xfs_ino_t		ino,
> -	void			__user *buffer,
> -	int			ubsize,
> -	bulkstat_one_fmt_pf	formatter,
> -	int			*ubused,
> -	int			*stat);
> -
> -int
> -xfs_bulkstat_one(
> -	xfs_mount_t		*mp,
> -	xfs_ino_t		ino,
> -	void			__user *buffer,
> -	int			ubsize,
> -	int			*ubused,
> -	int			*stat);
> +int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> +int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
>  
>  typedef int (*inumbers_fmt_pf)(
>  	void			__user *ubuffer, /* buffer to write to */
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code
  2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
@ 2019-06-10 14:02   ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-10 14:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:49:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we've reworked the bulkstat code to use iwalk, we can move the
> old bulkstat ichunk helpers to xfs_iwalk.c.  No functional changes here.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_itable.c |   93 --------------------------------------------------
>  fs/xfs/xfs_itable.h |    8 ----
>  fs/xfs/xfs_iwalk.c  |   95 ++++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 93 insertions(+), 103 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 87c597ea1df7..06abe5c9c0ee 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -186,99 +186,6 @@ xfs_bulkstat_one(
>  	return error;
>  }
>  
> -/*
> - * Loop over all clusters in a chunk for a given incore inode allocation btree
> - * record.  Do a readahead if there are any allocated inodes in that cluster.
> - */
> -void
> -xfs_bulkstat_ichunk_ra(
> -	struct xfs_mount		*mp,
> -	xfs_agnumber_t			agno,
> -	struct xfs_inobt_rec_incore	*irec)
> -{
> -	struct xfs_ino_geometry		*igeo = M_IGEO(mp);
> -	xfs_agblock_t			agbno;
> -	struct blk_plug			plug;
> -	int				i;	/* inode chunk index */
> -
> -	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
> -
> -	blk_start_plug(&plug);
> -	for (i = 0;
> -	     i < XFS_INODES_PER_CHUNK;
> -	     i += igeo->inodes_per_cluster,
> -			agbno += igeo->blocks_per_cluster) {
> -		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
> -		    ~irec->ir_free) {
> -			xfs_btree_reada_bufs(mp, agno, agbno,
> -					igeo->blocks_per_cluster,
> -					&xfs_inode_buf_ops);
> -		}
> -	}
> -	blk_finish_plug(&plug);
> -}
> -
> -/*
> - * Lookup the inode chunk that the given inode lives in and then get the record
> - * if we found the chunk.  If the inode was not the last in the chunk and there
> - * are some left allocated, update the data for the pointed-to record as well as
> - * return the count of grabbed inodes.
> - */
> -int
> -xfs_bulkstat_grab_ichunk(
> -	struct xfs_btree_cur		*cur,	/* btree cursor */
> -	xfs_agino_t			agino,	/* starting inode of chunk */
> -	int				*icount,/* return # of inodes grabbed */
> -	struct xfs_inobt_rec_incore	*irec)	/* btree record */
> -{
> -	int				idx;	/* index into inode chunk */
> -	int				stat;
> -	int				error = 0;
> -
> -	/* Lookup the inode chunk that this inode lives in */
> -	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
> -	if (error)
> -		return error;
> -	if (!stat) {
> -		*icount = 0;
> -		return error;
> -	}
> -
> -	/* Get the record, should always work */
> -	error = xfs_inobt_get_rec(cur, irec, &stat);
> -	if (error)
> -		return error;
> -	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
> -
> -	/* Check if the record contains the inode in request */
> -	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
> -		*icount = 0;
> -		return 0;
> -	}
> -
> -	idx = agino - irec->ir_startino + 1;
> -	if (idx < XFS_INODES_PER_CHUNK &&
> -	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
> -		int	i;
> -
> -		/* We got a right chunk with some left inodes allocated at it.
> -		 * Grab the chunk record.  Mark all the uninteresting inodes
> -		 * free -- because they're before our start point.
> -		 */
> -		for (i = 0; i < idx; i++) {
> -			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
> -				irec->ir_freecount++;
> -		}
> -
> -		irec->ir_free |= xfs_inobt_maskn(0, idx);
> -		*icount = irec->ir_count - irec->ir_freecount;
> -	}
> -
> -	return 0;
> -}
> -
> -#define XFS_BULKSTAT_UBLEFT(ubleft)	((ubleft) >= statstruct_size)
> -
>  static int
>  xfs_bulkstat_iwalk(
>  	struct xfs_mount	*mp,
> diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> index 366d391eb11f..a2562fe8d282 100644
> --- a/fs/xfs/xfs_itable.h
> +++ b/fs/xfs/xfs_itable.h
> @@ -67,12 +67,4 @@ xfs_inumbers(
>  	void			__user *buffer, /* buffer with inode info */
>  	inumbers_fmt_pf		formatter);
>  
> -/* Temporarily needed while we refactor functions. */
> -struct xfs_btree_cur;
> -struct xfs_inobt_rec_incore;
> -void xfs_bulkstat_ichunk_ra(struct xfs_mount *mp, xfs_agnumber_t agno,
> -		struct xfs_inobt_rec_incore *irec);
> -int xfs_bulkstat_grab_ichunk(struct xfs_btree_cur *cur, xfs_agino_t agino,
> -		int *icount, struct xfs_inobt_rec_incore *irec);
> -
>  #endif	/* __XFS_ITABLE_H__ */
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index 3e6c06e69c75..bef0c4907781 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -66,6 +66,97 @@ struct xfs_iwalk_ag {
>  	void				*data;
>  };
>  
> +/*
> + * Loop over all clusters in a chunk for a given incore inode allocation btree
> + * record.  Do a readahead if there are any allocated inodes in that cluster.
> + */
> +STATIC void
> +xfs_iwalk_ichunk_ra(
> +	struct xfs_mount		*mp,
> +	xfs_agnumber_t			agno,
> +	struct xfs_inobt_rec_incore	*irec)
> +{
> +	struct xfs_ino_geometry		*igeo = M_IGEO(mp);
> +	xfs_agblock_t			agbno;
> +	struct blk_plug			plug;
> +	int				i;	/* inode chunk index */
> +
> +	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
> +
> +	blk_start_plug(&plug);
> +	for (i = 0;
> +	     i < XFS_INODES_PER_CHUNK;
> +	     i += igeo->inodes_per_cluster,
> +			agbno += igeo->blocks_per_cluster) {
> +		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
> +		    ~irec->ir_free) {
> +			xfs_btree_reada_bufs(mp, agno, agbno,
> +					igeo->blocks_per_cluster,
> +					&xfs_inode_buf_ops);
> +		}
> +	}
> +	blk_finish_plug(&plug);
> +}
> +
> +/*
> + * Lookup the inode chunk that the given inode lives in and then get the record
> + * if we found the chunk.  If the inode was not the last in the chunk and there
> + * are some left allocated, update the data for the pointed-to record as well as
> + * return the count of grabbed inodes.
> + */
> +STATIC int
> +xfs_iwalk_grab_ichunk(
> +	struct xfs_btree_cur		*cur,	/* btree cursor */
> +	xfs_agino_t			agino,	/* starting inode of chunk */
> +	int				*icount,/* return # of inodes grabbed */
> +	struct xfs_inobt_rec_incore	*irec)	/* btree record */
> +{
> +	int				idx;	/* index into inode chunk */
> +	int				stat;
> +	int				error = 0;
> +
> +	/* Lookup the inode chunk that this inode lives in */
> +	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
> +	if (error)
> +		return error;
> +	if (!stat) {
> +		*icount = 0;
> +		return error;
> +	}
> +
> +	/* Get the record, should always work */
> +	error = xfs_inobt_get_rec(cur, irec, &stat);
> +	if (error)
> +		return error;
> +	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
> +
> +	/* Check if the record contains the inode in request */
> +	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
> +		*icount = 0;
> +		return 0;
> +	}
> +
> +	idx = agino - irec->ir_startino + 1;
> +	if (idx < XFS_INODES_PER_CHUNK &&
> +	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
> +		int	i;
> +
> +		/* We got a right chunk with some left inodes allocated at it.
> +		 * Grab the chunk record.  Mark all the uninteresting inodes
> +		 * free -- because they're before our start point.
> +		 */
> +		for (i = 0; i < idx; i++) {
> +			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
> +				irec->ir_freecount++;
> +		}
> +
> +		irec->ir_free |= xfs_inobt_maskn(0, idx);
> +		*icount = irec->ir_count - irec->ir_freecount;
> +	}
> +
> +	return 0;
> +}
> +
>  /* Allocate memory for a walk. */
>  STATIC int
>  xfs_iwalk_alloc(
> @@ -190,7 +281,7 @@ xfs_iwalk_ag_start(
>  	 * We require a lookup cache of at least two elements so that we don't
>  	 * have to deal with tearing down the cursor to walk the records.
>  	 */
> -	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
> +	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
>  			&iwag->recs[iwag->nr_recs]);
>  	if (error)
>  		return error;
> @@ -295,7 +386,7 @@ xfs_iwalk_ag(
>  		 * Start readahead for this inode chunk in anticipation of
>  		 * walking the inodes.
>  		 */
> -		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> +		xfs_iwalk_ichunk_ra(mp, agno, irec);
>  
>  		/*
>  		 * If there's space in the buffer for more records, increment
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-10 13:58   ` Brian Foster
@ 2019-06-10 16:59     ` Darrick J. Wong
  2019-06-10 17:55       ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-10 16:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 09:58:19AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:49:34PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a new iterator function to simplify walking inodes in an XFS
> > filesystem.  This new iterator will replace the existing open-coded
> > walking that goes on in various places.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/Makefile                  |    1 
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
> >  fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
> >  fs/xfs/xfs_itable.c              |    5 
> >  fs/xfs/xfs_itable.h              |    8 +
> >  fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_iwalk.h               |   18 ++
> >  fs/xfs/xfs_trace.h               |   40 ++++
> >  8 files changed, 502 insertions(+), 4 deletions(-)
> >  create mode 100644 fs/xfs/xfs_iwalk.c
> >  create mode 100644 fs/xfs/xfs_iwalk.h
> > 
> > 
> ...
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index ac4b65da4c2b..cb7eac2f51c0 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -564,6 +564,34 @@ xfs_inobt_max_size(
> >  					XFS_INODES_PER_CHUNK);
> >  }
> >  
> > +/* Read AGI and create inobt cursor. */
> > +int
> > +xfs_inobt_cur(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	struct xfs_btree_cur	**curpp,
> > +	struct xfs_buf		**agi_bpp)
> > +{
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +
> > +	ASSERT(*agi_bpp == NULL);
> > +
> 
> FYI, the xfs_inobt_count_blocks() caller doesn't initialize the pointer
> according to the assert.

AgAGkgjwepth, there's a gcc plugin that makes all uninitialized stack
variables zero now, and so I never see these things anymore. :(

Thanks for catching this.

> 
> > +	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
> > +	if (error)
> > +		return error;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
> > +	if (!cur) {
> > +		xfs_trans_brelse(tp, *agi_bpp);
> > +		*agi_bpp = NULL;
> > +		return -ENOMEM;
> > +	}
> > +	*curpp = cur;
> > +	return 0;
> > +}
> > +
> >  static int
> >  xfs_inobt_count_blocks(
> >  	struct xfs_mount	*mp,
> ...
> > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > new file mode 100644
> > index 000000000000..3e6c06e69c75
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iwalk.c
> > @@ -0,0 +1,400 @@
> ...
> > +/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
> > +STATIC int
> > +xfs_iwalk_ag(
> > +	struct xfs_iwalk_ag		*iwag)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_buf			*agi_bp = NULL;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agino_t			agino;
> > +	int				has_more;
> > +	int				error = 0;
> > +
> > +	/* Set up our cursor at the right place in the inode btree. */
> > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > +
> > +	while (!error && has_more) {
> > +		struct xfs_inobt_rec_incore	*irec;
> > +
> > +		cond_resched();
> > +
> > +		/* Fetch the inobt record. */
> > +		irec = &iwag->recs[iwag->nr_recs];
> > +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> > +		if (error || !has_more)
> > +			break;
> > +
> > +		/* No allocated inodes in this chunk; skip it. */
> > +		if (irec->ir_freecount == irec->ir_count) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			if (error)
> > +				break;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Start readahead for this inode chunk in anticipation of
> > +		 * walking the inodes.
> > +		 */
> > +		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> > +
> > +		/*
> > +		 * If there's space in the buffer for more records, increment
> > +		 * the btree cursor and grab more.
> > +		 */
> > +		if (++iwag->nr_recs < iwag->sz_recs) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			if (error || !has_more)
> > +				break;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Otherwise, we need to save cursor state and run the callback
> > +		 * function on the cached records.  The run_callbacks function
> > +		 * is supposed to return a cursor pointing to the record where
> > +		 * we would be if we had been able to increment like above.
> > +		 */
> > +		error = xfs_iwalk_run_callbacks(iwag, xfs_iwalk_ag_recs, agno,
> > +				&cur, &agi_bp, &has_more);
> > +	}
> > +
> > +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > +
> > +	/* Walk any records left behind in the cache. */
> > +	if (iwag->nr_recs == 0 || error)
> > +		return error;
> > +
> > +	return xfs_iwalk_ag_recs(iwag);
> 
> Hmm, I find the above pattern to process the leftover records a bit
> confusing because of how it is open coded. Could we find a way to reuse
> xfs_iwalk_run_callbacks() in both cases so it looks more obvious? For
> example, pass a flag to indicate whether the callback helper should
> recreate the cursor for continued processing. FWIW, it looks like
> has_more already reflects that state in the current logic above.

Yeah, I think this can be done without making the function too
incohesive.

> > +}
> > +
> > +/*
> > + * Given the number of inodes to prefetch, set the number of inobt records that
> > + * we cache in memory, which controls the number of inodes we try to read
> > + * ahead.
> > + *
> > + * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
> > + * this should be plenty of inodes to read ahead.  This number (256 inobt
> > + * records) was chosen so that the cache is never more than a single memory
> > + * page.
> > + */
> > +static inline void
> > +xfs_iwalk_set_prefetch(
> > +	struct xfs_iwalk_ag	*iwag,
> > +	unsigned int		max_prefetch)
> > +{
> > +	if (max_prefetch)
> > +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> > +					XFS_INODES_PER_CHUNK;
> > +	else
> > +		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
> > +
> 
> Perhaps this should use PAGE_SIZE or a related macro?

It did in the previous revision, but Dave pointed out that sz_recs then
becomes quite large on a system with 64k pages...

65536 bytes / 16 bytes per inobt record = 4096 records
4096 records * 64 inodes per record = 262144 inodes
262144 inodes * 512 bytes per inode = 128MB of inode readahead

I could extend the comment to explain why we don't use PAGE_SIZE...

/*
 * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
 * constrain the amount of inode readahead to 16k inodes regardless of CPU:
 *
 * 4096 bytes / 16 bytes per inobt record = 256 inobt records
 * 256 inobt records * 64 inodes per record = 16384 inodes
 * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
 */

--D

> Brian
> 
> > +	/*
> > +	 * Allocate enough space to prefetch at least two records so that we
> > +	 * can cache both the inobt record where the iwalk started and the next
> > +	 * record.  This simplifies the AG inode walk loop setup code.
> > +	 */
> > +	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
> > +}
> > +
> > +/*
> > + * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
> > + * will be called for each allocated inode, being passed the inode's number and
> > + * @data.  @max_prefetch controls how many inobt records' worth of inodes we
> > + * try to readahead.
> > + */
> > +int
> > +xfs_iwalk(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_ino_t		startino,
> > +	xfs_iwalk_fn		iwalk_fn,
> > +	unsigned int		max_prefetch,
> > +	void			*data)
> > +{
> > +	struct xfs_iwalk_ag	iwag = {
> > +		.mp		= mp,
> > +		.tp		= tp,
> > +		.iwalk_fn	= iwalk_fn,
> > +		.data		= data,
> > +		.startino	= startino,
> > +	};
> > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > +	int			error;
> > +
> > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > +
> > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> > +	error = xfs_iwalk_alloc(&iwag);
> > +	if (error)
> > +		return error;
> > +
> > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_iwalk_ag(&iwag);
> > +		if (error)
> > +			break;
> > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > +	}
> > +
> > +	xfs_iwalk_free(&iwag);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > new file mode 100644
> > index 000000000000..45b1baabcd2d
> > --- /dev/null
> > +++ b/fs/xfs/xfs_iwalk.h
> > @@ -0,0 +1,18 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/*
> > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + */
> > +#ifndef __XFS_IWALK_H__
> > +#define __XFS_IWALK_H__
> > +
> > +/* Walk all inodes in the filesystem starting from @startino. */
> > +typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > +			    xfs_ino_t ino, void *data);
> > +/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
> > +#define XFS_IWALK_ABORT	(1)
> > +
> > +int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> > +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> > +
> > +#endif /* __XFS_IWALK_H__ */
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 2464ea351f83..f9bb1d50bc0e 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
> >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
> >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
> >  
> > +TRACE_EVENT(xfs_iwalk_ag,
> > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		 xfs_agino_t startino),
> > +	TP_ARGS(mp, agno, startino),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_agnumber_t, agno)
> > +		__field(xfs_agino_t, startino)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->agno = agno;
> > +		__entry->startino = startino;
> > +	),
> > +	TP_printk("dev %d:%d agno %d startino %u",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > +		  __entry->startino)
> > +)
> > +
> > +TRACE_EVENT(xfs_iwalk_ag_rec,
> > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > +		 struct xfs_inobt_rec_incore *irec),
> > +	TP_ARGS(mp, agno, irec),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_agnumber_t, agno)
> > +		__field(xfs_agino_t, startino)
> > +		__field(uint64_t, freemask)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->agno = agno;
> > +		__entry->startino = irec->ir_startino;
> > +		__entry->freemask = irec->ir_free;
> > +	),
> > +	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > +		  __entry->startino, __entry->freemask)
> > +)
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-10 13:58   ` Brian Foster
@ 2019-06-10 17:10     ` Darrick J. Wong
  2019-06-11 23:23     ` Dave Chinner
  1 sibling, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-10 17:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Dave Chinner

On Mon, Jun 10, 2019 at 09:58:52AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:49:40PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Convert quotacheck to use the new iwalk iterator to dig through the
> > inodes.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
> >  1 file changed, 20 insertions(+), 42 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index aa6b6db3db0e..a5b2260406a8 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> ...
> > @@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
> >  	 * rootino must have its resources accounted for, not so with the quota
> >  	 * inodes.
> >  	 */
> > -	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
> > -		*res = BULKSTAT_RV_NOTHING;
> > -		return -EINVAL;
> > -	}
> > +	if (xfs_is_quota_inode(&mp->m_sb, ino))
> > +		return 0;
> >  
> >  	/*
> >  	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
> >  	 * at mount time and therefore nobody will be racing chown/chproj.
> >  	 */
> > -	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > -	if (error) {
> > -		*res = BULKSTAT_RV_NOTHING;
> > +	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
> 
> I was wondering if we should start using IGET_UNTRUSTED here, but I
> guess we're 1.) protected by quotacheck context and 2.) have the same
> record validity semantics as the existing bulkstat walker. LGTM:

Yep.  There's nothing else running in the fs at quotacheck time so inode
records should not be changing while quotacheck runs.

> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +	if (error == -EINVAL || error == -ENOENT)
> > +		return 0;
> > +	if (error)
> >  		return error;
> > -	}
> >  
> >  	ASSERT(ip->i_delayed_blks == 0);
> >  
> > @@ -1157,7 +1154,7 @@ xfs_qm_dqusage_adjust(
> >  		struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
> >  
> >  		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
> > -			error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
> > +			error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
> >  			if (error)
> >  				goto error0;
> >  		}
> > @@ -1200,13 +1197,8 @@ xfs_qm_dqusage_adjust(
> >  			goto error0;
> >  	}
> >  
> > -	xfs_irele(ip);
> > -	*res = BULKSTAT_RV_DIDONE;
> > -	return 0;
> > -
> >  error0:
> >  	xfs_irele(ip);
> > -	*res = BULKSTAT_RV_GIVEUP;
> >  	return error;
> >  }
> >  
> > @@ -1270,18 +1262,13 @@ STATIC int
> >  xfs_qm_quotacheck(
> >  	xfs_mount_t	*mp)
> >  {
> > -	int			done, count, error, error2;
> > -	xfs_ino_t		lastino;
> > -	size_t			structsz;
> > +	int			error, error2;
> >  	uint			flags;
> >  	LIST_HEAD		(buffer_list);
> >  	struct xfs_inode	*uip = mp->m_quotainfo->qi_uquotaip;
> >  	struct xfs_inode	*gip = mp->m_quotainfo->qi_gquotaip;
> >  	struct xfs_inode	*pip = mp->m_quotainfo->qi_pquotaip;
> >  
> > -	count = INT_MAX;
> > -	structsz = 1;
> > -	lastino = 0;
> >  	flags = 0;
> >  
> >  	ASSERT(uip || gip || pip);
> > @@ -1318,18 +1305,9 @@ xfs_qm_quotacheck(
> >  		flags |= XFS_PQUOTA_CHKD;
> >  	}
> >  
> > -	do {
> > -		/*
> > -		 * Iterate thru all the inodes in the file system,
> > -		 * adjusting the corresponding dquot counters in core.
> > -		 */
> > -		error = xfs_bulkstat(mp, &lastino, &count,
> > -				     xfs_qm_dqusage_adjust,
> > -				     structsz, NULL, &done);
> > -		if (error)
> > -			break;
> > -
> > -	} while (!done);
> > +	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > +	if (error)
> > +		goto error_return;
> >  
> >  	/*
> >  	 * We've made all the changes that we need to make incore.  Flush them
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure
  2019-06-10 14:02   ` Brian Foster
@ 2019-06-10 17:38     ` Darrick J. Wong
  2019-06-10 18:29       ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-10 17:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 10:02:29AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:49:53PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a new ibulk structure incore to help us deal with bulk inode stat
> > state tracking and then convert the bulkstat code to use the new iwalk
> > iterator.  This disentangles inode walking from bulk stat control for
> > simpler code and enables us to isolate the formatter functions to the
> > ioctl handling code.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_ioctl.c   |   65 ++++++--
> >  fs/xfs/xfs_ioctl.h   |    5 +
> >  fs/xfs/xfs_ioctl32.c |   88 +++++------
> >  fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
> >  fs/xfs/xfs_itable.h  |   79 ++++------
> >  5 files changed, 245 insertions(+), 399 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > index 5ffbdcff3dba..43734901aeb9 100644
> > --- a/fs/xfs/xfs_ioctl.c
> > +++ b/fs/xfs/xfs_ioctl.c
> ...
> > @@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
> >  	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
> >  		return -EFAULT;
> >  
> > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> >  		return -EFAULT;
> >  
> > -	if ((count = bulkreq.icount) <= 0)
> > +	if (bulkreq.icount <= 0)
> >  		return -EINVAL;
> >  
> >  	if (bulkreq.ubuffer == NULL)
> >  		return -EINVAL;
> >  
> > -	if (cmd == XFS_IOC_FSINUMBERS)
> > -		error = xfs_inumbers(mp, &inlast, &count,
> > +	breq.ubuffer = bulkreq.ubuffer;
> > +	breq.icount = bulkreq.icount;
> > +
> > +	/*
> > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > +	 * that *lastip contains either zero or the number of the last inode to
> > +	 * be examined by the previous call and return results starting with
> > +	 * the next inode after that.  The new bulk request functions take the
> > +	 * inode to start with, so we have to adjust the lastino/startino
> > +	 * parameter to maintain correct function.
> > +	 */
> 
> It's kind of difficult to tell what's new or old when just looking at
> the code. The comment suggests FSINUMBERS and FSBULKSTAT use the same
> interface wrt to lastip, but they aren't handled the same below. I take
> it this is because xfs_inumbers() still has the same behavior whereas
> xfs_bulkstat() has been changed to operate based on breq rather than
> lastip..?

Yes.  By the end of the series we'll have converted FSINUMBERS too, but
for the ~5 or so patches until we get there, the only way to tell new
vs. old interface is whether it takes breq or pointers to stuff in breq.

> > +	if (cmd == XFS_IOC_FSINUMBERS) {
> > +		int	count = breq.icount;
> > +
> > +		breq.startino = lastino;
> > +		error = xfs_inumbers(mp, &breq.startino, &count,
> >  					bulkreq.ubuffer, xfs_inumbers_fmt);
> > -	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
> > -		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
> > -					sizeof(xfs_bstat_t), NULL, &done);
> > -	else	/* XFS_IOC_FSBULKSTAT */
> > -		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
> > -				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
> > -				     &done);
> > +		breq.ocount = count;
> > +		lastino = breq.startino;
> > +	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
> > +		breq.startino = lastino;
> > +		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
> > +		lastino = breq.startino;
> > +	} else {	/* XFS_IOC_FSBULKSTAT */
> > +		breq.startino = lastino + 1;
> > +		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
> > +		lastino = breq.startino - 1;
> > +	}
> >  
> ...
> > diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> > index 814ffe6fbab7..add15819daf3 100644
> > --- a/fs/xfs/xfs_ioctl32.c
> > +++ b/fs/xfs/xfs_ioctl32.c
> ...
> > @@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
> >  		return -EFAULT;
> >  	bulkreq.ocount = compat_ptr(addr);
> >  
> > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> >  		return -EFAULT;
> > +	breq.startino = lastino + 1;
> >  
> > -	if ((count = bulkreq.icount) <= 0)
> > +	if (bulkreq.icount <= 0)
> >  		return -EINVAL;
> >  
> >  	if (bulkreq.ubuffer == NULL)
> >  		return -EINVAL;
> >  
> > +	breq.ubuffer = bulkreq.ubuffer;
> > +	breq.icount = bulkreq.icount;
> > +
> > +	/*
> > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > +	 * that *lastip contains either zero or the number of the last inode to
> > +	 * be examined by the previous call and return results starting with
> > +	 * the next inode after that.  The new bulk request functions take the
> > +	 * inode to start with, so we have to adjust the lastino/startino
> > +	 * parameter to maintain correct function.
> > +	 */
> 
> (Same comment here.)
> 
> >  	if (cmd == XFS_IOC_FSINUMBERS_32) {
> > -		error = xfs_inumbers(mp, &inlast, &count,
> > +		int	count = breq.icount;
> > +
> > +		breq.startino = lastino;
> > +		error = xfs_inumbers(mp, &breq.startino, &count,
> >  				bulkreq.ubuffer, inumbers_func);
> > +		breq.ocount = count;
> > +		lastino = breq.startino;
> >  	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
> > -		int res;
> > -
> > -		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
> > -				bs_one_size, NULL, &res);
> > +		breq.startino = lastino;
> > +		error = xfs_bulkstat_one(&breq, bs_one_func);
> > +		lastino = breq.startino;
> >  	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
> > -		error = xfs_bulkstat(mp, &inlast, &count,
> > -			bs_one_func, bs_one_size,
> > -			bulkreq.ubuffer, &done);
> > -	} else
> > +		breq.startino = lastino + 1;
> > +		error = xfs_bulkstat(&breq, bs_one_func);
> > +		lastino = breq.startino - 1;
> > +	} else {
> >  		error = -EINVAL;
> > +	}
> >  	if (error)
> >  		return error;
> >  
> > +	lastino = breq.startino - 1;
> 
> Should this be here?

Nope.

> >  	if (bulkreq.lastip != NULL &&
> > -	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> > +	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
> >  		return -EFAULT;
> >  
> >  	if (bulkreq.ocount != NULL &&
> > -	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> > +	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
> >  		return -EFAULT;
> >  
> >  	return 0;
> > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > index 3ca1c454afe6..87c597ea1df7 100644
> > --- a/fs/xfs/xfs_itable.c
> > +++ b/fs/xfs/xfs_itable.c
> > @@ -22,37 +22,63 @@
> >  #include "xfs_iwalk.h"
> >  
> >  /*
> > - * Return stat information for one inode.
> > - * Return 0 if ok, else errno.
> > + * Bulk Stat
> > + * =========
> > + *
> > + * Use the inode walking functions to fill out struct xfs_bstat for every
> > + * allocated inode, then pass the stat information to some externally provided
> > + * iteration function.
> >   */
> > -int
> > +
> > +struct xfs_bstat_chunk {
> > +	bulkstat_one_fmt_pf	formatter;
> > +	struct xfs_ibulk	*breq;
> > +};
> > +
> > +/*
> > + * Fill out the bulkstat info for a single inode and report it somewhere.
> > + *
> > + * bc->breq->lastino is effectively the inode cursor as we walk through the
> > + * filesystem.  Therefore, we update it any time we need to move the cursor
> > + * forward, regardless of whether or not we're sending any bstat information
> > + * back to userspace.  If the inode is internal metadata or, has been freed
> > + * out from under us, we just simply keep going.
> > + *
> > + * However, if any other type of error happens we want to stop right where we
> > + * are so that userspace will call back with exact number of the bad inode and
> > + * we can send back an error code.
> > + *
> > + * Note that if the formatter tells us there's no space left in the buffer we
> > + * move the cursor forward and abort the walk.
> > + */
> > +STATIC int
> >  xfs_bulkstat_one_int(
> > -	struct xfs_mount	*mp,		/* mount point for filesystem */
> > -	xfs_ino_t		ino,		/* inode to get data for */
> > -	void __user		*buffer,	/* buffer to place output in */
> > -	int			ubsize,		/* size of buffer */
> > -	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
> > -	int			*ubused,	/* bytes used by me */
> > -	int			*stat)		/* BULKSTAT_RV_... */
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_ino_t		ino,
> > +	void			*data)
> >  {
> 
> Any reason this function takes an 'ino' param considering it's sourced
> from breq->startino and we bump that value from within this function?
> The latter seems slightly misplaced to me since it doesn't appear to
> control the iteration.
>
> It also looks like we bump startino in almost all cases. Exceptions are
> memory allocation failure of the buffer and formatter error. Hmm.. could
> we lift the buffer to the bc and reuse it to avoid that error entirely
> (along with repeated allocs/frees)? From there, perhaps we could lift
> the ->startino update to the callers based on something like error !=
> -EFAULT..? (Alternatively, the caller could update it first and then
> walk it back if error == -EFAULT).

xfs_iwalk doesn't know anything about the xfs_bstat_chunk or the
xfs_ibulk that are being passed to the iterator function via the void
*data parameter.  iwalk is a generic iterator which shouldn't ever know
about the bulkreq interface.

I plan to reuse the xfs_iwalk code for other parts of online repair in
the future, so that's why bulkstat_one takes the inode number that
_iwalk gives it, and then uses that to mess around with bc and breq.

I also sort of prefer to keep the startino update the way it is now
because I only want to push it forward for the ~4 cases where we do now
(internal, invalid number, past eofs, or successfully statted).
Any other runtime error leaves the cursor where it was.

> > +	struct xfs_bstat_chunk	*bc = data;
> >  	struct xfs_icdinode	*dic;		/* dinode core info pointer */
> >  	struct xfs_inode	*ip;		/* incore inode pointer */
> >  	struct inode		*inode;
> >  	struct xfs_bstat	*buf;		/* return buffer */
> >  	int			error = 0;	/* error value */
> >  
> > -	*stat = BULKSTAT_RV_NOTHING;
> > -
> > -	if (!buffer || xfs_internal_inum(mp, ino))
> > +	if (xfs_internal_inum(mp, ino)) {
> > +		bc->breq->startino = ino + 1;
> >  		return -EINVAL;
> > +	}
> >  
> >  	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
> >  	if (!buf)
> >  		return -ENOMEM;
> >  
> > -	error = xfs_iget(mp, NULL, ino,
> > +	error = xfs_iget(mp, tp, ino,
> >  			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
> >  			 XFS_ILOCK_SHARED, &ip);
> > +	if (error == -ENOENT || error == -EINVAL)
> > +		bc->breq->startino = ino + 1;
> >  	if (error)
> >  		goto out_free;
> >  
> > @@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
> >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >  	xfs_irele(ip);
> >  
> > -	error = formatter(buffer, ubsize, ubused, buf);
> > -	if (!error)
> > -		*stat = BULKSTAT_RV_DIDONE;
> > -
> > - out_free:
> > +	error = bc->formatter(bc->breq, buf);
> > +	switch (error) {
> > +	case XFS_IBULK_BUFFER_FULL:
> > +		error = XFS_IWALK_ABORT;
> > +		/* fall through */
> > +	case 0:
> > +		bc->breq->startino = ino + 1;
> > +		break;
> > +	}
> > +out_free:
> >  	kmem_free(buf);
> >  	return error;
> >  }
> >  
> > -/* Return 0 on success or positive error */
> > -STATIC int
> > -xfs_bulkstat_one_fmt(
> > -	void			__user *ubuffer,
> > -	int			ubsize,
> > -	int			*ubused,
> > -	const xfs_bstat_t	*buffer)
> > -{
> > -	if (ubsize < sizeof(*buffer))
> > -		return -ENOMEM;
> > -	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
> > -		return -EFAULT;
> > -	if (ubused)
> > -		*ubused = sizeof(*buffer);
> > -	return 0;
> > -}
> > -
> > +/* Bulkstat a single inode. */
> >  int
> >  xfs_bulkstat_one(
> > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > -	xfs_ino_t	ino,		/* inode number to get data for */
> > -	void		__user *buffer,	/* buffer to place output in */
> > -	int		ubsize,		/* size of buffer */
> > -	int		*ubused,	/* bytes used by me */
> > -	int		*stat)		/* BULKSTAT_RV_... */
> > +	struct xfs_ibulk	*breq,
> > +	bulkstat_one_fmt_pf	formatter)
> >  {
> > -	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
> > -				    xfs_bulkstat_one_fmt, ubused, stat);
> > +	struct xfs_bstat_chunk	bc = {
> > +		.formatter	= formatter,
> > +		.breq		= breq,
> > +	};
> > +	int			error;
> > +
> > +	breq->icount = 1;
> > +	breq->ocount = 0;
> > +
> 
> So ->icount is already set by the caller based on user input. I'd guess
> this is set here to guarantee a single cycle in the event that the user
> provided value is >1, but that seems unnecessary since we're calling the
> internal helper to handle a single inode.
> 
> If we want to set ->icount for whatever reason, can we do it in the
> caller where it's more obvious? That also shows that the ->ocount init
> is unnecessary since the whole structure is initialized in the caller.

Ok.

> > +	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
> > +
> > +	/*
> > +	 * If we reported one inode to userspace then we abort because we hit
> > +	 * the end of the buffer.  Don't leak that back to userspace.
> > +	 */
> > +	if (error == XFS_IWALK_ABORT)
> > +		error = 0;
> > +
> > +	return error;
> >  }
> >  
> >  /*
> ...
> > diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> > index 369e3f159d4e..366d391eb11f 100644
> > --- a/fs/xfs/xfs_itable.h
> > +++ b/fs/xfs/xfs_itable.h
> > @@ -5,63 +5,46 @@
> >  #ifndef __XFS_ITABLE_H__
> >  #define	__XFS_ITABLE_H__
> >  
> > -/*
> > - * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
> > - * structures (by the dmi library). This is a pointer to a formatter function
> > - * that will iget the inode and fill in the appropriate structure.
> > - * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
> > - */
> > -typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
> > -			       xfs_ino_t	ino,
> > -			       void		__user *buffer,
> > -			       int		ubsize,
> > -			       int		*ubused,
> > -			       int		*stat);
> > +/* In-memory representation of a userspace request for batch inode data. */
> > +struct xfs_ibulk {
> > +	struct xfs_mount	*mp;
> > +	void __user		*ubuffer; /* user output buffer */
> > +	xfs_ino_t		startino; /* start with this inode */
> > +	unsigned int		icount;   /* number of elements in ubuffer */
> > +	unsigned int		ocount;   /* number of records returned */
> > +};
> > +
> > +/* Return value that means we want to abort the walk. */
> > +#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
> > +
> > +/* Return value that means the formatting buffer is now full. */
> > +#define XFS_IBULK_BUFFER_FULL	(2)
> >  
> 
> It might be wise to define this such that it's guaranteed not to be the
> same as the abort value, since that is defined externally to this
> header. IBULK_ABORT + 1 perhaps?

FWIW I'm of half a mind to establish a generic "abort walk" error code
and key all of these iterator functions to use it instead of having all
these #define XFS_FOO_ABORT	1 things everywhere.

I'll do #define XFS_IBULK_BUFFER_FULL IBULK_ABORT+1 here.

> Brian
> 
> >  /*
> > - * Values for stat return value.
> > + * Advance the user buffer pointer by one record of the given size.  If the
> > + * buffer is now full, return the appropriate error code.
> >   */
> > -#define BULKSTAT_RV_NOTHING	0
> > -#define BULKSTAT_RV_DIDONE	1
> > -#define BULKSTAT_RV_GIVEUP	2
> > +static inline int
> > +xfs_ibulk_advance(
> > +	struct xfs_ibulk	*breq,
> > +	size_t			bytes)
> > +{
> > +	char __user		*b = breq->ubuffer;
> > +
> > +	breq->ubuffer = b + bytes;
> > +	breq->ocount++;
> > +	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
> > +}
> >  
> >  /*
> >   * Return stat information in bulk (by-inode) for the filesystem.
> >   */
> > -int					/* error status */
> > -xfs_bulkstat(
> > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > -	xfs_ino_t	*lastino,	/* last inode returned */
> > -	int		*count,		/* size of buffer/count returned */
> > -	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
> > -	size_t		statstruct_size,/* sizeof struct that we're filling */
> > -	char		__user *ubuffer,/* buffer with inode stats */
> > -	int		*done);		/* 1 if there are more stats to get */
> >  
> > -typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
> > -	void			__user *ubuffer, /* buffer to write to */
> > -	int			ubsize,		 /* remaining user buffer sz */
> > -	int			*ubused,	 /* bytes used by formatter */
> > -	const xfs_bstat_t	*buffer);        /* buffer to read from */
> > +typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
> > +		const struct xfs_bstat *bstat);
> >  
> > -int
> > -xfs_bulkstat_one_int(
> > -	xfs_mount_t		*mp,
> > -	xfs_ino_t		ino,
> > -	void			__user *buffer,
> > -	int			ubsize,
> > -	bulkstat_one_fmt_pf	formatter,
> > -	int			*ubused,
> > -	int			*stat);
> > -
> > -int
> > -xfs_bulkstat_one(
> > -	xfs_mount_t		*mp,
> > -	xfs_ino_t		ino,
> > -	void			__user *buffer,
> > -	int			ubsize,
> > -	int			*ubused,
> > -	int			*stat);
> > +int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> > +int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> >  
> >  typedef int (*inumbers_fmt_pf)(
> >  	void			__user *ubuffer, /* buffer to write to */
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-10 16:59     ` Darrick J. Wong
@ 2019-06-10 17:55       ` Brian Foster
  2019-06-10 23:11         ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-10 17:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 09:59:09AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2019 at 09:58:19AM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:49:34PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a new iterator function to simplify walking inodes in an XFS
> > > filesystem.  This new iterator will replace the existing open-coded
> > > walking that goes on in various places.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/Makefile                  |    1 
> > >  fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
> > >  fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
> > >  fs/xfs/xfs_itable.c              |    5 
> > >  fs/xfs/xfs_itable.h              |    8 +
> > >  fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_iwalk.h               |   18 ++
> > >  fs/xfs/xfs_trace.h               |   40 ++++
> > >  8 files changed, 502 insertions(+), 4 deletions(-)
> > >  create mode 100644 fs/xfs/xfs_iwalk.c
> > >  create mode 100644 fs/xfs/xfs_iwalk.h
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > index ac4b65da4c2b..cb7eac2f51c0 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
...
> > > +}
> > > +
> > > +/*
> > > + * Given the number of inodes to prefetch, set the number of inobt records that
> > > + * we cache in memory, which controls the number of inodes we try to read
> > > + * ahead.
> > > + *
> > > + * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
> > > + * this should be plenty of inodes to read ahead.  This number (256 inobt
> > > + * records) was chosen so that the cache is never more than a single memory
> > > + * page.
> > > + */
> > > +static inline void
> > > +xfs_iwalk_set_prefetch(
> > > +	struct xfs_iwalk_ag	*iwag,
> > > +	unsigned int		max_prefetch)
> > > +{
> > > +	if (max_prefetch)
> > > +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> > > +					XFS_INODES_PER_CHUNK;
> > > +	else
> > > +		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
> > > +
> > 
> > Perhaps this should use PAGE_SIZE or a related macro?
> 
> It did in the previous revision, but Dave pointed out that sz_recs then
> becomes quite large on a system with 64k pages...
> 
> 65536 bytes / 16 bytes per inobt record = 4096 records
> 4096 records * 64 inodes per record = 262144 inodes
> 262144 inodes * 512 bytes per inode = 128MB of inode readahead
> 

Ok, the comment just gave me the impression the intent was to fill a
single page.

> I could extend the comment to explain why we don't use PAGE_SIZE...
> 

Sounds good, though what I think would be better is to define a
IWALK_DEFAULT_RECS or some such somewhere and put the calculation
details with that.

Though now that you point out the readahead thing, aren't we at risk of
a similar problem for users who happen to pass a really large userspace
buffer? Should we cap the kernel allocation/readahead window in all
cases and not just the default case?

Brian

> /*
>  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
>  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
>  *
>  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
>  * 256 inobt records * 64 inodes per record = 16384 inodes
>  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
>  */
> 
> --D
> 
> > Brian
> > 
> > > +	/*
> > > +	 * Allocate enough space to prefetch at least two records so that we
> > > +	 * can cache both the inobt record where the iwalk started and the next
> > > +	 * record.  This simplifies the AG inode walk loop setup code.
> > > +	 */
> > > +	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
> > > +}
> > > +
> > > +/*
> > > + * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
> > > + * will be called for each allocated inode, being passed the inode's number and
> > > + * @data.  @max_prefetch controls how many inobt records' worth of inodes we
> > > + * try to readahead.
> > > + */
> > > +int
> > > +xfs_iwalk(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	*tp,
> > > +	xfs_ino_t		startino,
> > > +	xfs_iwalk_fn		iwalk_fn,
> > > +	unsigned int		max_prefetch,
> > > +	void			*data)
> > > +{
> > > +	struct xfs_iwalk_ag	iwag = {
> > > +		.mp		= mp,
> > > +		.tp		= tp,
> > > +		.iwalk_fn	= iwalk_fn,
> > > +		.data		= data,
> > > +		.startino	= startino,
> > > +	};
> > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > +	int			error;
> > > +
> > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > +
> > > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> > > +	error = xfs_iwalk_alloc(&iwag);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		error = xfs_iwalk_ag(&iwag);
> > > +		if (error)
> > > +			break;
> > > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > +	}
> > > +
> > > +	xfs_iwalk_free(&iwag);
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > > new file mode 100644
> > > index 000000000000..45b1baabcd2d
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_iwalk.h
> > > @@ -0,0 +1,18 @@
> > > +// SPDX-License-Identifier: GPL-2.0+
> > > +/*
> > > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > + */
> > > +#ifndef __XFS_IWALK_H__
> > > +#define __XFS_IWALK_H__
> > > +
> > > +/* Walk all inodes in the filesystem starting from @startino. */
> > > +typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > > +			    xfs_ino_t ino, void *data);
> > > +/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
> > > +#define XFS_IWALK_ABORT	(1)
> > > +
> > > +int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> > > +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> > > +
> > > +#endif /* __XFS_IWALK_H__ */
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index 2464ea351f83..f9bb1d50bc0e 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
> > >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
> > >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
> > >  
> > > +TRACE_EVENT(xfs_iwalk_ag,
> > > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > +		 xfs_agino_t startino),
> > > +	TP_ARGS(mp, agno, startino),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(xfs_agnumber_t, agno)
> > > +		__field(xfs_agino_t, startino)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = mp->m_super->s_dev;
> > > +		__entry->agno = agno;
> > > +		__entry->startino = startino;
> > > +	),
> > > +	TP_printk("dev %d:%d agno %d startino %u",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > > +		  __entry->startino)
> > > +)
> > > +
> > > +TRACE_EVENT(xfs_iwalk_ag_rec,
> > > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > +		 struct xfs_inobt_rec_incore *irec),
> > > +	TP_ARGS(mp, agno, irec),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(xfs_agnumber_t, agno)
> > > +		__field(xfs_agino_t, startino)
> > > +		__field(uint64_t, freemask)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = mp->m_super->s_dev;
> > > +		__entry->agno = agno;
> > > +		__entry->startino = irec->ir_startino;
> > > +		__entry->freemask = irec->ir_free;
> > > +	),
> > > +	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > > +		  __entry->startino, __entry->freemask)
> > > +)
> > > +
> > >  #endif /* _TRACE_XFS_H */
> > >  
> > >  #undef TRACE_INCLUDE_PATH
> > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure
  2019-06-10 17:38     ` Darrick J. Wong
@ 2019-06-10 18:29       ` Brian Foster
  2019-06-10 23:42         ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-10 18:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 10:38:39AM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2019 at 10:02:29AM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:49:53PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a new ibulk structure incore to help us deal with bulk inode stat
> > > state tracking and then convert the bulkstat code to use the new iwalk
> > > iterator.  This disentangles inode walking from bulk stat control for
> > > simpler code and enables us to isolate the formatter functions to the
> > > ioctl handling code.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_ioctl.c   |   65 ++++++--
> > >  fs/xfs/xfs_ioctl.h   |    5 +
> > >  fs/xfs/xfs_ioctl32.c |   88 +++++------
> > >  fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
> > >  fs/xfs/xfs_itable.h  |   79 ++++------
> > >  5 files changed, 245 insertions(+), 399 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > index 5ffbdcff3dba..43734901aeb9 100644
> > > --- a/fs/xfs/xfs_ioctl.c
> > > +++ b/fs/xfs/xfs_ioctl.c
> > ...
> > > @@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
> > >  	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
> > >  		return -EFAULT;
> > >  
> > > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> > >  		return -EFAULT;
> > >  
> > > -	if ((count = bulkreq.icount) <= 0)
> > > +	if (bulkreq.icount <= 0)
> > >  		return -EINVAL;
> > >  
> > >  	if (bulkreq.ubuffer == NULL)
> > >  		return -EINVAL;
> > >  
> > > -	if (cmd == XFS_IOC_FSINUMBERS)
> > > -		error = xfs_inumbers(mp, &inlast, &count,
> > > +	breq.ubuffer = bulkreq.ubuffer;
> > > +	breq.icount = bulkreq.icount;
> > > +
> > > +	/*
> > > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > > +	 * that *lastip contains either zero or the number of the last inode to
> > > +	 * be examined by the previous call and return results starting with
> > > +	 * the next inode after that.  The new bulk request functions take the
> > > +	 * inode to start with, so we have to adjust the lastino/startino
> > > +	 * parameter to maintain correct function.
> > > +	 */
> > 
> > It's kind of difficult to tell what's new or old when just looking at
> > the code. The comment suggests FSINUMBERS and FSBULKSTAT use the same
> > interface wrt to lastip, but they aren't handled the same below. I take
> > it this is because xfs_inumbers() still has the same behavior whereas
> > xfs_bulkstat() has been changed to operate based on breq rather than
> > lastip..?
> 
> Yes.  By the end of the series we'll have converted FSINUMBERS too, but
> for the ~5 or so patches until we get there, the only way to tell new
> vs. old interface is whether it takes breq or pointers to stuff in breq.
> 
> > > +	if (cmd == XFS_IOC_FSINUMBERS) {
> > > +		int	count = breq.icount;
> > > +
> > > +		breq.startino = lastino;
> > > +		error = xfs_inumbers(mp, &breq.startino, &count,
> > >  					bulkreq.ubuffer, xfs_inumbers_fmt);
> > > -	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
> > > -		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
> > > -					sizeof(xfs_bstat_t), NULL, &done);
> > > -	else	/* XFS_IOC_FSBULKSTAT */
> > > -		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
> > > -				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
> > > -				     &done);
> > > +		breq.ocount = count;
> > > +		lastino = breq.startino;
> > > +	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
> > > +		breq.startino = lastino;
> > > +		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
> > > +		lastino = breq.startino;
> > > +	} else {	/* XFS_IOC_FSBULKSTAT */
> > > +		breq.startino = lastino + 1;
> > > +		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
> > > +		lastino = breq.startino - 1;
> > > +	}
> > >  
> > ...
> > > diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> > > index 814ffe6fbab7..add15819daf3 100644
> > > --- a/fs/xfs/xfs_ioctl32.c
> > > +++ b/fs/xfs/xfs_ioctl32.c
> > ...
> > > @@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
> > >  		return -EFAULT;
> > >  	bulkreq.ocount = compat_ptr(addr);
> > >  
> > > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> > >  		return -EFAULT;
> > > +	breq.startino = lastino + 1;
> > >  
> > > -	if ((count = bulkreq.icount) <= 0)
> > > +	if (bulkreq.icount <= 0)
> > >  		return -EINVAL;
> > >  
> > >  	if (bulkreq.ubuffer == NULL)
> > >  		return -EINVAL;
> > >  
> > > +	breq.ubuffer = bulkreq.ubuffer;
> > > +	breq.icount = bulkreq.icount;
> > > +
> > > +	/*
> > > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > > +	 * that *lastip contains either zero or the number of the last inode to
> > > +	 * be examined by the previous call and return results starting with
> > > +	 * the next inode after that.  The new bulk request functions take the
> > > +	 * inode to start with, so we have to adjust the lastino/startino
> > > +	 * parameter to maintain correct function.
> > > +	 */
> > 
> > (Same comment here.)
> > 
> > >  	if (cmd == XFS_IOC_FSINUMBERS_32) {
> > > -		error = xfs_inumbers(mp, &inlast, &count,
> > > +		int	count = breq.icount;
> > > +
> > > +		breq.startino = lastino;
> > > +		error = xfs_inumbers(mp, &breq.startino, &count,
> > >  				bulkreq.ubuffer, inumbers_func);
> > > +		breq.ocount = count;
> > > +		lastino = breq.startino;
> > >  	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
> > > -		int res;
> > > -
> > > -		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
> > > -				bs_one_size, NULL, &res);
> > > +		breq.startino = lastino;
> > > +		error = xfs_bulkstat_one(&breq, bs_one_func);
> > > +		lastino = breq.startino;
> > >  	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
> > > -		error = xfs_bulkstat(mp, &inlast, &count,
> > > -			bs_one_func, bs_one_size,
> > > -			bulkreq.ubuffer, &done);
> > > -	} else
> > > +		breq.startino = lastino + 1;
> > > +		error = xfs_bulkstat(&breq, bs_one_func);
> > > +		lastino = breq.startino - 1;
> > > +	} else {
> > >  		error = -EINVAL;
> > > +	}
> > >  	if (error)
> > >  		return error;
> > >  
> > > +	lastino = breq.startino - 1;
> > 
> > Should this be here?
> 
> Nope.
> 
> > >  	if (bulkreq.lastip != NULL &&
> > > -	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> > > +	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
> > >  		return -EFAULT;
> > >  
> > >  	if (bulkreq.ocount != NULL &&
> > > -	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> > > +	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
> > >  		return -EFAULT;
> > >  
> > >  	return 0;
> > > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > > index 3ca1c454afe6..87c597ea1df7 100644
> > > --- a/fs/xfs/xfs_itable.c
> > > +++ b/fs/xfs/xfs_itable.c
> > > @@ -22,37 +22,63 @@
> > >  #include "xfs_iwalk.h"
> > >  
> > >  /*
> > > - * Return stat information for one inode.
> > > - * Return 0 if ok, else errno.
> > > + * Bulk Stat
> > > + * =========
> > > + *
> > > + * Use the inode walking functions to fill out struct xfs_bstat for every
> > > + * allocated inode, then pass the stat information to some externally provided
> > > + * iteration function.
> > >   */
> > > -int
> > > +
> > > +struct xfs_bstat_chunk {
> > > +	bulkstat_one_fmt_pf	formatter;
> > > +	struct xfs_ibulk	*breq;
> > > +};
> > > +
> > > +/*
> > > + * Fill out the bulkstat info for a single inode and report it somewhere.
> > > + *
> > > + * bc->breq->lastino is effectively the inode cursor as we walk through the
> > > + * filesystem.  Therefore, we update it any time we need to move the cursor
> > > + * forward, regardless of whether or not we're sending any bstat information
> > > + * back to userspace.  If the inode is internal metadata or, has been freed
> > > + * out from under us, we just simply keep going.
> > > + *
> > > + * However, if any other type of error happens we want to stop right where we
> > > + * are so that userspace will call back with exact number of the bad inode and
> > > + * we can send back an error code.
> > > + *
> > > + * Note that if the formatter tells us there's no space left in the buffer we
> > > + * move the cursor forward and abort the walk.
> > > + */
> > > +STATIC int
> > >  xfs_bulkstat_one_int(
> > > -	struct xfs_mount	*mp,		/* mount point for filesystem */
> > > -	xfs_ino_t		ino,		/* inode to get data for */
> > > -	void __user		*buffer,	/* buffer to place output in */
> > > -	int			ubsize,		/* size of buffer */
> > > -	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
> > > -	int			*ubused,	/* bytes used by me */
> > > -	int			*stat)		/* BULKSTAT_RV_... */
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	*tp,
> > > +	xfs_ino_t		ino,
> > > +	void			*data)
> > >  {
> > 
> > Any reason this function takes an 'ino' param considering it's sourced
> > from breq->startino and we bump that value from within this function?
> > The latter seems slightly misplaced to me since it doesn't appear to
> > control the iteration.
> >
> > It also looks like we bump startino in almost all cases. Exceptions are
> > memory allocation failure of the buffer and formatter error. Hmm.. could
> > we lift the buffer to the bc and reuse it to avoid that error entirely
> > (along with repeated allocs/frees)? From there, perhaps we could lift
> > the ->startino update to the callers based on something like error !=
> > -EFAULT..? (Alternatively, the caller could update it first and then
> > walk it back if error == -EFAULT).
> 
> xfs_iwalk doesn't know anything about the xfs_bstat_chunk or the
> xfs_ibulk that are being passed to the iterator function via the void
> *data parameter.  iwalk is a generic iterator which shouldn't ever know
> about the bulkreq interface.
> 

Sure, I'm not suggesting to push anything into the iwalk code. That's
obviously a separate layer. I'm referring to the direct callers of
xfs_bulkstat_one_int(), which is still bulkstat code AFAICT.

> I plan to reuse the xfs_iwalk code for other parts of online repair in
> the future, so that's why bulkstat_one takes the inode number that
> _iwalk gives it, and then uses that to mess around with bc and breq.
> 

Ok. BTW, looking again I notice that xfs_bulkstat_one_int() passes bc as
a void pointer. That could probably be fixed up to take the
xfs_bstat_chunk from the current callers.

> I also sort of prefer to keep the startino update the way it is now
> because I only want to push it forward for the ~4 cases where we do now
> (internal, invalid number, past eofs, or successfully statted).
> Any other runtime error leaves the cursor where it was.
> 

Fair enough, though it could also be cleaned up a bit with an exit label
IMO.

Brian

> > > +	struct xfs_bstat_chunk	*bc = data;
> > >  	struct xfs_icdinode	*dic;		/* dinode core info pointer */
> > >  	struct xfs_inode	*ip;		/* incore inode pointer */
> > >  	struct inode		*inode;
> > >  	struct xfs_bstat	*buf;		/* return buffer */
> > >  	int			error = 0;	/* error value */
> > >  
> > > -	*stat = BULKSTAT_RV_NOTHING;
> > > -
> > > -	if (!buffer || xfs_internal_inum(mp, ino))
> > > +	if (xfs_internal_inum(mp, ino)) {
> > > +		bc->breq->startino = ino + 1;
> > >  		return -EINVAL;
> > > +	}
> > >  
> > >  	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
> > >  	if (!buf)
> > >  		return -ENOMEM;
> > >  
> > > -	error = xfs_iget(mp, NULL, ino,
> > > +	error = xfs_iget(mp, tp, ino,
> > >  			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
> > >  			 XFS_ILOCK_SHARED, &ip);
> > > +	if (error == -ENOENT || error == -EINVAL)
> > > +		bc->breq->startino = ino + 1;
> > >  	if (error)
> > >  		goto out_free;
> > >  
> > > @@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
> > >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > >  	xfs_irele(ip);
> > >  
> > > -	error = formatter(buffer, ubsize, ubused, buf);
> > > -	if (!error)
> > > -		*stat = BULKSTAT_RV_DIDONE;
> > > -
> > > - out_free:
> > > +	error = bc->formatter(bc->breq, buf);
> > > +	switch (error) {
> > > +	case XFS_IBULK_BUFFER_FULL:
> > > +		error = XFS_IWALK_ABORT;
> > > +		/* fall through */
> > > +	case 0:
> > > +		bc->breq->startino = ino + 1;
> > > +		break;
> > > +	}
> > > +out_free:
> > >  	kmem_free(buf);
> > >  	return error;
> > >  }
> > >  
> > > -/* Return 0 on success or positive error */
> > > -STATIC int
> > > -xfs_bulkstat_one_fmt(
> > > -	void			__user *ubuffer,
> > > -	int			ubsize,
> > > -	int			*ubused,
> > > -	const xfs_bstat_t	*buffer)
> > > -{
> > > -	if (ubsize < sizeof(*buffer))
> > > -		return -ENOMEM;
> > > -	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
> > > -		return -EFAULT;
> > > -	if (ubused)
> > > -		*ubused = sizeof(*buffer);
> > > -	return 0;
> > > -}
> > > -
> > > +/* Bulkstat a single inode. */
> > >  int
> > >  xfs_bulkstat_one(
> > > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > > -	xfs_ino_t	ino,		/* inode number to get data for */
> > > -	void		__user *buffer,	/* buffer to place output in */
> > > -	int		ubsize,		/* size of buffer */
> > > -	int		*ubused,	/* bytes used by me */
> > > -	int		*stat)		/* BULKSTAT_RV_... */
> > > +	struct xfs_ibulk	*breq,
> > > +	bulkstat_one_fmt_pf	formatter)
> > >  {
> > > -	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
> > > -				    xfs_bulkstat_one_fmt, ubused, stat);
> > > +	struct xfs_bstat_chunk	bc = {
> > > +		.formatter	= formatter,
> > > +		.breq		= breq,
> > > +	};
> > > +	int			error;
> > > +
> > > +	breq->icount = 1;
> > > +	breq->ocount = 0;
> > > +
> > 
> > So ->icount is already set by the caller based on user input. I'd guess
> > this is set here to guarantee a single cycle in the event that the user
> > provided value is >1, but that seems unnecessary since we're calling the
> > internal helper to handle a single inode.
> > 
> > If we want to set ->icount for whatever reason, can we do it in the
> > caller where it's more obvious? That also shows that the ->ocount init
> > is unnecessary since the whole structure is initialized in the caller.
> 
> Ok.
> 
> > > +	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
> > > +
> > > +	/*
> > > +	 * If we reported one inode to userspace then we abort because we hit
> > > +	 * the end of the buffer.  Don't leak that back to userspace.
> > > +	 */
> > > +	if (error == XFS_IWALK_ABORT)
> > > +		error = 0;
> > > +
> > > +	return error;
> > >  }
> > >  
> > >  /*
> > ...
> > > diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> > > index 369e3f159d4e..366d391eb11f 100644
> > > --- a/fs/xfs/xfs_itable.h
> > > +++ b/fs/xfs/xfs_itable.h
> > > @@ -5,63 +5,46 @@
> > >  #ifndef __XFS_ITABLE_H__
> > >  #define	__XFS_ITABLE_H__
> > >  
> > > -/*
> > > - * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
> > > - * structures (by the dmi library). This is a pointer to a formatter function
> > > - * that will iget the inode and fill in the appropriate structure.
> > > - * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
> > > - */
> > > -typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
> > > -			       xfs_ino_t	ino,
> > > -			       void		__user *buffer,
> > > -			       int		ubsize,
> > > -			       int		*ubused,
> > > -			       int		*stat);
> > > +/* In-memory representation of a userspace request for batch inode data. */
> > > +struct xfs_ibulk {
> > > +	struct xfs_mount	*mp;
> > > +	void __user		*ubuffer; /* user output buffer */
> > > +	xfs_ino_t		startino; /* start with this inode */
> > > +	unsigned int		icount;   /* number of elements in ubuffer */
> > > +	unsigned int		ocount;   /* number of records returned */
> > > +};
> > > +
> > > +/* Return value that means we want to abort the walk. */
> > > +#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
> > > +
> > > +/* Return value that means the formatting buffer is now full. */
> > > +#define XFS_IBULK_BUFFER_FULL	(2)
> > >  
> > 
> > It might be wise to define this such that it's guaranteed not to be the
> > same as the abort value, since that is defined externally to this
> > header. IBULK_ABORT + 1 perhaps?
> 
> FWIW I'm of half a mind to establish a generic "abort walk" error code
> and key all of these iterator functions to use it instead of having all
> these #define XFS_FOO_ABORT	1 things everywhere.
> 
> I'll do #define XFS_IBULK_BUFFER_FULL IBULK_ABORT+1 here.
> 
> > Brian
> > 
> > >  /*
> > > - * Values for stat return value.
> > > + * Advance the user buffer pointer by one record of the given size.  If the
> > > + * buffer is now full, return the appropriate error code.
> > >   */
> > > -#define BULKSTAT_RV_NOTHING	0
> > > -#define BULKSTAT_RV_DIDONE	1
> > > -#define BULKSTAT_RV_GIVEUP	2
> > > +static inline int
> > > +xfs_ibulk_advance(
> > > +	struct xfs_ibulk	*breq,
> > > +	size_t			bytes)
> > > +{
> > > +	char __user		*b = breq->ubuffer;
> > > +
> > > +	breq->ubuffer = b + bytes;
> > > +	breq->ocount++;
> > > +	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
> > > +}
> > >  
> > >  /*
> > >   * Return stat information in bulk (by-inode) for the filesystem.
> > >   */
> > > -int					/* error status */
> > > -xfs_bulkstat(
> > > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > > -	xfs_ino_t	*lastino,	/* last inode returned */
> > > -	int		*count,		/* size of buffer/count returned */
> > > -	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
> > > -	size_t		statstruct_size,/* sizeof struct that we're filling */
> > > -	char		__user *ubuffer,/* buffer with inode stats */
> > > -	int		*done);		/* 1 if there are more stats to get */
> > >  
> > > -typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
> > > -	void			__user *ubuffer, /* buffer to write to */
> > > -	int			ubsize,		 /* remaining user buffer sz */
> > > -	int			*ubused,	 /* bytes used by formatter */
> > > -	const xfs_bstat_t	*buffer);        /* buffer to read from */
> > > +typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
> > > +		const struct xfs_bstat *bstat);
> > >  
> > > -int
> > > -xfs_bulkstat_one_int(
> > > -	xfs_mount_t		*mp,
> > > -	xfs_ino_t		ino,
> > > -	void			__user *buffer,
> > > -	int			ubsize,
> > > -	bulkstat_one_fmt_pf	formatter,
> > > -	int			*ubused,
> > > -	int			*stat);
> > > -
> > > -int
> > > -xfs_bulkstat_one(
> > > -	xfs_mount_t		*mp,
> > > -	xfs_ino_t		ino,
> > > -	void			__user *buffer,
> > > -	int			ubsize,
> > > -	int			*ubused,
> > > -	int			*stat);
> > > +int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> > > +int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> > >  
> > >  typedef int (*inumbers_fmt_pf)(
> > >  	void			__user *ubuffer, /* buffer to write to */
> > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino
  2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
@ 2019-06-10 19:32   ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-10 19:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:50:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that the inode chunk grabbing function is a static function in the
> iwalk code, change its behavior so that @agino is the inode where we
> want to /start/ the iteration.  This reduces cognitive friction with the
> callers and simplifes the code.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iwalk.c |   37 +++++++++++++++++--------------------
>  1 file changed, 17 insertions(+), 20 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index bef0c4907781..9ad017ddbae7 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -99,10 +99,10 @@ xfs_iwalk_ichunk_ra(
>  }
>  
>  /*
> - * Lookup the inode chunk that the given inode lives in and then get the record
> - * if we found the chunk.  If the inode was not the last in the chunk and there
> - * are some left allocated, update the data for the pointed-to record as well as
> - * return the count of grabbed inodes.
> + * Lookup the inode chunk that the given @agino lives in and then get the
> + * record if we found the chunk.  Set the bits in @irec's free mask that
> + * correspond to the inodes before @agino so that we skip them.  This is how we
> + * restart an inode walk that was interrupted in the middle of an inode record.
>   */
>  STATIC int
>  xfs_iwalk_grab_ichunk(
> @@ -113,6 +113,7 @@ xfs_iwalk_grab_ichunk(
>  {
>  	int				idx;	/* index into inode chunk */
>  	int				stat;
> +	int				i;
>  	int				error = 0;
>  
>  	/* Lookup the inode chunk that this inode lives in */
> @@ -136,24 +137,20 @@ xfs_iwalk_grab_ichunk(
>  		return 0;
>  	}
>  
> -	idx = agino - irec->ir_startino + 1;
> -	if (idx < XFS_INODES_PER_CHUNK &&
> -	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
> -		int	i;
> +	idx = agino - irec->ir_startino;
>  
> -		/* We got a right chunk with some left inodes allocated at it.
> -		 * Grab the chunk record.  Mark all the uninteresting inodes
> -		 * free -- because they're before our start point.
> -		 */
> -		for (i = 0; i < idx; i++) {
> -			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
> -				irec->ir_freecount++;
> -		}
> -
> -		irec->ir_free |= xfs_inobt_maskn(0, idx);
> -		*icount = irec->ir_count - irec->ir_freecount;
> +	/*
> +	 * We got a right chunk with some left inodes allocated at it.  Grab
> +	 * the chunk record.  Mark all the uninteresting inodes free because
> +	 * they're before our start point.
> +	 */
> +	for (i = 0; i < idx; i++) {
> +		if (XFS_INOBT_MASK(i) & ~irec->ir_free)
> +			irec->ir_freecount++;
>  	}
>  
> +	irec->ir_free |= xfs_inobt_maskn(0, idx);
> +	*icount = irec->ir_count - irec->ir_freecount;
>  	return 0;
>  }
>  
> @@ -281,7 +278,7 @@ xfs_iwalk_ag_start(
>  	 * We require a lookup cache of at least two elements so that we don't
>  	 * have to deal with tearing down the cursor to walk the records.
>  	 */
> -	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
> +	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
>  			&iwag->recs[iwag->nr_recs]);
>  	if (error)
>  		return error;
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra
  2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
@ 2019-06-10 19:32   ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-10 19:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:50:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_iwalk.c |   12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index 9ad017ddbae7..8595258b5001 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -84,16 +84,16 @@ xfs_iwalk_ichunk_ra(
>  	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
>  
>  	blk_start_plug(&plug);
> -	for (i = 0;
> -	     i < XFS_INODES_PER_CHUNK;
> -	     i += igeo->inodes_per_cluster,
> -			agbno += igeo->blocks_per_cluster) {
> -		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
> -		    ~irec->ir_free) {
> +	for (i = 0; i < XFS_INODES_PER_CHUNK; i += igeo->inodes_per_cluster) {
> +		xfs_inofree_t	imask;
> +
> +		imask = xfs_inobt_maskn(i, igeo->inodes_per_cluster);
> +		if (imask & ~irec->ir_free) {
>  			xfs_btree_reada_bufs(mp, agno, agbno,
>  					igeo->blocks_per_cluster,
>  					&xfs_inode_buf_ops);
>  		}
> +		agbno += igeo->blocks_per_cluster;
>  	}
>  	blk_finish_plug(&plug);
>  }
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
@ 2019-06-10 19:40   ` Brian Foster
  2019-06-11  1:10     ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-10 19:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:50:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a parallel iwalk implementation and switch quotacheck to use it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Interesting.. is there any commonality here with the ktask mechanism
that's been in progress? I've not followed the details, but I thought it
was a similar idea. The last post I see for that is here:

https://marc.info/?l=linux-mm&m=154143701122927&w=2

That aside, this all looks mostly fine to me. A few random thoughts..

>  fs/xfs/Makefile      |    1 
>  fs/xfs/xfs_globals.c |    3 +
>  fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
>  fs/xfs/xfs_iwalk.h   |    2 +
>  fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
>  fs/xfs/xfs_qm.c      |    2 -
>  fs/xfs/xfs_sysctl.h  |    6 ++
>  fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
>  fs/xfs/xfs_trace.h   |   18 +++++++
>  10 files changed, 317 insertions(+), 3 deletions(-)
>  create mode 100644 fs/xfs/xfs_pwork.c
>  create mode 100644 fs/xfs/xfs_pwork.h
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index 74d30ef0dbce..48940a27d4aa 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
>  				   xfs_message.o \
>  				   xfs_mount.o \
>  				   xfs_mru_cache.o \
> +				   xfs_pwork.o \
>  				   xfs_reflink.o \
>  				   xfs_stats.o \
>  				   xfs_super.o \
> diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> index d0d377384120..4f93f2c4dc38 100644
> --- a/fs/xfs/xfs_globals.c
> +++ b/fs/xfs/xfs_globals.c
> @@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
>  	.fstrm_timer	= {	1,		30*100,		3600*100},
>  	.eofb_timer	= {	1,		300,		3600*24},
>  	.cowb_timer	= {	1,		1800,		3600*24},
> +#ifdef DEBUG
> +	.pwork_threads	= {	0,		0,		NR_CPUS	},
> +#endif
>  };
>  
>  struct xfs_globals xfs_globals = {
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index 8595258b5001..71ee1628aa70 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -21,6 +21,7 @@
>  #include "xfs_health.h"
>  #include "xfs_trans.h"
>  #include "xfs_iwalk.h"
> +#include "xfs_pwork.h"
>  
>  /*
>   * Walking Inodes in the Filesystem
> @@ -46,6 +47,9 @@
>   */
>  
>  struct xfs_iwalk_ag {
> +	/* parallel work control data; will be null if single threaded */
> +	struct xfs_pwork		pwork;
> +
>  	struct xfs_mount		*mp;
>  	struct xfs_trans		*tp;
>  
> @@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
>  		trace_xfs_iwalk_ag_rec(mp, agno, irec);
>  
>  		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> +			if (xfs_pwork_want_abort(&iwag->pwork))
> +				return 0;
> +
>  			/* Skip if this inode is free */
>  			if (XFS_INOBT_MASK(j) & irec->ir_free)
>  				continue;
> @@ -360,7 +367,7 @@ xfs_iwalk_ag(
>  	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
>  	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
>  
> -	while (!error && has_more) {
> +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
>  		struct xfs_inobt_rec_incore	*irec;
>  
>  		cond_resched();
> @@ -409,7 +416,7 @@ xfs_iwalk_ag(
>  	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
>  
>  	/* Walk any records left behind in the cache. */
> -	if (iwag->nr_recs == 0 || error)
> +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
>  		return error;
>  
>  	return xfs_iwalk_ag_recs(iwag);
> @@ -465,6 +472,7 @@ xfs_iwalk(
>  		.iwalk_fn	= iwalk_fn,
>  		.data		= data,
>  		.startino	= startino,
> +		.pwork		= XFS_PWORK_SINGLE_THREADED,
>  	};
>  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
>  	int			error;
> @@ -486,3 +494,67 @@ xfs_iwalk(
>  	xfs_iwalk_free(&iwag);
>  	return error;
>  }
> +
> +/* Run per-thread iwalk work. */
> +static int
> +xfs_iwalk_ag_work(
> +	struct xfs_mount	*mp,
> +	struct xfs_pwork	*pwork)
> +{
> +	struct xfs_iwalk_ag	*iwag;
> +	int			error;
> +
> +	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
> +	error = xfs_iwalk_alloc(iwag);
> +	if (error)
> +		goto out;

In most cases this will never fail, but the error path if it does looks
slightly painful. I was thinking if we could move this up into
xfs_iwalk_threaded() so we wouldn't continue to queue work jobs when
failure is imminent...

> +
> +	error = xfs_iwalk_ag(iwag);
> +	xfs_iwalk_free(iwag);
> +out:
> +	kmem_free(iwag);
> +	return error;
> +}
> +
> +/*
> + * Walk all the inodes in the filesystem using multiple threads to process each
> + * AG.
> + */
> +int
> +xfs_iwalk_threaded(
> +	struct xfs_mount	*mp,
> +	xfs_ino_t		startino,
> +	xfs_iwalk_fn		iwalk_fn,
> +	unsigned int		max_prefetch,
> +	void			*data)
> +{
> +	struct xfs_pwork_ctl	pctl;
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> +	unsigned int		nr_threads;
> +	int			error;
> +
> +	ASSERT(agno < mp->m_sb.sb_agcount);
> +
> +	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
> +	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
> +			nr_threads);
> +	if (error)
> +		return error;
> +
> +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> +		struct xfs_iwalk_ag	*iwag;
> +
> +		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
> +		iwag->mp = mp;
> +		iwag->tp = NULL;
> +		iwag->iwalk_fn = iwalk_fn;
> +		iwag->data = data;
> +		iwag->startino = startino;
> +		iwag->recs = NULL;
> +		xfs_iwalk_set_prefetch(iwag, max_prefetch);
> +		xfs_pwork_queue(&pctl, &iwag->pwork);
> +		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> +	}

... but this is only bound by the number of AGs and so could result in a
large number of allocations. FWIW, I wouldn't expect that to be a
problem in the common case. I'm more thinking about the case of a
specially crafted filesystem designed to cause problems on mount.

> +
> +	return xfs_pwork_destroy(&pctl);
> +}
> diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> index 45b1baabcd2d..40233a05a766 100644
> --- a/fs/xfs/xfs_iwalk.h
> +++ b/fs/xfs/xfs_iwalk.h
> @@ -14,5 +14,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
>  
>  int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
>  		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> +int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
> +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
>  
>  #endif /* __XFS_IWALK_H__ */
> diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
> new file mode 100644
> index 000000000000..19605a3a2482
> --- /dev/null
> +++ b/fs/xfs/xfs_pwork.c
> @@ -0,0 +1,122 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_shared.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_mount.h"
> +#include "xfs_trace.h"
> +#include "xfs_sysctl.h"
> +#include "xfs_pwork.h"
> +
> +/*
> + * Parallel Work Queue
> + * ===================
> + *
> + * Abstract away the details of running a large and "obviously" parallelizable
> + * task across multiple CPUs.  Callers initialize the pwork control object with
> + * a desired level of parallelization and a work function.  Next, they embed
> + * struct xfs_pwork in whatever structure they use to pass work context to a
> + * worker thread and queue that pwork.  The work function will be passed the
> + * pwork item when it is run (from process context) and any returned error will
> + * cause all threads to abort.
> + * 

FYI minor whitespace damage (trailing space) on the line above.

> + * This is the rough equivalent of the xfsprogs workqueue code, though we can't
> + * reuse that name here.
> + */
> +
> +/* Invoke our caller's function. */
> +static void
> +xfs_pwork_work(
> +	struct work_struct	*work)
> +{
> +	struct xfs_pwork	*pwork;
> +	struct xfs_pwork_ctl	*pctl;
> +	int			error;
> +
> +	pwork = container_of(work, struct xfs_pwork, work);
> +	pctl = pwork->pctl;
> +	error = pctl->work_fn(pctl->mp, pwork);
> +	if (error && !pctl->error)
> +		pctl->error = error;
> +}
> +
> +/*
> + * Set up control data for parallel work.  @work_fn is the function that will
> + * be called.  @tag will be written into the kernel threads.  @nr_threads is
> + * the level of parallelism desired, or 0 for no limit.
> + */
> +int
> +xfs_pwork_init(
> +	struct xfs_mount	*mp,
> +	struct xfs_pwork_ctl	*pctl,
> +	xfs_pwork_work_fn	work_fn,
> +	const char		*tag,
> +	unsigned int		nr_threads)
> +{
> +#ifdef DEBUG
> +	if (xfs_globals.pwork_threads > 0)
> +		nr_threads = xfs_globals.pwork_threads;
> +#endif

Hmm, it might be useful to have the ability to force the no limit case
from the debug knob. Can we use -1 or something here for "disabled?"

> +	trace_xfs_pwork_init(mp, nr_threads, current->pid);
> +
> +	pctl->wq = alloc_workqueue("%s-%d", WQ_FREEZABLE, nr_threads, tag,
> +			current->pid);
> +	if (!pctl->wq)
> +		return -ENOMEM;
> +	pctl->work_fn = work_fn;
> +	pctl->error = 0;
> +	pctl->mp = mp;
> +
> +	return 0;
> +}
> +
> +/* Queue some parallel work. */
> +void
> +xfs_pwork_queue(
> +	struct xfs_pwork_ctl	*pctl,
> +	struct xfs_pwork	*pwork)
> +{
> +	INIT_WORK(&pwork->work, xfs_pwork_work);
> +	pwork->pctl = pctl;
> +	queue_work(pctl->wq, &pwork->work);
> +}
> +
> +/* Wait for the work to finish and tear down the control structure. */
> +int
> +xfs_pwork_destroy(
> +	struct xfs_pwork_ctl	*pctl)
> +{
> +	destroy_workqueue(pctl->wq);
> +	pctl->wq = NULL;
> +	return pctl->error;
> +}
> +
> +/*
> + * Return the amount of parallelism that the data device can handle, or 0 for
> + * no limit.
> + */
> +unsigned int
> +xfs_pwork_guess_datadev_parallelism(
> +	struct xfs_mount	*mp)
> +{
> +	struct xfs_buftarg	*btp = mp->m_ddev_targp;
> +	int			iomin;
> +	int			ioopt;
> +
> +	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
> +		return num_online_cpus();
> +	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
> +		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
> +	iomin = bdev_io_min(btp->bt_bdev);
> +	ioopt = bdev_io_opt(btp->bt_bdev);
> +	if (iomin && ioopt)
> +		return ioopt / iomin;
> +
> +	return 1;

Have you collected any performance data related to these heuristics? I
assume the feature is generally a win, but this also seems like we have
a large window of variance here. E.g., an SSD on a server with hundreds
of CPUs will enable as many threads as CPUs, but a single xTB spindle on
the same box may run single threaded (a quick check of a few local
devices all return an optimal I/O size of 0). Is there really no benefit
parallelizing some of that work in the spinning rust case? What about in
the other direction where we might have a ton of threads for inodes
across AGs that all happen to be in the same project quota, for example?

Brian

> +}
> diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> new file mode 100644
> index 000000000000..e0c1354a2d8c
> --- /dev/null
> +++ b/fs/xfs/xfs_pwork.h
> @@ -0,0 +1,50 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +/*
> + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> + */
> +#ifndef __XFS_PWORK_H__
> +#define __XFS_PWORK_H__
> +
> +struct xfs_pwork;
> +struct xfs_mount;
> +
> +typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
> +
> +/*
> + * Parallel work coordination structure.
> + */
> +struct xfs_pwork_ctl {
> +	struct workqueue_struct	*wq;
> +	struct xfs_mount	*mp;
> +	xfs_pwork_work_fn	work_fn;
> +	int			error;
> +};
> +
> +/*
> + * Embed this parallel work control item inside your own work structure,
> + * then queue work with it.
> + */
> +struct xfs_pwork {
> +	struct work_struct	work;
> +	struct xfs_pwork_ctl	*pctl;
> +};
> +
> +#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
> +
> +/* Have we been told to abort? */
> +static inline bool
> +xfs_pwork_want_abort(
> +	struct xfs_pwork	*pwork)
> +{
> +	return pwork->pctl && pwork->pctl->error;
> +}
> +
> +int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> +		xfs_pwork_work_fn work_fn, const char *tag,
> +		unsigned int nr_threads);
> +void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> +int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> +unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> +
> +#endif /* __XFS_PWORK_H__ */
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index a5b2260406a8..e4f3785f7a64 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
>  		flags |= XFS_PQUOTA_CHKD;
>  	}
>  
> -	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
>  	if (error)
>  		goto error_return;
>  
> diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> index ad7f9be13087..b555e045e2f4 100644
> --- a/fs/xfs/xfs_sysctl.h
> +++ b/fs/xfs/xfs_sysctl.h
> @@ -37,6 +37,9 @@ typedef struct xfs_param {
>  	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
>  	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
>  	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
> +#ifdef DEBUG
> +	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
> +#endif
>  } xfs_param_t;
>  
>  /*
> @@ -82,6 +85,9 @@ enum {
>  extern xfs_param_t	xfs_params;
>  
>  struct xfs_globals {
> +#ifdef DEBUG
> +	int	pwork_threads;		/* parallel workqueue threads */
> +#endif
>  	int	log_recovery_delay;	/* log recovery delay (secs) */
>  	int	mount_delay;		/* mount setup delay (secs) */
>  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
> diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> index cabda13f3c64..910e6b9cb1a7 100644
> --- a/fs/xfs/xfs_sysfs.c
> +++ b/fs/xfs/xfs_sysfs.c
> @@ -206,11 +206,51 @@ always_cow_show(
>  }
>  XFS_SYSFS_ATTR_RW(always_cow);
>  
> +#ifdef DEBUG
> +/*
> + * Override how many threads the parallel work queue is allowed to create.
> + * This has to be a debug-only global (instead of an errortag) because one of
> + * the main users of parallel workqueues is mount time quotacheck.
> + */
> +STATIC ssize_t
> +pwork_threads_store(
> +	struct kobject	*kobject,
> +	const char	*buf,
> +	size_t		count)
> +{
> +	int		ret;
> +	int		val;
> +
> +	ret = kstrtoint(buf, 0, &val);
> +	if (ret)
> +		return ret;
> +
> +	if (val < 0 || val > NR_CPUS)
> +		return -EINVAL;
> +
> +	xfs_globals.pwork_threads = val;
> +
> +	return count;
> +}
> +
> +STATIC ssize_t
> +pwork_threads_show(
> +	struct kobject	*kobject,
> +	char		*buf)
> +{
> +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
> +}
> +XFS_SYSFS_ATTR_RW(pwork_threads);
> +#endif /* DEBUG */
> +
>  static struct attribute *xfs_dbg_attrs[] = {
>  	ATTR_LIST(bug_on_assert),
>  	ATTR_LIST(log_recovery_delay),
>  	ATTR_LIST(mount_delay),
>  	ATTR_LIST(always_cow),
> +#ifdef DEBUG
> +	ATTR_LIST(pwork_threads),
> +#endif
>  	NULL,
>  };
>  
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index f9bb1d50bc0e..658cbade1998 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
>  		  __entry->startino, __entry->freemask)
>  )
>  
> +TRACE_EVENT(xfs_pwork_init,
> +	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
> +	TP_ARGS(mp, nr_threads, pid),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(unsigned int, nr_threads)
> +		__field(pid_t, pid)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->nr_threads = nr_threads;
> +		__entry->pid = pid;
> +	),
> +	TP_printk("dev %d:%d nr_threads %u pid %u",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->nr_threads, __entry->pid)
> +)
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-10 17:55       ` Brian Foster
@ 2019-06-10 23:11         ` Darrick J. Wong
  2019-06-11 22:33           ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-10 23:11 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> On Mon, Jun 10, 2019 at 09:59:09AM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 10, 2019 at 09:58:19AM -0400, Brian Foster wrote:
> > > On Tue, Jun 04, 2019 at 02:49:34PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Create a new iterator function to simplify walking inodes in an XFS
> > > > filesystem.  This new iterator will replace the existing open-coded
> > > > walking that goes on in various places.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/Makefile                  |    1 
> > > >  fs/xfs/libxfs/xfs_ialloc_btree.c |   31 +++
> > > >  fs/xfs/libxfs/xfs_ialloc_btree.h |    3 
> > > >  fs/xfs/xfs_itable.c              |    5 
> > > >  fs/xfs/xfs_itable.h              |    8 +
> > > >  fs/xfs/xfs_iwalk.c               |  400 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_iwalk.h               |   18 ++
> > > >  fs/xfs/xfs_trace.h               |   40 ++++
> > > >  8 files changed, 502 insertions(+), 4 deletions(-)
> > > >  create mode 100644 fs/xfs/xfs_iwalk.c
> > > >  create mode 100644 fs/xfs/xfs_iwalk.h
> > > > 
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > > index ac4b65da4c2b..cb7eac2f51c0 100644
> > > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> ...
> > > > +}
> > > > +
> > > > +/*
> > > > + * Given the number of inodes to prefetch, set the number of inobt records that
> > > > + * we cache in memory, which controls the number of inodes we try to read
> > > > + * ahead.
> > > > + *
> > > > + * If no max prefetch was given, default to 4096 bytes' worth of inobt records;
> > > > + * this should be plenty of inodes to read ahead.  This number (256 inobt
> > > > + * records) was chosen so that the cache is never more than a single memory
> > > > + * page.
> > > > + */
> > > > +static inline void
> > > > +xfs_iwalk_set_prefetch(
> > > > +	struct xfs_iwalk_ag	*iwag,
> > > > +	unsigned int		max_prefetch)
> > > > +{
> > > > +	if (max_prefetch)
> > > > +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> > > > +					XFS_INODES_PER_CHUNK;
> > > > +	else
> > > > +		iwag->sz_recs = 4096 / sizeof(struct xfs_inobt_rec_incore);
> > > > +
> > > 
> > > Perhaps this should use PAGE_SIZE or a related macro?
> > 
> > It did in the previous revision, but Dave pointed out that sz_recs then
> > becomes quite large on a system with 64k pages...
> > 
> > 65536 bytes / 16 bytes per inobt record = 4096 records
> > 4096 records * 64 inodes per record = 262144 inodes
> > 262144 inodes * 512 bytes per inode = 128MB of inode readahead
> > 
> 
> Ok, the comment just gave me the impression the intent was to fill a
> single page.
> 
> > I could extend the comment to explain why we don't use PAGE_SIZE...
> > 
> 
> Sounds good, though what I think would be better is to define a
> IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> details with that.
> 
> Though now that you point out the readahead thing, aren't we at risk of
> a similar problem for users who happen to pass a really large userspace
> buffer? Should we cap the kernel allocation/readahead window in all
> cases and not just the default case?

Hmm, that's right, we don't want to let userspace arbitrarily determine
the size of the buffer, and I think the current implementation caps it
the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).

Oh, right, and in the V1 patchset Dave said that we should constrain
readahead even further.

--D

> 
> Brian
> 
> > /*
> >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> >  *
> >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> >  * 256 inobt records * 64 inodes per record = 16384 inodes
> >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> >  */
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > +	/*
> > > > +	 * Allocate enough space to prefetch at least two records so that we
> > > > +	 * can cache both the inobt record where the iwalk started and the next
> > > > +	 * record.  This simplifies the AG inode walk loop setup code.
> > > > +	 */
> > > > +	iwag->sz_recs = max_t(unsigned int, iwag->sz_recs, 2);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
> > > > + * will be called for each allocated inode, being passed the inode's number and
> > > > + * @data.  @max_prefetch controls how many inobt records' worth of inodes we
> > > > + * try to readahead.
> > > > + */
> > > > +int
> > > > +xfs_iwalk(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_trans	*tp,
> > > > +	xfs_ino_t		startino,
> > > > +	xfs_iwalk_fn		iwalk_fn,
> > > > +	unsigned int		max_prefetch,
> > > > +	void			*data)
> > > > +{
> > > > +	struct xfs_iwalk_ag	iwag = {
> > > > +		.mp		= mp,
> > > > +		.tp		= tp,
> > > > +		.iwalk_fn	= iwalk_fn,
> > > > +		.data		= data,
> > > > +		.startino	= startino,
> > > > +	};
> > > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > > +	int			error;
> > > > +
> > > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > > +
> > > > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> > > > +	error = xfs_iwalk_alloc(&iwag);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > > +		error = xfs_iwalk_ag(&iwag);
> > > > +		if (error)
> > > > +			break;
> > > > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > > +	}
> > > > +
> > > > +	xfs_iwalk_free(&iwag);
> > > > +	return error;
> > > > +}
> > > > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > > > new file mode 100644
> > > > index 000000000000..45b1baabcd2d
> > > > --- /dev/null
> > > > +++ b/fs/xfs/xfs_iwalk.h
> > > > @@ -0,0 +1,18 @@
> > > > +// SPDX-License-Identifier: GPL-2.0+
> > > > +/*
> > > > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > + */
> > > > +#ifndef __XFS_IWALK_H__
> > > > +#define __XFS_IWALK_H__
> > > > +
> > > > +/* Walk all inodes in the filesystem starting from @startino. */
> > > > +typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > > > +			    xfs_ino_t ino, void *data);
> > > > +/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
> > > > +#define XFS_IWALK_ABORT	(1)
> > > > +
> > > > +int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> > > > +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> > > > +
> > > > +#endif /* __XFS_IWALK_H__ */
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index 2464ea351f83..f9bb1d50bc0e 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
> > > >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
> > > >  DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
> > > >  
> > > > +TRACE_EVENT(xfs_iwalk_ag,
> > > > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > > +		 xfs_agino_t startino),
> > > > +	TP_ARGS(mp, agno, startino),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_agnumber_t, agno)
> > > > +		__field(xfs_agino_t, startino)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = mp->m_super->s_dev;
> > > > +		__entry->agno = agno;
> > > > +		__entry->startino = startino;
> > > > +	),
> > > > +	TP_printk("dev %d:%d agno %d startino %u",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > > > +		  __entry->startino)
> > > > +)
> > > > +
> > > > +TRACE_EVENT(xfs_iwalk_ag_rec,
> > > > +	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
> > > > +		 struct xfs_inobt_rec_incore *irec),
> > > > +	TP_ARGS(mp, agno, irec),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(xfs_agnumber_t, agno)
> > > > +		__field(xfs_agino_t, startino)
> > > > +		__field(uint64_t, freemask)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = mp->m_super->s_dev;
> > > > +		__entry->agno = agno;
> > > > +		__entry->startino = irec->ir_startino;
> > > > +		__entry->freemask = irec->ir_free;
> > > > +	),
> > > > +	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
> > > > +		  __entry->startino, __entry->freemask)
> > > > +)
> > > > +
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > >  #undef TRACE_INCLUDE_PATH
> > > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure
  2019-06-10 18:29       ` Brian Foster
@ 2019-06-10 23:42         ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-10 23:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 02:29:21PM -0400, Brian Foster wrote:
> On Mon, Jun 10, 2019 at 10:38:39AM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 10, 2019 at 10:02:29AM -0400, Brian Foster wrote:
> > > On Tue, Jun 04, 2019 at 02:49:53PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Create a new ibulk structure incore to help us deal with bulk inode stat
> > > > state tracking and then convert the bulkstat code to use the new iwalk
> > > > iterator.  This disentangles inode walking from bulk stat control for
> > > > simpler code and enables us to isolate the formatter functions to the
> > > > ioctl handling code.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/xfs_ioctl.c   |   65 ++++++--
> > > >  fs/xfs/xfs_ioctl.h   |    5 +
> > > >  fs/xfs/xfs_ioctl32.c |   88 +++++------
> > > >  fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
> > > >  fs/xfs/xfs_itable.h  |   79 ++++------
> > > >  5 files changed, 245 insertions(+), 399 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
> > > > index 5ffbdcff3dba..43734901aeb9 100644
> > > > --- a/fs/xfs/xfs_ioctl.c
> > > > +++ b/fs/xfs/xfs_ioctl.c
> > > ...
> > > > @@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
> > > >  	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
> > > >  		return -EFAULT;
> > > >  
> > > > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > > > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> > > >  		return -EFAULT;
> > > >  
> > > > -	if ((count = bulkreq.icount) <= 0)
> > > > +	if (bulkreq.icount <= 0)
> > > >  		return -EINVAL;
> > > >  
> > > >  	if (bulkreq.ubuffer == NULL)
> > > >  		return -EINVAL;
> > > >  
> > > > -	if (cmd == XFS_IOC_FSINUMBERS)
> > > > -		error = xfs_inumbers(mp, &inlast, &count,
> > > > +	breq.ubuffer = bulkreq.ubuffer;
> > > > +	breq.icount = bulkreq.icount;
> > > > +
> > > > +	/*
> > > > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > > > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > > > +	 * that *lastip contains either zero or the number of the last inode to
> > > > +	 * be examined by the previous call and return results starting with
> > > > +	 * the next inode after that.  The new bulk request functions take the
> > > > +	 * inode to start with, so we have to adjust the lastino/startino
> > > > +	 * parameter to maintain correct function.
> > > > +	 */
> > > 
> > > It's kind of difficult to tell what's new or old when just looking at
> > > the code. The comment suggests FSINUMBERS and FSBULKSTAT use the same
> > > interface wrt to lastip, but they aren't handled the same below. I take
> > > it this is because xfs_inumbers() still has the same behavior whereas
> > > xfs_bulkstat() has been changed to operate based on breq rather than
> > > lastip..?
> > 
> > Yes.  By the end of the series we'll have converted FSINUMBERS too, but
> > for the ~5 or so patches until we get there, the only way to tell new
> > vs. old interface is whether it takes breq or pointers to stuff in breq.
> > 
> > > > +	if (cmd == XFS_IOC_FSINUMBERS) {
> > > > +		int	count = breq.icount;
> > > > +
> > > > +		breq.startino = lastino;
> > > > +		error = xfs_inumbers(mp, &breq.startino, &count,
> > > >  					bulkreq.ubuffer, xfs_inumbers_fmt);
> > > > -	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
> > > > -		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
> > > > -					sizeof(xfs_bstat_t), NULL, &done);
> > > > -	else	/* XFS_IOC_FSBULKSTAT */
> > > > -		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
> > > > -				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
> > > > -				     &done);
> > > > +		breq.ocount = count;
> > > > +		lastino = breq.startino;
> > > > +	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
> > > > +		breq.startino = lastino;
> > > > +		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
> > > > +		lastino = breq.startino;
> > > > +	} else {	/* XFS_IOC_FSBULKSTAT */
> > > > +		breq.startino = lastino + 1;
> > > > +		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
> > > > +		lastino = breq.startino - 1;
> > > > +	}
> > > >  
> > > ...
> > > > diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
> > > > index 814ffe6fbab7..add15819daf3 100644
> > > > --- a/fs/xfs/xfs_ioctl32.c
> > > > +++ b/fs/xfs/xfs_ioctl32.c
> > > ...
> > > > @@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
> > > >  		return -EFAULT;
> > > >  	bulkreq.ocount = compat_ptr(addr);
> > > >  
> > > > -	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
> > > > +	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
> > > >  		return -EFAULT;
> > > > +	breq.startino = lastino + 1;
> > > >  
> > > > -	if ((count = bulkreq.icount) <= 0)
> > > > +	if (bulkreq.icount <= 0)
> > > >  		return -EINVAL;
> > > >  
> > > >  	if (bulkreq.ubuffer == NULL)
> > > >  		return -EINVAL;
> > > >  
> > > > +	breq.ubuffer = bulkreq.ubuffer;
> > > > +	breq.icount = bulkreq.icount;
> > > > +
> > > > +	/*
> > > > +	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
> > > > +	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
> > > > +	 * that *lastip contains either zero or the number of the last inode to
> > > > +	 * be examined by the previous call and return results starting with
> > > > +	 * the next inode after that.  The new bulk request functions take the
> > > > +	 * inode to start with, so we have to adjust the lastino/startino
> > > > +	 * parameter to maintain correct function.
> > > > +	 */
> > > 
> > > (Same comment here.)
> > > 
> > > >  	if (cmd == XFS_IOC_FSINUMBERS_32) {
> > > > -		error = xfs_inumbers(mp, &inlast, &count,
> > > > +		int	count = breq.icount;
> > > > +
> > > > +		breq.startino = lastino;
> > > > +		error = xfs_inumbers(mp, &breq.startino, &count,
> > > >  				bulkreq.ubuffer, inumbers_func);
> > > > +		breq.ocount = count;
> > > > +		lastino = breq.startino;
> > > >  	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
> > > > -		int res;
> > > > -
> > > > -		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
> > > > -				bs_one_size, NULL, &res);
> > > > +		breq.startino = lastino;
> > > > +		error = xfs_bulkstat_one(&breq, bs_one_func);
> > > > +		lastino = breq.startino;
> > > >  	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
> > > > -		error = xfs_bulkstat(mp, &inlast, &count,
> > > > -			bs_one_func, bs_one_size,
> > > > -			bulkreq.ubuffer, &done);
> > > > -	} else
> > > > +		breq.startino = lastino + 1;
> > > > +		error = xfs_bulkstat(&breq, bs_one_func);
> > > > +		lastino = breq.startino - 1;
> > > > +	} else {
> > > >  		error = -EINVAL;
> > > > +	}
> > > >  	if (error)
> > > >  		return error;
> > > >  
> > > > +	lastino = breq.startino - 1;
> > > 
> > > Should this be here?
> > 
> > Nope.
> > 
> > > >  	if (bulkreq.lastip != NULL &&
> > > > -	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
> > > > +	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
> > > >  		return -EFAULT;
> > > >  
> > > >  	if (bulkreq.ocount != NULL &&
> > > > -	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
> > > > +	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
> > > >  		return -EFAULT;
> > > >  
> > > >  	return 0;
> > > > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > > > index 3ca1c454afe6..87c597ea1df7 100644
> > > > --- a/fs/xfs/xfs_itable.c
> > > > +++ b/fs/xfs/xfs_itable.c
> > > > @@ -22,37 +22,63 @@
> > > >  #include "xfs_iwalk.h"
> > > >  
> > > >  /*
> > > > - * Return stat information for one inode.
> > > > - * Return 0 if ok, else errno.
> > > > + * Bulk Stat
> > > > + * =========
> > > > + *
> > > > + * Use the inode walking functions to fill out struct xfs_bstat for every
> > > > + * allocated inode, then pass the stat information to some externally provided
> > > > + * iteration function.
> > > >   */
> > > > -int
> > > > +
> > > > +struct xfs_bstat_chunk {
> > > > +	bulkstat_one_fmt_pf	formatter;
> > > > +	struct xfs_ibulk	*breq;
> > > > +};
> > > > +
> > > > +/*
> > > > + * Fill out the bulkstat info for a single inode and report it somewhere.
> > > > + *
> > > > + * bc->breq->lastino is effectively the inode cursor as we walk through the
> > > > + * filesystem.  Therefore, we update it any time we need to move the cursor
> > > > + * forward, regardless of whether or not we're sending any bstat information
> > > > + * back to userspace.  If the inode is internal metadata or, has been freed
> > > > + * out from under us, we just simply keep going.
> > > > + *
> > > > + * However, if any other type of error happens we want to stop right where we
> > > > + * are so that userspace will call back with exact number of the bad inode and
> > > > + * we can send back an error code.
> > > > + *
> > > > + * Note that if the formatter tells us there's no space left in the buffer we
> > > > + * move the cursor forward and abort the walk.
> > > > + */
> > > > +STATIC int
> > > >  xfs_bulkstat_one_int(
> > > > -	struct xfs_mount	*mp,		/* mount point for filesystem */
> > > > -	xfs_ino_t		ino,		/* inode to get data for */
> > > > -	void __user		*buffer,	/* buffer to place output in */
> > > > -	int			ubsize,		/* size of buffer */
> > > > -	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
> > > > -	int			*ubused,	/* bytes used by me */
> > > > -	int			*stat)		/* BULKSTAT_RV_... */
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_trans	*tp,
> > > > +	xfs_ino_t		ino,
> > > > +	void			*data)
> > > >  {
> > > 
> > > Any reason this function takes an 'ino' param considering it's sourced
> > > from breq->startino and we bump that value from within this function?
> > > The latter seems slightly misplaced to me since it doesn't appear to
> > > control the iteration.
> > >
> > > It also looks like we bump startino in almost all cases. Exceptions are
> > > memory allocation failure of the buffer and formatter error. Hmm.. could
> > > we lift the buffer to the bc and reuse it to avoid that error entirely
> > > (along with repeated allocs/frees)? From there, perhaps we could lift
> > > the ->startino update to the callers based on something like error !=
> > > -EFAULT..? (Alternatively, the caller could update it first and then
> > > walk it back if error == -EFAULT).
> > 
> > xfs_iwalk doesn't know anything about the xfs_bstat_chunk or the
> > xfs_ibulk that are being passed to the iterator function via the void
> > *data parameter.  iwalk is a generic iterator which shouldn't ever know
> > about the bulkreq interface.
> > 
> 
> Sure, I'm not suggesting to push anything into the iwalk code. That's
> obviously a separate layer. I'm referring to the direct callers of
> xfs_bulkstat_one_int(), which is still bulkstat code AFAICT.

OH!

Heh, I had wondered if maybe you were talking about xfs_bulkstat() and
xfs_bulkstat_one().  Maybe I should've drank more coffee this morning.

Uh, the buffer allocation thing definitely can move out to the
callers... 

> > I plan to reuse the xfs_iwalk code for other parts of online repair in
> > the future, so that's why bulkstat_one takes the inode number that
> > _iwalk gives it, and then uses that to mess around with bc and breq.
> > 
> 
> Ok. BTW, looking again I notice that xfs_bulkstat_one_int() passes bc as
> a void pointer. That could probably be fixed up to take the
> xfs_bstat_chunk from the current callers.
> 
> > I also sort of prefer to keep the startino update the way it is now
> > because I only want to push it forward for the ~4 cases where we do now
> > (internal, invalid number, past eofs, or successfully statted).
> > Any other runtime error leaves the cursor where it was.
> > 
> 
> Fair enough, though it could also be cleaned up a bit with an exit label
> IMO.

...and having an exit label to make it clearer why we do or don't push
startino forward sounds like a good idea.

--D

> 
> Brian
> 
> > > > +	struct xfs_bstat_chunk	*bc = data;
> > > >  	struct xfs_icdinode	*dic;		/* dinode core info pointer */
> > > >  	struct xfs_inode	*ip;		/* incore inode pointer */
> > > >  	struct inode		*inode;
> > > >  	struct xfs_bstat	*buf;		/* return buffer */
> > > >  	int			error = 0;	/* error value */
> > > >  
> > > > -	*stat = BULKSTAT_RV_NOTHING;
> > > > -
> > > > -	if (!buffer || xfs_internal_inum(mp, ino))
> > > > +	if (xfs_internal_inum(mp, ino)) {
> > > > +		bc->breq->startino = ino + 1;
> > > >  		return -EINVAL;
> > > > +	}
> > > >  
> > > >  	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
> > > >  	if (!buf)
> > > >  		return -ENOMEM;
> > > >  
> > > > -	error = xfs_iget(mp, NULL, ino,
> > > > +	error = xfs_iget(mp, tp, ino,
> > > >  			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
> > > >  			 XFS_ILOCK_SHARED, &ip);
> > > > +	if (error == -ENOENT || error == -EINVAL)
> > > > +		bc->breq->startino = ino + 1;
> > > >  	if (error)
> > > >  		goto out_free;
> > > >  
> > > > @@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
> > > >  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> > > >  	xfs_irele(ip);
> > > >  
> > > > -	error = formatter(buffer, ubsize, ubused, buf);
> > > > -	if (!error)
> > > > -		*stat = BULKSTAT_RV_DIDONE;
> > > > -
> > > > - out_free:
> > > > +	error = bc->formatter(bc->breq, buf);
> > > > +	switch (error) {
> > > > +	case XFS_IBULK_BUFFER_FULL:
> > > > +		error = XFS_IWALK_ABORT;
> > > > +		/* fall through */
> > > > +	case 0:
> > > > +		bc->breq->startino = ino + 1;
> > > > +		break;
> > > > +	}
> > > > +out_free:
> > > >  	kmem_free(buf);
> > > >  	return error;
> > > >  }
> > > >  
> > > > -/* Return 0 on success or positive error */
> > > > -STATIC int
> > > > -xfs_bulkstat_one_fmt(
> > > > -	void			__user *ubuffer,
> > > > -	int			ubsize,
> > > > -	int			*ubused,
> > > > -	const xfs_bstat_t	*buffer)
> > > > -{
> > > > -	if (ubsize < sizeof(*buffer))
> > > > -		return -ENOMEM;
> > > > -	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
> > > > -		return -EFAULT;
> > > > -	if (ubused)
> > > > -		*ubused = sizeof(*buffer);
> > > > -	return 0;
> > > > -}
> > > > -
> > > > +/* Bulkstat a single inode. */
> > > >  int
> > > >  xfs_bulkstat_one(
> > > > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > > > -	xfs_ino_t	ino,		/* inode number to get data for */
> > > > -	void		__user *buffer,	/* buffer to place output in */
> > > > -	int		ubsize,		/* size of buffer */
> > > > -	int		*ubused,	/* bytes used by me */
> > > > -	int		*stat)		/* BULKSTAT_RV_... */
> > > > +	struct xfs_ibulk	*breq,
> > > > +	bulkstat_one_fmt_pf	formatter)
> > > >  {
> > > > -	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
> > > > -				    xfs_bulkstat_one_fmt, ubused, stat);
> > > > +	struct xfs_bstat_chunk	bc = {
> > > > +		.formatter	= formatter,
> > > > +		.breq		= breq,
> > > > +	};
> > > > +	int			error;
> > > > +
> > > > +	breq->icount = 1;
> > > > +	breq->ocount = 0;
> > > > +
> > > 
> > > So ->icount is already set by the caller based on user input. I'd guess
> > > this is set here to guarantee a single cycle in the event that the user
> > > provided value is >1, but that seems unnecessary since we're calling the
> > > internal helper to handle a single inode.
> > > 
> > > If we want to set ->icount for whatever reason, can we do it in the
> > > caller where it's more obvious? That also shows that the ->ocount init
> > > is unnecessary since the whole structure is initialized in the caller.
> > 
> > Ok.
> > 
> > > > +	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
> > > > +
> > > > +	/*
> > > > +	 * If we reported one inode to userspace then we abort because we hit
> > > > +	 * the end of the buffer.  Don't leak that back to userspace.
> > > > +	 */
> > > > +	if (error == XFS_IWALK_ABORT)
> > > > +		error = 0;
> > > > +
> > > > +	return error;
> > > >  }
> > > >  
> > > >  /*
> > > ...
> > > > diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> > > > index 369e3f159d4e..366d391eb11f 100644
> > > > --- a/fs/xfs/xfs_itable.h
> > > > +++ b/fs/xfs/xfs_itable.h
> > > > @@ -5,63 +5,46 @@
> > > >  #ifndef __XFS_ITABLE_H__
> > > >  #define	__XFS_ITABLE_H__
> > > >  
> > > > -/*
> > > > - * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
> > > > - * structures (by the dmi library). This is a pointer to a formatter function
> > > > - * that will iget the inode and fill in the appropriate structure.
> > > > - * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
> > > > - */
> > > > -typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
> > > > -			       xfs_ino_t	ino,
> > > > -			       void		__user *buffer,
> > > > -			       int		ubsize,
> > > > -			       int		*ubused,
> > > > -			       int		*stat);
> > > > +/* In-memory representation of a userspace request for batch inode data. */
> > > > +struct xfs_ibulk {
> > > > +	struct xfs_mount	*mp;
> > > > +	void __user		*ubuffer; /* user output buffer */
> > > > +	xfs_ino_t		startino; /* start with this inode */
> > > > +	unsigned int		icount;   /* number of elements in ubuffer */
> > > > +	unsigned int		ocount;   /* number of records returned */
> > > > +};
> > > > +
> > > > +/* Return value that means we want to abort the walk. */
> > > > +#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
> > > > +
> > > > +/* Return value that means the formatting buffer is now full. */
> > > > +#define XFS_IBULK_BUFFER_FULL	(2)
> > > >  
> > > 
> > > It might be wise to define this such that it's guaranteed not to be the
> > > same as the abort value, since that is defined externally to this
> > > header. IBULK_ABORT + 1 perhaps?
> > 
> > FWIW I'm of half a mind to establish a generic "abort walk" error code
> > and key all of these iterator functions to use it instead of having all
> > these #define XFS_FOO_ABORT	1 things everywhere.
> > 
> > I'll do #define XFS_IBULK_BUFFER_FULL IBULK_ABORT+1 here.
> > 
> > > Brian
> > > 
> > > >  /*
> > > > - * Values for stat return value.
> > > > + * Advance the user buffer pointer by one record of the given size.  If the
> > > > + * buffer is now full, return the appropriate error code.
> > > >   */
> > > > -#define BULKSTAT_RV_NOTHING	0
> > > > -#define BULKSTAT_RV_DIDONE	1
> > > > -#define BULKSTAT_RV_GIVEUP	2
> > > > +static inline int
> > > > +xfs_ibulk_advance(
> > > > +	struct xfs_ibulk	*breq,
> > > > +	size_t			bytes)
> > > > +{
> > > > +	char __user		*b = breq->ubuffer;
> > > > +
> > > > +	breq->ubuffer = b + bytes;
> > > > +	breq->ocount++;
> > > > +	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
> > > > +}
> > > >  
> > > >  /*
> > > >   * Return stat information in bulk (by-inode) for the filesystem.
> > > >   */
> > > > -int					/* error status */
> > > > -xfs_bulkstat(
> > > > -	xfs_mount_t	*mp,		/* mount point for filesystem */
> > > > -	xfs_ino_t	*lastino,	/* last inode returned */
> > > > -	int		*count,		/* size of buffer/count returned */
> > > > -	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
> > > > -	size_t		statstruct_size,/* sizeof struct that we're filling */
> > > > -	char		__user *ubuffer,/* buffer with inode stats */
> > > > -	int		*done);		/* 1 if there are more stats to get */
> > > >  
> > > > -typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
> > > > -	void			__user *ubuffer, /* buffer to write to */
> > > > -	int			ubsize,		 /* remaining user buffer sz */
> > > > -	int			*ubused,	 /* bytes used by formatter */
> > > > -	const xfs_bstat_t	*buffer);        /* buffer to read from */
> > > > +typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
> > > > +		const struct xfs_bstat *bstat);
> > > >  
> > > > -int
> > > > -xfs_bulkstat_one_int(
> > > > -	xfs_mount_t		*mp,
> > > > -	xfs_ino_t		ino,
> > > > -	void			__user *buffer,
> > > > -	int			ubsize,
> > > > -	bulkstat_one_fmt_pf	formatter,
> > > > -	int			*ubused,
> > > > -	int			*stat);
> > > > -
> > > > -int
> > > > -xfs_bulkstat_one(
> > > > -	xfs_mount_t		*mp,
> > > > -	xfs_ino_t		ino,
> > > > -	void			__user *buffer,
> > > > -	int			ubsize,
> > > > -	int			*ubused,
> > > > -	int			*stat);
> > > > +int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> > > > +int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
> > > >  
> > > >  typedef int (*inumbers_fmt_pf)(
> > > >  	void			__user *ubuffer, /* buffer to write to */
> > > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-10 19:40   ` Brian Foster
@ 2019-06-11  1:10     ` Darrick J. Wong
  2019-06-11 13:13       ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-11  1:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 03:40:13PM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:50:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a parallel iwalk implementation and switch quotacheck to use it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Interesting.. is there any commonality here with the ktask mechanism
> that's been in progress? I've not followed the details, but I thought it
> was a similar idea. The last post I see for that is here:
> 
> https://marc.info/?l=linux-mm&m=154143701122927&w=2

Yes, xfs_pwork is... the result of ktask still not landing upstream
after a couple of years. :(

> That aside, this all looks mostly fine to me. A few random thoughts..
> 
> >  fs/xfs/Makefile      |    1 
> >  fs/xfs/xfs_globals.c |    3 +
> >  fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
> >  fs/xfs/xfs_iwalk.h   |    2 +
> >  fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
> >  fs/xfs/xfs_qm.c      |    2 -
> >  fs/xfs/xfs_sysctl.h  |    6 ++
> >  fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
> >  fs/xfs/xfs_trace.h   |   18 +++++++
> >  10 files changed, 317 insertions(+), 3 deletions(-)
> >  create mode 100644 fs/xfs/xfs_pwork.c
> >  create mode 100644 fs/xfs/xfs_pwork.h
> > 
> > 
> > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > index 74d30ef0dbce..48940a27d4aa 100644
> > --- a/fs/xfs/Makefile
> > +++ b/fs/xfs/Makefile
> > @@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
> >  				   xfs_message.o \
> >  				   xfs_mount.o \
> >  				   xfs_mru_cache.o \
> > +				   xfs_pwork.o \
> >  				   xfs_reflink.o \
> >  				   xfs_stats.o \
> >  				   xfs_super.o \
> > diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> > index d0d377384120..4f93f2c4dc38 100644
> > --- a/fs/xfs/xfs_globals.c
> > +++ b/fs/xfs/xfs_globals.c
> > @@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
> >  	.fstrm_timer	= {	1,		30*100,		3600*100},
> >  	.eofb_timer	= {	1,		300,		3600*24},
> >  	.cowb_timer	= {	1,		1800,		3600*24},
> > +#ifdef DEBUG
> > +	.pwork_threads	= {	0,		0,		NR_CPUS	},
> > +#endif
> >  };
> >  
> >  struct xfs_globals xfs_globals = {
> > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > index 8595258b5001..71ee1628aa70 100644
> > --- a/fs/xfs/xfs_iwalk.c
> > +++ b/fs/xfs/xfs_iwalk.c
> > @@ -21,6 +21,7 @@
> >  #include "xfs_health.h"
> >  #include "xfs_trans.h"
> >  #include "xfs_iwalk.h"
> > +#include "xfs_pwork.h"
> >  
> >  /*
> >   * Walking Inodes in the Filesystem
> > @@ -46,6 +47,9 @@
> >   */
> >  
> >  struct xfs_iwalk_ag {
> > +	/* parallel work control data; will be null if single threaded */
> > +	struct xfs_pwork		pwork;
> > +
> >  	struct xfs_mount		*mp;
> >  	struct xfs_trans		*tp;
> >  
> > @@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
> >  		trace_xfs_iwalk_ag_rec(mp, agno, irec);
> >  
> >  		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > +			if (xfs_pwork_want_abort(&iwag->pwork))
> > +				return 0;
> > +
> >  			/* Skip if this inode is free */
> >  			if (XFS_INOBT_MASK(j) & irec->ir_free)
> >  				continue;
> > @@ -360,7 +367,7 @@ xfs_iwalk_ag(
> >  	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> >  	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> >  
> > -	while (!error && has_more) {
> > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> >  		struct xfs_inobt_rec_incore	*irec;
> >  
> >  		cond_resched();
> > @@ -409,7 +416,7 @@ xfs_iwalk_ag(
> >  	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> >  
> >  	/* Walk any records left behind in the cache. */
> > -	if (iwag->nr_recs == 0 || error)
> > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> >  		return error;
> >  
> >  	return xfs_iwalk_ag_recs(iwag);
> > @@ -465,6 +472,7 @@ xfs_iwalk(
> >  		.iwalk_fn	= iwalk_fn,
> >  		.data		= data,
> >  		.startino	= startino,
> > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> >  	};
> >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> >  	int			error;
> > @@ -486,3 +494,67 @@ xfs_iwalk(
> >  	xfs_iwalk_free(&iwag);
> >  	return error;
> >  }
> > +
> > +/* Run per-thread iwalk work. */
> > +static int
> > +xfs_iwalk_ag_work(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_pwork	*pwork)
> > +{
> > +	struct xfs_iwalk_ag	*iwag;
> > +	int			error;
> > +
> > +	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
> > +	error = xfs_iwalk_alloc(iwag);
> > +	if (error)
> > +		goto out;
> 
> In most cases this will never fail, but the error path if it does looks
> slightly painful. I was thinking if we could move this up into
> xfs_iwalk_threaded() so we wouldn't continue to queue work jobs when
> failure is imminent...
> 
> > +
> > +	error = xfs_iwalk_ag(iwag);
> > +	xfs_iwalk_free(iwag);
> > +out:
> > +	kmem_free(iwag);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Walk all the inodes in the filesystem using multiple threads to process each
> > + * AG.
> > + */
> > +int
> > +xfs_iwalk_threaded(
> > +	struct xfs_mount	*mp,
> > +	xfs_ino_t		startino,
> > +	xfs_iwalk_fn		iwalk_fn,
> > +	unsigned int		max_prefetch,
> > +	void			*data)
> > +{
> > +	struct xfs_pwork_ctl	pctl;
> > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > +	unsigned int		nr_threads;
> > +	int			error;
> > +
> > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > +
> > +	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
> > +	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
> > +			nr_threads);
> > +	if (error)
> > +		return error;
> > +
> > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > +		struct xfs_iwalk_ag	*iwag;
> > +
> > +		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
> > +		iwag->mp = mp;
> > +		iwag->tp = NULL;
> > +		iwag->iwalk_fn = iwalk_fn;
> > +		iwag->data = data;
> > +		iwag->startino = startino;
> > +		iwag->recs = NULL;
> > +		xfs_iwalk_set_prefetch(iwag, max_prefetch);
> > +		xfs_pwork_queue(&pctl, &iwag->pwork);
> > +		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > +	}
> 
> ... but this is only bound by the number of AGs and so could result in a
> large number of allocations. FWIW, I wouldn't expect that to be a
> problem in the common case. I'm more thinking about the case of a
> specially crafted filesystem designed to cause problems on mount.

<nod> I thought about that, and decided that it wasn't a good idea to
for each of the queued (but not processing) work items to be sitting on
a bunch of memory because that memory can't be put to useful work.
That's why I put it in xfs_iwalk_ag_work.

Also that would necessitate iwalk feeding a destructor to pwork so that
it can deal with work items that were queued but never actually run.

> 
> > +
> > +	return xfs_pwork_destroy(&pctl);
> > +}
> > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > index 45b1baabcd2d..40233a05a766 100644
> > --- a/fs/xfs/xfs_iwalk.h
> > +++ b/fs/xfs/xfs_iwalk.h
> > @@ -14,5 +14,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> >  
> >  int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
> >  		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> > +int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
> > +		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
> >  
> >  #endif /* __XFS_IWALK_H__ */
> > diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
> > new file mode 100644
> > index 000000000000..19605a3a2482
> > --- /dev/null
> > +++ b/fs/xfs/xfs_pwork.c
> > @@ -0,0 +1,122 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/*
> > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + */
> > +#include "xfs.h"
> > +#include "xfs_fs.h"
> > +#include "xfs_shared.h"
> > +#include "xfs_format.h"
> > +#include "xfs_log_format.h"
> > +#include "xfs_trans_resv.h"
> > +#include "xfs_mount.h"
> > +#include "xfs_trace.h"
> > +#include "xfs_sysctl.h"
> > +#include "xfs_pwork.h"
> > +
> > +/*
> > + * Parallel Work Queue
> > + * ===================
> > + *
> > + * Abstract away the details of running a large and "obviously" parallelizable
> > + * task across multiple CPUs.  Callers initialize the pwork control object with
> > + * a desired level of parallelization and a work function.  Next, they embed
> > + * struct xfs_pwork in whatever structure they use to pass work context to a
> > + * worker thread and queue that pwork.  The work function will be passed the
> > + * pwork item when it is run (from process context) and any returned error will
> > + * cause all threads to abort.
> > + * 
> 
> FYI minor whitespace damage (trailing space) on the line above.

Fixed.

> > + * This is the rough equivalent of the xfsprogs workqueue code, though we can't
> > + * reuse that name here.
> > + */
> > +
> > +/* Invoke our caller's function. */
> > +static void
> > +xfs_pwork_work(
> > +	struct work_struct	*work)
> > +{
> > +	struct xfs_pwork	*pwork;
> > +	struct xfs_pwork_ctl	*pctl;
> > +	int			error;
> > +
> > +	pwork = container_of(work, struct xfs_pwork, work);
> > +	pctl = pwork->pctl;
> > +	error = pctl->work_fn(pctl->mp, pwork);
> > +	if (error && !pctl->error)
> > +		pctl->error = error;
> > +}
> > +
> > +/*
> > + * Set up control data for parallel work.  @work_fn is the function that will
> > + * be called.  @tag will be written into the kernel threads.  @nr_threads is
> > + * the level of parallelism desired, or 0 for no limit.
> > + */
> > +int
> > +xfs_pwork_init(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_pwork_ctl	*pctl,
> > +	xfs_pwork_work_fn	work_fn,
> > +	const char		*tag,
> > +	unsigned int		nr_threads)
> > +{
> > +#ifdef DEBUG
> > +	if (xfs_globals.pwork_threads > 0)
> > +		nr_threads = xfs_globals.pwork_threads;
> > +#endif
> 
> Hmm, it might be useful to have the ability to force the no limit case
> from the debug knob. Can we use -1 or something here for "disabled?"

Ok.

> > +	trace_xfs_pwork_init(mp, nr_threads, current->pid);
> > +
> > +	pctl->wq = alloc_workqueue("%s-%d", WQ_FREEZABLE, nr_threads, tag,
> > +			current->pid);
> > +	if (!pctl->wq)
> > +		return -ENOMEM;
> > +	pctl->work_fn = work_fn;
> > +	pctl->error = 0;
> > +	pctl->mp = mp;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Queue some parallel work. */
> > +void
> > +xfs_pwork_queue(
> > +	struct xfs_pwork_ctl	*pctl,
> > +	struct xfs_pwork	*pwork)
> > +{
> > +	INIT_WORK(&pwork->work, xfs_pwork_work);
> > +	pwork->pctl = pctl;
> > +	queue_work(pctl->wq, &pwork->work);
> > +}
> > +
> > +/* Wait for the work to finish and tear down the control structure. */
> > +int
> > +xfs_pwork_destroy(
> > +	struct xfs_pwork_ctl	*pctl)
> > +{
> > +	destroy_workqueue(pctl->wq);
> > +	pctl->wq = NULL;
> > +	return pctl->error;
> > +}
> > +
> > +/*
> > + * Return the amount of parallelism that the data device can handle, or 0 for
> > + * no limit.
> > + */
> > +unsigned int
> > +xfs_pwork_guess_datadev_parallelism(
> > +	struct xfs_mount	*mp)
> > +{
> > +	struct xfs_buftarg	*btp = mp->m_ddev_targp;
> > +	int			iomin;
> > +	int			ioopt;
> > +
> > +	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
> > +		return num_online_cpus();
> > +	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
> > +		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
> > +	iomin = bdev_io_min(btp->bt_bdev);
> > +	ioopt = bdev_io_opt(btp->bt_bdev);
> > +	if (iomin && ioopt)
> > +		return ioopt / iomin;
> > +
> > +	return 1;
> 
> Have you collected any performance data related to these heuristics?

Yeah, the quotacheck runtime reduces by 5-10% on my SSDs (~5% on a
single SSD, ~10% on a 4-way raid0).  That wasn't really all that
awesome, so I recorded a flame graph (see below) to find where the
remaining overhead is.  A lot of it was in xfs_iget, and I also noticed
that deferred inode inactivation sped it up a little more.

> I
> assume the feature is generally a win, but this also seems like we have
> a large window of variance here. E.g., an SSD on a server with hundreds
> of CPUs will enable as many threads as CPUs, but a single xTB spindle on
> the same box may run single threaded (a quick check of a few local
> devices all return an optimal I/O size of 0). Is there really no benefit
> parallelizing some of that work in the spinning rust case?

Single-spindle spinning rust got worse even with 2 threads because the
heads ping-pong between AGs.  It's not horrible with enough readahead,
but it turns into a disaster when I disabled readahead, unsurprisingly.

> What about in the other direction where we might have a ton of threads
> for inodes across AGs that all happen to be in the same project quota,
> for example?

I collected a flame graph... https://djwong.org/docs/quotacheck.svg

It turned out that the inode setup overhead in xfs_iget is so high that
locking the dquot has negligible overhead.  When I "fixed" quotacheck to
read the quota information straight from the inode cluster buffer if the
inode wasn't in memory, the runtime dropped by 60% but Dave warned me
not to take us back to inode buffer aliasing hell.  I also noted that
if booting with mem=512M the memory reclamation overhead totally fries
us regardless of parallelisation.

--D

> 
> Brian
> 
> > +}
> > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > new file mode 100644
> > index 000000000000..e0c1354a2d8c
> > --- /dev/null
> > +++ b/fs/xfs/xfs_pwork.h
> > @@ -0,0 +1,50 @@
> > +// SPDX-License-Identifier: GPL-2.0+
> > +/*
> > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > + */
> > +#ifndef __XFS_PWORK_H__
> > +#define __XFS_PWORK_H__
> > +
> > +struct xfs_pwork;
> > +struct xfs_mount;
> > +
> > +typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
> > +
> > +/*
> > + * Parallel work coordination structure.
> > + */
> > +struct xfs_pwork_ctl {
> > +	struct workqueue_struct	*wq;
> > +	struct xfs_mount	*mp;
> > +	xfs_pwork_work_fn	work_fn;
> > +	int			error;
> > +};
> > +
> > +/*
> > + * Embed this parallel work control item inside your own work structure,
> > + * then queue work with it.
> > + */
> > +struct xfs_pwork {
> > +	struct work_struct	work;
> > +	struct xfs_pwork_ctl	*pctl;
> > +};
> > +
> > +#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
> > +
> > +/* Have we been told to abort? */
> > +static inline bool
> > +xfs_pwork_want_abort(
> > +	struct xfs_pwork	*pwork)
> > +{
> > +	return pwork->pctl && pwork->pctl->error;
> > +}
> > +
> > +int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> > +		xfs_pwork_work_fn work_fn, const char *tag,
> > +		unsigned int nr_threads);
> > +void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> > +int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > +unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> > +
> > +#endif /* __XFS_PWORK_H__ */
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index a5b2260406a8..e4f3785f7a64 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> >  		flags |= XFS_PQUOTA_CHKD;
> >  	}
> >  
> > -	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> >  	if (error)
> >  		goto error_return;
> >  
> > diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> > index ad7f9be13087..b555e045e2f4 100644
> > --- a/fs/xfs/xfs_sysctl.h
> > +++ b/fs/xfs/xfs_sysctl.h
> > @@ -37,6 +37,9 @@ typedef struct xfs_param {
> >  	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
> >  	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
> >  	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
> > +#ifdef DEBUG
> > +	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
> > +#endif
> >  } xfs_param_t;
> >  
> >  /*
> > @@ -82,6 +85,9 @@ enum {
> >  extern xfs_param_t	xfs_params;
> >  
> >  struct xfs_globals {
> > +#ifdef DEBUG
> > +	int	pwork_threads;		/* parallel workqueue threads */
> > +#endif
> >  	int	log_recovery_delay;	/* log recovery delay (secs) */
> >  	int	mount_delay;		/* mount setup delay (secs) */
> >  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
> > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> > index cabda13f3c64..910e6b9cb1a7 100644
> > --- a/fs/xfs/xfs_sysfs.c
> > +++ b/fs/xfs/xfs_sysfs.c
> > @@ -206,11 +206,51 @@ always_cow_show(
> >  }
> >  XFS_SYSFS_ATTR_RW(always_cow);
> >  
> > +#ifdef DEBUG
> > +/*
> > + * Override how many threads the parallel work queue is allowed to create.
> > + * This has to be a debug-only global (instead of an errortag) because one of
> > + * the main users of parallel workqueues is mount time quotacheck.
> > + */
> > +STATIC ssize_t
> > +pwork_threads_store(
> > +	struct kobject	*kobject,
> > +	const char	*buf,
> > +	size_t		count)
> > +{
> > +	int		ret;
> > +	int		val;
> > +
> > +	ret = kstrtoint(buf, 0, &val);
> > +	if (ret)
> > +		return ret;
> > +
> > +	if (val < 0 || val > NR_CPUS)
> > +		return -EINVAL;
> > +
> > +	xfs_globals.pwork_threads = val;
> > +
> > +	return count;
> > +}
> > +
> > +STATIC ssize_t
> > +pwork_threads_show(
> > +	struct kobject	*kobject,
> > +	char		*buf)
> > +{
> > +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
> > +}
> > +XFS_SYSFS_ATTR_RW(pwork_threads);
> > +#endif /* DEBUG */
> > +
> >  static struct attribute *xfs_dbg_attrs[] = {
> >  	ATTR_LIST(bug_on_assert),
> >  	ATTR_LIST(log_recovery_delay),
> >  	ATTR_LIST(mount_delay),
> >  	ATTR_LIST(always_cow),
> > +#ifdef DEBUG
> > +	ATTR_LIST(pwork_threads),
> > +#endif
> >  	NULL,
> >  };
> >  
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index f9bb1d50bc0e..658cbade1998 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
> >  		  __entry->startino, __entry->freemask)
> >  )
> >  
> > +TRACE_EVENT(xfs_pwork_init,
> > +	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
> > +	TP_ARGS(mp, nr_threads, pid),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(unsigned int, nr_threads)
> > +		__field(pid_t, pid)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->nr_threads = nr_threads;
> > +		__entry->pid = pid;
> > +	),
> > +	TP_printk("dev %d:%d nr_threads %u pid %u",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->nr_threads, __entry->pid)
> > +)
> > +
> >  #endif /* _TRACE_XFS_H */
> >  
> >  #undef TRACE_INCLUDE_PATH
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-11  1:10     ` Darrick J. Wong
@ 2019-06-11 13:13       ` Brian Foster
  2019-06-11 15:29         ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-11 13:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jun 10, 2019 at 06:10:20PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2019 at 03:40:13PM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:50:20PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a parallel iwalk implementation and switch quotacheck to use it.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Interesting.. is there any commonality here with the ktask mechanism
> > that's been in progress? I've not followed the details, but I thought it
> > was a similar idea. The last post I see for that is here:
> > 
> > https://marc.info/?l=linux-mm&m=154143701122927&w=2
> 
> Yes, xfs_pwork is... the result of ktask still not landing upstream
> after a couple of years. :(
> 

Heh, Ok. We could always port over to it if it ever does land.

> > That aside, this all looks mostly fine to me. A few random thoughts..
> > 
> > >  fs/xfs/Makefile      |    1 
> > >  fs/xfs/xfs_globals.c |    3 +
> > >  fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
> > >  fs/xfs/xfs_iwalk.h   |    2 +
> > >  fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > >  fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
> > >  fs/xfs/xfs_qm.c      |    2 -
> > >  fs/xfs/xfs_sysctl.h  |    6 ++
> > >  fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
> > >  fs/xfs/xfs_trace.h   |   18 +++++++
> > >  10 files changed, 317 insertions(+), 3 deletions(-)
> > >  create mode 100644 fs/xfs/xfs_pwork.c
> > >  create mode 100644 fs/xfs/xfs_pwork.h
> > > 
> > > 
> > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > index 74d30ef0dbce..48940a27d4aa 100644
> > > --- a/fs/xfs/Makefile
> > > +++ b/fs/xfs/Makefile
> > > @@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
> > >  				   xfs_message.o \
> > >  				   xfs_mount.o \
> > >  				   xfs_mru_cache.o \
> > > +				   xfs_pwork.o \
> > >  				   xfs_reflink.o \
> > >  				   xfs_stats.o \
> > >  				   xfs_super.o \
> > > diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> > > index d0d377384120..4f93f2c4dc38 100644
> > > --- a/fs/xfs/xfs_globals.c
> > > +++ b/fs/xfs/xfs_globals.c
> > > @@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
> > >  	.fstrm_timer	= {	1,		30*100,		3600*100},
> > >  	.eofb_timer	= {	1,		300,		3600*24},
> > >  	.cowb_timer	= {	1,		1800,		3600*24},
> > > +#ifdef DEBUG
> > > +	.pwork_threads	= {	0,		0,		NR_CPUS	},
> > > +#endif
> > >  };
> > >  
> > >  struct xfs_globals xfs_globals = {
> > > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > > index 8595258b5001..71ee1628aa70 100644
> > > --- a/fs/xfs/xfs_iwalk.c
> > > +++ b/fs/xfs/xfs_iwalk.c
> > > @@ -21,6 +21,7 @@
> > >  #include "xfs_health.h"
> > >  #include "xfs_trans.h"
> > >  #include "xfs_iwalk.h"
> > > +#include "xfs_pwork.h"
> > >  
> > >  /*
> > >   * Walking Inodes in the Filesystem
> > > @@ -46,6 +47,9 @@
> > >   */
> > >  
> > >  struct xfs_iwalk_ag {
> > > +	/* parallel work control data; will be null if single threaded */
> > > +	struct xfs_pwork		pwork;
> > > +
> > >  	struct xfs_mount		*mp;
> > >  	struct xfs_trans		*tp;
> > >  
> > > @@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
> > >  		trace_xfs_iwalk_ag_rec(mp, agno, irec);
> > >  
> > >  		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > > +			if (xfs_pwork_want_abort(&iwag->pwork))
> > > +				return 0;
> > > +
> > >  			/* Skip if this inode is free */
> > >  			if (XFS_INOBT_MASK(j) & irec->ir_free)
> > >  				continue;
> > > @@ -360,7 +367,7 @@ xfs_iwalk_ag(
> > >  	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > >  	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > >  
> > > -	while (!error && has_more) {
> > > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> > >  		struct xfs_inobt_rec_incore	*irec;
> > >  
> > >  		cond_resched();
> > > @@ -409,7 +416,7 @@ xfs_iwalk_ag(
> > >  	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > >  
> > >  	/* Walk any records left behind in the cache. */
> > > -	if (iwag->nr_recs == 0 || error)
> > > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> > >  		return error;
> > >  
> > >  	return xfs_iwalk_ag_recs(iwag);
> > > @@ -465,6 +472,7 @@ xfs_iwalk(
> > >  		.iwalk_fn	= iwalk_fn,
> > >  		.data		= data,
> > >  		.startino	= startino,
> > > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> > >  	};
> > >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > >  	int			error;
> > > @@ -486,3 +494,67 @@ xfs_iwalk(
> > >  	xfs_iwalk_free(&iwag);
> > >  	return error;
> > >  }
> > > +
> > > +/* Run per-thread iwalk work. */
> > > +static int
> > > +xfs_iwalk_ag_work(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_pwork	*pwork)
> > > +{
> > > +	struct xfs_iwalk_ag	*iwag;
> > > +	int			error;
> > > +
> > > +	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
> > > +	error = xfs_iwalk_alloc(iwag);
> > > +	if (error)
> > > +		goto out;
> > 
> > In most cases this will never fail, but the error path if it does looks
> > slightly painful. I was thinking if we could move this up into
> > xfs_iwalk_threaded() so we wouldn't continue to queue work jobs when
> > failure is imminent...
> > 
> > > +
> > > +	error = xfs_iwalk_ag(iwag);
> > > +	xfs_iwalk_free(iwag);
> > > +out:
> > > +	kmem_free(iwag);
> > > +	return error;
> > > +}
> > > +
> > > +/*
> > > + * Walk all the inodes in the filesystem using multiple threads to process each
> > > + * AG.
> > > + */
> > > +int
> > > +xfs_iwalk_threaded(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_ino_t		startino,
> > > +	xfs_iwalk_fn		iwalk_fn,
> > > +	unsigned int		max_prefetch,
> > > +	void			*data)
> > > +{
> > > +	struct xfs_pwork_ctl	pctl;
> > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > +	unsigned int		nr_threads;
> > > +	int			error;
> > > +
> > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > +
> > > +	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
> > > +	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
> > > +			nr_threads);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		struct xfs_iwalk_ag	*iwag;
> > > +
> > > +		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
> > > +		iwag->mp = mp;
> > > +		iwag->tp = NULL;
> > > +		iwag->iwalk_fn = iwalk_fn;
> > > +		iwag->data = data;
> > > +		iwag->startino = startino;
> > > +		iwag->recs = NULL;
> > > +		xfs_iwalk_set_prefetch(iwag, max_prefetch);
> > > +		xfs_pwork_queue(&pctl, &iwag->pwork);
> > > +		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > +	}
> > 
> > ... but this is only bound by the number of AGs and so could result in a
> > large number of allocations. FWIW, I wouldn't expect that to be a
> > problem in the common case. I'm more thinking about the case of a
> > specially crafted filesystem designed to cause problems on mount.
> 
> <nod> I thought about that, and decided that it wasn't a good idea to
> for each of the queued (but not processing) work items to be sitting on
> a bunch of memory because that memory can't be put to useful work.
> That's why I put it in xfs_iwalk_ag_work.
> 

Yep.

> Also that would necessitate iwalk feeding a destructor to pwork so that
> it can deal with work items that were queued but never actually run.
> 

What's the scenario for queued jobs that never run? We have to run the
work item to free the iwag, so I'd assume we could free the record
buffer in the same place if we wanted to (which we don't :P).

> > 
> > > +
> > > +	return xfs_pwork_destroy(&pctl);
> > > +}
...
> > > +/*
> > > + * Return the amount of parallelism that the data device can handle, or 0 for
> > > + * no limit.
> > > + */
> > > +unsigned int
> > > +xfs_pwork_guess_datadev_parallelism(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	struct xfs_buftarg	*btp = mp->m_ddev_targp;
> > > +	int			iomin;
> > > +	int			ioopt;
> > > +
> > > +	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
> > > +		return num_online_cpus();
> > > +	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
> > > +		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
> > > +	iomin = bdev_io_min(btp->bt_bdev);
> > > +	ioopt = bdev_io_opt(btp->bt_bdev);
> > > +	if (iomin && ioopt)
> > > +		return ioopt / iomin;
> > > +
> > > +	return 1;
> > 
> > Have you collected any performance data related to these heuristics?
> 
> Yeah, the quotacheck runtime reduces by 5-10% on my SSDs (~5% on a
> single SSD, ~10% on a 4-way raid0).  That wasn't really all that
> awesome, so I recorded a flame graph (see below) to find where the
> remaining overhead is.  A lot of it was in xfs_iget, and I also noticed
> that deferred inode inactivation sped it up a little more.
> 

So if I follow the graphic correctly, you have 4 walker threads running
the quotacheck. The majority of overhead is the inode memory allocation
followed by inode buffer lookup and then dquot lookup slightly behind
that. If this is an SSD, the heuristic presumably set the thread count
based on the CPU count, right?

Hmm, I can't tell from the image what happens down in xfs_buf_find(). Do
you have granular enough data to see whether these are buffer cache hits
or misses?

> > I
> > assume the feature is generally a win, but this also seems like we have
> > a large window of variance here. E.g., an SSD on a server with hundreds
> > of CPUs will enable as many threads as CPUs, but a single xTB spindle on
> > the same box may run single threaded (a quick check of a few local
> > devices all return an optimal I/O size of 0). Is there really no benefit
> > parallelizing some of that work in the spinning rust case?
> 
> Single-spindle spinning rust got worse even with 2 threads because the
> heads ping-pong between AGs.  It's not horrible with enough readahead,
> but it turns into a disaster when I disabled readahead, unsurprisingly.
> 

Ok. Is that tempered by multi-spindle devices? What about a raid5/6 like
device where we have a stripe unit/width set, but the device itself may
have concurrency characteristics more like a single spindle as opposed
to something like raid0?

> > What about in the other direction where we might have a ton of threads
> > for inodes across AGs that all happen to be in the same project quota,
> > for example?
> 
> I collected a flame graph... https://djwong.org/docs/quotacheck.svg
> 

I'm specifically thinking about a few random systems I've used recently
with hundreds of CPUs. I'm not sure those boxes actually have SSDs, but
if one did (and with enough AGs), I'm curious how this algorithm would
behave under those conditions.

I guess this all boils down to trying to understand if/what breakdown
conditions might exist given the different possibilies allowed by the
current heuristic. I'm wondering if we should either restrict this
heuristic to enable concurrency specifically in the environments we've
shown it to have benefits and/or with some kind of reasonable cap to
limit unknown boundary conditions. E.g., the case of hundreds of AGs and
hundreds of CPUs on SSD seems like it could go really well (a big flash
raid0) or potentially really bad (one big SSD with a poorly configured
fs). Thoughts?

> It turned out that the inode setup overhead in xfs_iget is so high that
> locking the dquot has negligible overhead.  When I "fixed" quotacheck to
> read the quota information straight from the inode cluster buffer if the
> inode wasn't in memory, the runtime dropped by 60% but Dave warned me
> not to take us back to inode buffer aliasing hell.  I also noted that
> if booting with mem=512M the memory reclamation overhead totally fries
> us regardless of parallelisation.
> 

I'm not familiar with the buffer aliasing problem.. I'm guessing this is
talking about risk of inconsistent in-core inodes with inode buffers..?
In any event, I agree that it's not worth risking integrity or
overcomplicating things for something like quotacheck.

Brian

> --D
> 
> > 
> > Brian
> > 
> > > +}
> > > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > > new file mode 100644
> > > index 000000000000..e0c1354a2d8c
> > > --- /dev/null
> > > +++ b/fs/xfs/xfs_pwork.h
> > > @@ -0,0 +1,50 @@
> > > +// SPDX-License-Identifier: GPL-2.0+
> > > +/*
> > > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > + */
> > > +#ifndef __XFS_PWORK_H__
> > > +#define __XFS_PWORK_H__
> > > +
> > > +struct xfs_pwork;
> > > +struct xfs_mount;
> > > +
> > > +typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
> > > +
> > > +/*
> > > + * Parallel work coordination structure.
> > > + */
> > > +struct xfs_pwork_ctl {
> > > +	struct workqueue_struct	*wq;
> > > +	struct xfs_mount	*mp;
> > > +	xfs_pwork_work_fn	work_fn;
> > > +	int			error;
> > > +};
> > > +
> > > +/*
> > > + * Embed this parallel work control item inside your own work structure,
> > > + * then queue work with it.
> > > + */
> > > +struct xfs_pwork {
> > > +	struct work_struct	work;
> > > +	struct xfs_pwork_ctl	*pctl;
> > > +};
> > > +
> > > +#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
> > > +
> > > +/* Have we been told to abort? */
> > > +static inline bool
> > > +xfs_pwork_want_abort(
> > > +	struct xfs_pwork	*pwork)
> > > +{
> > > +	return pwork->pctl && pwork->pctl->error;
> > > +}
> > > +
> > > +int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> > > +		xfs_pwork_work_fn work_fn, const char *tag,
> > > +		unsigned int nr_threads);
> > > +void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> > > +int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > > +unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> > > +
> > > +#endif /* __XFS_PWORK_H__ */
> > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > index a5b2260406a8..e4f3785f7a64 100644
> > > --- a/fs/xfs/xfs_qm.c
> > > +++ b/fs/xfs/xfs_qm.c
> > > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> > >  		flags |= XFS_PQUOTA_CHKD;
> > >  	}
> > >  
> > > -	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > >  	if (error)
> > >  		goto error_return;
> > >  
> > > diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> > > index ad7f9be13087..b555e045e2f4 100644
> > > --- a/fs/xfs/xfs_sysctl.h
> > > +++ b/fs/xfs/xfs_sysctl.h
> > > @@ -37,6 +37,9 @@ typedef struct xfs_param {
> > >  	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
> > >  	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
> > >  	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
> > > +#ifdef DEBUG
> > > +	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
> > > +#endif
> > >  } xfs_param_t;
> > >  
> > >  /*
> > > @@ -82,6 +85,9 @@ enum {
> > >  extern xfs_param_t	xfs_params;
> > >  
> > >  struct xfs_globals {
> > > +#ifdef DEBUG
> > > +	int	pwork_threads;		/* parallel workqueue threads */
> > > +#endif
> > >  	int	log_recovery_delay;	/* log recovery delay (secs) */
> > >  	int	mount_delay;		/* mount setup delay (secs) */
> > >  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
> > > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> > > index cabda13f3c64..910e6b9cb1a7 100644
> > > --- a/fs/xfs/xfs_sysfs.c
> > > +++ b/fs/xfs/xfs_sysfs.c
> > > @@ -206,11 +206,51 @@ always_cow_show(
> > >  }
> > >  XFS_SYSFS_ATTR_RW(always_cow);
> > >  
> > > +#ifdef DEBUG
> > > +/*
> > > + * Override how many threads the parallel work queue is allowed to create.
> > > + * This has to be a debug-only global (instead of an errortag) because one of
> > > + * the main users of parallel workqueues is mount time quotacheck.
> > > + */
> > > +STATIC ssize_t
> > > +pwork_threads_store(
> > > +	struct kobject	*kobject,
> > > +	const char	*buf,
> > > +	size_t		count)
> > > +{
> > > +	int		ret;
> > > +	int		val;
> > > +
> > > +	ret = kstrtoint(buf, 0, &val);
> > > +	if (ret)
> > > +		return ret;
> > > +
> > > +	if (val < 0 || val > NR_CPUS)
> > > +		return -EINVAL;
> > > +
> > > +	xfs_globals.pwork_threads = val;
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +STATIC ssize_t
> > > +pwork_threads_show(
> > > +	struct kobject	*kobject,
> > > +	char		*buf)
> > > +{
> > > +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
> > > +}
> > > +XFS_SYSFS_ATTR_RW(pwork_threads);
> > > +#endif /* DEBUG */
> > > +
> > >  static struct attribute *xfs_dbg_attrs[] = {
> > >  	ATTR_LIST(bug_on_assert),
> > >  	ATTR_LIST(log_recovery_delay),
> > >  	ATTR_LIST(mount_delay),
> > >  	ATTR_LIST(always_cow),
> > > +#ifdef DEBUG
> > > +	ATTR_LIST(pwork_threads),
> > > +#endif
> > >  	NULL,
> > >  };
> > >  
> > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > index f9bb1d50bc0e..658cbade1998 100644
> > > --- a/fs/xfs/xfs_trace.h
> > > +++ b/fs/xfs/xfs_trace.h
> > > @@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
> > >  		  __entry->startino, __entry->freemask)
> > >  )
> > >  
> > > +TRACE_EVENT(xfs_pwork_init,
> > > +	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
> > > +	TP_ARGS(mp, nr_threads, pid),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(unsigned int, nr_threads)
> > > +		__field(pid_t, pid)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = mp->m_super->s_dev;
> > > +		__entry->nr_threads = nr_threads;
> > > +		__entry->pid = pid;
> > > +	),
> > > +	TP_printk("dev %d:%d nr_threads %u pid %u",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > +		  __entry->nr_threads, __entry->pid)
> > > +)
> > > +
> > >  #endif /* _TRACE_XFS_H */
> > >  
> > >  #undef TRACE_INCLUDE_PATH
> > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/10] xfs: poll waiting for quotacheck
  2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
@ 2019-06-11 15:07   ` Brian Foster
  2019-06-11 16:06     ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-11 15:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:50:27PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Create a pwork destroy function that uses polling instead of
> uninterruptible sleep to wait for work items to finish so that we can
> touch the softlockup watchdog.  IOWs, gross hack.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---

Seems reasonable given the quirky circumstances of quotacheck. Just a
couple nits..

>  fs/xfs/xfs_iwalk.c |    3 +++
>  fs/xfs/xfs_iwalk.h |    3 ++-
>  fs/xfs/xfs_pwork.c |   21 +++++++++++++++++++++
>  fs/xfs/xfs_pwork.h |    2 ++
>  fs/xfs/xfs_qm.c    |    2 +-
>  5 files changed, 29 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index 71ee1628aa70..c4a9c4c246b7 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
> @@ -526,6 +526,7 @@ xfs_iwalk_threaded(
>  	xfs_ino_t		startino,
>  	xfs_iwalk_fn		iwalk_fn,
>  	unsigned int		max_prefetch,
> +	bool			polled,
>  	void			*data)
>  {
>  	struct xfs_pwork_ctl	pctl;
> @@ -556,5 +557,7 @@ xfs_iwalk_threaded(
>  		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
>  	}
>  
> +	if (polled)
> +		return xfs_pwork_destroy_poll(&pctl);
>  	return xfs_pwork_destroy(&pctl);

Rather than have duplicate destruction paths, could we rework
xfs_pwork_destroy_poll() to something like xfs_pwork_poll() that just
does the polling and returns, then the caller can fall back into the
current xfs_pwork_destroy()? I.e., this ends up looking like:

	...
	/* poll to keep soft lockup watchdog quiet */
	if (polled)
		xfs_pwork_poll(&pctl);
	return xfs_pwork_destroy(&pctl);

>  }
...
> diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
> index 19605a3a2482..3b885e0b52ac 100644
> --- a/fs/xfs/xfs_pwork.c
> +++ b/fs/xfs/xfs_pwork.c
...
> @@ -97,6 +101,23 @@ xfs_pwork_destroy(
>  	return pctl->error;
>  }
>  
> +/*
> + * Wait for the work to finish and tear down the control structure.
> + * Continually poll completion status and touch the soft lockup watchdog.
> + * This is for things like mount that hold locks.
> + */
> +int
> +xfs_pwork_destroy_poll(
> +	struct xfs_pwork_ctl	*pctl)
> +{
> +	while (atomic_read(&pctl->nr_work) > 0) {
> +		msleep(1);
> +		touch_softlockup_watchdog();
> +	}

Any idea what the typical watchdog timeout is? I'm curious where the 1ms
comes from and whether we could get away with anything larger. I realize
that might introduce mount latency with the current sleep based
implementation, but we could also consider a waitqueue and using
something like wait_event_timeout() to schedule out for longer time
periods and still wake up immediately when the count drops to 0.

Brian

> +
> +	return xfs_pwork_destroy(pctl);
> +}
> +
>  /*
>   * Return the amount of parallelism that the data device can handle, or 0 for
>   * no limit.
> diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> index e0c1354a2d8c..08da723a8dc9 100644
> --- a/fs/xfs/xfs_pwork.h
> +++ b/fs/xfs/xfs_pwork.h
> @@ -18,6 +18,7 @@ struct xfs_pwork_ctl {
>  	struct workqueue_struct	*wq;
>  	struct xfs_mount	*mp;
>  	xfs_pwork_work_fn	work_fn;
> +	atomic_t		nr_work;
>  	int			error;
>  };
>  
> @@ -45,6 +46,7 @@ int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
>  		unsigned int nr_threads);
>  void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
>  int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> +int xfs_pwork_destroy_poll(struct xfs_pwork_ctl *pctl);
>  unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
>  
>  #endif /* __XFS_PWORK_H__ */
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index e4f3785f7a64..de6a623ada02 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
>  		flags |= XFS_PQUOTA_CHKD;
>  	}
>  
> -	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, true, NULL);
>  	if (error)
>  		goto error_return;
>  
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions
  2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
@ 2019-06-11 15:08   ` Brian Foster
  2019-06-11 16:21     ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-11 15:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 02:50:33PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Now that we have generic functions to walk inode records, refactor the
> INUMBERS implementation to use it.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/xfs_ioctl.c   |   20 ++++--
>  fs/xfs/xfs_ioctl.h   |    2 +
>  fs/xfs/xfs_ioctl32.c |   35 ++++------
>  fs/xfs/xfs_itable.c  |  168 ++++++++++++++++++++------------------------------
>  fs/xfs/xfs_itable.h  |   22 +------
>  fs/xfs/xfs_iwalk.c   |  161 ++++++++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_iwalk.h   |   12 ++++

It looks like there's a decent amount of xfs_iwalk code changes in this
patch. That should probably be a separate patch, at minimum to have a
separate changelog to document the changes/updates required for
inumbers.

>  7 files changed, 262 insertions(+), 158 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 06abe5c9c0ee..bade54d6ac64 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -259,121 +259,85 @@ xfs_bulkstat(
>  	return error;
>  }
>  
> -int
> -xfs_inumbers_fmt(
> -	void			__user *ubuffer, /* buffer to write to */
> -	const struct xfs_inogrp	*buffer,	/* buffer to read from */
> -	long			count,		/* # of elements to read */
> -	long			*written)	/* # of bytes written */
> +struct xfs_inumbers_chunk {
> +	inumbers_fmt_pf		formatter;
> +	struct xfs_ibulk	*breq;
> +};
> +
> +/*
> + * INUMBERS
> + * ========
> + * This is how we export inode btree records to userspace, so that XFS tools
> + * can figure out where inodes are allocated.
> + */
> +
> +/*
> + * Format the inode group structure and report it somewhere.
> + *
> + * Similar to xfs_bulkstat_one_int, lastino is the inode cursor as we walk
> + * through the filesystem so we move it forward unless there was a runtime
> + * error.  If the formatter tells us the buffer is now full we also move the
> + * cursor forward and abort the walk.
> + */
> +STATIC int
> +xfs_inumbers_walk(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	const struct xfs_inobt_rec_incore *irec,
> +	void			*data)
>  {
> -	if (copy_to_user(ubuffer, buffer, count * sizeof(*buffer)))
> -		return -EFAULT;
> -	*written = count * sizeof(*buffer);
> -	return 0;
> +	struct xfs_inogrp	inogrp = {
> +		.xi_startino	= XFS_AGINO_TO_INO(mp, agno, irec->ir_startino),
> +		.xi_alloccount	= irec->ir_count - irec->ir_freecount,
> +		.xi_allocmask	= ~irec->ir_free,
> +	};

Not related to this patch, but I'm wondering if we should be using
xfs_inobt_irec_to_allocmask() here to mask off holes from the resulting
allocation bitmap. Eh, I guess it's misleading either way..

> +	struct xfs_inumbers_chunk *ic = data;
> +	xfs_agino_t		agino;
> +	int			error;
> +
> +	error = ic->formatter(ic->breq, &inogrp);
> +	if (error && error != XFS_IBULK_BUFFER_FULL)
> +		return error;
> +	if (error == XFS_IBULK_BUFFER_FULL)
> +		error = XFS_INOBT_WALK_ABORT;
> +
> +	agino = irec->ir_startino + XFS_INODES_PER_CHUNK;
> +	ic->breq->startino = XFS_AGINO_TO_INO(mp, agno, agino);
> +	return error;
>  }
>  
...
> diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> index c4a9c4c246b7..3a35d1cf7e14 100644
> --- a/fs/xfs/xfs_iwalk.c
> +++ b/fs/xfs/xfs_iwalk.c
...
> @@ -286,7 +298,7 @@ xfs_iwalk_ag_start(
>  	 * have to deal with tearing down the cursor to walk the records.
>  	 */
>  	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
> -			&iwag->recs[iwag->nr_recs]);
> +			&iwag->recs[iwag->nr_recs], trim);

Hmm, it looks like we could lift the lookup from xfs_iwalk_grab_ichunk()
up into xfs_iwalk_ag_start() and avoid needing to pass trim down
multiple levels. In fact, if we're not trimming the record we don't need
to grab the record at all in this path. We could do the lookup (setting
has_more) then bounce right up to the core walker algorithm, right? If
so, that seems a bit cleaner in terms of only using special cased code
when it's actually necessary.

>  	if (error)
>  		return error;
>  	if (icount)
...
> @@ -561,3 +574,135 @@ xfs_iwalk_threaded(
...
> +/*
> + * Walk all inode btree records in a single AG, from @iwag->startino to the end
> + * of the AG.
> + */
> +STATIC int
> +xfs_inobt_walk_ag(
> +	struct xfs_iwalk_ag		*iwag)
> +{
> +	struct xfs_mount		*mp = iwag->mp;
> +	struct xfs_trans		*tp = iwag->tp;
> +	struct xfs_buf			*agi_bp = NULL;
> +	struct xfs_btree_cur		*cur = NULL;
> +	xfs_agnumber_t			agno;
> +	xfs_agino_t			agino;
> +	int				has_more;
> +	int				error = 0;
> +
> +	/* Set up our cursor at the right place in the inode btree. */
> +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
> +			false);
> +
> +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> +		struct xfs_inobt_rec_incore	*irec;
> +
> +		cond_resched();
> +
> +		/* Fetch the inobt record. */
> +		irec = &iwag->recs[iwag->nr_recs];
> +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> +		if (error || !has_more)
> +			break;
> +
> +		/*
> +		 * If there's space in the buffer for more records, increment
> +		 * the btree cursor and grab more.
> +		 */
> +		if (++iwag->nr_recs < iwag->sz_recs) {
> +			error = xfs_btree_increment(cur, 0, &has_more);
> +			if (error || !has_more)
> +				break;
> +			continue;
> +		}
> +
> +		/*
> +		 * Otherwise, we need to save cursor state and run the callback
> +		 * function on the cached records.  The run_callbacks function
> +		 * is supposed to return a cursor pointing to the record where
> +		 * we would be if we had been able to increment like above.
> +		 */
> +		error = xfs_iwalk_run_callbacks(iwag, xfs_inobt_walk_ag_recs,
> +				agno, &cur, &agi_bp, &has_more);
> +	}
> +
> +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> +
> +	/* Walk any records left behind in the cache. */
> +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> +		return error;
> +
> +	return xfs_inobt_walk_ag_recs(iwag);
> +}

Similar comments apply here as for the previous xfs_iwalk_ag() patch.
Though looking at it, the only differences here are the lack of free
inode check, readahead and the callback function (particularly once you
consider the partial completion refactoring we discussed earlier). I
think this could all be generalized with a single flag such that there's
no need for separate xfs_[inobt|iwalk]_ag() functions.

Hmmm.. perhaps we could use a flag or separate function pointers in
struct xfs_iwalk_ag to accomplish the same thing all the way up through
the record walker functions. IOW, xfs_iwalk_ag_recs() looks like it
could call either callback based on which is defined in the
xfs_iwalk_ag.

This could all be done as a separate patch of course, if that's easier.

> +
> +/*
> + * Walk all inode btree records in the filesystem starting from @startino.  The
> + * @inobt_walk_fn will be called for each btree record, being passed the incore
> + * record and @data.  @max_prefetch controls how many inobt records we try to
> + * cache ahead of time.
> + */
> +int
> +xfs_inobt_walk(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_ino_t		startino,
> +	xfs_inobt_walk_fn	inobt_walk_fn,
> +	unsigned int		max_prefetch,
> +	void			*data)
> +{
> +	struct xfs_iwalk_ag	iwag = {
> +		.mp		= mp,
> +		.tp		= tp,
> +		.inobt_walk_fn	= inobt_walk_fn,
> +		.data		= data,
> +		.startino	= startino,
> +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> +	};
> +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> +	int			error;
> +
> +	ASSERT(agno < mp->m_sb.sb_agcount);
> +
> +	xfs_iwalk_set_prefetch(&iwag, max_prefetch * XFS_INODES_PER_CHUNK);

A brief comment above this line would be helpful. Something like:

	/* translate inumbers record count to inode count */

Brian

> +	error = xfs_iwalk_alloc(&iwag);
> +	if (error)
> +		return error;
> +
> +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> +		error = xfs_inobt_walk_ag(&iwag);
> +		if (error)
> +			break;
> +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> +	}
> +
> +	xfs_iwalk_free(&iwag);
> +	return error;
> +}
> diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> index 76d8f87a39ef..20bee93d4676 100644
> --- a/fs/xfs/xfs_iwalk.h
> +++ b/fs/xfs/xfs_iwalk.h
> @@ -18,4 +18,16 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
>  		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
>  		void *data);
>  
> +/* Walk all inode btree records in the filesystem starting from @startino. */
> +typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> +				 xfs_agnumber_t agno,
> +				 const struct xfs_inobt_rec_incore *irec,
> +				 void *data);
> +/* Return value (for xfs_inobt_walk_fn) that aborts the walk immediately. */
> +#define XFS_INOBT_WALK_ABORT	(XFS_IWALK_ABORT)
> +
> +int xfs_inobt_walk(struct xfs_mount *mp, struct xfs_trans *tp,
> +		xfs_ino_t startino, xfs_inobt_walk_fn inobt_walk_fn,
> +		unsigned int max_prefetch, void *data);
> +
>  #endif /* __XFS_IWALK_H__ */
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-11 13:13       ` Brian Foster
@ 2019-06-11 15:29         ` Darrick J. Wong
  2019-06-11 17:00           ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-11 15:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 09:13:40AM -0400, Brian Foster wrote:
> On Mon, Jun 10, 2019 at 06:10:20PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 10, 2019 at 03:40:13PM -0400, Brian Foster wrote:
> > > On Tue, Jun 04, 2019 at 02:50:20PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Create a parallel iwalk implementation and switch quotacheck to use it.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > 
> > > Interesting.. is there any commonality here with the ktask mechanism
> > > that's been in progress? I've not followed the details, but I thought it
> > > was a similar idea. The last post I see for that is here:
> > > 
> > > https://marc.info/?l=linux-mm&m=154143701122927&w=2
> > 
> > Yes, xfs_pwork is... the result of ktask still not landing upstream
> > after a couple of years. :(
> > 
> 
> Heh, Ok. We could always port over to it if it ever does land.
> 
> > > That aside, this all looks mostly fine to me. A few random thoughts..
> > > 
> > > >  fs/xfs/Makefile      |    1 
> > > >  fs/xfs/xfs_globals.c |    3 +
> > > >  fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
> > > >  fs/xfs/xfs_iwalk.h   |    2 +
> > > >  fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
> > > >  fs/xfs/xfs_qm.c      |    2 -
> > > >  fs/xfs/xfs_sysctl.h  |    6 ++
> > > >  fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
> > > >  fs/xfs/xfs_trace.h   |   18 +++++++
> > > >  10 files changed, 317 insertions(+), 3 deletions(-)
> > > >  create mode 100644 fs/xfs/xfs_pwork.c
> > > >  create mode 100644 fs/xfs/xfs_pwork.h
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > > index 74d30ef0dbce..48940a27d4aa 100644
> > > > --- a/fs/xfs/Makefile
> > > > +++ b/fs/xfs/Makefile
> > > > @@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
> > > >  				   xfs_message.o \
> > > >  				   xfs_mount.o \
> > > >  				   xfs_mru_cache.o \
> > > > +				   xfs_pwork.o \
> > > >  				   xfs_reflink.o \
> > > >  				   xfs_stats.o \
> > > >  				   xfs_super.o \
> > > > diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> > > > index d0d377384120..4f93f2c4dc38 100644
> > > > --- a/fs/xfs/xfs_globals.c
> > > > +++ b/fs/xfs/xfs_globals.c
> > > > @@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
> > > >  	.fstrm_timer	= {	1,		30*100,		3600*100},
> > > >  	.eofb_timer	= {	1,		300,		3600*24},
> > > >  	.cowb_timer	= {	1,		1800,		3600*24},
> > > > +#ifdef DEBUG
> > > > +	.pwork_threads	= {	0,		0,		NR_CPUS	},
> > > > +#endif
> > > >  };
> > > >  
> > > >  struct xfs_globals xfs_globals = {
> > > > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > > > index 8595258b5001..71ee1628aa70 100644
> > > > --- a/fs/xfs/xfs_iwalk.c
> > > > +++ b/fs/xfs/xfs_iwalk.c
> > > > @@ -21,6 +21,7 @@
> > > >  #include "xfs_health.h"
> > > >  #include "xfs_trans.h"
> > > >  #include "xfs_iwalk.h"
> > > > +#include "xfs_pwork.h"
> > > >  
> > > >  /*
> > > >   * Walking Inodes in the Filesystem
> > > > @@ -46,6 +47,9 @@
> > > >   */
> > > >  
> > > >  struct xfs_iwalk_ag {
> > > > +	/* parallel work control data; will be null if single threaded */
> > > > +	struct xfs_pwork		pwork;
> > > > +
> > > >  	struct xfs_mount		*mp;
> > > >  	struct xfs_trans		*tp;
> > > >  
> > > > @@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
> > > >  		trace_xfs_iwalk_ag_rec(mp, agno, irec);
> > > >  
> > > >  		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > > > +			if (xfs_pwork_want_abort(&iwag->pwork))
> > > > +				return 0;
> > > > +
> > > >  			/* Skip if this inode is free */
> > > >  			if (XFS_INOBT_MASK(j) & irec->ir_free)
> > > >  				continue;
> > > > @@ -360,7 +367,7 @@ xfs_iwalk_ag(
> > > >  	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > > >  	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > > >  
> > > > -	while (!error && has_more) {
> > > > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> > > >  		struct xfs_inobt_rec_incore	*irec;
> > > >  
> > > >  		cond_resched();
> > > > @@ -409,7 +416,7 @@ xfs_iwalk_ag(
> > > >  	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > > >  
> > > >  	/* Walk any records left behind in the cache. */
> > > > -	if (iwag->nr_recs == 0 || error)
> > > > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> > > >  		return error;
> > > >  
> > > >  	return xfs_iwalk_ag_recs(iwag);
> > > > @@ -465,6 +472,7 @@ xfs_iwalk(
> > > >  		.iwalk_fn	= iwalk_fn,
> > > >  		.data		= data,
> > > >  		.startino	= startino,
> > > > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> > > >  	};
> > > >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > >  	int			error;
> > > > @@ -486,3 +494,67 @@ xfs_iwalk(
> > > >  	xfs_iwalk_free(&iwag);
> > > >  	return error;
> > > >  }
> > > > +
> > > > +/* Run per-thread iwalk work. */
> > > > +static int
> > > > +xfs_iwalk_ag_work(
> > > > +	struct xfs_mount	*mp,
> > > > +	struct xfs_pwork	*pwork)
> > > > +{
> > > > +	struct xfs_iwalk_ag	*iwag;
> > > > +	int			error;
> > > > +
> > > > +	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
> > > > +	error = xfs_iwalk_alloc(iwag);
> > > > +	if (error)
> > > > +		goto out;
> > > 
> > > In most cases this will never fail, but the error path if it does looks
> > > slightly painful. I was thinking if we could move this up into
> > > xfs_iwalk_threaded() so we wouldn't continue to queue work jobs when
> > > failure is imminent...
> > > 
> > > > +
> > > > +	error = xfs_iwalk_ag(iwag);
> > > > +	xfs_iwalk_free(iwag);
> > > > +out:
> > > > +	kmem_free(iwag);
> > > > +	return error;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Walk all the inodes in the filesystem using multiple threads to process each
> > > > + * AG.
> > > > + */
> > > > +int
> > > > +xfs_iwalk_threaded(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_ino_t		startino,
> > > > +	xfs_iwalk_fn		iwalk_fn,
> > > > +	unsigned int		max_prefetch,
> > > > +	void			*data)
> > > > +{
> > > > +	struct xfs_pwork_ctl	pctl;
> > > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > > +	unsigned int		nr_threads;
> > > > +	int			error;
> > > > +
> > > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > > +
> > > > +	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
> > > > +	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
> > > > +			nr_threads);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > > +		struct xfs_iwalk_ag	*iwag;
> > > > +
> > > > +		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
> > > > +		iwag->mp = mp;
> > > > +		iwag->tp = NULL;
> > > > +		iwag->iwalk_fn = iwalk_fn;
> > > > +		iwag->data = data;
> > > > +		iwag->startino = startino;
> > > > +		iwag->recs = NULL;
> > > > +		xfs_iwalk_set_prefetch(iwag, max_prefetch);
> > > > +		xfs_pwork_queue(&pctl, &iwag->pwork);
> > > > +		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > > +	}
> > > 
> > > ... but this is only bound by the number of AGs and so could result in a
> > > large number of allocations. FWIW, I wouldn't expect that to be a
> > > problem in the common case. I'm more thinking about the case of a
> > > specially crafted filesystem designed to cause problems on mount.
> > 
> > <nod> I thought about that, and decided that it wasn't a good idea to
> > for each of the queued (but not processing) work items to be sitting on
> > a bunch of memory because that memory can't be put to useful work.
> > That's why I put it in xfs_iwalk_ag_work.
> > 
> 
> Yep.
> 
> > Also that would necessitate iwalk feeding a destructor to pwork so that
> > it can deal with work items that were queued but never actually run.
> > 
> 
> What's the scenario for queued jobs that never run? We have to run the
> work item to free the iwag, so I'd assume we could free the record
> buffer in the same place if we wanted to (which we don't :P).

Doh!  That was a thinko on my part.  pwork always runs the work item,
even if pctl->error != 0; it's up to the individual work item to check
if we're aborted and run for the exits, which _iwalk_ag_work doesn't do.

Erugh, I'll fix the documentation and _iwalk_ag_work.

> > > 
> > > > +
> > > > +	return xfs_pwork_destroy(&pctl);
> > > > +}
> ...
> > > > +/*
> > > > + * Return the amount of parallelism that the data device can handle, or 0 for
> > > > + * no limit.
> > > > + */
> > > > +unsigned int
> > > > +xfs_pwork_guess_datadev_parallelism(
> > > > +	struct xfs_mount	*mp)
> > > > +{
> > > > +	struct xfs_buftarg	*btp = mp->m_ddev_targp;
> > > > +	int			iomin;
> > > > +	int			ioopt;
> > > > +
> > > > +	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
> > > > +		return num_online_cpus();
> > > > +	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
> > > > +		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
> > > > +	iomin = bdev_io_min(btp->bt_bdev);
> > > > +	ioopt = bdev_io_opt(btp->bt_bdev);
> > > > +	if (iomin && ioopt)
> > > > +		return ioopt / iomin;
> > > > +
> > > > +	return 1;
> > > 
> > > Have you collected any performance data related to these heuristics?
> > 
> > Yeah, the quotacheck runtime reduces by 5-10% on my SSDs (~5% on a
> > single SSD, ~10% on a 4-way raid0).  That wasn't really all that
> > awesome, so I recorded a flame graph (see below) to find where the
> > remaining overhead is.  A lot of it was in xfs_iget, and I also noticed
> > that deferred inode inactivation sped it up a little more.
> > 
> 
> So if I follow the graphic correctly, you have 4 walker threads running
> the quotacheck. The majority of overhead is the inode memory allocation
> followed by inode buffer lookup and then dquot lookup slightly behind
> that. If this is an SSD, the heuristic presumably set the thread count
> based on the CPU count, right?

Right.

> Hmm, I can't tell from the image what happens down in xfs_buf_find(). Do
> you have granular enough data to see whether these are buffer cache hits
> or misses?

I /think/ they're mostly hits, though I admit it's hard to read the
24000 pixel wide output, even after clicking on xfs_buf_read_map to zoom
in a bit.

> > > I
> > > assume the feature is generally a win, but this also seems like we have
> > > a large window of variance here. E.g., an SSD on a server with hundreds
> > > of CPUs will enable as many threads as CPUs, but a single xTB spindle on
> > > the same box may run single threaded (a quick check of a few local
> > > devices all return an optimal I/O size of 0). Is there really no benefit
> > > parallelizing some of that work in the spinning rust case?
> > 
> > Single-spindle spinning rust got worse even with 2 threads because the
> > heads ping-pong between AGs.  It's not horrible with enough readahead,
> > but it turns into a disaster when I disabled readahead, unsurprisingly.
> > 
> 
> Ok. Is that tempered by multi-spindle devices? What about a raid5/6 like
> device where we have a stripe unit/width set, but the device itself may
> have concurrency characteristics more like a single spindle as opposed
> to something like raid0?

I'd have to study a wider variety of disk raid setups.  I'm under the
impression that the inode allocation code tries to allocate contiguous
runs of inode clusters which will fill up stripe units (su), and then
(if there's still space) it'll keep going into the next stripe?

Hmm, I wonder how well inodes get distributed across stripes...

> > > What about in the other direction where we might have a ton of threads
> > > for inodes across AGs that all happen to be in the same project quota,
> > > for example?
> > 
> > I collected a flame graph... https://djwong.org/docs/quotacheck.svg
> > 
> 
> I'm specifically thinking about a few random systems I've used recently
> with hundreds of CPUs. I'm not sure those boxes actually have SSDs, but
> if one did (and with enough AGs), I'm curious how this algorithm would
> behave under those conditions.
>
> I guess this all boils down to trying to understand if/what breakdown
> conditions might exist given the different possibilies allowed by the
> current heuristic. I'm wondering if we should either restrict this
> heuristic to enable concurrency specifically in the environments we've
> shown it to have benefits and/or with some kind of reasonable cap to
> limit unknown boundary conditions. E.g., the case of hundreds of AGs and
> hundreds of CPUs on SSD seems like it could go really well (a big flash
> raid0) or potentially really bad (one big SSD with a poorly configured
> fs). Thoughts?

<nod> I wish we had a better way for the block layer to give us even a
rough estimate of how many iops a device can handle... or even the
ability to help us train ourselves after the fact by measuring average
latency over time vs. the last (say) 2s so that we could back off if we
thought the device was getting congested.

[Hmm, Johannes merged some "pressure stall information" thing a couple
of releases back; I wonder if we can make use of that... I'll have a
look and report back.]

I suppose we could cap the number of cpus at some arbitrary high number
for now (128?) and see how long it takes to hit scalability problems
because of it, but we could also just let things scale and see who
complains about quotacheck getting slower on their 500 CPU system with a
nonrotational slate stone.  I suspect "one big SSD with a poorly
configured fs" is going to notice problems everywhere anyway.

Admittedly this is yet another one of those cases where I turn something
on and ... the bottleneck just shows up somewhere else now.  I hadn't
realized that iget overhead is so high.

> > It turned out that the inode setup overhead in xfs_iget is so high that
> > locking the dquot has negligible overhead.  When I "fixed" quotacheck to
> > read the quota information straight from the inode cluster buffer if the
> > inode wasn't in memory, the runtime dropped by 60% but Dave warned me
> > not to take us back to inode buffer aliasing hell.  I also noted that
> > if booting with mem=512M the memory reclamation overhead totally fries
> > us regardless of parallelisation.
> > 
> 
> I'm not familiar with the buffer aliasing problem.. I'm guessing this is
> talking about risk of inconsistent in-core inodes with inode buffers..?
> In any event, I agree that it's not worth risking integrity or
> overcomplicating things for something like quotacheck.

Correct.

--D

> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > > +}
> > > > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > > > new file mode 100644
> > > > index 000000000000..e0c1354a2d8c
> > > > --- /dev/null
> > > > +++ b/fs/xfs/xfs_pwork.h
> > > > @@ -0,0 +1,50 @@
> > > > +// SPDX-License-Identifier: GPL-2.0+
> > > > +/*
> > > > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > + */
> > > > +#ifndef __XFS_PWORK_H__
> > > > +#define __XFS_PWORK_H__
> > > > +
> > > > +struct xfs_pwork;
> > > > +struct xfs_mount;
> > > > +
> > > > +typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
> > > > +
> > > > +/*
> > > > + * Parallel work coordination structure.
> > > > + */
> > > > +struct xfs_pwork_ctl {
> > > > +	struct workqueue_struct	*wq;
> > > > +	struct xfs_mount	*mp;
> > > > +	xfs_pwork_work_fn	work_fn;
> > > > +	int			error;
> > > > +};
> > > > +
> > > > +/*
> > > > + * Embed this parallel work control item inside your own work structure,
> > > > + * then queue work with it.
> > > > + */
> > > > +struct xfs_pwork {
> > > > +	struct work_struct	work;
> > > > +	struct xfs_pwork_ctl	*pctl;
> > > > +};
> > > > +
> > > > +#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
> > > > +
> > > > +/* Have we been told to abort? */
> > > > +static inline bool
> > > > +xfs_pwork_want_abort(
> > > > +	struct xfs_pwork	*pwork)
> > > > +{
> > > > +	return pwork->pctl && pwork->pctl->error;
> > > > +}
> > > > +
> > > > +int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> > > > +		xfs_pwork_work_fn work_fn, const char *tag,
> > > > +		unsigned int nr_threads);
> > > > +void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> > > > +int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > > > +unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> > > > +
> > > > +#endif /* __XFS_PWORK_H__ */
> > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > > index a5b2260406a8..e4f3785f7a64 100644
> > > > --- a/fs/xfs/xfs_qm.c
> > > > +++ b/fs/xfs/xfs_qm.c
> > > > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> > > >  		flags |= XFS_PQUOTA_CHKD;
> > > >  	}
> > > >  
> > > > -	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > >  	if (error)
> > > >  		goto error_return;
> > > >  
> > > > diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> > > > index ad7f9be13087..b555e045e2f4 100644
> > > > --- a/fs/xfs/xfs_sysctl.h
> > > > +++ b/fs/xfs/xfs_sysctl.h
> > > > @@ -37,6 +37,9 @@ typedef struct xfs_param {
> > > >  	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
> > > >  	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
> > > >  	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
> > > > +#ifdef DEBUG
> > > > +	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
> > > > +#endif
> > > >  } xfs_param_t;
> > > >  
> > > >  /*
> > > > @@ -82,6 +85,9 @@ enum {
> > > >  extern xfs_param_t	xfs_params;
> > > >  
> > > >  struct xfs_globals {
> > > > +#ifdef DEBUG
> > > > +	int	pwork_threads;		/* parallel workqueue threads */
> > > > +#endif
> > > >  	int	log_recovery_delay;	/* log recovery delay (secs) */
> > > >  	int	mount_delay;		/* mount setup delay (secs) */
> > > >  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
> > > > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> > > > index cabda13f3c64..910e6b9cb1a7 100644
> > > > --- a/fs/xfs/xfs_sysfs.c
> > > > +++ b/fs/xfs/xfs_sysfs.c
> > > > @@ -206,11 +206,51 @@ always_cow_show(
> > > >  }
> > > >  XFS_SYSFS_ATTR_RW(always_cow);
> > > >  
> > > > +#ifdef DEBUG
> > > > +/*
> > > > + * Override how many threads the parallel work queue is allowed to create.
> > > > + * This has to be a debug-only global (instead of an errortag) because one of
> > > > + * the main users of parallel workqueues is mount time quotacheck.
> > > > + */
> > > > +STATIC ssize_t
> > > > +pwork_threads_store(
> > > > +	struct kobject	*kobject,
> > > > +	const char	*buf,
> > > > +	size_t		count)
> > > > +{
> > > > +	int		ret;
> > > > +	int		val;
> > > > +
> > > > +	ret = kstrtoint(buf, 0, &val);
> > > > +	if (ret)
> > > > +		return ret;
> > > > +
> > > > +	if (val < 0 || val > NR_CPUS)
> > > > +		return -EINVAL;
> > > > +
> > > > +	xfs_globals.pwork_threads = val;
> > > > +
> > > > +	return count;
> > > > +}
> > > > +
> > > > +STATIC ssize_t
> > > > +pwork_threads_show(
> > > > +	struct kobject	*kobject,
> > > > +	char		*buf)
> > > > +{
> > > > +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
> > > > +}
> > > > +XFS_SYSFS_ATTR_RW(pwork_threads);
> > > > +#endif /* DEBUG */
> > > > +
> > > >  static struct attribute *xfs_dbg_attrs[] = {
> > > >  	ATTR_LIST(bug_on_assert),
> > > >  	ATTR_LIST(log_recovery_delay),
> > > >  	ATTR_LIST(mount_delay),
> > > >  	ATTR_LIST(always_cow),
> > > > +#ifdef DEBUG
> > > > +	ATTR_LIST(pwork_threads),
> > > > +#endif
> > > >  	NULL,
> > > >  };
> > > >  
> > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > index f9bb1d50bc0e..658cbade1998 100644
> > > > --- a/fs/xfs/xfs_trace.h
> > > > +++ b/fs/xfs/xfs_trace.h
> > > > @@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
> > > >  		  __entry->startino, __entry->freemask)
> > > >  )
> > > >  
> > > > +TRACE_EVENT(xfs_pwork_init,
> > > > +	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
> > > > +	TP_ARGS(mp, nr_threads, pid),
> > > > +	TP_STRUCT__entry(
> > > > +		__field(dev_t, dev)
> > > > +		__field(unsigned int, nr_threads)
> > > > +		__field(pid_t, pid)
> > > > +	),
> > > > +	TP_fast_assign(
> > > > +		__entry->dev = mp->m_super->s_dev;
> > > > +		__entry->nr_threads = nr_threads;
> > > > +		__entry->pid = pid;
> > > > +	),
> > > > +	TP_printk("dev %d:%d nr_threads %u pid %u",
> > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > +		  __entry->nr_threads, __entry->pid)
> > > > +)
> > > > +
> > > >  #endif /* _TRACE_XFS_H */
> > > >  
> > > >  #undef TRACE_INCLUDE_PATH
> > > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/10] xfs: poll waiting for quotacheck
  2019-06-11 15:07   ` Brian Foster
@ 2019-06-11 16:06     ` Darrick J. Wong
  2019-06-11 17:01       ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-11 16:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 11:07:12AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:50:27PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Create a pwork destroy function that uses polling instead of
> > uninterruptible sleep to wait for work items to finish so that we can
> > touch the softlockup watchdog.  IOWs, gross hack.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> 
> Seems reasonable given the quirky circumstances of quotacheck. Just a
> couple nits..
> 
> >  fs/xfs/xfs_iwalk.c |    3 +++
> >  fs/xfs/xfs_iwalk.h |    3 ++-
> >  fs/xfs/xfs_pwork.c |   21 +++++++++++++++++++++
> >  fs/xfs/xfs_pwork.h |    2 ++
> >  fs/xfs/xfs_qm.c    |    2 +-
> >  5 files changed, 29 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > index 71ee1628aa70..c4a9c4c246b7 100644
> > --- a/fs/xfs/xfs_iwalk.c
> > +++ b/fs/xfs/xfs_iwalk.c
> > @@ -526,6 +526,7 @@ xfs_iwalk_threaded(
> >  	xfs_ino_t		startino,
> >  	xfs_iwalk_fn		iwalk_fn,
> >  	unsigned int		max_prefetch,
> > +	bool			polled,
> >  	void			*data)
> >  {
> >  	struct xfs_pwork_ctl	pctl;
> > @@ -556,5 +557,7 @@ xfs_iwalk_threaded(
> >  		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> >  	}
> >  
> > +	if (polled)
> > +		return xfs_pwork_destroy_poll(&pctl);
> >  	return xfs_pwork_destroy(&pctl);
> 
> Rather than have duplicate destruction paths, could we rework
> xfs_pwork_destroy_poll() to something like xfs_pwork_poll() that just
> does the polling and returns, then the caller can fall back into the
> current xfs_pwork_destroy()? I.e., this ends up looking like:
> 
> 	...
> 	/* poll to keep soft lockup watchdog quiet */
> 	if (polled)
> 		xfs_pwork_poll(&pctl);
> 	return xfs_pwork_destroy(&pctl);

Sounds good, will do!

> >  }
> ...
> > diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
> > index 19605a3a2482..3b885e0b52ac 100644
> > --- a/fs/xfs/xfs_pwork.c
> > +++ b/fs/xfs/xfs_pwork.c
> ...
> > @@ -97,6 +101,23 @@ xfs_pwork_destroy(
> >  	return pctl->error;
> >  }
> >  
> > +/*
> > + * Wait for the work to finish and tear down the control structure.
> > + * Continually poll completion status and touch the soft lockup watchdog.
> > + * This is for things like mount that hold locks.
> > + */
> > +int
> > +xfs_pwork_destroy_poll(
> > +	struct xfs_pwork_ctl	*pctl)
> > +{
> > +	while (atomic_read(&pctl->nr_work) > 0) {
> > +		msleep(1);
> > +		touch_softlockup_watchdog();
> > +	}
> 
> Any idea what the typical watchdog timeout is?

Usually 30s for the hangcheck timeout.

> I'm curious where the 1ms
> comes from and whether we could get away with anything larger. I realize
> that might introduce mount latency with the current sleep based
> implementation, but we could also consider a waitqueue and using
> something like wait_event_timeout() to schedule out for longer time
> periods and still wake up immediately when the count drops to 0.

That's a much better approach than this naïve one which waits
unnecessarily after the pwork finishes; I'll replace it with this.
The kernel doesn't export the hang check timeout variable, so I think
I'll just set it to 1s, which should be infrequent enough not to use a
lot of CPU and frequent enough that we don't spew warnings everywhere.

--D

> Brian
> 
> > +
> > +	return xfs_pwork_destroy(pctl);
> > +}
> > +
> >  /*
> >   * Return the amount of parallelism that the data device can handle, or 0 for
> >   * no limit.
> > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > index e0c1354a2d8c..08da723a8dc9 100644
> > --- a/fs/xfs/xfs_pwork.h
> > +++ b/fs/xfs/xfs_pwork.h
> > @@ -18,6 +18,7 @@ struct xfs_pwork_ctl {
> >  	struct workqueue_struct	*wq;
> >  	struct xfs_mount	*mp;
> >  	xfs_pwork_work_fn	work_fn;
> > +	atomic_t		nr_work;
> >  	int			error;
> >  };
> >  
> > @@ -45,6 +46,7 @@ int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> >  		unsigned int nr_threads);
> >  void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> >  int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > +int xfs_pwork_destroy_poll(struct xfs_pwork_ctl *pctl);
> >  unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> >  
> >  #endif /* __XFS_PWORK_H__ */
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index e4f3785f7a64..de6a623ada02 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> >  		flags |= XFS_PQUOTA_CHKD;
> >  	}
> >  
> > -	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, true, NULL);
> >  	if (error)
> >  		goto error_return;
> >  
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions
  2019-06-11 15:08   ` Brian Foster
@ 2019-06-11 16:21     ` Darrick J. Wong
  2019-06-11 17:01       ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-11 16:21 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 11:08:51AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:50:33PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Now that we have generic functions to walk inode records, refactor the
> > INUMBERS implementation to use it.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/xfs_ioctl.c   |   20 ++++--
> >  fs/xfs/xfs_ioctl.h   |    2 +
> >  fs/xfs/xfs_ioctl32.c |   35 ++++------
> >  fs/xfs/xfs_itable.c  |  168 ++++++++++++++++++++------------------------------
> >  fs/xfs/xfs_itable.h  |   22 +------
> >  fs/xfs/xfs_iwalk.c   |  161 ++++++++++++++++++++++++++++++++++++++++++++++--
> >  fs/xfs/xfs_iwalk.h   |   12 ++++
> 
> It looks like there's a decent amount of xfs_iwalk code changes in this
> patch. That should probably be a separate patch, at minimum to have a
> separate changelog to document the changes/updates required for
> inumbers.

<nod> I'll break out the iwalk changes into a separate patch so that
this one only has the changes needed to wire up the ioctl.

> >  7 files changed, 262 insertions(+), 158 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > index 06abe5c9c0ee..bade54d6ac64 100644
> > --- a/fs/xfs/xfs_itable.c
> > +++ b/fs/xfs/xfs_itable.c
> > @@ -259,121 +259,85 @@ xfs_bulkstat(
> >  	return error;
> >  }
> >  
> > -int
> > -xfs_inumbers_fmt(
> > -	void			__user *ubuffer, /* buffer to write to */
> > -	const struct xfs_inogrp	*buffer,	/* buffer to read from */
> > -	long			count,		/* # of elements to read */
> > -	long			*written)	/* # of bytes written */
> > +struct xfs_inumbers_chunk {
> > +	inumbers_fmt_pf		formatter;
> > +	struct xfs_ibulk	*breq;
> > +};
> > +
> > +/*
> > + * INUMBERS
> > + * ========
> > + * This is how we export inode btree records to userspace, so that XFS tools
> > + * can figure out where inodes are allocated.
> > + */
> > +
> > +/*
> > + * Format the inode group structure and report it somewhere.
> > + *
> > + * Similar to xfs_bulkstat_one_int, lastino is the inode cursor as we walk
> > + * through the filesystem so we move it forward unless there was a runtime
> > + * error.  If the formatter tells us the buffer is now full we also move the
> > + * cursor forward and abort the walk.
> > + */
> > +STATIC int
> > +xfs_inumbers_walk(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	const struct xfs_inobt_rec_incore *irec,
> > +	void			*data)
> >  {
> > -	if (copy_to_user(ubuffer, buffer, count * sizeof(*buffer)))
> > -		return -EFAULT;
> > -	*written = count * sizeof(*buffer);
> > -	return 0;
> > +	struct xfs_inogrp	inogrp = {
> > +		.xi_startino	= XFS_AGINO_TO_INO(mp, agno, irec->ir_startino),
> > +		.xi_alloccount	= irec->ir_count - irec->ir_freecount,
> > +		.xi_allocmask	= ~irec->ir_free,
> > +	};
> 
> Not related to this patch, but I'm wondering if we should be using
> xfs_inobt_irec_to_allocmask() here to mask off holes from the resulting
> allocation bitmap. Eh, I guess it's misleading either way..

Holes were supposed to be marked in ir_free also, right?

So (assuming the irec isn't corrupt) we should be protected against
reporting a hole as an "allocated" inode, right?

> > +	struct xfs_inumbers_chunk *ic = data;
> > +	xfs_agino_t		agino;
> > +	int			error;
> > +
> > +	error = ic->formatter(ic->breq, &inogrp);
> > +	if (error && error != XFS_IBULK_BUFFER_FULL)
> > +		return error;
> > +	if (error == XFS_IBULK_BUFFER_FULL)
> > +		error = XFS_INOBT_WALK_ABORT;
> > +
> > +	agino = irec->ir_startino + XFS_INODES_PER_CHUNK;
> > +	ic->breq->startino = XFS_AGINO_TO_INO(mp, agno, agino);
> > +	return error;
> >  }
> >  
> ...
> > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > index c4a9c4c246b7..3a35d1cf7e14 100644
> > --- a/fs/xfs/xfs_iwalk.c
> > +++ b/fs/xfs/xfs_iwalk.c
> ...
> > @@ -286,7 +298,7 @@ xfs_iwalk_ag_start(
> >  	 * have to deal with tearing down the cursor to walk the records.
> >  	 */
> >  	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
> > -			&iwag->recs[iwag->nr_recs]);
> > +			&iwag->recs[iwag->nr_recs], trim);
> 
> Hmm, it looks like we could lift the lookup from xfs_iwalk_grab_ichunk()
> up into xfs_iwalk_ag_start() and avoid needing to pass trim down
> multiple levels. In fact, if we're not trimming the record we don't need
> to grab the record at all in this path. We could do the lookup (setting
> has_more) then bounce right up to the core walker algorithm, right? If
> so, that seems a bit cleaner in terms of only using special cased code
> when it's actually necessary.

Right.

> 
> >  	if (error)
> >  		return error;
> >  	if (icount)
> ...
> > @@ -561,3 +574,135 @@ xfs_iwalk_threaded(
> ...
> > +/*
> > + * Walk all inode btree records in a single AG, from @iwag->startino to the end
> > + * of the AG.
> > + */
> > +STATIC int
> > +xfs_inobt_walk_ag(
> > +	struct xfs_iwalk_ag		*iwag)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_buf			*agi_bp = NULL;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agino_t			agino;
> > +	int				has_more;
> > +	int				error = 0;
> > +
> > +	/* Set up our cursor at the right place in the inode btree. */
> > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
> > +			false);
> > +
> > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> > +		struct xfs_inobt_rec_incore	*irec;
> > +
> > +		cond_resched();
> > +
> > +		/* Fetch the inobt record. */
> > +		irec = &iwag->recs[iwag->nr_recs];
> > +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> > +		if (error || !has_more)
> > +			break;
> > +
> > +		/*
> > +		 * If there's space in the buffer for more records, increment
> > +		 * the btree cursor and grab more.
> > +		 */
> > +		if (++iwag->nr_recs < iwag->sz_recs) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			if (error || !has_more)
> > +				break;
> > +			continue;
> > +		}
> > +
> > +		/*
> > +		 * Otherwise, we need to save cursor state and run the callback
> > +		 * function on the cached records.  The run_callbacks function
> > +		 * is supposed to return a cursor pointing to the record where
> > +		 * we would be if we had been able to increment like above.
> > +		 */
> > +		error = xfs_iwalk_run_callbacks(iwag, xfs_inobt_walk_ag_recs,
> > +				agno, &cur, &agi_bp, &has_more);
> > +	}
> > +
> > +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > +
> > +	/* Walk any records left behind in the cache. */
> > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> > +		return error;
> > +
> > +	return xfs_inobt_walk_ag_recs(iwag);
> > +}
> 
> Similar comments apply here as for the previous xfs_iwalk_ag() patch.
> Though looking at it, the only differences here are the lack of free
> inode check, readahead and the callback function (particularly once you
> consider the partial completion refactoring we discussed earlier). I
> think this could all be generalized with a single flag such that there's
> no need for separate xfs_[inobt|iwalk]_ag() functions.

Yep.  Already refactoring that. :)

> Hmmm.. perhaps we could use a flag or separate function pointers in
> struct xfs_iwalk_ag to accomplish the same thing all the way up through
> the record walker functions. IOW, xfs_iwalk_ag_recs() looks like it
> could call either callback based on which is defined in the
> xfs_iwalk_ag.

<nod>

> This could all be done as a separate patch of course, if that's easier.

I might just go back and remove the function pointer from run_callbacks
in the earlier patches...

> 
> > +
> > +/*
> > + * Walk all inode btree records in the filesystem starting from @startino.  The
> > + * @inobt_walk_fn will be called for each btree record, being passed the incore
> > + * record and @data.  @max_prefetch controls how many inobt records we try to
> > + * cache ahead of time.
> > + */
> > +int
> > +xfs_inobt_walk(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_ino_t		startino,
> > +	xfs_inobt_walk_fn	inobt_walk_fn,
> > +	unsigned int		max_prefetch,
> > +	void			*data)
> > +{
> > +	struct xfs_iwalk_ag	iwag = {
> > +		.mp		= mp,
> > +		.tp		= tp,
> > +		.inobt_walk_fn	= inobt_walk_fn,
> > +		.data		= data,
> > +		.startino	= startino,
> > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> > +	};
> > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > +	int			error;
> > +
> > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > +
> > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch * XFS_INODES_PER_CHUNK);
> 
> A brief comment above this line would be helpful. Something like:
> 
> 	/* translate inumbers record count to inode count */

Done.  Thanks for the review!

--D

> Brian
> 
> > +	error = xfs_iwalk_alloc(&iwag);
> > +	if (error)
> > +		return error;
> > +
> > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > +		error = xfs_inobt_walk_ag(&iwag);
> > +		if (error)
> > +			break;
> > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > +	}
> > +
> > +	xfs_iwalk_free(&iwag);
> > +	return error;
> > +}
> > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > index 76d8f87a39ef..20bee93d4676 100644
> > --- a/fs/xfs/xfs_iwalk.h
> > +++ b/fs/xfs/xfs_iwalk.h
> > @@ -18,4 +18,16 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
> >  		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
> >  		void *data);
> >  
> > +/* Walk all inode btree records in the filesystem starting from @startino. */
> > +typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > +				 xfs_agnumber_t agno,
> > +				 const struct xfs_inobt_rec_incore *irec,
> > +				 void *data);
> > +/* Return value (for xfs_inobt_walk_fn) that aborts the walk immediately. */
> > +#define XFS_INOBT_WALK_ABORT	(XFS_IWALK_ABORT)
> > +
> > +int xfs_inobt_walk(struct xfs_mount *mp, struct xfs_trans *tp,
> > +		xfs_ino_t startino, xfs_inobt_walk_fn inobt_walk_fn,
> > +		unsigned int max_prefetch, void *data);
> > +
> >  #endif /* __XFS_IWALK_H__ */
> > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 08/10] xfs: multithreaded iwalk implementation
  2019-06-11 15:29         ` Darrick J. Wong
@ 2019-06-11 17:00           ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-11 17:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 08:29:26AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 11, 2019 at 09:13:40AM -0400, Brian Foster wrote:
> > On Mon, Jun 10, 2019 at 06:10:20PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 10, 2019 at 03:40:13PM -0400, Brian Foster wrote:
> > > > On Tue, Jun 04, 2019 at 02:50:20PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Create a parallel iwalk implementation and switch quotacheck to use it.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > 
> > > > Interesting.. is there any commonality here with the ktask mechanism
> > > > that's been in progress? I've not followed the details, but I thought it
> > > > was a similar idea. The last post I see for that is here:
> > > > 
> > > > https://marc.info/?l=linux-mm&m=154143701122927&w=2
> > > 
> > > Yes, xfs_pwork is... the result of ktask still not landing upstream
> > > after a couple of years. :(
> > > 
> > 
> > Heh, Ok. We could always port over to it if it ever does land.
> > 
> > > > That aside, this all looks mostly fine to me. A few random thoughts..
> > > > 
> > > > >  fs/xfs/Makefile      |    1 
> > > > >  fs/xfs/xfs_globals.c |    3 +
> > > > >  fs/xfs/xfs_iwalk.c   |   76 ++++++++++++++++++++++++++++++-
> > > > >  fs/xfs/xfs_iwalk.h   |    2 +
> > > > >  fs/xfs/xfs_pwork.c   |  122 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  fs/xfs/xfs_pwork.h   |   50 ++++++++++++++++++++
> > > > >  fs/xfs/xfs_qm.c      |    2 -
> > > > >  fs/xfs/xfs_sysctl.h  |    6 ++
> > > > >  fs/xfs/xfs_sysfs.c   |   40 ++++++++++++++++
> > > > >  fs/xfs/xfs_trace.h   |   18 +++++++
> > > > >  10 files changed, 317 insertions(+), 3 deletions(-)
> > > > >  create mode 100644 fs/xfs/xfs_pwork.c
> > > > >  create mode 100644 fs/xfs/xfs_pwork.h
> > > > > 
> > > > > 
> > > > > diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> > > > > index 74d30ef0dbce..48940a27d4aa 100644
> > > > > --- a/fs/xfs/Makefile
> > > > > +++ b/fs/xfs/Makefile
> > > > > @@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
> > > > >  				   xfs_message.o \
> > > > >  				   xfs_mount.o \
> > > > >  				   xfs_mru_cache.o \
> > > > > +				   xfs_pwork.o \
> > > > >  				   xfs_reflink.o \
> > > > >  				   xfs_stats.o \
> > > > >  				   xfs_super.o \
> > > > > diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
> > > > > index d0d377384120..4f93f2c4dc38 100644
> > > > > --- a/fs/xfs/xfs_globals.c
> > > > > +++ b/fs/xfs/xfs_globals.c
> > > > > @@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
> > > > >  	.fstrm_timer	= {	1,		30*100,		3600*100},
> > > > >  	.eofb_timer	= {	1,		300,		3600*24},
> > > > >  	.cowb_timer	= {	1,		1800,		3600*24},
> > > > > +#ifdef DEBUG
> > > > > +	.pwork_threads	= {	0,		0,		NR_CPUS	},
> > > > > +#endif
> > > > >  };
> > > > >  
> > > > >  struct xfs_globals xfs_globals = {
> > > > > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > > > > index 8595258b5001..71ee1628aa70 100644
> > > > > --- a/fs/xfs/xfs_iwalk.c
> > > > > +++ b/fs/xfs/xfs_iwalk.c
> > > > > @@ -21,6 +21,7 @@
> > > > >  #include "xfs_health.h"
> > > > >  #include "xfs_trans.h"
> > > > >  #include "xfs_iwalk.h"
> > > > > +#include "xfs_pwork.h"
> > > > >  
> > > > >  /*
> > > > >   * Walking Inodes in the Filesystem
> > > > > @@ -46,6 +47,9 @@
> > > > >   */
> > > > >  
> > > > >  struct xfs_iwalk_ag {
> > > > > +	/* parallel work control data; will be null if single threaded */
> > > > > +	struct xfs_pwork		pwork;
> > > > > +
> > > > >  	struct xfs_mount		*mp;
> > > > >  	struct xfs_trans		*tp;
> > > > >  
> > > > > @@ -200,6 +204,9 @@ xfs_iwalk_ag_recs(
> > > > >  		trace_xfs_iwalk_ag_rec(mp, agno, irec);
> > > > >  
> > > > >  		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > > > > +			if (xfs_pwork_want_abort(&iwag->pwork))
> > > > > +				return 0;
> > > > > +
> > > > >  			/* Skip if this inode is free */
> > > > >  			if (XFS_INOBT_MASK(j) & irec->ir_free)
> > > > >  				continue;
> > > > > @@ -360,7 +367,7 @@ xfs_iwalk_ag(
> > > > >  	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > > > >  	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > > > >  
> > > > > -	while (!error && has_more) {
> > > > > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> > > > >  		struct xfs_inobt_rec_incore	*irec;
> > > > >  
> > > > >  		cond_resched();
> > > > > @@ -409,7 +416,7 @@ xfs_iwalk_ag(
> > > > >  	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > > > >  
> > > > >  	/* Walk any records left behind in the cache. */
> > > > > -	if (iwag->nr_recs == 0 || error)
> > > > > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> > > > >  		return error;
> > > > >  
> > > > >  	return xfs_iwalk_ag_recs(iwag);
> > > > > @@ -465,6 +472,7 @@ xfs_iwalk(
> > > > >  		.iwalk_fn	= iwalk_fn,
> > > > >  		.data		= data,
> > > > >  		.startino	= startino,
> > > > > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> > > > >  	};
> > > > >  	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > > >  	int			error;
> > > > > @@ -486,3 +494,67 @@ xfs_iwalk(
> > > > >  	xfs_iwalk_free(&iwag);
> > > > >  	return error;
> > > > >  }
> > > > > +
> > > > > +/* Run per-thread iwalk work. */
> > > > > +static int
> > > > > +xfs_iwalk_ag_work(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	struct xfs_pwork	*pwork)
> > > > > +{
> > > > > +	struct xfs_iwalk_ag	*iwag;
> > > > > +	int			error;
> > > > > +
> > > > > +	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
> > > > > +	error = xfs_iwalk_alloc(iwag);
> > > > > +	if (error)
> > > > > +		goto out;
> > > > 
> > > > In most cases this will never fail, but the error path if it does looks
> > > > slightly painful. I was thinking if we could move this up into
> > > > xfs_iwalk_threaded() so we wouldn't continue to queue work jobs when
> > > > failure is imminent...
> > > > 
> > > > > +
> > > > > +	error = xfs_iwalk_ag(iwag);
> > > > > +	xfs_iwalk_free(iwag);
> > > > > +out:
> > > > > +	kmem_free(iwag);
> > > > > +	return error;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Walk all the inodes in the filesystem using multiple threads to process each
> > > > > + * AG.
> > > > > + */
> > > > > +int
> > > > > +xfs_iwalk_threaded(
> > > > > +	struct xfs_mount	*mp,
> > > > > +	xfs_ino_t		startino,
> > > > > +	xfs_iwalk_fn		iwalk_fn,
> > > > > +	unsigned int		max_prefetch,
> > > > > +	void			*data)
> > > > > +{
> > > > > +	struct xfs_pwork_ctl	pctl;
> > > > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > > > +	unsigned int		nr_threads;
> > > > > +	int			error;
> > > > > +
> > > > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > > > +
> > > > > +	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
> > > > > +	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
> > > > > +			nr_threads);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +
> > > > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > > > +		struct xfs_iwalk_ag	*iwag;
> > > > > +
> > > > > +		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
> > > > > +		iwag->mp = mp;
> > > > > +		iwag->tp = NULL;
> > > > > +		iwag->iwalk_fn = iwalk_fn;
> > > > > +		iwag->data = data;
> > > > > +		iwag->startino = startino;
> > > > > +		iwag->recs = NULL;
> > > > > +		xfs_iwalk_set_prefetch(iwag, max_prefetch);
> > > > > +		xfs_pwork_queue(&pctl, &iwag->pwork);
> > > > > +		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > > > +	}
> > > > 
> > > > ... but this is only bound by the number of AGs and so could result in a
> > > > large number of allocations. FWIW, I wouldn't expect that to be a
> > > > problem in the common case. I'm more thinking about the case of a
> > > > specially crafted filesystem designed to cause problems on mount.
> > > 
> > > <nod> I thought about that, and decided that it wasn't a good idea to
> > > for each of the queued (but not processing) work items to be sitting on
> > > a bunch of memory because that memory can't be put to useful work.
> > > That's why I put it in xfs_iwalk_ag_work.
> > > 
> > 
> > Yep.
> > 
> > > Also that would necessitate iwalk feeding a destructor to pwork so that
> > > it can deal with work items that were queued but never actually run.
> > > 
> > 
> > What's the scenario for queued jobs that never run? We have to run the
> > work item to free the iwag, so I'd assume we could free the record
> > buffer in the same place if we wanted to (which we don't :P).
> 
> Doh!  That was a thinko on my part.  pwork always runs the work item,
> even if pctl->error != 0; it's up to the individual work item to check
> if we're aborted and run for the exits, which _iwalk_ag_work doesn't do.
> 
> Erugh, I'll fix the documentation and _iwalk_ag_work.
> 
> > > > 
> > > > > +
> > > > > +	return xfs_pwork_destroy(&pctl);
> > > > > +}
> > ...
> > > > > +/*
> > > > > + * Return the amount of parallelism that the data device can handle, or 0 for
> > > > > + * no limit.
> > > > > + */
> > > > > +unsigned int
> > > > > +xfs_pwork_guess_datadev_parallelism(
> > > > > +	struct xfs_mount	*mp)
> > > > > +{
> > > > > +	struct xfs_buftarg	*btp = mp->m_ddev_targp;
> > > > > +	int			iomin;
> > > > > +	int			ioopt;
> > > > > +
> > > > > +	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
> > > > > +		return num_online_cpus();
> > > > > +	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
> > > > > +		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
> > > > > +	iomin = bdev_io_min(btp->bt_bdev);
> > > > > +	ioopt = bdev_io_opt(btp->bt_bdev);
> > > > > +	if (iomin && ioopt)
> > > > > +		return ioopt / iomin;
> > > > > +
> > > > > +	return 1;
> > > > 
> > > > Have you collected any performance data related to these heuristics?
> > > 
> > > Yeah, the quotacheck runtime reduces by 5-10% on my SSDs (~5% on a
> > > single SSD, ~10% on a 4-way raid0).  That wasn't really all that
> > > awesome, so I recorded a flame graph (see below) to find where the
> > > remaining overhead is.  A lot of it was in xfs_iget, and I also noticed
> > > that deferred inode inactivation sped it up a little more.
> > > 
> > 
> > So if I follow the graphic correctly, you have 4 walker threads running
> > the quotacheck. The majority of overhead is the inode memory allocation
> > followed by inode buffer lookup and then dquot lookup slightly behind
> > that. If this is an SSD, the heuristic presumably set the thread count
> > based on the CPU count, right?
> 
> Right.
> 
> > Hmm, I can't tell from the image what happens down in xfs_buf_find(). Do
> > you have granular enough data to see whether these are buffer cache hits
> > or misses?
> 
> I /think/ they're mostly hits, though I admit it's hard to read the
> 24000 pixel wide output, even after clicking on xfs_buf_read_map to zoom
> in a bit.
> 

Ok, I guess it would be nice to see the source of that overhead for
future reference. Maybe perf or something could show it better.

> > > > I
> > > > assume the feature is generally a win, but this also seems like we have
> > > > a large window of variance here. E.g., an SSD on a server with hundreds
> > > > of CPUs will enable as many threads as CPUs, but a single xTB spindle on
> > > > the same box may run single threaded (a quick check of a few local
> > > > devices all return an optimal I/O size of 0). Is there really no benefit
> > > > parallelizing some of that work in the spinning rust case?
> > > 
> > > Single-spindle spinning rust got worse even with 2 threads because the
> > > heads ping-pong between AGs.  It's not horrible with enough readahead,
> > > but it turns into a disaster when I disabled readahead, unsurprisingly.
> > > 
> > 
> > Ok. Is that tempered by multi-spindle devices? What about a raid5/6 like
> > device where we have a stripe unit/width set, but the device itself may
> > have concurrency characteristics more like a single spindle as opposed
> > to something like raid0?
> 
> I'd have to study a wider variety of disk raid setups.  I'm under the
> impression that the inode allocation code tries to allocate contiguous
> runs of inode clusters which will fill up stripe units (su), and then
> (if there's still space) it'll keep going into the next stripe?
> 

That sounds about right to me.

Note that I'm not suggesting we necessarily have to test every possible
storage config to land this mechanism. I think it would be reasonable to
to restrict the heuristic to conservatively enable cases we have tested
so far (i.e., nonrotational with a handful of threads) and expand it
from there over time as we have opportunity to test more configs.

> Hmm, I wonder how well inodes get distributed across stripes...
> 
> > > > What about in the other direction where we might have a ton of threads
> > > > for inodes across AGs that all happen to be in the same project quota,
> > > > for example?
> > > 
> > > I collected a flame graph... https://djwong.org/docs/quotacheck.svg
> > > 
> > 
> > I'm specifically thinking about a few random systems I've used recently
> > with hundreds of CPUs. I'm not sure those boxes actually have SSDs, but
> > if one did (and with enough AGs), I'm curious how this algorithm would
> > behave under those conditions.
> >
> > I guess this all boils down to trying to understand if/what breakdown
> > conditions might exist given the different possibilies allowed by the
> > current heuristic. I'm wondering if we should either restrict this
> > heuristic to enable concurrency specifically in the environments we've
> > shown it to have benefits and/or with some kind of reasonable cap to
> > limit unknown boundary conditions. E.g., the case of hundreds of AGs and
> > hundreds of CPUs on SSD seems like it could go really well (a big flash
> > raid0) or potentially really bad (one big SSD with a poorly configured
> > fs). Thoughts?
> 
> <nod> I wish we had a better way for the block layer to give us even a
> rough estimate of how many iops a device can handle... or even the
> ability to help us train ourselves after the fact by measuring average
> latency over time vs. the last (say) 2s so that we could back off if we
> thought the device was getting congested.
> 

Yeah, we're kind of limited by the lack of detailed information on the
storage setup.

> [Hmm, Johannes merged some "pressure stall information" thing a couple
> of releases back; I wonder if we can make use of that... I'll have a
> look and report back.]
> 
> I suppose we could cap the number of cpus at some arbitrary high number
> for now (128?) and see how long it takes to hit scalability problems
> because of it, but we could also just let things scale and see who
> complains about quotacheck getting slower on their 500 CPU system with a
> nonrotational slate stone.  I suspect "one big SSD with a poorly
> configured fs" is going to notice problems everywhere anyway.
> 

Indeed, we could do that. I was thinking about a similar approach but
from the opposite direction. I.e., rather than set a potential
performance landmine (assuming there's some config out there that this
would thrash), cap things at a few threads to start and expand out from
there as dictated by testing, etc. As it is, quotacheck has been
single threaded forever (AFAIA) anyways.

I suppose with either approach we could also consider enabling the sysfs
knob on production environments such that users have either the ability
to speed up a currently slow quotacheck or recover from one that is
thrashing due to too many workers (depending on which approach we take).
We'd probably have to think a little bit harder about the interface if
we wanted to do that, though. One obvious issue is that the thread count
for one filesystem might be irrelevant for another, and quotacheck is a
mount time operation. :/

I dunno.. I'm kind of curious to test this against some of the filesets
I'm generating for the near allocation mode rework stuff. I only have a
couple days before heading off for vacation though so realistically it
might be a couple weeks before I have a chance to do that. It sounds
like Dave has reviewed at least some of this work so far as well. Maybe
he or others have some thoughts on how to approach the concurrency
heuristic..?

> Admittedly this is yet another one of those cases where I turn something
> on and ... the bottleneck just shows up somewhere else now.  I hadn't
> realized that iget overhead is so high.
> 

Same.

Brian

> > > It turned out that the inode setup overhead in xfs_iget is so high that
> > > locking the dquot has negligible overhead.  When I "fixed" quotacheck to
> > > read the quota information straight from the inode cluster buffer if the
> > > inode wasn't in memory, the runtime dropped by 60% but Dave warned me
> > > not to take us back to inode buffer aliasing hell.  I also noted that
> > > if booting with mem=512M the memory reclamation overhead totally fries
> > > us regardless of parallelisation.
> > > 
> > 
> > I'm not familiar with the buffer aliasing problem.. I'm guessing this is
> > talking about risk of inconsistent in-core inodes with inode buffers..?
> > In any event, I agree that it's not worth risking integrity or
> > overcomplicating things for something like quotacheck.
> 
> Correct.
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > > +}
> > > > > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > > > > new file mode 100644
> > > > > index 000000000000..e0c1354a2d8c
> > > > > --- /dev/null
> > > > > +++ b/fs/xfs/xfs_pwork.h
> > > > > @@ -0,0 +1,50 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0+
> > > > > +/*
> > > > > + * Copyright (C) 2019 Oracle.  All Rights Reserved.
> > > > > + * Author: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > + */
> > > > > +#ifndef __XFS_PWORK_H__
> > > > > +#define __XFS_PWORK_H__
> > > > > +
> > > > > +struct xfs_pwork;
> > > > > +struct xfs_mount;
> > > > > +
> > > > > +typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
> > > > > +
> > > > > +/*
> > > > > + * Parallel work coordination structure.
> > > > > + */
> > > > > +struct xfs_pwork_ctl {
> > > > > +	struct workqueue_struct	*wq;
> > > > > +	struct xfs_mount	*mp;
> > > > > +	xfs_pwork_work_fn	work_fn;
> > > > > +	int			error;
> > > > > +};
> > > > > +
> > > > > +/*
> > > > > + * Embed this parallel work control item inside your own work structure,
> > > > > + * then queue work with it.
> > > > > + */
> > > > > +struct xfs_pwork {
> > > > > +	struct work_struct	work;
> > > > > +	struct xfs_pwork_ctl	*pctl;
> > > > > +};
> > > > > +
> > > > > +#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
> > > > > +
> > > > > +/* Have we been told to abort? */
> > > > > +static inline bool
> > > > > +xfs_pwork_want_abort(
> > > > > +	struct xfs_pwork	*pwork)
> > > > > +{
> > > > > +	return pwork->pctl && pwork->pctl->error;
> > > > > +}
> > > > > +
> > > > > +int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> > > > > +		xfs_pwork_work_fn work_fn, const char *tag,
> > > > > +		unsigned int nr_threads);
> > > > > +void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> > > > > +int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > > > > +unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> > > > > +
> > > > > +#endif /* __XFS_PWORK_H__ */
> > > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > > > index a5b2260406a8..e4f3785f7a64 100644
> > > > > --- a/fs/xfs/xfs_qm.c
> > > > > +++ b/fs/xfs/xfs_qm.c
> > > > > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> > > > >  		flags |= XFS_PQUOTA_CHKD;
> > > > >  	}
> > > > >  
> > > > > -	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > > > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > > >  	if (error)
> > > > >  		goto error_return;
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
> > > > > index ad7f9be13087..b555e045e2f4 100644
> > > > > --- a/fs/xfs/xfs_sysctl.h
> > > > > +++ b/fs/xfs/xfs_sysctl.h
> > > > > @@ -37,6 +37,9 @@ typedef struct xfs_param {
> > > > >  	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
> > > > >  	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
> > > > >  	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
> > > > > +#ifdef DEBUG
> > > > > +	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
> > > > > +#endif
> > > > >  } xfs_param_t;
> > > > >  
> > > > >  /*
> > > > > @@ -82,6 +85,9 @@ enum {
> > > > >  extern xfs_param_t	xfs_params;
> > > > >  
> > > > >  struct xfs_globals {
> > > > > +#ifdef DEBUG
> > > > > +	int	pwork_threads;		/* parallel workqueue threads */
> > > > > +#endif
> > > > >  	int	log_recovery_delay;	/* log recovery delay (secs) */
> > > > >  	int	mount_delay;		/* mount setup delay (secs) */
> > > > >  	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
> > > > > diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
> > > > > index cabda13f3c64..910e6b9cb1a7 100644
> > > > > --- a/fs/xfs/xfs_sysfs.c
> > > > > +++ b/fs/xfs/xfs_sysfs.c
> > > > > @@ -206,11 +206,51 @@ always_cow_show(
> > > > >  }
> > > > >  XFS_SYSFS_ATTR_RW(always_cow);
> > > > >  
> > > > > +#ifdef DEBUG
> > > > > +/*
> > > > > + * Override how many threads the parallel work queue is allowed to create.
> > > > > + * This has to be a debug-only global (instead of an errortag) because one of
> > > > > + * the main users of parallel workqueues is mount time quotacheck.
> > > > > + */
> > > > > +STATIC ssize_t
> > > > > +pwork_threads_store(
> > > > > +	struct kobject	*kobject,
> > > > > +	const char	*buf,
> > > > > +	size_t		count)
> > > > > +{
> > > > > +	int		ret;
> > > > > +	int		val;
> > > > > +
> > > > > +	ret = kstrtoint(buf, 0, &val);
> > > > > +	if (ret)
> > > > > +		return ret;
> > > > > +
> > > > > +	if (val < 0 || val > NR_CPUS)
> > > > > +		return -EINVAL;
> > > > > +
> > > > > +	xfs_globals.pwork_threads = val;
> > > > > +
> > > > > +	return count;
> > > > > +}
> > > > > +
> > > > > +STATIC ssize_t
> > > > > +pwork_threads_show(
> > > > > +	struct kobject	*kobject,
> > > > > +	char		*buf)
> > > > > +{
> > > > > +	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
> > > > > +}
> > > > > +XFS_SYSFS_ATTR_RW(pwork_threads);
> > > > > +#endif /* DEBUG */
> > > > > +
> > > > >  static struct attribute *xfs_dbg_attrs[] = {
> > > > >  	ATTR_LIST(bug_on_assert),
> > > > >  	ATTR_LIST(log_recovery_delay),
> > > > >  	ATTR_LIST(mount_delay),
> > > > >  	ATTR_LIST(always_cow),
> > > > > +#ifdef DEBUG
> > > > > +	ATTR_LIST(pwork_threads),
> > > > > +#endif
> > > > >  	NULL,
> > > > >  };
> > > > >  
> > > > > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > > > > index f9bb1d50bc0e..658cbade1998 100644
> > > > > --- a/fs/xfs/xfs_trace.h
> > > > > +++ b/fs/xfs/xfs_trace.h
> > > > > @@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
> > > > >  		  __entry->startino, __entry->freemask)
> > > > >  )
> > > > >  
> > > > > +TRACE_EVENT(xfs_pwork_init,
> > > > > +	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
> > > > > +	TP_ARGS(mp, nr_threads, pid),
> > > > > +	TP_STRUCT__entry(
> > > > > +		__field(dev_t, dev)
> > > > > +		__field(unsigned int, nr_threads)
> > > > > +		__field(pid_t, pid)
> > > > > +	),
> > > > > +	TP_fast_assign(
> > > > > +		__entry->dev = mp->m_super->s_dev;
> > > > > +		__entry->nr_threads = nr_threads;
> > > > > +		__entry->pid = pid;
> > > > > +	),
> > > > > +	TP_printk("dev %d:%d nr_threads %u pid %u",
> > > > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > > > +		  __entry->nr_threads, __entry->pid)
> > > > > +)
> > > > > +
> > > > >  #endif /* _TRACE_XFS_H */
> > > > >  
> > > > >  #undef TRACE_INCLUDE_PATH
> > > > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 09/10] xfs: poll waiting for quotacheck
  2019-06-11 16:06     ` Darrick J. Wong
@ 2019-06-11 17:01       ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-11 17:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 09:06:27AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 11, 2019 at 11:07:12AM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:50:27PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Create a pwork destroy function that uses polling instead of
> > > uninterruptible sleep to wait for work items to finish so that we can
> > > touch the softlockup watchdog.  IOWs, gross hack.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > 
> > Seems reasonable given the quirky circumstances of quotacheck. Just a
> > couple nits..
> > 
> > >  fs/xfs/xfs_iwalk.c |    3 +++
> > >  fs/xfs/xfs_iwalk.h |    3 ++-
> > >  fs/xfs/xfs_pwork.c |   21 +++++++++++++++++++++
> > >  fs/xfs/xfs_pwork.h |    2 ++
> > >  fs/xfs/xfs_qm.c    |    2 +-
> > >  5 files changed, 29 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > > index 71ee1628aa70..c4a9c4c246b7 100644
> > > --- a/fs/xfs/xfs_iwalk.c
> > > +++ b/fs/xfs/xfs_iwalk.c
> > > @@ -526,6 +526,7 @@ xfs_iwalk_threaded(
> > >  	xfs_ino_t		startino,
> > >  	xfs_iwalk_fn		iwalk_fn,
> > >  	unsigned int		max_prefetch,
> > > +	bool			polled,
> > >  	void			*data)
> > >  {
> > >  	struct xfs_pwork_ctl	pctl;
> > > @@ -556,5 +557,7 @@ xfs_iwalk_threaded(
> > >  		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > >  	}
> > >  
> > > +	if (polled)
> > > +		return xfs_pwork_destroy_poll(&pctl);
> > >  	return xfs_pwork_destroy(&pctl);
> > 
> > Rather than have duplicate destruction paths, could we rework
> > xfs_pwork_destroy_poll() to something like xfs_pwork_poll() that just
> > does the polling and returns, then the caller can fall back into the
> > current xfs_pwork_destroy()? I.e., this ends up looking like:
> > 
> > 	...
> > 	/* poll to keep soft lockup watchdog quiet */
> > 	if (polled)
> > 		xfs_pwork_poll(&pctl);
> > 	return xfs_pwork_destroy(&pctl);
> 
> Sounds good, will do!
> 
> > >  }
> > ...
> > > diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
> > > index 19605a3a2482..3b885e0b52ac 100644
> > > --- a/fs/xfs/xfs_pwork.c
> > > +++ b/fs/xfs/xfs_pwork.c
> > ...
> > > @@ -97,6 +101,23 @@ xfs_pwork_destroy(
> > >  	return pctl->error;
> > >  }
> > >  
> > > +/*
> > > + * Wait for the work to finish and tear down the control structure.
> > > + * Continually poll completion status and touch the soft lockup watchdog.
> > > + * This is for things like mount that hold locks.
> > > + */
> > > +int
> > > +xfs_pwork_destroy_poll(
> > > +	struct xfs_pwork_ctl	*pctl)
> > > +{
> > > +	while (atomic_read(&pctl->nr_work) > 0) {
> > > +		msleep(1);
> > > +		touch_softlockup_watchdog();
> > > +	}
> > 
> > Any idea what the typical watchdog timeout is?
> 
> Usually 30s for the hangcheck timeout.
> 
> > I'm curious where the 1ms
> > comes from and whether we could get away with anything larger. I realize
> > that might introduce mount latency with the current sleep based
> > implementation, but we could also consider a waitqueue and using
> > something like wait_event_timeout() to schedule out for longer time
> > periods and still wake up immediately when the count drops to 0.
> 
> That's a much better approach than this naïve one which waits
> unnecessarily after the pwork finishes; I'll replace it with this.
> The kernel doesn't export the hang check timeout variable, so I think
> I'll just set it to 1s, which should be infrequent enough not to use a
> lot of CPU and frequent enough that we don't spew warnings everywhere.
> 

Ack, that sounds reasonable to me if the timeout itself is 30s or so.

Brian

> --D
> 
> > Brian
> > 
> > > +
> > > +	return xfs_pwork_destroy(pctl);
> > > +}
> > > +
> > >  /*
> > >   * Return the amount of parallelism that the data device can handle, or 0 for
> > >   * no limit.
> > > diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
> > > index e0c1354a2d8c..08da723a8dc9 100644
> > > --- a/fs/xfs/xfs_pwork.h
> > > +++ b/fs/xfs/xfs_pwork.h
> > > @@ -18,6 +18,7 @@ struct xfs_pwork_ctl {
> > >  	struct workqueue_struct	*wq;
> > >  	struct xfs_mount	*mp;
> > >  	xfs_pwork_work_fn	work_fn;
> > > +	atomic_t		nr_work;
> > >  	int			error;
> > >  };
> > >  
> > > @@ -45,6 +46,7 @@ int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
> > >  		unsigned int nr_threads);
> > >  void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
> > >  int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
> > > +int xfs_pwork_destroy_poll(struct xfs_pwork_ctl *pctl);
> > >  unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
> > >  
> > >  #endif /* __XFS_PWORK_H__ */
> > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > index e4f3785f7a64..de6a623ada02 100644
> > > --- a/fs/xfs/xfs_qm.c
> > > +++ b/fs/xfs/xfs_qm.c
> > > @@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
> > >  		flags |= XFS_PQUOTA_CHKD;
> > >  	}
> > >  
> > > -	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
> > > +	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, true, NULL);
> > >  	if (error)
> > >  		goto error_return;
> > >  
> > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions
  2019-06-11 16:21     ` Darrick J. Wong
@ 2019-06-11 17:01       ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-11 17:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 11, 2019 at 09:21:28AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 11, 2019 at 11:08:51AM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:50:33PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Now that we have generic functions to walk inode records, refactor the
> > > INUMBERS implementation to use it.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/xfs/xfs_ioctl.c   |   20 ++++--
> > >  fs/xfs/xfs_ioctl.h   |    2 +
> > >  fs/xfs/xfs_ioctl32.c |   35 ++++------
> > >  fs/xfs/xfs_itable.c  |  168 ++++++++++++++++++++------------------------------
> > >  fs/xfs/xfs_itable.h  |   22 +------
> > >  fs/xfs/xfs_iwalk.c   |  161 ++++++++++++++++++++++++++++++++++++++++++++++--
> > >  fs/xfs/xfs_iwalk.h   |   12 ++++
> > 
> > It looks like there's a decent amount of xfs_iwalk code changes in this
> > patch. That should probably be a separate patch, at minimum to have a
> > separate changelog to document the changes/updates required for
> > inumbers.
> 
> <nod> I'll break out the iwalk changes into a separate patch so that
> this one only has the changes needed to wire up the ioctl.
> 
> > >  7 files changed, 262 insertions(+), 158 deletions(-)
> > > 
> > > 
> > ...
> > > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > > index 06abe5c9c0ee..bade54d6ac64 100644
> > > --- a/fs/xfs/xfs_itable.c
> > > +++ b/fs/xfs/xfs_itable.c
> > > @@ -259,121 +259,85 @@ xfs_bulkstat(
> > >  	return error;
> > >  }
> > >  
> > > -int
> > > -xfs_inumbers_fmt(
> > > -	void			__user *ubuffer, /* buffer to write to */
> > > -	const struct xfs_inogrp	*buffer,	/* buffer to read from */
> > > -	long			count,		/* # of elements to read */
> > > -	long			*written)	/* # of bytes written */
> > > +struct xfs_inumbers_chunk {
> > > +	inumbers_fmt_pf		formatter;
> > > +	struct xfs_ibulk	*breq;
> > > +};
> > > +
> > > +/*
> > > + * INUMBERS
> > > + * ========
> > > + * This is how we export inode btree records to userspace, so that XFS tools
> > > + * can figure out where inodes are allocated.
> > > + */
> > > +
> > > +/*
> > > + * Format the inode group structure and report it somewhere.
> > > + *
> > > + * Similar to xfs_bulkstat_one_int, lastino is the inode cursor as we walk
> > > + * through the filesystem so we move it forward unless there was a runtime
> > > + * error.  If the formatter tells us the buffer is now full we also move the
> > > + * cursor forward and abort the walk.
> > > + */
> > > +STATIC int
> > > +xfs_inumbers_walk(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	*tp,
> > > +	xfs_agnumber_t		agno,
> > > +	const struct xfs_inobt_rec_incore *irec,
> > > +	void			*data)
> > >  {
> > > -	if (copy_to_user(ubuffer, buffer, count * sizeof(*buffer)))
> > > -		return -EFAULT;
> > > -	*written = count * sizeof(*buffer);
> > > -	return 0;
> > > +	struct xfs_inogrp	inogrp = {
> > > +		.xi_startino	= XFS_AGINO_TO_INO(mp, agno, irec->ir_startino),
> > > +		.xi_alloccount	= irec->ir_count - irec->ir_freecount,
> > > +		.xi_allocmask	= ~irec->ir_free,
> > > +	};
> > 
> > Not related to this patch, but I'm wondering if we should be using
> > xfs_inobt_irec_to_allocmask() here to mask off holes from the resulting
> > allocation bitmap. Eh, I guess it's misleading either way..
> 
> Holes were supposed to be marked in ir_free also, right?
> 

Hm, yeah I think so. So right now holes would be reported as free via
inumbers, which is probably what we want. Must have been thinking about
this backwards... disregard!

Brian

> So (assuming the irec isn't corrupt) we should be protected against
> reporting a hole as an "allocated" inode, right?
> 
> > > +	struct xfs_inumbers_chunk *ic = data;
> > > +	xfs_agino_t		agino;
> > > +	int			error;
> > > +
> > > +	error = ic->formatter(ic->breq, &inogrp);
> > > +	if (error && error != XFS_IBULK_BUFFER_FULL)
> > > +		return error;
> > > +	if (error == XFS_IBULK_BUFFER_FULL)
> > > +		error = XFS_INOBT_WALK_ABORT;
> > > +
> > > +	agino = irec->ir_startino + XFS_INODES_PER_CHUNK;
> > > +	ic->breq->startino = XFS_AGINO_TO_INO(mp, agno, agino);
> > > +	return error;
> > >  }
> > >  
> > ...
> > > diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
> > > index c4a9c4c246b7..3a35d1cf7e14 100644
> > > --- a/fs/xfs/xfs_iwalk.c
> > > +++ b/fs/xfs/xfs_iwalk.c
> > ...
> > > @@ -286,7 +298,7 @@ xfs_iwalk_ag_start(
> > >  	 * have to deal with tearing down the cursor to walk the records.
> > >  	 */
> > >  	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
> > > -			&iwag->recs[iwag->nr_recs]);
> > > +			&iwag->recs[iwag->nr_recs], trim);
> > 
> > Hmm, it looks like we could lift the lookup from xfs_iwalk_grab_ichunk()
> > up into xfs_iwalk_ag_start() and avoid needing to pass trim down
> > multiple levels. In fact, if we're not trimming the record we don't need
> > to grab the record at all in this path. We could do the lookup (setting
> > has_more) then bounce right up to the core walker algorithm, right? If
> > so, that seems a bit cleaner in terms of only using special cased code
> > when it's actually necessary.
> 
> Right.
> 
> > 
> > >  	if (error)
> > >  		return error;
> > >  	if (icount)
> > ...
> > > @@ -561,3 +574,135 @@ xfs_iwalk_threaded(
> > ...
> > > +/*
> > > + * Walk all inode btree records in a single AG, from @iwag->startino to the end
> > > + * of the AG.
> > > + */
> > > +STATIC int
> > > +xfs_inobt_walk_ag(
> > > +	struct xfs_iwalk_ag		*iwag)
> > > +{
> > > +	struct xfs_mount		*mp = iwag->mp;
> > > +	struct xfs_trans		*tp = iwag->tp;
> > > +	struct xfs_buf			*agi_bp = NULL;
> > > +	struct xfs_btree_cur		*cur = NULL;
> > > +	xfs_agnumber_t			agno;
> > > +	xfs_agino_t			agino;
> > > +	int				has_more;
> > > +	int				error = 0;
> > > +
> > > +	/* Set up our cursor at the right place in the inode btree. */
> > > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > > +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > > +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
> > > +			false);
> > > +
> > > +	while (!error && has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
> > > +		struct xfs_inobt_rec_incore	*irec;
> > > +
> > > +		cond_resched();
> > > +
> > > +		/* Fetch the inobt record. */
> > > +		irec = &iwag->recs[iwag->nr_recs];
> > > +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> > > +		if (error || !has_more)
> > > +			break;
> > > +
> > > +		/*
> > > +		 * If there's space in the buffer for more records, increment
> > > +		 * the btree cursor and grab more.
> > > +		 */
> > > +		if (++iwag->nr_recs < iwag->sz_recs) {
> > > +			error = xfs_btree_increment(cur, 0, &has_more);
> > > +			if (error || !has_more)
> > > +				break;
> > > +			continue;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Otherwise, we need to save cursor state and run the callback
> > > +		 * function on the cached records.  The run_callbacks function
> > > +		 * is supposed to return a cursor pointing to the record where
> > > +		 * we would be if we had been able to increment like above.
> > > +		 */
> > > +		error = xfs_iwalk_run_callbacks(iwag, xfs_inobt_walk_ag_recs,
> > > +				agno, &cur, &agi_bp, &has_more);
> > > +	}
> > > +
> > > +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > > +
> > > +	/* Walk any records left behind in the cache. */
> > > +	if (iwag->nr_recs == 0 || error || xfs_pwork_want_abort(&iwag->pwork))
> > > +		return error;
> > > +
> > > +	return xfs_inobt_walk_ag_recs(iwag);
> > > +}
> > 
> > Similar comments apply here as for the previous xfs_iwalk_ag() patch.
> > Though looking at it, the only differences here are the lack of free
> > inode check, readahead and the callback function (particularly once you
> > consider the partial completion refactoring we discussed earlier). I
> > think this could all be generalized with a single flag such that there's
> > no need for separate xfs_[inobt|iwalk]_ag() functions.
> 
> Yep.  Already refactoring that. :)
> 
> > Hmmm.. perhaps we could use a flag or separate function pointers in
> > struct xfs_iwalk_ag to accomplish the same thing all the way up through
> > the record walker functions. IOW, xfs_iwalk_ag_recs() looks like it
> > could call either callback based on which is defined in the
> > xfs_iwalk_ag.
> 
> <nod>
> 
> > This could all be done as a separate patch of course, if that's easier.
> 
> I might just go back and remove the function pointer from run_callbacks
> in the earlier patches...
> 
> > 
> > > +
> > > +/*
> > > + * Walk all inode btree records in the filesystem starting from @startino.  The
> > > + * @inobt_walk_fn will be called for each btree record, being passed the incore
> > > + * record and @data.  @max_prefetch controls how many inobt records we try to
> > > + * cache ahead of time.
> > > + */
> > > +int
> > > +xfs_inobt_walk(
> > > +	struct xfs_mount	*mp,
> > > +	struct xfs_trans	*tp,
> > > +	xfs_ino_t		startino,
> > > +	xfs_inobt_walk_fn	inobt_walk_fn,
> > > +	unsigned int		max_prefetch,
> > > +	void			*data)
> > > +{
> > > +	struct xfs_iwalk_ag	iwag = {
> > > +		.mp		= mp,
> > > +		.tp		= tp,
> > > +		.inobt_walk_fn	= inobt_walk_fn,
> > > +		.data		= data,
> > > +		.startino	= startino,
> > > +		.pwork		= XFS_PWORK_SINGLE_THREADED,
> > > +	};
> > > +	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
> > > +	int			error;
> > > +
> > > +	ASSERT(agno < mp->m_sb.sb_agcount);
> > > +
> > > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch * XFS_INODES_PER_CHUNK);
> > 
> > A brief comment above this line would be helpful. Something like:
> > 
> > 	/* translate inumbers record count to inode count */
> 
> Done.  Thanks for the review!
> 
> --D
> 
> > Brian
> > 
> > > +	error = xfs_iwalk_alloc(&iwag);
> > > +	if (error)
> > > +		return error;
> > > +
> > > +	for (; agno < mp->m_sb.sb_agcount; agno++) {
> > > +		error = xfs_inobt_walk_ag(&iwag);
> > > +		if (error)
> > > +			break;
> > > +		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
> > > +	}
> > > +
> > > +	xfs_iwalk_free(&iwag);
> > > +	return error;
> > > +}
> > > diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> > > index 76d8f87a39ef..20bee93d4676 100644
> > > --- a/fs/xfs/xfs_iwalk.h
> > > +++ b/fs/xfs/xfs_iwalk.h
> > > @@ -18,4 +18,16 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
> > >  		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
> > >  		void *data);
> > >  
> > > +/* Walk all inode btree records in the filesystem starting from @startino. */
> > > +typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
> > > +				 xfs_agnumber_t agno,
> > > +				 const struct xfs_inobt_rec_incore *irec,
> > > +				 void *data);
> > > +/* Return value (for xfs_inobt_walk_fn) that aborts the walk immediately. */
> > > +#define XFS_INOBT_WALK_ABORT	(XFS_IWALK_ABORT)
> > > +
> > > +int xfs_inobt_walk(struct xfs_mount *mp, struct xfs_trans *tp,
> > > +		xfs_ino_t startino, xfs_inobt_walk_fn inobt_walk_fn,
> > > +		unsigned int max_prefetch, void *data);
> > > +
> > >  #endif /* __XFS_IWALK_H__ */
> > > 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-10 23:11         ` Darrick J. Wong
@ 2019-06-11 22:33           ` Dave Chinner
  2019-06-11 23:05             ` Darrick J. Wong
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2019-06-11 22:33 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > 
> > 
> > Sounds good, though what I think would be better is to define a
> > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > details with that.
> > 
> > Though now that you point out the readahead thing, aren't we at risk of
> > a similar problem for users who happen to pass a really large userspace
> > buffer? Should we cap the kernel allocation/readahead window in all
> > cases and not just the default case?
> 
> Hmm, that's right, we don't want to let userspace arbitrarily determine
> the size of the buffer, and I think the current implementation caps it
> the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> 
> Oh, right, and in the V1 patchset Dave said that we should constrain
> readahead even further.

Right, I should explain a bit further why, too - it's about
performance.  I've found that a user buffer size of ~1024 inodes is
generally enough to max out performance of bulkstat. i.e. somewhere
around 1000 inodes per syscall is enough to mostly amortise all of
the cost of syscall, setup, readahead, etc vs the CPU overhead of
copying all the inodes into the user buffer.

Once the user buffer goes over a few thousand inodes, performance
then starts to tail back off - we don't get any gains from trying to
bulkstat tens of thousands of inodes at a time, especially under
memory pressure because that can push us into readahead and buffer
cache thrashing.

> > > /*
> > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > >  *
> > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > >  */

Hence I suspect that even this is overkill - it makes no sense to
have a huge readahead window when there has been no measurable
performance benefit to doing large inode count bulkstat syscalls.

And, FWIW, readahead probably should also be capped at what the user
buffer can hold - no point in reading 16k inodes when the output
buffer can only fit 1000 inodes...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-11 22:33           ` Dave Chinner
@ 2019-06-11 23:05             ` Darrick J. Wong
  2019-06-12 12:13               ` Brian Foster
  2019-06-12 17:54               ` Darrick J. Wong
  0 siblings, 2 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-11 23:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote:
> On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > > 
> > > 
> > > Sounds good, though what I think would be better is to define a
> > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > > details with that.
> > > 
> > > Though now that you point out the readahead thing, aren't we at risk of
> > > a similar problem for users who happen to pass a really large userspace
> > > buffer? Should we cap the kernel allocation/readahead window in all
> > > cases and not just the default case?
> > 
> > Hmm, that's right, we don't want to let userspace arbitrarily determine
> > the size of the buffer, and I think the current implementation caps it
> > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> > 
> > Oh, right, and in the V1 patchset Dave said that we should constrain
> > readahead even further.
> 
> Right, I should explain a bit further why, too - it's about
> performance.  I've found that a user buffer size of ~1024 inodes is
> generally enough to max out performance of bulkstat. i.e. somewhere
> around 1000 inodes per syscall is enough to mostly amortise all of
> the cost of syscall, setup, readahead, etc vs the CPU overhead of
> copying all the inodes into the user buffer.
> 
> Once the user buffer goes over a few thousand inodes, performance
> then starts to tail back off - we don't get any gains from trying to
> bulkstat tens of thousands of inodes at a time, especially under
> memory pressure because that can push us into readahead and buffer
> cache thrashing.

<nod> I don't mind setting the max inobt record cache buffer size to a
smaller value (1024 bytes == 4096 inodes readahead?) so we can get a
little farther into future hardware scalability (or decreases in syscall
performance :P).

I guess the question here is how to relate the number of inodes the user
asked for to how many inobt records we have to read to find that many
allocated inodes?  Or in other words, what's the average ir_freecount
across all the inobt records?

Note that this is technically a decrease since the old code would
reserve 16K for this purpose...

> > > > /*
> > > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > > >  *
> > > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > > >  */
> 
> Hence I suspect that even this is overkill - it makes no sense to
> have a huge readahead window when there has been no measurable
> performance benefit to doing large inode count bulkstat syscalls.
> 
> And, FWIW, readahead probably should also be capped at what the user
> buffer can hold - no point in reading 16k inodes when the output
> buffer can only fit 1000 inodes...

It already is -- the icount parameter from userspace is (eventually) fed
to xfs_iwalk-set_prefetch.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-10 13:58   ` Brian Foster
  2019-06-10 17:10     ` Darrick J. Wong
@ 2019-06-11 23:23     ` Dave Chinner
  2019-06-12  0:32       ` Darrick J. Wong
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2019-06-11 23:23 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Dave Chinner

On Mon, Jun 10, 2019 at 09:58:52AM -0400, Brian Foster wrote:
> On Tue, Jun 04, 2019 at 02:49:40PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Convert quotacheck to use the new iwalk iterator to dig through the
> > inodes.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
> >  1 file changed, 20 insertions(+), 42 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > index aa6b6db3db0e..a5b2260406a8 100644
> > --- a/fs/xfs/xfs_qm.c
> > +++ b/fs/xfs/xfs_qm.c
> ...
> > @@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
> >  	 * rootino must have its resources accounted for, not so with the quota
> >  	 * inodes.
> >  	 */
> > -	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
> > -		*res = BULKSTAT_RV_NOTHING;
> > -		return -EINVAL;
> > -	}
> > +	if (xfs_is_quota_inode(&mp->m_sb, ino))
> > +		return 0;
> >  
> >  	/*
> >  	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
> >  	 * at mount time and therefore nobody will be racing chown/chproj.
> >  	 */
> > -	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > -	if (error) {
> > -		*res = BULKSTAT_RV_NOTHING;
> > +	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
> 
> I was wondering if we should start using IGET_UNTRUSTED here, but I
> guess we're 1.) protected by quotacheck context and 2.) have the same
> record validity semantics as the existing bulkstat walker. LGTM:

FWIW, I'd be wanting to go the other way with bulkstat. i.e. finding
ways of reducing IGET_UNTRUSTED in bulkstat because it adds
substantial CPU overhead during inode lookup because it has to look
up the inobt to validate the inode number. i.e. we are locking the
AGI and doing an inobt lookup on every inode we bulkstat because
there is some time between the initial inobt lookup and the
xfs_iget() call and that's when the inode chunk can get removed.

IOWs, we only need to validate that the inode buffer still contains
inodes before we start instantiating inodes from it, but because we
don't hold any locks across individual inode processing in bulkstat
we have to revalidate that buffer contains inodes for every
allocated inode in that buffer. If we had a way of passing a locked
cluster buffer into xfs_iget to avoid having to look it up and read
it, we could do a single inode cluster read after validating the
inobt record is still valid, we could cycle all the remaining inodes
through xfs_iget() without having to use IGET_UNTRUSTED to validate
the inode cluster still contains valid inodes on every inode....

We still need to cycle inodes through the cache (so bulkstat is
coherent with other inode operations), but this would substantially
reduce the per-inode bulkstat CPU overhead, I think....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-11 23:23     ` Dave Chinner
@ 2019-06-12  0:32       ` Darrick J. Wong
  2019-06-12 12:55         ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-12  0:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs, Dave Chinner

On Wed, Jun 12, 2019 at 09:23:47AM +1000, Dave Chinner wrote:
> On Mon, Jun 10, 2019 at 09:58:52AM -0400, Brian Foster wrote:
> > On Tue, Jun 04, 2019 at 02:49:40PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > 
> > > Convert quotacheck to use the new iwalk iterator to dig through the
> > > inodes.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
> > >  1 file changed, 20 insertions(+), 42 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > index aa6b6db3db0e..a5b2260406a8 100644
> > > --- a/fs/xfs/xfs_qm.c
> > > +++ b/fs/xfs/xfs_qm.c
> > ...
> > > @@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
> > >  	 * rootino must have its resources accounted for, not so with the quota
> > >  	 * inodes.
> > >  	 */
> > > -	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
> > > -		*res = BULKSTAT_RV_NOTHING;
> > > -		return -EINVAL;
> > > -	}
> > > +	if (xfs_is_quota_inode(&mp->m_sb, ino))
> > > +		return 0;
> > >  
> > >  	/*
> > >  	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
> > >  	 * at mount time and therefore nobody will be racing chown/chproj.
> > >  	 */
> > > -	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > > -	if (error) {
> > > -		*res = BULKSTAT_RV_NOTHING;
> > > +	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > 
> > I was wondering if we should start using IGET_UNTRUSTED here, but I
> > guess we're 1.) protected by quotacheck context and 2.) have the same
> > record validity semantics as the existing bulkstat walker. LGTM:
> 
> FWIW, I'd be wanting to go the other way with bulkstat. i.e. finding
> ways of reducing IGET_UNTRUSTED in bulkstat because it adds
> substantial CPU overhead during inode lookup because it has to look
> up the inobt to validate the inode number. i.e. we are locking the
> AGI and doing an inobt lookup on every inode we bulkstat because
> there is some time between the initial inobt lookup and the
> xfs_iget() call and that's when the inode chunk can get removed.
> 
> IOWs, we only need to validate that the inode buffer still contains
> inodes before we start instantiating inodes from it, but because we
> don't hold any locks across individual inode processing in bulkstat
> we have to revalidate that buffer contains inodes for every
> allocated inode in that buffer. If we had a way of passing a locked
> cluster buffer into xfs_iget to avoid having to look it up and read
> it, we could do a single inode cluster read after validating the
> inobt record is still valid, we could cycle all the remaining inodes
> through xfs_iget() without having to use IGET_UNTRUSTED to validate
> the inode cluster still contains valid inodes on every inode....
> 
> We still need to cycle inodes through the cache (so bulkstat is
> coherent with other inode operations), but this would substantially
> reduce the per-inode bulkstat CPU overhead, I think....

I'll think about this as an addendum to the series, because I suspect
that remodelling the existing users is going to be an entire series on
its own.  (IOWs, my brain is too tired for today)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-11 23:05             ` Darrick J. Wong
@ 2019-06-12 12:13               ` Brian Foster
  2019-06-12 16:53                 ` Darrick J. Wong
  2019-06-12 17:54               ` Darrick J. Wong
  1 sibling, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-12 12:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 11, 2019 at 04:05:14PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote:
> > On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > > > 
> > > > 
> > > > Sounds good, though what I think would be better is to define a
> > > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > > > details with that.
> > > > 
> > > > Though now that you point out the readahead thing, aren't we at risk of
> > > > a similar problem for users who happen to pass a really large userspace
> > > > buffer? Should we cap the kernel allocation/readahead window in all
> > > > cases and not just the default case?
> > > 
> > > Hmm, that's right, we don't want to let userspace arbitrarily determine
> > > the size of the buffer, and I think the current implementation caps it
> > > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> > > 
> > > Oh, right, and in the V1 patchset Dave said that we should constrain
> > > readahead even further.
> > 
> > Right, I should explain a bit further why, too - it's about
> > performance.  I've found that a user buffer size of ~1024 inodes is
> > generally enough to max out performance of bulkstat. i.e. somewhere
> > around 1000 inodes per syscall is enough to mostly amortise all of
> > the cost of syscall, setup, readahead, etc vs the CPU overhead of
> > copying all the inodes into the user buffer.
> > 
> > Once the user buffer goes over a few thousand inodes, performance
> > then starts to tail back off - we don't get any gains from trying to
> > bulkstat tens of thousands of inodes at a time, especially under
> > memory pressure because that can push us into readahead and buffer
> > cache thrashing.
> 
> <nod> I don't mind setting the max inobt record cache buffer size to a
> smaller value (1024 bytes == 4096 inodes readahead?) so we can get a
> little farther into future hardware scalability (or decreases in syscall
> performance :P).
> 

The 1k baseline presumably applies to the current code. Taking a closer
look at the current code, we unconditionally allocate a 4 page record
buffer and start to fill it. For every record we grab, we issue
readahead on the underlying clusters.

Hmm, that seems like generally what this patchset is doing aside from
the more variable record buffer allocation. I'm fine with changing
things like record buffer allocation, readahead semantics, etc. given
Dave's practical analysis above, but TBH I don't think that should all
be part of the same patch. IMO, this rework patch should maintain as
close as possible to current behavior and a subsequent patches in the
series can tweak record buffer size and whatnot to improve readahead
logic. That makes this all easier to review, discuss and maintain in the
event of regression.

> I guess the question here is how to relate the number of inodes the user
> asked for to how many inobt records we have to read to find that many
> allocated inodes?  Or in other words, what's the average ir_freecount
> across all the inobt records?
> 

The current code basically looks like it allocates an oversized buffer
and hopes for the best with regard to readahead. If we took a similar
approach in terms of overestimating the buffer size (assuming not all
inode records are fully allocated), I suppose we could also track the
number of cluster readaheads issued and govern the collect/drain
sequences of the record buffer based on that..? But again, I think we
should have this as a separate "xfs: make iwalk readahead smarter ..."
patch that documents Dave's analysis above, perhaps includes some
numbers, etc..

> Note that this is technically a decrease since the old code would
> reserve 16K for this purpose...
> 

Indeed.

Brian

> > > > > /*
> > > > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > > > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > > > >  *
> > > > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > > > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > > > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > > > >  */
> > 
> > Hence I suspect that even this is overkill - it makes no sense to
> > have a huge readahead window when there has been no measurable
> > performance benefit to doing large inode count bulkstat syscalls.
> > 
> > And, FWIW, readahead probably should also be capped at what the user
> > buffer can hold - no point in reading 16k inodes when the output
> > buffer can only fit 1000 inodes...
> 
> It already is -- the icount parameter from userspace is (eventually) fed
> to xfs_iwalk-set_prefetch.
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-12  0:32       ` Darrick J. Wong
@ 2019-06-12 12:55         ` Brian Foster
  2019-06-12 23:33           ` Dave Chinner
  0 siblings, 1 reply; 46+ messages in thread
From: Brian Foster @ 2019-06-12 12:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Dave Chinner

On Tue, Jun 11, 2019 at 05:32:19PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 12, 2019 at 09:23:47AM +1000, Dave Chinner wrote:
> > On Mon, Jun 10, 2019 at 09:58:52AM -0400, Brian Foster wrote:
> > > On Tue, Jun 04, 2019 at 02:49:40PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Convert quotacheck to use the new iwalk iterator to dig through the
> > > > inodes.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > Reviewed-by: Dave Chinner <dchinner@redhat.com>
> > > > ---
> > > >  fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
> > > >  1 file changed, 20 insertions(+), 42 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> > > > index aa6b6db3db0e..a5b2260406a8 100644
> > > > --- a/fs/xfs/xfs_qm.c
> > > > +++ b/fs/xfs/xfs_qm.c
> > > ...
> > > > @@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
> > > >  	 * rootino must have its resources accounted for, not so with the quota
> > > >  	 * inodes.
> > > >  	 */
> > > > -	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
> > > > -		*res = BULKSTAT_RV_NOTHING;
> > > > -		return -EINVAL;
> > > > -	}
> > > > +	if (xfs_is_quota_inode(&mp->m_sb, ino))
> > > > +		return 0;
> > > >  
> > > >  	/*
> > > >  	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
> > > >  	 * at mount time and therefore nobody will be racing chown/chproj.
> > > >  	 */
> > > > -	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > > > -	if (error) {
> > > > -		*res = BULKSTAT_RV_NOTHING;
> > > > +	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
> > > 
> > > I was wondering if we should start using IGET_UNTRUSTED here, but I
> > > guess we're 1.) protected by quotacheck context and 2.) have the same
> > > record validity semantics as the existing bulkstat walker. LGTM:
> > 
> > FWIW, I'd be wanting to go the other way with bulkstat. i.e. finding
> > ways of reducing IGET_UNTRUSTED in bulkstat because it adds
> > substantial CPU overhead during inode lookup because it has to look
> > up the inobt to validate the inode number. i.e. we are locking the
> > AGI and doing an inobt lookup on every inode we bulkstat because
> > there is some time between the initial inobt lookup and the
> > xfs_iget() call and that's when the inode chunk can get removed.
> > 

Right.. note that I was just surmising whether IGET_UNTRUSTED was
required for correctness here due to the semantics of the iwalk.

> > IOWs, we only need to validate that the inode buffer still contains
> > inodes before we start instantiating inodes from it, but because we
> > don't hold any locks across individual inode processing in bulkstat
> > we have to revalidate that buffer contains inodes for every
> > allocated inode in that buffer. If we had a way of passing a locked
> > cluster buffer into xfs_iget to avoid having to look it up and read
> > it, we could do a single inode cluster read after validating the
> > inobt record is still valid, we could cycle all the remaining inodes
> > through xfs_iget() without having to use IGET_UNTRUSTED to validate
> > the inode cluster still contains valid inodes on every inode....
> > 

Yep, that sounds like a nice idea to me. Perhaps we could cache the
current cluster buffer somewhere in the iteration structure and use that
to validate incoming inode numbers until we move out of the cluster.

> > We still need to cycle inodes through the cache (so bulkstat is
> > coherent with other inode operations), but this would substantially
> > reduce the per-inode bulkstat CPU overhead, I think....
> 

Since we're already discussing tweaks to readahead, another approach to
this problem could be to try and readahead all the way into the inode
cache. For example, consider a mechanism where a cluster buffer
readahead sets a flag on the buffer that effectively triggers an iget of
each allocated inode in the buffer. Darrick has already shown that the
inode memory allocation and iget itself has considerable overhead even
when the cluster buffer is already cached. We know that's not due to
btree lookups because quotacheck isn't using IGET_UNTRUSTED, so perhaps
we could amortize more of this cost via readahead.

The caveats are that would probably be more involved than something that
just caches the current cluster buffer and passes it into the iget path.
We'd have to rectify readahead in-core inodes against DONTCACHE inodes
used by bulkstat, for example, though I don't think that would be too
difficult to address via a new inode readahead flag or some such
preserve existing DONTCACHE behavior.

It's also likely that passing the buffer into iget would already address
most of the overhead associated with the buffer lookup, so there might
not be enough tangible benefit at that point. The positive is that it's
probably an incremental step on top of an "iget from an existing cluster
buffer" mechanism and so could be easily prototyped by hacking in a read
side b_iodone handler or something.

Anyways, just thinking out loud a bit..

> I'll think about this as an addendum to the series, because I suspect
> that remodelling the existing users is going to be an entire series on
> its own.  (IOWs, my brain is too tired for today)
> 

Heh, definitely something for a separate patch (series). :)

Brian

> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-12 12:13               ` Brian Foster
@ 2019-06-12 16:53                 ` Darrick J. Wong
  0 siblings, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-12 16:53 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Wed, Jun 12, 2019 at 08:13:10AM -0400, Brian Foster wrote:
> On Tue, Jun 11, 2019 at 04:05:14PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> > > > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > > > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > > > > 
> > > > > 
> > > > > Sounds good, though what I think would be better is to define a
> > > > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > > > > details with that.
> > > > > 
> > > > > Though now that you point out the readahead thing, aren't we at risk of
> > > > > a similar problem for users who happen to pass a really large userspace
> > > > > buffer? Should we cap the kernel allocation/readahead window in all
> > > > > cases and not just the default case?
> > > > 
> > > > Hmm, that's right, we don't want to let userspace arbitrarily determine
> > > > the size of the buffer, and I think the current implementation caps it
> > > > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> > > > 
> > > > Oh, right, and in the V1 patchset Dave said that we should constrain
> > > > readahead even further.
> > > 
> > > Right, I should explain a bit further why, too - it's about
> > > performance.  I've found that a user buffer size of ~1024 inodes is
> > > generally enough to max out performance of bulkstat. i.e. somewhere
> > > around 1000 inodes per syscall is enough to mostly amortise all of
> > > the cost of syscall, setup, readahead, etc vs the CPU overhead of
> > > copying all the inodes into the user buffer.
> > > 
> > > Once the user buffer goes over a few thousand inodes, performance
> > > then starts to tail back off - we don't get any gains from trying to
> > > bulkstat tens of thousands of inodes at a time, especially under
> > > memory pressure because that can push us into readahead and buffer
> > > cache thrashing.
> > 
> > <nod> I don't mind setting the max inobt record cache buffer size to a
> > smaller value (1024 bytes == 4096 inodes readahead?) so we can get a
> > little farther into future hardware scalability (or decreases in syscall
> > performance :P).
> > 
> 
> The 1k baseline presumably applies to the current code. Taking a closer
> look at the current code, we unconditionally allocate a 4 page record
> buffer and start to fill it. For every record we grab, we issue
> readahead on the underlying clusters.
> 
> Hmm, that seems like generally what this patchset is doing aside from
> the more variable record buffer allocation. I'm fine with changing
> things like record buffer allocation, readahead semantics, etc. given
> Dave's practical analysis above, but TBH I don't think that should all
> be part of the same patch. IMO, this rework patch should maintain as
> close as possible to current behavior and a subsequent patches in the
> series can tweak record buffer size and whatnot to improve readahead
> logic. That makes this all easier to review, discuss and maintain in the
> event of regression.

This also got me thinking that I should look up what INUMBERS does,
which is that it uses the icount to determine the cache size, up to a
maximum of PAGE_SIZE.  Since INUMBERS doesn't issue readahead I think
it's fine to preserve that behavior too... and probably with a separate
xfs_inobt_walk_set_prefetch function.

> > I guess the question here is how to relate the number of inodes the user
> > asked for to how many inobt records we have to read to find that many
> > allocated inodes?  Or in other words, what's the average ir_freecount
> > across all the inobt records?
> > 
> 
> The current code basically looks like it allocates an oversized buffer
> and hopes for the best with regard to readahead. If we took a similar
> approach in terms of overestimating the buffer size (assuming not all
> inode records are fully allocated), I suppose we could also track the
> number of cluster readaheads issued and govern the collect/drain
> sequences of the record buffer based on that..? But again, I think we
> should have this as a separate "xfs: make iwalk readahead smarter ..."
> patch that documents Dave's analysis above, perhaps includes some
> numbers, etc..

Ok.

--D

> > Note that this is technically a decrease since the old code would
> > reserve 16K for this purpose...
> > 
> 
> Indeed.
> 
> Brian
> 
> > > > > > /*
> > > > > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > > > > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > > > > >  *
> > > > > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > > > > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > > > > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > > > > >  */
> > > 
> > > Hence I suspect that even this is overkill - it makes no sense to
> > > have a huge readahead window when there has been no measurable
> > > performance benefit to doing large inode count bulkstat syscalls.
> > > 
> > > And, FWIW, readahead probably should also be capped at what the user
> > > buffer can hold - no point in reading 16k inodes when the output
> > > buffer can only fit 1000 inodes...
> > 
> > It already is -- the icount parameter from userspace is (eventually) fed
> > to xfs_iwalk-set_prefetch.
> > 
> > --D
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 01/10] xfs: create simplified inode walk function
  2019-06-11 23:05             ` Darrick J. Wong
  2019-06-12 12:13               ` Brian Foster
@ 2019-06-12 17:54               ` Darrick J. Wong
  1 sibling, 0 replies; 46+ messages in thread
From: Darrick J. Wong @ 2019-06-12 17:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

[-- Attachment #1: Type: text/plain, Size: 4596 bytes --]

On Tue, Jun 11, 2019 at 04:05:14PM -0700, Darrick J. Wong wrote:
> On Wed, Jun 12, 2019 at 08:33:41AM +1000, Dave Chinner wrote:
> > On Mon, Jun 10, 2019 at 04:11:34PM -0700, Darrick J. Wong wrote:
> > > On Mon, Jun 10, 2019 at 01:55:10PM -0400, Brian Foster wrote:
> > > > > I could extend the comment to explain why we don't use PAGE_SIZE...
> > > > > 
> > > > 
> > > > Sounds good, though what I think would be better is to define a
> > > > IWALK_DEFAULT_RECS or some such somewhere and put the calculation
> > > > details with that.
> > > > 
> > > > Though now that you point out the readahead thing, aren't we at risk of
> > > > a similar problem for users who happen to pass a really large userspace
> > > > buffer? Should we cap the kernel allocation/readahead window in all
> > > > cases and not just the default case?
> > > 
> > > Hmm, that's right, we don't want to let userspace arbitrarily determine
> > > the size of the buffer, and I think the current implementation caps it
> > > the readahaead at ... oh, PAGE_SIZE / sizeof(xfs_inogrp_t).
> > > 
> > > Oh, right, and in the V1 patchset Dave said that we should constrain
> > > readahead even further.
> > 
> > Right, I should explain a bit further why, too - it's about
> > performance.  I've found that a user buffer size of ~1024 inodes is
> > generally enough to max out performance of bulkstat. i.e. somewhere
> > around 1000 inodes per syscall is enough to mostly amortise all of
> > the cost of syscall, setup, readahead, etc vs the CPU overhead of
> > copying all the inodes into the user buffer.
> > 
> > Once the user buffer goes over a few thousand inodes, performance
> > then starts to tail back off - we don't get any gains from trying to
> > bulkstat tens of thousands of inodes at a time, especially under
> > memory pressure because that can push us into readahead and buffer
> > cache thrashing.
> 
> <nod> I don't mind setting the max inobt record cache buffer size to a
> smaller value (1024 bytes == 4096 inodes readahead?) so we can get a
> little farther into future hardware scalability (or decreases in syscall
> performance :P).
> 
> I guess the question here is how to relate the number of inodes the user
> asked for to how many inobt records we have to read to find that many
> allocated inodes?  Or in other words, what's the average ir_freecount
> across all the inobt records?

Partial results computed by using xfs_db to dump all the inobt records
in the fs to look at averge freecounts, and the number of inobt records
with zero freecount:

Development workstation / and /home:
4678 total, zerofree 72.49%, avg freecount 4.24
107940 total, zerofree 94.84%, avg freecount 0.78

This laptop /, /home, and /boot:
4613 total, zerofree 80.73%, avg freecount 3.11
49660 total, zerofree 99.54%, avg freecount 0.04
10 total, zerofree 20.00%, avg freecount 27.40

Backup server / and /home:
3897 total, zerofree 22.99%, avg freecount 27.08
55995 total, zerofree 99.87%, avg freecount 0.01

(Note that the root fs is nearly out of space now, thanks journald...)

xfstests host / and $TEST_DIR:
1834 total, zerofree 76.28%, avg freecount 3.31
20097 total, zerofree 83.41%, avg freecount 3.62

The script I used to generate these reports is attached.  From this
admittedly cursory output I "conclude" that bulkstat could get away with
prefetching (icount / 48) inobt records up to a max of 1000 inodes.
A more conservative estimate would be (icount / 32) inobt records.

--D

> 
> Note that this is technically a decrease since the old code would
> reserve 16K for this purpose...
> 
> > > > > /*
> > > > >  * Note: We hardcode 4096 here (instead of, say, PAGE_SIZE) because we want to
> > > > >  * constrain the amount of inode readahead to 16k inodes regardless of CPU:
> > > > >  *
> > > > >  * 4096 bytes / 16 bytes per inobt record = 256 inobt records
> > > > >  * 256 inobt records * 64 inodes per record = 16384 inodes
> > > > >  * 16384 inodes * 512 bytes per inode(?) = 8MB of inode readahead
> > > > >  */
> > 
> > Hence I suspect that even this is overkill - it makes no sense to
> > have a huge readahead window when there has been no measurable
> > performance benefit to doing large inode count bulkstat syscalls.
> > 
> > And, FWIW, readahead probably should also be capped at what the user
> > buffer can hold - no point in reading 16k inodes when the output
> > buffer can only fit 1000 inodes...
> 
> It already is -- the icount parameter from userspace is (eventually) fed
> to xfs_iwalk-set_prefetch.
> 
> --D
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

[-- Attachment #2: xfsinoload.sh --]
[-- Type: application/x-sh, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-12 12:55         ` Brian Foster
@ 2019-06-12 23:33           ` Dave Chinner
  2019-06-13 18:34             ` Brian Foster
  0 siblings, 1 reply; 46+ messages in thread
From: Dave Chinner @ 2019-06-12 23:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs, Dave Chinner

On Wed, Jun 12, 2019 at 08:55:06AM -0400, Brian Foster wrote:
> On Tue, Jun 11, 2019 at 05:32:19PM -0700, Darrick J. Wong wrote:
> > On Wed, Jun 12, 2019 at 09:23:47AM +1000, Dave Chinner wrote:
> Since we're already discussing tweaks to readahead, another approach to
> this problem could be to try and readahead all the way into the inode
> cache. For example, consider a mechanism where a cluster buffer
> readahead sets a flag on the buffer that effectively triggers an iget of
> each allocated inode in the buffer. Darrick has already shown that the
> inode memory allocation and iget itself has considerable overhead even
> when the cluster buffer is already cached. We know that's not due to
> btree lookups because quotacheck isn't using IGET_UNTRUSTED, so perhaps
> we could amortize more of this cost via readahead.

The DONTCACHE inode caching semantics of bulkstat tend to conflict
with "readahead all the way to the inode cache".

> The caveats are that would probably be more involved than something that
> just caches the current cluster buffer and passes it into the iget path.
> We'd have to rectify readahead in-core inodes against DONTCACHE inodes
> used by bulkstat, for example, though I don't think that would be too
> difficult to address via a new inode readahead flag or some such
> preserve existing DONTCACHE behavior.

I did try that once, the cache thrashing was .... difficult to
contain under memory pressure. bulkstat pushes hundreds of thousands
of inodes a second through the inode cache, and under memory
pressure it will still cause working set perturbation with DONTCACHE
being set. Holding DONTCACHE inodes for some time in the cache kinda
defeats the simple purpose it has, and relying on cache hits to
convert "readahead" to "dont cache" becomes really nasty when we
try to use inode readahead for other things (like speeding up
directory traversals by having xfs_readdir() issue inode readahead).

The largest delay in bulkstat is the inode cluster IO latency.
Getting rid of that is where the biggest win is (hence cluster
read-ahead). The second largest overhead is the CPU burnt doing
inode lookups, and on filesystems with lots of inodes, a significant
amount of that is in the IGET_UNTRUSTED inobt lookup. IOWs, avoiding
GET_UNTRUSTED is relatively low hanging fruit.

The next limitation for bulkstat is the superblock inode list lock
contention. Getting rid of the IGET_UNTRUSTED overhead is likely to
push the lock contention into the severe range (the lock is already
the largest CPU consumer at 16 threads bulkstating 600,000 inodes/s
on a 16p machine) so until we get rid of that lock contention, there
isn't much point in doing major rework to the bulkstat algorithm as
it doesn't address the limitations that the current algorithm has.

> It's also likely that passing the buffer into iget would already address
> most of the overhead associated with the buffer lookup, so there might
> not be enough tangible benefit at that point. The positive is that it's
> probably an incremental step on top of an "iget from an existing cluster
> buffer" mechanism and so could be easily prototyped by hacking in a read
> side b_iodone handler or something.

We don't want to put inode cache insertion into a IO completion
routine. Tried it, caused horrible problems with metadata read IO
latency and substantially increased inode cache lock contention by
bouncing the radix trees around both submission and completion CPU
contexts...

/me has spent many, many years trying lots of different ways to make
the inode cache in XFS go faster and has failed most of the time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions
  2019-06-12 23:33           ` Dave Chinner
@ 2019-06-13 18:34             ` Brian Foster
  0 siblings, 0 replies; 46+ messages in thread
From: Brian Foster @ 2019-06-13 18:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, Dave Chinner

On Thu, Jun 13, 2019 at 09:33:02AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 08:55:06AM -0400, Brian Foster wrote:
> > On Tue, Jun 11, 2019 at 05:32:19PM -0700, Darrick J. Wong wrote:
> > > On Wed, Jun 12, 2019 at 09:23:47AM +1000, Dave Chinner wrote:
> > Since we're already discussing tweaks to readahead, another approach to
> > this problem could be to try and readahead all the way into the inode
> > cache. For example, consider a mechanism where a cluster buffer
> > readahead sets a flag on the buffer that effectively triggers an iget of
> > each allocated inode in the buffer. Darrick has already shown that the
> > inode memory allocation and iget itself has considerable overhead even
> > when the cluster buffer is already cached. We know that's not due to
> > btree lookups because quotacheck isn't using IGET_UNTRUSTED, so perhaps
> > we could amortize more of this cost via readahead.
> 
> The DONTCACHE inode caching semantics of bulkstat tend to conflict
> with "readahead all the way to the inode cache".
> 

Yep..

> > The caveats are that would probably be more involved than something that
> > just caches the current cluster buffer and passes it into the iget path.
> > We'd have to rectify readahead in-core inodes against DONTCACHE inodes
> > used by bulkstat, for example, though I don't think that would be too
> > difficult to address via a new inode readahead flag or some such
> > preserve existing DONTCACHE behavior.
> 
> I did try that once, the cache thrashing was .... difficult to
> contain under memory pressure. bulkstat pushes hundreds of thousands
> of inodes a second through the inode cache, and under memory
> pressure it will still cause working set perturbation with DONTCACHE
> being set. Holding DONTCACHE inodes for some time in the cache kinda
> defeats the simple purpose it has, and relying on cache hits to
> convert "readahead" to "dont cache" becomes really nasty when we
> try to use inode readahead for other things (like speeding up
> directory traversals by having xfs_readdir() issue inode readahead).
> 

Yeah, though you're taking things a bit further than I was originally
thinking by applying this to bulkstat in general. I conflated bulkstat
with quotacheck above (because the latter uses the former), but I
probably should have just referred to quotacheck since the bulkstat
callback is where we actually grab the inode anyways.

The experiment I was thinking about was really just intended for
quotacheck because it's already running in a fairly isolated context.
There's no other working set, no contention to worry about with other
tasks, etc., so it's a simple environment to evaluate this kind of
(hacky) experiment. FWIW the motivation is related, since when qc is
required the user basically has to sit there and wait for it to complete
before the fs is usable (as opposed to bulkstat, which I just chalk up
to being a long running fs operation requested by the user).

Of course if there's simply no benefit in the quotacheck context, then
there's clearly not much reason to consider similar behavior for
bulkstat in general. But even if there was a benefit to qc, I agree with
your point that things get a whole lot more complex when we have to
consider working set and workload of an operational fs. Either way, I'm
still curious if it helps in qc context. :)

> The largest delay in bulkstat is the inode cluster IO latency.
> Getting rid of that is where the biggest win is (hence cluster
> read-ahead). The second largest overhead is the CPU burnt doing
> inode lookups, and on filesystems with lots of inodes, a significant
> amount of that is in the IGET_UNTRUSTED inobt lookup. IOWs, avoiding
> GET_UNTRUSTED is relatively low hanging fruit.
> 

Agreed, but this again is more applicable to general bulkstat. Have you
looked at Darrick's quotacheck flame graph? It shows the majority of
quotacheck overhead in the inode memory allocation and grabbing the
(presumably already in-core, though that is not visible in the graphic)
cluster buffer. IGET_UNTRUSTED is not a factor here because the special
quotacheck context allows us to elide it.

BTW, a related point here is that perhaps anything that speeds up
xfs_buf_find() might also speed this up (hmm.. _less_ read-ahead
perhaps?) without resorting to special case inode preallocation hacks.
It's probably worth collecting more detailed perf data on that qc
codepath before getting too far into the weeds...

> The next limitation for bulkstat is the superblock inode list lock
> contention. Getting rid of the IGET_UNTRUSTED overhead is likely to
> push the lock contention into the severe range (the lock is already
> the largest CPU consumer at 16 threads bulkstating 600,000 inodes/s
> on a 16p machine) so until we get rid of that lock contention, there
> isn't much point in doing major rework to the bulkstat algorithm as
> it doesn't address the limitations that the current algorithm has.
> 
> > It's also likely that passing the buffer into iget would already address
> > most of the overhead associated with the buffer lookup, so there might
> > not be enough tangible benefit at that point. The positive is that it's
> > probably an incremental step on top of an "iget from an existing cluster
> > buffer" mechanism and so could be easily prototyped by hacking in a read
> > side b_iodone handler or something.
> 
> We don't want to put inode cache insertion into a IO completion
> routine. Tried it, caused horrible problems with metadata read IO
> latency and substantially increased inode cache lock contention by
> bouncing the radix trees around both submission and completion CPU
> contexts...
> 

Agreed, that certainly makes sense. It would be kind of crazy to have
single inode lookups waiting on inode cache population of entire cluster
buffers worth of inodes.

> /me has spent many, many years trying lots of different ways to make
> the inode cache in XFS go faster and has failed most of the time....
> 

Heh. :)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2019-06-13 18:34 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-04 21:49 [PATCH v2 00/10] xfs: refactor and improve inode iteration Darrick J. Wong
2019-06-04 21:49 ` [PATCH 01/10] xfs: create simplified inode walk function Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 16:59     ` Darrick J. Wong
2019-06-10 17:55       ` Brian Foster
2019-06-10 23:11         ` Darrick J. Wong
2019-06-11 22:33           ` Dave Chinner
2019-06-11 23:05             ` Darrick J. Wong
2019-06-12 12:13               ` Brian Foster
2019-06-12 16:53                 ` Darrick J. Wong
2019-06-12 17:54               ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 02/10] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
2019-06-10 13:58   ` Brian Foster
2019-06-10 17:10     ` Darrick J. Wong
2019-06-11 23:23     ` Dave Chinner
2019-06-12  0:32       ` Darrick J. Wong
2019-06-12 12:55         ` Brian Foster
2019-06-12 23:33           ` Dave Chinner
2019-06-13 18:34             ` Brian Foster
2019-06-04 21:49 ` [PATCH 03/10] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
2019-06-10 13:59   ` Brian Foster
2019-06-04 21:49 ` [PATCH 04/10] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-10 17:38     ` Darrick J. Wong
2019-06-10 18:29       ` Brian Foster
2019-06-10 23:42         ` Darrick J. Wong
2019-06-04 21:49 ` [PATCH 05/10] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
2019-06-10 14:02   ` Brian Foster
2019-06-04 21:50 ` [PATCH 06/10] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 07/10] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
2019-06-10 19:32   ` Brian Foster
2019-06-04 21:50 ` [PATCH 08/10] xfs: multithreaded iwalk implementation Darrick J. Wong
2019-06-10 19:40   ` Brian Foster
2019-06-11  1:10     ` Darrick J. Wong
2019-06-11 13:13       ` Brian Foster
2019-06-11 15:29         ` Darrick J. Wong
2019-06-11 17:00           ` Brian Foster
2019-06-04 21:50 ` [PATCH 09/10] xfs: poll waiting for quotacheck Darrick J. Wong
2019-06-11 15:07   ` Brian Foster
2019-06-11 16:06     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster
2019-06-04 21:50 ` [PATCH 10/10] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
2019-06-11 15:08   ` Brian Foster
2019-06-11 16:21     ` Darrick J. Wong
2019-06-11 17:01       ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.