All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] xfs: refactor and improve inode iteration
@ 2019-05-29 22:26 Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 01/11] xfs: separate inode geometry Darrick J. Wong
                   ` (10 more replies)
  0 siblings, 11 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

Hi all,

This series refactors all the inode walking code in XFS into a single
set of helper functions.  The goal is to separate the mechanics of
iterating a subset of inode in the filesystem from bulkstat.

Next we introduce a parallel inode walk feature to speed up quotacheck
on large filesystems.  Finally, we port the existing bulkstat and
inumbers ioctls in preparation for the next series.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=parallel-iwalk

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=parallel-iwalk

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/11] xfs: separate inode geometry
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-05-30  1:18   ` Dave Chinner
  2019-05-29 22:26 ` [PATCH 02/11] xfs: create simplified inode walk function Darrick J. Wong
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Separate the inode geometry information into a distinct structure.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h       |   33 +++++++++++-
 fs/xfs/libxfs/xfs_ialloc.c       |  109 ++++++++++++++++++++------------------
 fs/xfs/libxfs/xfs_ialloc.h       |    6 +-
 fs/xfs/libxfs/xfs_ialloc_btree.c |   15 +++--
 fs/xfs/libxfs/xfs_inode_buf.c    |    2 -
 fs/xfs/libxfs/xfs_sb.c           |   24 +++++---
 fs/xfs/libxfs/xfs_trans_resv.c   |   17 +++---
 fs/xfs/libxfs/xfs_trans_space.h  |    7 +-
 fs/xfs/libxfs/xfs_types.c        |    4 +
 fs/xfs/scrub/ialloc.c            |   22 ++++----
 fs/xfs/scrub/quota.c             |    2 -
 fs/xfs/xfs_fsops.c               |    4 +
 fs/xfs/xfs_inode.c               |   17 +++---
 fs/xfs/xfs_itable.c              |   11 ++--
 fs/xfs/xfs_log_recover.c         |   23 ++++----
 fs/xfs/xfs_mount.c               |   49 +++++++++--------
 fs/xfs/xfs_mount.h               |   17 ------
 fs/xfs/xfs_super.c               |    6 +-
 18 files changed, 205 insertions(+), 163 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 9bb3c48843ec..66f527b1c461 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -1071,7 +1071,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define	XFS_INO_MASK(k)			(uint32_t)((1ULL << (k)) - 1)
 #define	XFS_INO_OFFSET_BITS(mp)		(mp)->m_sb.sb_inopblog
 #define	XFS_INO_AGBNO_BITS(mp)		(mp)->m_sb.sb_agblklog
-#define	XFS_INO_AGINO_BITS(mp)		(mp)->m_agino_log
+#define	XFS_INO_AGINO_BITS(mp)		((mp)->m_ino_geo.ig_agino_log)
 #define	XFS_INO_AGNO_BITS(mp)		(mp)->m_agno_log
 #define	XFS_INO_BITS(mp)		\
 	XFS_INO_AGNO_BITS(mp) + XFS_INO_AGINO_BITS(mp)
@@ -1694,4 +1694,35 @@ struct xfs_acl {
 #define SGI_ACL_FILE_SIZE	(sizeof(SGI_ACL_FILE)-1)
 #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
 
+struct xfs_ino_geometry {
+	/* Maximum inode count in this filesystem. */
+	uint64_t	ig_maxicount;
+
+	/* Minimum inode buffer size, in bytes. */
+	unsigned int	ig_min_cluster_size;
+
+	/* Inode cluster sizes, adjusted to be at least 1 fsb. */
+	unsigned int	ig_inodes_per_cluster;
+	unsigned int	ig_blocks_per_cluster;
+
+	/* Inode cluster alignment. */
+	unsigned int	ig_cluster_align;
+	unsigned int	ig_cluster_align_inodes;
+
+	unsigned int	ig_inobt_mxr[2]; /* max inobt btree records */
+	unsigned int	ig_inobt_mnr[2]; /* min inobt btree records */
+	unsigned int	ig_in_maxlevels; /* max inobt btree levels. */
+
+	/* Minimum inode allocation size */
+	unsigned int	ig_ialloc_inos;
+	unsigned int	ig_ialloc_blks;
+
+	/* Minimum inode blocks for a sparse allocation. */
+	unsigned int	ig_ialloc_min_blks;
+
+	unsigned int	ig_inoalign_mask;/* mask sb_inoalignmt if used */
+	unsigned int	ig_agino_log;	/* #bits for agino in inum */
+	unsigned int	ig_sinoalign;	/* stripe unit inode alignment */
+};
+
 #endif /* __XFS_FORMAT_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index fe9898875097..c881e0521331 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -299,7 +299,7 @@ xfs_ialloc_inode_init(
 	 * sizes, manipulate the inodes in buffers  which are multiples of the
 	 * blocks size.
 	 */
-	nbufs = length / mp->m_blocks_per_cluster;
+	nbufs = length / mp->m_ino_geo.ig_blocks_per_cluster;
 
 	/*
 	 * Figure out what version number to use in the inodes we create.  If
@@ -343,9 +343,10 @@ xfs_ialloc_inode_init(
 		 * Get the block.
 		 */
 		d = XFS_AGB_TO_DADDR(mp, agno, agbno +
-				(j * mp->m_blocks_per_cluster));
+				(j * mp->m_ino_geo.ig_blocks_per_cluster));
 		fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
-					 mp->m_bsize * mp->m_blocks_per_cluster,
+					 mp->m_bsize *
+					 mp->m_ino_geo.ig_blocks_per_cluster,
 					 XBF_UNMAPPED);
 		if (!fbuf)
 			return -ENOMEM;
@@ -353,7 +354,7 @@ xfs_ialloc_inode_init(
 		/* Initialize the inode buffers and log them appropriately. */
 		fbuf->b_ops = &xfs_inode_buf_ops;
 		xfs_buf_zero(fbuf, 0, BBTOB(fbuf->b_length));
-		for (i = 0; i < mp->m_inodes_per_cluster; i++) {
+		for (i = 0; i < mp->m_ino_geo.ig_inodes_per_cluster; i++) {
 			int	ioffset = i << mp->m_sb.sb_inodelog;
 			uint	isize = xfs_dinode_size(version);
 
@@ -616,24 +617,26 @@ xfs_inobt_insert_sprec(
  * Allocate new inodes in the allocation group specified by agbp.
  * Return 0 for success, else error code.
  */
-STATIC int				/* error code or 0 */
+STATIC int
 xfs_ialloc_ag_alloc(
-	xfs_trans_t	*tp,		/* transaction pointer */
-	xfs_buf_t	*agbp,		/* alloc group buffer */
-	int		*alloc)
+	struct xfs_trans	*tp,
+	struct xfs_buf		*agbp,
+	int			*alloc)
 {
-	xfs_agi_t	*agi;		/* allocation group header */
-	xfs_alloc_arg_t	args;		/* allocation argument structure */
-	xfs_agnumber_t	agno;
-	int		error;
-	xfs_agino_t	newino;		/* new first inode's number */
-	xfs_agino_t	newlen;		/* new number of inodes */
-	int		isaligned = 0;	/* inode allocation at stripe unit */
-					/* boundary */
-	uint16_t	allocmask = (uint16_t) -1; /* init. to full chunk */
+	struct xfs_agi		*agi;
+	struct xfs_alloc_arg	args;
+	xfs_agnumber_t		agno;
+	int			error;
+	xfs_agino_t		newino;		/* new first inode's number */
+	xfs_agino_t		newlen;		/* new number of inodes */
+	int			isaligned = 0;	/* inode allocation at stripe */
+						/* unit boundary */
+	/* init. to full chunk */
+	uint16_t		allocmask = (uint16_t) -1;
 	struct xfs_inobt_rec_incore rec;
-	struct xfs_perag *pag;
-	int		do_sparse = 0;
+	struct xfs_perag	*pag;
+	struct xfs_ino_geometry	*igeo = &tp->t_mountp->m_ino_geo;
+	int			do_sparse = 0;
 
 	memset(&args, 0, sizeof(args));
 	args.tp = tp;
@@ -644,7 +647,7 @@ xfs_ialloc_ag_alloc(
 #ifdef DEBUG
 	/* randomly do sparse inode allocations */
 	if (xfs_sb_version_hassparseinodes(&tp->t_mountp->m_sb) &&
-	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks)
+	    igeo->ig_ialloc_min_blks < igeo->ig_ialloc_blks)
 		do_sparse = prandom_u32() & 1;
 #endif
 
@@ -652,12 +655,12 @@ xfs_ialloc_ag_alloc(
 	 * Locking will ensure that we don't have two callers in here
 	 * at one time.
 	 */
-	newlen = args.mp->m_ialloc_inos;
-	if (args.mp->m_maxicount &&
+	newlen = igeo->ig_ialloc_inos;
+	if (igeo->ig_maxicount &&
 	    percpu_counter_read_positive(&args.mp->m_icount) + newlen >
-							args.mp->m_maxicount)
+							igeo->ig_maxicount)
 		return -ENOSPC;
-	args.minlen = args.maxlen = args.mp->m_ialloc_blks;
+	args.minlen = args.maxlen = igeo->ig_ialloc_blks;
 	/*
 	 * First try to allocate inodes contiguous with the last-allocated
 	 * chunk of inodes.  If the filesystem is striped, this will fill
@@ -667,7 +670,7 @@ xfs_ialloc_ag_alloc(
 	newino = be32_to_cpu(agi->agi_newino);
 	agno = be32_to_cpu(agi->agi_seqno);
 	args.agbno = XFS_AGINO_TO_AGBNO(args.mp, newino) +
-		     args.mp->m_ialloc_blks;
+		     igeo->ig_ialloc_blks;
 	if (do_sparse)
 		goto sparse_alloc;
 	if (likely(newino != NULLAGINO &&
@@ -690,10 +693,10 @@ xfs_ialloc_ag_alloc(
 		 * but not to use them in the actual exact allocation.
 		 */
 		args.alignment = 1;
-		args.minalignslop = args.mp->m_cluster_align - 1;
+		args.minalignslop = args.mp->m_ino_geo.ig_cluster_align - 1;
 
 		/* Allow space for the inode btree to split. */
-		args.minleft = args.mp->m_in_maxlevels - 1;
+		args.minleft = igeo->ig_in_maxlevels - 1;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 
@@ -720,12 +723,12 @@ xfs_ialloc_ag_alloc(
 		 * pieces, so don't need alignment anyway.
 		 */
 		isaligned = 0;
-		if (args.mp->m_sinoalign) {
+		if (igeo->ig_sinoalign) {
 			ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN));
 			args.alignment = args.mp->m_dalign;
 			isaligned = 1;
 		} else
-			args.alignment = args.mp->m_cluster_align;
+			args.alignment = args.mp->m_ino_geo.ig_cluster_align;
 		/*
 		 * Need to figure out where to allocate the inode blocks.
 		 * Ideally they should be spaced out through the a.g.
@@ -741,7 +744,7 @@ xfs_ialloc_ag_alloc(
 		/*
 		 * Allow space for the inode btree to split.
 		 */
-		args.minleft = args.mp->m_in_maxlevels - 1;
+		args.minleft = igeo->ig_in_maxlevels - 1;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
@@ -754,7 +757,7 @@ xfs_ialloc_ag_alloc(
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
 		args.agbno = be32_to_cpu(agi->agi_root);
 		args.fsbno = XFS_AGB_TO_FSB(args.mp, agno, args.agbno);
-		args.alignment = args.mp->m_cluster_align;
+		args.alignment = args.mp->m_ino_geo.ig_cluster_align;
 		if ((error = xfs_alloc_vextent(&args)))
 			return error;
 	}
@@ -764,7 +767,7 @@ xfs_ialloc_ag_alloc(
 	 * the sparse allocation length is smaller than a full chunk.
 	 */
 	if (xfs_sb_version_hassparseinodes(&args.mp->m_sb) &&
-	    args.mp->m_ialloc_min_blks < args.mp->m_ialloc_blks &&
+	    igeo->ig_ialloc_min_blks < igeo->ig_ialloc_blks &&
 	    args.fsbno == NULLFSBLOCK) {
 sparse_alloc:
 		args.type = XFS_ALLOCTYPE_NEAR_BNO;
@@ -773,7 +776,7 @@ xfs_ialloc_ag_alloc(
 		args.alignment = args.mp->m_sb.sb_spino_align;
 		args.prod = 1;
 
-		args.minlen = args.mp->m_ialloc_min_blks;
+		args.minlen = igeo->ig_ialloc_min_blks;
 		args.maxlen = args.minlen;
 
 		/*
@@ -789,7 +792,7 @@ xfs_ialloc_ag_alloc(
 		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
 		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
 					    args.mp->m_sb.sb_inoalignmt) -
-				 args.mp->m_ialloc_blks;
+				 igeo->ig_ialloc_blks;
 
 		error = xfs_alloc_vextent(&args);
 		if (error)
@@ -1006,8 +1009,8 @@ xfs_ialloc_ag_select(
 		 * space needed for alignment of inode chunks when checking the
 		 * longest contiguous free space in the AG - this prevents us
 		 * from getting ENOSPC because we have free space larger than
-		 * m_ialloc_blks but alignment constraints prevent us from using
-		 * it.
+		 * ig_ialloc_blks but alignment constraints prevent us from
+		 * using it.
 		 *
 		 * If we can't find an AG with space for full alignment slack to
 		 * be taken into account, we must be near ENOSPC in all AGs.
@@ -1015,9 +1018,9 @@ xfs_ialloc_ag_select(
 		 * if we fail allocation due to alignment issues then it is most
 		 * likely a real ENOSPC condition.
 		 */
-		ineed = mp->m_ialloc_min_blks;
+		ineed = mp->m_ino_geo.ig_ialloc_min_blks;
 		if (flags && ineed > 1)
-			ineed += mp->m_cluster_align;
+			ineed += mp->m_ino_geo.ig_cluster_align;
 		longest = pag->pagf_longest;
 		if (!longest)
 			longest = pag->pagf_flcount > 0;
@@ -1703,6 +1706,7 @@ xfs_dialloc(
 	int			noroom = 0;
 	xfs_agnumber_t		start_agno;
 	struct xfs_perag	*pag;
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
 	int			okalloc = 1;
 
 	if (*IO_agbp) {
@@ -1733,9 +1737,9 @@ xfs_dialloc(
 	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
 	 * which will sacrifice the preciseness but improve the performance.
 	 */
-	if (mp->m_maxicount &&
-	    percpu_counter_read_positive(&mp->m_icount) + mp->m_ialloc_inos
-							> mp->m_maxicount) {
+	if (mp->m_ino_geo.ig_maxicount &&
+	    percpu_counter_read_positive(&mp->m_icount) + igeo->ig_ialloc_inos
+							> igeo->ig_maxicount) {
 		noroom = 1;
 		okalloc = 0;
 	}
@@ -1852,7 +1856,8 @@ xfs_difree_inode_chunk(
 	if (!xfs_inobt_issparse(rec->ir_holemask)) {
 		/* not sparse, calculate extent info directly */
 		xfs_bmap_add_free(tp, XFS_AGB_TO_FSB(mp, agno, sagbno),
-				  mp->m_ialloc_blks, &XFS_RMAP_OINFO_INODES);
+				  mp->m_ino_geo.ig_ialloc_blks,
+				  &XFS_RMAP_OINFO_INODES);
 		return;
 	}
 
@@ -2261,7 +2266,7 @@ xfs_imap_lookup(
 
 	/* check that the returned record contains the required inode */
 	if (rec.ir_startino > agino ||
-	    rec.ir_startino + mp->m_ialloc_inos <= agino)
+	    rec.ir_startino + mp->m_ino_geo.ig_ialloc_inos <= agino)
 		return -EINVAL;
 
 	/* for untrusted inodes check it is allocated first */
@@ -2352,7 +2357,7 @@ xfs_imap(
 	 * If the inode cluster size is the same as the blocksize or
 	 * smaller we get to the buffer by simple arithmetics.
 	 */
-	if (mp->m_blocks_per_cluster == 1) {
+	if (mp->m_ino_geo.ig_blocks_per_cluster == 1) {
 		offset = XFS_INO_TO_OFFSET(mp, ino);
 		ASSERT(offset < mp->m_sb.sb_inopblock);
 
@@ -2368,8 +2373,8 @@ xfs_imap(
 	 * find the location. Otherwise we have to do a btree
 	 * lookup to find the location.
 	 */
-	if (mp->m_inoalign_mask) {
-		offset_agbno = agbno & mp->m_inoalign_mask;
+	if (mp->m_ino_geo.ig_inoalign_mask) {
+		offset_agbno = agbno & mp->m_ino_geo.ig_inoalign_mask;
 		chunk_agbno = agbno - offset_agbno;
 	} else {
 		error = xfs_imap_lookup(mp, tp, agno, agino, agbno,
@@ -2381,13 +2386,13 @@ xfs_imap(
 out_map:
 	ASSERT(agbno >= chunk_agbno);
 	cluster_agbno = chunk_agbno +
-		((offset_agbno / mp->m_blocks_per_cluster) *
-		 mp->m_blocks_per_cluster);
+		((offset_agbno / mp->m_ino_geo.ig_blocks_per_cluster) *
+		 mp->m_ino_geo.ig_blocks_per_cluster);
 	offset = ((agbno - cluster_agbno) * mp->m_sb.sb_inopblock) +
 		XFS_INO_TO_OFFSET(mp, ino);
 
 	imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, cluster_agbno);
-	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
+	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_ino_geo.ig_blocks_per_cluster);
 	imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
 
 	/*
@@ -2409,7 +2414,7 @@ xfs_imap(
 }
 
 /*
- * Compute and fill in value of m_in_maxlevels.
+ * Compute and fill in value of m_ino_geo.ig_in_maxlevels.
  */
 void
 xfs_ialloc_compute_maxlevels(
@@ -2418,8 +2423,8 @@ xfs_ialloc_compute_maxlevels(
 	uint		inodes;
 
 	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
-	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp->m_inobt_mnr,
-							 inodes);
+	mp->m_ino_geo.ig_in_maxlevels = xfs_btree_compute_maxlevels(
+			mp->m_ino_geo.ig_inobt_mnr, inodes);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index e936b7cc9389..b74fa2addd51 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -28,9 +28,9 @@ static inline int
 xfs_icluster_size_fsb(
 	struct xfs_mount	*mp)
 {
-	if (mp->m_sb.sb_blocksize >= mp->m_inode_cluster_size)
+	if (mp->m_sb.sb_blocksize >= mp->m_ino_geo.ig_min_cluster_size)
 		return 1;
-	return mp->m_inode_cluster_size >> mp->m_sb.sb_blocklog;
+	return mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_blocklog;
 }
 
 /*
@@ -96,7 +96,7 @@ xfs_imap(
 	uint		flags);		/* flags for inode btree lookup */
 
 /*
- * Compute and fill in value of m_in_maxlevels.
+ * Compute and fill in value of m_ino_geo.ig_in_maxlevels.
  */
 void
 xfs_ialloc_compute_maxlevels(
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index bc2dfacd2f4a..79cc5cf21e1b 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -28,7 +28,7 @@ xfs_inobt_get_minrecs(
 	struct xfs_btree_cur	*cur,
 	int			level)
 {
-	return cur->bc_mp->m_inobt_mnr[level != 0];
+	return cur->bc_mp->m_ino_geo.ig_inobt_mnr[level != 0];
 }
 
 STATIC struct xfs_btree_cur *
@@ -164,7 +164,7 @@ xfs_inobt_get_maxrecs(
 	struct xfs_btree_cur	*cur,
 	int			level)
 {
-	return cur->bc_mp->m_inobt_mxr[level != 0];
+	return cur->bc_mp->m_ino_geo.ig_inobt_mxr[level != 0];
 }
 
 STATIC void
@@ -281,10 +281,11 @@ xfs_inobt_verify(
 
 	/* level verification */
 	level = be16_to_cpu(block->bb_level);
-	if (level >= mp->m_in_maxlevels)
+	if (level >= mp->m_ino_geo.ig_in_maxlevels)
 		return __this_address;
 
-	return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
+	return xfs_btree_sblock_verify(bp,
+			mp->m_ino_geo.ig_inobt_mxr[level != 0]);
 }
 
 static void
@@ -546,7 +547,7 @@ xfs_inobt_max_size(
 	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
 
 	/* Bail out if we're uninitialized, which can happen in mkfs. */
-	if (mp->m_inobt_mxr[0] == 0)
+	if (mp->m_ino_geo.ig_inobt_mxr[0] == 0)
 		return 0;
 
 	/*
@@ -558,7 +559,7 @@ xfs_inobt_max_size(
 	    XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart) == agno)
 		agblocks -= mp->m_sb.sb_logblocks;
 
-	return xfs_btree_calc_size(mp->m_inobt_mnr,
+	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr,
 				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
 					XFS_INODES_PER_CHUNK);
 }
@@ -619,5 +620,5 @@ xfs_iallocbt_calc_size(
 	struct xfs_mount	*mp,
 	unsigned long long	len)
 {
-	return xfs_btree_calc_size(mp->m_inobt_mnr, len);
+	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr, len);
 }
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index e021d5133ccb..641aa1c2f1ae 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -36,7 +36,7 @@ xfs_inobp_check(
 	int		j;
 	xfs_dinode_t	*dip;
 
-	j = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
+	j = mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_inodelog;
 
 	for (i = 0; i < j; i++) {
 		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index e76a3e5d28d7..9416fc741788 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -804,16 +804,18 @@ const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
  */
 void
 xfs_sb_mount_common(
-	struct xfs_mount *mp,
-	struct xfs_sb	*sbp)
+	struct xfs_mount	*mp,
+	struct xfs_sb		*sbp)
 {
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
+
 	mp->m_agfrotor = mp->m_agirotor = 0;
 	mp->m_maxagi = mp->m_sb.sb_agcount;
 	mp->m_blkbit_log = sbp->sb_blocklog + XFS_NBBYLOG;
 	mp->m_blkbb_log = sbp->sb_blocklog - BBSHIFT;
 	mp->m_sectbb_log = sbp->sb_sectlog - BBSHIFT;
 	mp->m_agno_log = xfs_highbit32(sbp->sb_agcount - 1) + 1;
-	mp->m_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
+	mp->m_ino_geo.ig_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
 	mp->m_blockmask = sbp->sb_blocksize - 1;
 	mp->m_blockwsize = sbp->sb_blocksize >> XFS_WORDLOG;
 	mp->m_blockwmask = mp->m_blockwsize - 1;
@@ -823,10 +825,10 @@ xfs_sb_mount_common(
 	mp->m_alloc_mnr[0] = mp->m_alloc_mxr[0] / 2;
 	mp->m_alloc_mnr[1] = mp->m_alloc_mxr[1] / 2;
 
-	mp->m_inobt_mxr[0] = xfs_inobt_maxrecs(mp, sbp->sb_blocksize, 1);
-	mp->m_inobt_mxr[1] = xfs_inobt_maxrecs(mp, sbp->sb_blocksize, 0);
-	mp->m_inobt_mnr[0] = mp->m_inobt_mxr[0] / 2;
-	mp->m_inobt_mnr[1] = mp->m_inobt_mxr[1] / 2;
+	igeo->ig_inobt_mxr[0] = xfs_inobt_maxrecs(mp, sbp->sb_blocksize, 1);
+	igeo->ig_inobt_mxr[1] = xfs_inobt_maxrecs(mp, sbp->sb_blocksize, 0);
+	igeo->ig_inobt_mnr[0] = igeo->ig_inobt_mxr[0] / 2;
+	igeo->ig_inobt_mnr[1] = igeo->ig_inobt_mxr[1] / 2;
 
 	mp->m_bmap_dmxr[0] = xfs_bmbt_maxrecs(mp, sbp->sb_blocksize, 1);
 	mp->m_bmap_dmxr[1] = xfs_bmbt_maxrecs(mp, sbp->sb_blocksize, 0);
@@ -844,14 +846,14 @@ xfs_sb_mount_common(
 	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
 
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
-	mp->m_ialloc_inos = max_t(uint16_t, XFS_INODES_PER_CHUNK,
+	igeo->ig_ialloc_inos = max_t(uint16_t, XFS_INODES_PER_CHUNK,
 					sbp->sb_inopblock);
-	mp->m_ialloc_blks = mp->m_ialloc_inos >> sbp->sb_inopblog;
+	igeo->ig_ialloc_blks = igeo->ig_ialloc_inos >> sbp->sb_inopblog;
 
 	if (sbp->sb_spino_align)
-		mp->m_ialloc_min_blks = sbp->sb_spino_align;
+		igeo->ig_ialloc_min_blks = sbp->sb_spino_align;
 	else
-		mp->m_ialloc_min_blks = mp->m_ialloc_blks;
+		igeo->ig_ialloc_min_blks = igeo->ig_ialloc_blks;
 	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
 	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 83f4ee2afc49..0d0d24729606 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -136,9 +136,10 @@ STATIC uint
 xfs_calc_inobt_res(
 	struct xfs_mount	*mp)
 {
-	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+	return xfs_calc_buf_res(mp->m_ino_geo.ig_in_maxlevels,
+			XFS_FSB_TO_B(mp, 1)) +
+				xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+			XFS_FSB_TO_B(mp, 1));
 }
 
 /*
@@ -167,7 +168,7 @@ xfs_calc_finobt_res(
  * includes:
  *
  * the allocation btrees: 2 trees * (max depth - 1) * block size
- * the inode chunk: m_ialloc_blks * N
+ * the inode chunk: m_ino_geo.ig_ialloc_blks * N
  *
  * The size N of the inode chunk reservation depends on whether it is for
  * allocation or free and which type of create transaction is in use. An inode
@@ -193,7 +194,7 @@ xfs_calc_inode_chunk_res(
 		size = XFS_FSB_TO_B(mp, 1);
 	}
 
-	res += xfs_calc_buf_res(mp->m_ialloc_blks, size);
+	res += xfs_calc_buf_res(mp->m_ino_geo.ig_ialloc_blks, size);
 	return res;
 }
 
@@ -307,7 +308,8 @@ xfs_calc_iunlink_remove_reservation(
 	struct xfs_mount        *mp)
 {
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
+	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1),
+			 mp->m_ino_geo.ig_min_cluster_size);
 }
 
 /*
@@ -345,7 +347,8 @@ STATIC uint
 xfs_calc_iunlink_add_reservation(xfs_mount_t *mp)
 {
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
+			max_t(uint, XFS_FSB_TO_B(mp, 1),
+			      mp->m_ino_geo.ig_min_cluster_size);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index a62fb950bef1..65ecc600ef44 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -56,9 +56,9 @@
 #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
 	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
 #define	XFS_IALLOC_SPACE_RES(mp)	\
-	((mp)->m_ialloc_blks + \
+	((mp)->m_ino_geo.ig_ialloc_blks + \
 	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
-	  ((mp)->m_in_maxlevels - 1)))
+	  ((mp)->m_ino_geo.ig_in_maxlevels - 1)))
 
 /*
  * Space reservation values for various transactions.
@@ -94,7 +94,8 @@
 #define	XFS_SYMLINK_SPACE_RES(mp,nl,b)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl) + (b))
 #define XFS_IFREE_SPACE_RES(mp)		\
-	(xfs_sb_version_hasfinobt(&mp->m_sb) ? (mp)->m_in_maxlevels : 0)
+	(xfs_sb_version_hasfinobt(&mp->m_sb) ? \
+			(mp)->m_ino_geo.ig_in_maxlevels : 0)
 
 
 #endif	/* __XFS_TRANS_SPACE_H__ */
diff --git a/fs/xfs/libxfs/xfs_types.c b/fs/xfs/libxfs/xfs_types.c
index d51acc95bc00..cfa3f407eacd 100644
--- a/fs/xfs/libxfs/xfs_types.c
+++ b/fs/xfs/libxfs/xfs_types.c
@@ -87,14 +87,14 @@ xfs_agino_range(
 	 * Calculate the first inode, which will be in the first
 	 * cluster-aligned block after the AGFL.
 	 */
-	bno = round_up(XFS_AGFL_BLOCK(mp) + 1, mp->m_cluster_align);
+	bno = round_up(XFS_AGFL_BLOCK(mp) + 1, mp->m_ino_geo.ig_cluster_align);
 	*first = XFS_AGB_TO_AGINO(mp, bno);
 
 	/*
 	 * Calculate the last inode, which will be at the end of the
 	 * last (aligned) cluster that can be allocated in the AG.
 	 */
-	bno = round_down(eoag, mp->m_cluster_align);
+	bno = round_down(eoag, mp->m_ino_geo.ig_cluster_align);
 	*last = XFS_AGB_TO_AGINO(mp, bno) - 1;
 }
 
diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
index 9b47117180cb..fa7386bf76e9 100644
--- a/fs/xfs/scrub/ialloc.c
+++ b/fs/xfs/scrub/ialloc.c
@@ -230,7 +230,7 @@ xchk_iallocbt_check_cluster(
 	int				error = 0;
 
 	nr_inodes = min_t(unsigned int, XFS_INODES_PER_CHUNK,
-			mp->m_inodes_per_cluster);
+			mp->m_ino_geo.ig_inodes_per_cluster);
 
 	/* Map this inode cluster */
 	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino + cluster_base);
@@ -251,7 +251,7 @@ xchk_iallocbt_check_cluster(
 	 */
 	ir_holemask = (irec->ir_holemask & cluster_mask);
 	imap.im_blkno = XFS_AGB_TO_DADDR(mp, agno, agbno);
-	imap.im_len = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
+	imap.im_len = XFS_FSB_TO_BB(mp, mp->m_ino_geo.ig_blocks_per_cluster);
 	imap.im_boffset = XFS_INO_TO_OFFSET(mp, irec->ir_startino) <<
 			mp->m_sb.sb_inodelog;
 
@@ -276,12 +276,13 @@ xchk_iallocbt_check_cluster(
 	/* If any part of this is a hole, skip it. */
 	if (ir_holemask) {
 		xchk_xref_is_not_owned_by(bs->sc, agbno,
-				mp->m_blocks_per_cluster,
+				mp->m_ino_geo.ig_blocks_per_cluster,
 				&XFS_RMAP_OINFO_INODES);
 		return 0;
 	}
 
-	xchk_xref_is_owned_by(bs->sc, agbno, mp->m_blocks_per_cluster,
+	xchk_xref_is_owned_by(bs->sc, agbno,
+			mp->m_ino_geo.ig_blocks_per_cluster,
 			&XFS_RMAP_OINFO_INODES);
 
 	/* Grab the inode cluster buffer. */
@@ -333,7 +334,7 @@ xchk_iallocbt_check_clusters(
 	 */
 	for (cluster_base = 0;
 	     cluster_base < XFS_INODES_PER_CHUNK;
-	     cluster_base += bs->sc->mp->m_inodes_per_cluster) {
+	     cluster_base += bs->sc->mp->m_ino_geo.ig_inodes_per_cluster) {
 		error = xchk_iallocbt_check_cluster(bs, irec, cluster_base);
 		if (error)
 			break;
@@ -355,6 +356,7 @@ xchk_iallocbt_rec_alignment(
 {
 	struct xfs_mount		*mp = bs->sc->mp;
 	struct xchk_iallocbt		*iabt = bs->private;
+	struct xfs_ino_geometry		*ig = &mp->m_ino_geo;
 
 	/*
 	 * finobt records have different positioning requirements than inobt
@@ -372,7 +374,7 @@ xchk_iallocbt_rec_alignment(
 		unsigned int	imask;
 
 		imask = min_t(unsigned int, XFS_INODES_PER_CHUNK,
-				mp->m_cluster_align_inodes) - 1;
+				ig->ig_cluster_align_inodes) - 1;
 		if (irec->ir_startino & imask)
 			xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		return;
@@ -400,17 +402,17 @@ xchk_iallocbt_rec_alignment(
 	}
 
 	/* inobt records must be aligned to cluster and inoalignmnt size. */
-	if (irec->ir_startino & (mp->m_cluster_align_inodes - 1)) {
+	if (irec->ir_startino & (ig->ig_cluster_align_inodes - 1)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		return;
 	}
 
-	if (irec->ir_startino & (mp->m_inodes_per_cluster - 1)) {
+	if (irec->ir_startino & (ig->ig_inodes_per_cluster - 1)) {
 		xchk_btree_set_corrupt(bs->sc, bs->cur, 0);
 		return;
 	}
 
-	if (mp->m_inodes_per_cluster <= XFS_INODES_PER_CHUNK)
+	if (ig->ig_inodes_per_cluster <= XFS_INODES_PER_CHUNK)
 		return;
 
 	/*
@@ -419,7 +421,7 @@ xchk_iallocbt_rec_alignment(
 	 * after this one.
 	 */
 	iabt->next_startino = irec->ir_startino + XFS_INODES_PER_CHUNK;
-	iabt->next_cluster_ino = irec->ir_startino + mp->m_inodes_per_cluster;
+	iabt->next_cluster_ino = irec->ir_startino + ig->ig_inodes_per_cluster;
 }
 
 /* Scrub an inobt/finobt record. */
diff --git a/fs/xfs/scrub/quota.c b/fs/xfs/scrub/quota.c
index 5dfe2b5924db..bf4c4630e1df 100644
--- a/fs/xfs/scrub/quota.c
+++ b/fs/xfs/scrub/quota.c
@@ -144,7 +144,7 @@ xchk_quota_item(
 	if (bsoft > bhard)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
 
-	if (ihard > mp->m_maxicount)
+	if (ihard > mp->m_ino_geo.ig_maxicount)
 		xchk_fblock_set_warning(sc, XFS_DATA_FORK, offset);
 	if (isoft > ihard)
 		xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, offset);
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 3d0e0570e3aa..564bb64d51d3 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -251,9 +251,9 @@ xfs_growfs_data(
 	if (mp->m_sb.sb_imax_pct) {
 		uint64_t icount = mp->m_sb.sb_dblocks * mp->m_sb.sb_imax_pct;
 		do_div(icount, 100);
-		mp->m_maxicount = XFS_FSB_TO_INO(mp, icount);
+		mp->m_ino_geo.ig_maxicount = XFS_FSB_TO_INO(mp, icount);
 	} else
-		mp->m_maxicount = 0;
+		mp->m_ino_geo.ig_maxicount = 0;
 
 	/* Update secondary superblocks now the physical grow has completed */
 	error = xfs_update_secondary_sbs(mp);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 71d216cf6f87..28ad467607cf 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2537,13 +2537,14 @@ xfs_ifree_cluster(
 	xfs_inode_log_item_t	*iip;
 	struct xfs_log_item	*lip;
 	struct xfs_perag	*pag;
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
 	xfs_ino_t		inum;
 
 	inum = xic->first_ino;
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum));
-	nbufs = mp->m_ialloc_blks / mp->m_blocks_per_cluster;
+	nbufs = igeo->ig_ialloc_blks / igeo->ig_blocks_per_cluster;
 
-	for (j = 0; j < nbufs; j++, inum += mp->m_inodes_per_cluster) {
+	for (j = 0; j < nbufs; j++, inum += igeo->ig_inodes_per_cluster) {
 		/*
 		 * The allocation bitmap tells us which inodes of the chunk were
 		 * physically allocated. Skip the cluster if an inode falls into
@@ -2551,7 +2552,7 @@ xfs_ifree_cluster(
 		 */
 		ioffset = inum - xic->first_ino;
 		if ((xic->alloc & XFS_INOBT_MASK(ioffset)) == 0) {
-			ASSERT(ioffset % mp->m_inodes_per_cluster == 0);
+			ASSERT(ioffset % igeo->ig_inodes_per_cluster == 0);
 			continue;
 		}
 
@@ -2567,7 +2568,8 @@ xfs_ifree_cluster(
 		 * to mark all the active inodes on the buffer stale.
 		 */
 		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
-					mp->m_bsize * mp->m_blocks_per_cluster,
+					mp->m_bsize *
+						igeo->ig_blocks_per_cluster,
 					XBF_UNMAPPED);
 
 		if (!bp)
@@ -2614,7 +2616,7 @@ xfs_ifree_cluster(
 		 * transaction stale above, which means there is no point in
 		 * even trying to lock them.
 		 */
-		for (i = 0; i < mp->m_inodes_per_cluster; i++) {
+		for (i = 0; i < igeo->ig_inodes_per_cluster; i++) {
 retry:
 			rcu_read_lock();
 			ip = radix_tree_lookup(&pag->pag_ici_root,
@@ -3476,19 +3478,20 @@ xfs_iflush_cluster(
 	int			cilist_size;
 	struct xfs_inode	**cilist;
 	struct xfs_inode	*cip;
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
 	int			nr_found;
 	int			clcount = 0;
 	int			i;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 
-	inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
+	inodes_per_cluster = igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog;
 	cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
 	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
 	if (!cilist)
 		goto out_put;
 
-	mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
+	mask = ~(((igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
 	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
 	rcu_read_lock();
 	/* really need a gang lookup range call here */
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 1e1a0af1dd34..cff28ee73deb 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -167,6 +167,7 @@ xfs_bulkstat_ichunk_ra(
 	xfs_agnumber_t			agno,
 	struct xfs_inobt_rec_incore	*irec)
 {
+	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
 	xfs_agblock_t			agbno;
 	struct blk_plug			plug;
 	int				i;	/* inode chunk index */
@@ -174,12 +175,14 @@ xfs_bulkstat_ichunk_ra(
 	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
 
 	blk_start_plug(&plug);
-	for (i = 0; i < XFS_INODES_PER_CHUNK;
-	     i += mp->m_inodes_per_cluster, agbno += mp->m_blocks_per_cluster) {
-		if (xfs_inobt_maskn(i, mp->m_inodes_per_cluster) &
+	for (i = 0;
+	     i < XFS_INODES_PER_CHUNK;
+	     i += igeo->ig_inodes_per_cluster,
+			agbno += igeo->ig_blocks_per_cluster) {
+		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
 		    ~irec->ir_free) {
 			xfs_btree_reada_bufs(mp, agno, agbno,
-					mp->m_blocks_per_cluster,
+					igeo->ig_blocks_per_cluster,
 					&xfs_inode_buf_ops);
 		}
 	}
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 9329f5adbfbe..15118e531184 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2882,19 +2882,19 @@ xlog_recover_buffer_pass2(
 	 *
 	 * Also make sure that only inode buffers with good sizes stay in
 	 * the buffer cache.  The kernel moves inodes in buffers of 1 block
-	 * or mp->m_inode_cluster_size bytes, whichever is bigger.  The inode
+	 * or ig_min_cluster_size bytes, whichever is bigger.  The inode
 	 * buffers in the log can be a different size if the log was generated
 	 * by an older kernel using unclustered inode buffers or a newer kernel
 	 * running with a different inode cluster size.  Regardless, if the
-	 * the inode buffer size isn't max(blocksize, mp->m_inode_cluster_size)
-	 * for *our* value of mp->m_inode_cluster_size, then we need to keep
+	 * the inode buffer size isn't max(blocksize, ig_min_cluster_size)
+	 * for *our* value of ig_min_cluster_size, then we need to keep
 	 * the buffer out of the buffer cache so that the buffer won't
 	 * overlap with future reads of those inodes.
 	 */
 	if (XFS_DINODE_MAGIC ==
 	    be16_to_cpu(*((__be16 *)xfs_buf_offset(bp, 0))) &&
 	    (BBTOB(bp->b_io_length) != max(log->l_mp->m_sb.sb_blocksize,
-			(uint32_t)log->l_mp->m_inode_cluster_size))) {
+			(uint32_t)log->l_mp->m_ino_geo.ig_min_cluster_size))) {
 		xfs_buf_stale(bp);
 		error = xfs_bwrite(bp);
 	} else {
@@ -3849,6 +3849,7 @@ xlog_recover_do_icreate_pass2(
 {
 	struct xfs_mount	*mp = log->l_mp;
 	struct xfs_icreate_log	*icl;
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
 	xfs_agnumber_t		agno;
 	xfs_agblock_t		agbno;
 	unsigned int		count;
@@ -3898,10 +3899,10 @@ xlog_recover_do_icreate_pass2(
 
 	/*
 	 * The inode chunk is either full or sparse and we only support
-	 * m_ialloc_min_blks sized sparse allocations at this time.
+	 * m_ino_geo.ig_ialloc_min_blks sized sparse allocations at this time.
 	 */
-	if (length != mp->m_ialloc_blks &&
-	    length != mp->m_ialloc_min_blks) {
+	if (length != igeo->ig_ialloc_blks &&
+	    length != igeo->ig_ialloc_min_blks) {
 		xfs_warn(log->l_mp,
 			 "%s: unsupported chunk length", __FUNCTION__);
 		return -EINVAL;
@@ -3921,13 +3922,13 @@ xlog_recover_do_icreate_pass2(
 	 * buffers for cancellation so we don't overwrite anything written after
 	 * a cancellation.
 	 */
-	bb_per_cluster = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
-	nbufs = length / mp->m_blocks_per_cluster;
+	bb_per_cluster = XFS_FSB_TO_BB(mp, igeo->ig_blocks_per_cluster);
+	nbufs = length / igeo->ig_blocks_per_cluster;
 	for (i = 0, cancel_count = 0; i < nbufs; i++) {
 		xfs_daddr_t	daddr;
 
-		daddr = XFS_AGB_TO_DADDR(mp, agno,
-					 agbno + i * mp->m_blocks_per_cluster);
+		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno +
+				i * igeo->ig_blocks_per_cluster);
 		if (xlog_check_buffer_cancelled(log, daddr, bb_per_cluster, 0))
 			cancel_count++;
 	}
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 6b2bfe81dc51..17c47682609b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -433,10 +433,12 @@ xfs_update_alignment(xfs_mount_t *mp)
  * Set the maximum inode count for this filesystem
  */
 STATIC void
-xfs_set_maxicount(xfs_mount_t *mp)
+xfs_set_maxicount(
+	struct xfs_mount	*mp)
 {
-	xfs_sb_t	*sbp = &(mp->m_sb);
-	uint64_t	icount;
+	struct xfs_sb		*sbp = &(mp->m_sb);
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
+	uint64_t		icount;
 
 	if (sbp->sb_imax_pct) {
 		/*
@@ -445,11 +447,11 @@ xfs_set_maxicount(xfs_mount_t *mp)
 		 */
 		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
 		do_div(icount, 100);
-		do_div(icount, mp->m_ialloc_blks);
-		mp->m_maxicount = (icount * mp->m_ialloc_blks)  <<
-				   sbp->sb_inopblog;
+		do_div(icount, igeo->ig_ialloc_blks);
+		igeo->ig_maxicount = XFS_FSB_TO_INO(mp,
+				icount * igeo->ig_ialloc_blks);
 	} else {
-		mp->m_maxicount = 0;
+		igeo->ig_maxicount = 0;
 	}
 }
 
@@ -518,18 +520,18 @@ xfs_set_inoalignment(xfs_mount_t *mp)
 {
 	if (xfs_sb_version_hasalign(&mp->m_sb) &&
 		mp->m_sb.sb_inoalignmt >= xfs_icluster_size_fsb(mp))
-		mp->m_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
+		mp->m_ino_geo.ig_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
 	else
-		mp->m_inoalign_mask = 0;
+		mp->m_ino_geo.ig_inoalign_mask = 0;
 	/*
 	 * If we are using stripe alignment, check whether
 	 * the stripe unit is a multiple of the inode alignment
 	 */
-	if (mp->m_dalign && mp->m_inoalign_mask &&
-	    !(mp->m_dalign & mp->m_inoalign_mask))
-		mp->m_sinoalign = mp->m_dalign;
+	if (mp->m_dalign && mp->m_ino_geo.ig_inoalign_mask &&
+	    !(mp->m_dalign & mp->m_ino_geo.ig_inoalign_mask))
+		mp->m_ino_geo.ig_sinoalign = mp->m_dalign;
 	else
-		mp->m_sinoalign = 0;
+		mp->m_ino_geo.ig_sinoalign = 0;
 }
 
 /*
@@ -683,6 +685,7 @@ xfs_mountfs(
 {
 	struct xfs_sb		*sbp = &(mp->m_sb);
 	struct xfs_inode	*rip;
+	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
 	uint64_t		resblks;
 	uint			quotamount = 0;
 	uint			quotaflags = 0;
@@ -797,18 +800,20 @@ xfs_mountfs(
 	 * has set the inode alignment value appropriately for larger cluster
 	 * sizes.
 	 */
-	mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
+	igeo->ig_min_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
 	if (xfs_sb_version_hascrc(&mp->m_sb)) {
-		int	new_size = mp->m_inode_cluster_size;
+		int	new_size = igeo->ig_min_cluster_size;
 
 		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
 		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
-			mp->m_inode_cluster_size = new_size;
+			igeo->ig_min_cluster_size = new_size;
 	}
-	mp->m_blocks_per_cluster = xfs_icluster_size_fsb(mp);
-	mp->m_inodes_per_cluster = XFS_FSB_TO_INO(mp, mp->m_blocks_per_cluster);
-	mp->m_cluster_align = xfs_ialloc_cluster_alignment(mp);
-	mp->m_cluster_align_inodes = XFS_FSB_TO_INO(mp, mp->m_cluster_align);
+	igeo->ig_blocks_per_cluster = xfs_icluster_size_fsb(mp);
+	igeo->ig_inodes_per_cluster = XFS_FSB_TO_INO(mp,
+			igeo->ig_blocks_per_cluster);
+	igeo->ig_cluster_align = xfs_ialloc_cluster_alignment(mp);
+	igeo->ig_cluster_align_inodes = XFS_FSB_TO_INO(mp,
+			igeo->ig_cluster_align);
 
 	/*
 	 * If enabled, sparse inode chunk alignment is expected to match the
@@ -817,11 +822,11 @@ xfs_mountfs(
 	 */
 	if (xfs_sb_version_hassparseinodes(&mp->m_sb) &&
 	    mp->m_sb.sb_spino_align !=
-			XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size)) {
+			XFS_B_TO_FSBT(mp, igeo->ig_min_cluster_size)) {
 		xfs_warn(mp,
 	"Sparse inode block alignment (%u) must match cluster size (%llu).",
 			 mp->m_sb.sb_spino_align,
-			 XFS_B_TO_FSBT(mp, mp->m_inode_cluster_size));
+			 XFS_B_TO_FSBT(mp, igeo->ig_min_cluster_size));
 		error = -EINVAL;
 		goto out_remove_uuid;
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index c81a5cd7c228..89cbb1268b63 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -105,6 +105,7 @@ typedef struct xfs_mount {
 	struct xfs_da_geometry	*m_dir_geo;	/* directory block geometry */
 	struct xfs_da_geometry	*m_attr_geo;	/* attribute block geometry */
 	struct xlog		*m_log;		/* log specific stuff */
+	struct xfs_ino_geometry	m_ino_geo;	/* inode geometry */
 	int			m_logbufs;	/* number of log buffers */
 	int			m_logbsize;	/* size of each log buffer */
 	uint			m_rsumlevels;	/* rt summary levels */
@@ -126,12 +127,6 @@ typedef struct xfs_mount {
 	uint8_t			m_blkbit_log;	/* blocklog + NBBY */
 	uint8_t			m_blkbb_log;	/* blocklog - BBSHIFT */
 	uint8_t			m_agno_log;	/* log #ag's */
-	uint8_t			m_agino_log;	/* #bits for agino in inum */
-	uint			m_inode_cluster_size;/* min inode buf size */
-	unsigned int		m_inodes_per_cluster;
-	unsigned int		m_blocks_per_cluster;
-	unsigned int		m_cluster_align;
-	unsigned int		m_cluster_align_inodes;
 	uint			m_blockmask;	/* sb_blocksize-1 */
 	uint			m_blockwsize;	/* sb_blocksize in words */
 	uint			m_blockwmask;	/* blockwsize-1 */
@@ -139,15 +134,12 @@ typedef struct xfs_mount {
 	uint			m_alloc_mnr[2];	/* min alloc btree records */
 	uint			m_bmap_dmxr[2];	/* max bmap btree records */
 	uint			m_bmap_dmnr[2];	/* min bmap btree records */
-	uint			m_inobt_mxr[2];	/* max inobt btree records */
-	uint			m_inobt_mnr[2];	/* min inobt btree records */
 	uint			m_rmap_mxr[2];	/* max rmap btree records */
 	uint			m_rmap_mnr[2];	/* min rmap btree records */
 	uint			m_refc_mxr[2];	/* max refc btree records */
 	uint			m_refc_mnr[2];	/* min refc btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
-	uint			m_in_maxlevels;	/* max inobt btree levels. */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	uint			m_refc_maxlevels; /* max refcount btree level */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
@@ -159,20 +151,13 @@ typedef struct xfs_mount {
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint64_t		m_flags;	/* global mount flags */
 	bool			m_finobt_nores; /* no per-AG finobt resv. */
-	int			m_ialloc_inos;	/* inodes in inode allocation */
-	int			m_ialloc_blks;	/* blocks in inode allocation */
-	int			m_ialloc_min_blks;/* min blocks in sparse inode
-						   * allocation */
-	int			m_inoalign_mask;/* mask sb_inoalignmt if used */
 	uint			m_qflags;	/* quota status flags */
 	struct xfs_trans_resv	m_resv;		/* precomputed res values */
-	uint64_t		m_maxicount;	/* maximum inode count */
 	uint64_t		m_resblks;	/* total reserved blocks */
 	uint64_t		m_resblks_avail;/* available reserved blocks */
 	uint64_t		m_resblks_save;	/* reserved blks @ remount,ro */
 	int			m_dalign;	/* stripe unit */
 	int			m_swidth;	/* stripe width */
-	int			m_sinoalign;	/* stripe unit inode alignment */
 	uint8_t			m_sectbb_log;	/* sectlog - BBSHIFT */
 	const struct xfs_nameops *m_dirnameops;	/* vector of dir name ops */
 	const struct xfs_dir_ops *m_dir_inode_ops; /* vector of dir inode ops */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a14d11d78bd8..4b44c55e3022 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -582,7 +582,7 @@ xfs_set_inode_alloc(
 	 * Calculate how much should be reserved for inodes to meet
 	 * the max inode percentage.  Used only for inode32.
 	 */
-	if (mp->m_maxicount) {
+	if (mp->m_ino_geo.ig_maxicount) {
 		uint64_t	icount;
 
 		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
@@ -1131,10 +1131,10 @@ xfs_fs_statfs(
 
 	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);
 	statp->f_files = min(icount + fakeinos, (uint64_t)XFS_MAXINUMBER);
-	if (mp->m_maxicount)
+	if (mp->m_ino_geo.ig_maxicount)
 		statp->f_files = min_t(typeof(statp->f_files),
 					statp->f_files,
-					mp->m_maxicount);
+					mp->m_ino_geo.ig_maxicount);
 
 	/* If sb_icount overshot maxicount, report actual allocation */
 	statp->f_files = max_t(typeof(statp->f_files),

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/11] xfs: create simplified inode walk function
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 01/11] xfs: separate inode geometry Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-06-04  7:41   ` Dave Chinner
  2019-05-29 22:26 ` [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile     |    1 
 fs/xfs/xfs_itable.c |    5 -
 fs/xfs/xfs_itable.h |    8 +
 fs/xfs/xfs_iwalk.c  |  402 +++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_iwalk.h  |   18 ++
 fs/xfs/xfs_trace.h  |   40 +++++
 6 files changed, 472 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_iwalk.c
 create mode 100644 fs/xfs/xfs_iwalk.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 91831975363b..74d30ef0dbce 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -80,6 +80,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_iops.o \
 				   xfs_inode.o \
 				   xfs_itable.o \
+				   xfs_iwalk.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index cff28ee73deb..96590d9f917c 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -19,6 +19,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_health.h"
+#include "xfs_iwalk.h"
 
 /*
  * Return stat information for one inode.
@@ -161,7 +162,7 @@ xfs_bulkstat_one(
  * Loop over all clusters in a chunk for a given incore inode allocation btree
  * record.  Do a readahead if there are any allocated inodes in that cluster.
  */
-STATIC void
+void
 xfs_bulkstat_ichunk_ra(
 	struct xfs_mount		*mp,
 	xfs_agnumber_t			agno,
@@ -195,7 +196,7 @@ xfs_bulkstat_ichunk_ra(
  * are some left allocated, update the data for the pointed-to record as well as
  * return the count of grabbed inodes.
  */
-STATIC int
+int
 xfs_bulkstat_grab_ichunk(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
 	xfs_agino_t			agino,	/* starting inode of chunk */
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 8a822285b671..369e3f159d4e 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -84,4 +84,12 @@ xfs_inumbers(
 	void			__user *buffer, /* buffer with inode info */
 	inumbers_fmt_pf		formatter);
 
+/* Temporarily needed while we refactor functions. */
+struct xfs_btree_cur;
+struct xfs_inobt_rec_incore;
+void xfs_bulkstat_ichunk_ra(struct xfs_mount *mp, xfs_agnumber_t agno,
+		struct xfs_inobt_rec_incore *irec);
+int xfs_bulkstat_grab_ichunk(struct xfs_btree_cur *cur, xfs_agino_t agino,
+		int *icount, struct xfs_inobt_rec_incore *irec);
+
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
new file mode 100644
index 000000000000..0ce3baa159ba
--- /dev/null
+++ b/fs/xfs/xfs_iwalk.c
@@ -0,0 +1,402 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_inode.h"
+#include "xfs_btree.h"
+#include "xfs_ialloc.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_itable.h"
+#include "xfs_error.h"
+#include "xfs_trace.h"
+#include "xfs_icache.h"
+#include "xfs_health.h"
+#include "xfs_trans.h"
+#include "xfs_iwalk.h"
+
+/*
+ * Walking All the Inodes in the Filesystem
+ * ========================================
+ * Starting at some @startino, call a walk function on every allocated inode in
+ * the system.  The walk function is called with the relevant inode number and
+ * a pointer to caller-provided data.  The walk function can return the usual
+ * negative error code, 0, or XFS_IWALK_ABORT to stop the iteration.  This
+ * return value is returned to the caller.
+ *
+ * Internally, we allow the walk function to do anything, which means that we
+ * cannot maintain the inobt cursor or our lock on the AGI buffer.  We
+ * therefore build up a batch of inobt records in kernel memory and only call
+ * the walk function when our memory buffer is full.
+ */
+
+struct xfs_iwalk_ag {
+	struct xfs_mount		*mp;
+	struct xfs_trans		*tp;
+
+	/* Where do we start the traversal? */
+	xfs_ino_t			startino;
+
+	/* Array of inobt records we cache. */
+	struct xfs_inobt_rec_incore	*recs;
+	unsigned int			sz_recs;
+	unsigned int			nr_recs;
+
+	/* Inode walk function and data pointer. */
+	xfs_iwalk_fn			iwalk_fn;
+	void				*data;
+};
+
+/* Allocate memory for a walk. */
+STATIC int
+xfs_iwalk_allocbuf(
+	struct xfs_iwalk_ag	*iwag)
+{
+	size_t			size;
+
+	ASSERT(iwag->recs == NULL);
+	iwag->nr_recs = 0;
+
+	/* Allocate a prefetch buffer for inobt records. */
+	size = iwag->sz_recs * sizeof(struct xfs_inobt_rec_incore);
+	iwag->recs = kmem_alloc(size, KM_SLEEP);
+	if (iwag->recs == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
+
+/* Free memory we allocated for a walk. */
+STATIC void
+xfs_iwalk_freebuf(
+	struct xfs_iwalk_ag	*iwag)
+{
+	ASSERT(iwag->recs != NULL);
+	kmem_free(iwag->recs);
+}
+
+/* For each inuse inode in each cached inobt record, call our function. */
+STATIC int
+xfs_iwalk_ag_recs(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_inobt_rec_incore	*irec;
+	xfs_ino_t			ino;
+	unsigned int			i, j;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	for (i = 0, irec = iwag->recs; i < iwag->nr_recs; i++, irec++) {
+		trace_xfs_iwalk_ag_rec(mp, agno, irec->ir_startino,
+				irec->ir_free);
+		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
+			/* Skip if this inode is free */
+			if (XFS_INOBT_MASK(j) & irec->ir_free)
+				continue;
+
+			/* Otherwise call our function. */
+			ino = XFS_AGINO_TO_INO(mp, agno, irec->ir_startino + j);
+			error = iwag->iwalk_fn(mp, tp, ino, iwag->data);
+			if (error)
+				return error;
+		}
+	}
+
+	iwag->nr_recs = 0;
+	return 0;
+}
+
+/* Read AGI and create inobt cursor. */
+static inline int
+xfs_iwalk_inobt_cur(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp)
+{
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	ASSERT(*agi_bpp == NULL);
+
+	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
+	if (error)
+		return error;
+
+	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
+	if (!cur)
+		return -ENOMEM;
+	*curpp = cur;
+	return 0;
+}
+
+/* Delete cursor and let go of AGI. */
+static inline void
+xfs_iwalk_del_inobt(
+	struct xfs_trans	*tp,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp,
+	int			error)
+{
+	if (*curpp) {
+		xfs_btree_del_cursor(*curpp, error);
+		*curpp = NULL;
+	}
+	if (*agi_bpp) {
+		xfs_trans_brelse(tp, *agi_bpp);
+		*agi_bpp = NULL;
+	}
+}
+
+/*
+ * Set ourselves up for walking inobt records starting from a given point in
+ * the filesystem.
+ *
+ * If caller passed in a nonzero start inode number, load the record from the
+ * inobt and make the record look like all the inodes before agino are free so
+ * that we skip them, and then move the cursor to the next inobt record.  This
+ * is how we support starting an iwalk in the middle of an inode chunk.
+ *
+ * If the caller passed in a start number of zero, move the cursor to the first
+ * inobt record.
+ *
+ * The caller is responsible for cleaning up the cursor and buffer pointer
+ * regardless of the error status.
+ */
+STATIC int
+xfs_iwalk_ag_start(
+	struct xfs_iwalk_ag	*iwag,
+	xfs_agnumber_t		agno,
+	xfs_agino_t		agino,
+	struct xfs_btree_cur	**curpp,
+	struct xfs_buf		**agi_bpp,
+	int			*has_more)
+{
+	struct xfs_mount	*mp = iwag->mp;
+	struct xfs_trans	*tp = iwag->tp;
+	int			icount;
+	int			error;
+
+	/* Set up a fresh cursor and empty the inobt cache. */
+	iwag->nr_recs = 0;
+	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
+	if (error)
+		return error;
+
+	/* Starting at the beginning of the AG?  That's easy! */
+	if (agino == 0)
+		return xfs_inobt_lookup(*curpp, 0, XFS_LOOKUP_GE, has_more);
+
+	/*
+	 * Otherwise, we have to grab the inobt record where we left off, stuff
+	 * the record into our cache, and then see if there are more records.
+	 * We require a lookup cache of at least two elements so that we don't
+	 * have to deal with tearing down the cursor to walk the records.
+	 */
+	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
+			&iwag->recs[iwag->nr_recs]);
+	if (error)
+		return error;
+	if (icount)
+		iwag->nr_recs++;
+
+	ASSERT(iwag->nr_recs < iwag->sz_recs);
+	return xfs_btree_increment(*curpp, 0, has_more);
+}
+
+typedef int (*xfs_iwalk_ag_recs_fn)(struct xfs_iwalk_ag *iwag);
+
+/*
+ * Acknowledge that we added an inobt record to the cache.  Flush the inobt
+ * record cache if the buffer is full, and position the cursor wherever it
+ * needs to be so that we can keep going.
+ */
+STATIC int
+xfs_iwalk_ag_increment(
+	struct xfs_iwalk_ag		*iwag,
+	xfs_iwalk_ag_recs_fn		walk_ag_recs_fn,
+	xfs_agnumber_t			agno,
+	struct xfs_btree_cur		**curpp,
+	struct xfs_buf			**agi_bpp,
+	int				*has_more)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_inobt_rec_incore	*irec;
+	xfs_agino_t			restart;
+	int				error;
+
+	iwag->nr_recs++;
+
+	/* If there's space, just increment and look for more records. */
+	if (iwag->nr_recs < iwag->sz_recs)
+		return xfs_btree_increment(*curpp, 0, has_more);
+
+	/*
+	 * Otherwise the record cache is full; delete the cursor and walk the
+	 * records...
+	 */
+	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
+	irec = &iwag->recs[iwag->nr_recs - 1];
+	restart = irec->ir_startino + XFS_INODES_PER_CHUNK - 1;
+
+	error = walk_ag_recs_fn(iwag);
+	if (error)
+		return error;
+
+	/* ...and recreate cursor where we left off. */
+	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
+	if (error)
+		return error;
+
+	return xfs_inobt_lookup(*curpp, restart, XFS_LOOKUP_GE, has_more);
+}
+
+/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
+STATIC int
+xfs_iwalk_ag(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_buf			*agi_bp = NULL;
+	struct xfs_btree_cur		*cur = NULL;
+	xfs_agnumber_t			agno;
+	xfs_agino_t			agino;
+	int				has_more;
+	int				error = 0;
+
+	/* Set up our cursor at the right place in the inode btree. */
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
+	if (error)
+		goto out_cur;
+
+	while (has_more) {
+		struct xfs_inobt_rec_incore	*irec;
+
+		/* Fetch the inobt record. */
+		irec = &iwag->recs[iwag->nr_recs];
+		error = xfs_inobt_get_rec(cur, irec, &has_more);
+		if (error)
+			goto out_cur;
+		if (!has_more)
+			break;
+
+		/* No allocated inodes in this chunk; skip it. */
+		if (irec->ir_freecount == irec->ir_count) {
+			error = xfs_btree_increment(cur, 0, &has_more);
+			goto next_loop;
+		}
+
+		/*
+		 * Start readahead for this inode chunk in anticipation of
+		 * walking the inodes.
+		 */
+		xfs_bulkstat_ichunk_ra(mp, agno, irec);
+
+		/*
+		 * Add this inobt record to our cache, flush the cache if
+		 * needed, and move on to the next record.
+		 */
+		error = xfs_iwalk_ag_increment(iwag, xfs_iwalk_ag_recs, agno,
+				&cur, &agi_bp, &has_more);
+next_loop:
+		if (error)
+			goto out_cur;
+		cond_resched();
+	}
+
+	/* Walk any records left behind in the cache. */
+	if (iwag->nr_recs) {
+		xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+		return xfs_iwalk_ag_recs(iwag);
+	}
+
+out_cur:
+	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
+	return error;
+}
+
+/*
+ * Given the number of inodes to prefetch, set the number of inobt records that
+ * we cache in memory, which controls the number of inodes we try to read
+ * ahead.
+ *
+ * If no max prefetch was given, default to one page's worth of inobt records;
+ * this should be plenty of inodes to read ahead.
+ */
+static inline void
+xfs_iwalk_set_prefetch(
+	struct xfs_iwalk_ag	*iwag,
+	unsigned int		max_prefetch)
+{
+	if (max_prefetch)
+		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
+					XFS_INODES_PER_CHUNK;
+	else
+		iwag->sz_recs = PAGE_SIZE / sizeof(struct xfs_inobt_rec_incore);
+
+	/*
+	 * Allocate enough space to prefetch at least two records so that we
+	 * can cache both the inobt record where the iwalk started and the next
+	 * record.  This simplifies the AG inode walk loop setup code.
+	 */
+	if (iwag->sz_recs < 2)
+		iwag->sz_recs = 2;
+}
+
+/*
+ * Walk all inodes in the filesystem starting from @startino.  The @iwalk_fn
+ * will be called for each allocated inode, being passed the inode's number and
+ * @data.  @max_prefetch controls how many inobt records' worth of inodes we
+ * try to readahead.
+ */
+int
+xfs_iwalk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		startino,
+	xfs_iwalk_fn		iwalk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_iwalk_ag	iwag = {
+		.mp		= mp,
+		.tp		= tp,
+		.iwalk_fn	= iwalk_fn,
+		.data		= data,
+		.startino	= startino,
+	};
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
+	error = xfs_iwalk_allocbuf(&iwag);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_iwalk_ag(&iwag);
+		if (error)
+			break;
+		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	xfs_iwalk_freebuf(&iwag);
+	return error;
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
new file mode 100644
index 000000000000..45b1baabcd2d
--- /dev/null
+++ b/fs/xfs/xfs_iwalk.h
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_IWALK_H__
+#define __XFS_IWALK_H__
+
+/* Walk all inodes in the filesystem starting from @startino. */
+typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
+			    xfs_ino_t ino, void *data);
+/* Return value (for xfs_iwalk_fn) that aborts the walk immediately. */
+#define XFS_IWALK_ABORT	(1)
+
+int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+
+#endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 2464ea351f83..a2881659f776 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3516,6 +3516,46 @@ DEFINE_EVENT(xfs_inode_corrupt_class, name,	\
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_sick);
 DEFINE_INODE_CORRUPT_EVENT(xfs_inode_mark_healthy);
 
+TRACE_EVENT(xfs_iwalk_ag,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agino_t startino),
+	TP_ARGS(mp, agno, startino),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, startino)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startino = startino;
+	),
+	TP_printk("dev %d:%d agno %d startino %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
+		  __entry->startino)
+)
+
+TRACE_EVENT(xfs_iwalk_ag_rec,
+	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno,
+		 xfs_agino_t startino, uint64_t freemask),
+	TP_ARGS(mp, agno, startino, freemask),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_agnumber_t, agno)
+		__field(xfs_agino_t, startino)
+		__field(uint64_t, freemask)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->agno = agno;
+		__entry->startino = startino;
+		__entry->freemask = freemask;
+	),
+	TP_printk("dev %d:%d agno %d startino %u freemask 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->agno,
+		  __entry->startino, __entry->freemask)
+)
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 01/11] xfs: separate inode geometry Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 02/11] xfs: create simplified inode walk function Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-06-04  7:52   ` Dave Chinner
  2019-05-29 22:26 ` [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Convert quotacheck to use the new iwalk iterator to dig through the
inodes.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_qm.c |   62 ++++++++++++++++++-------------------------------------
 1 file changed, 20 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index aa6b6db3db0e..a5b2260406a8 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -26,6 +26,7 @@
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_cksum.h"
+#include "xfs_iwalk.h"
 
 /*
  * The global quota manager. There is only one of these for the entire
@@ -1118,17 +1119,15 @@ xfs_qm_quotacheck_dqadjust(
 /* ARGSUSED */
 STATIC int
 xfs_qm_dqusage_adjust(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* not used */
-	int		ubsize,		/* not used */
-	int		*ubused,	/* not used */
-	int		*res)		/* result code value */
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
-	xfs_inode_t	*ip;
-	xfs_qcnt_t	nblks;
-	xfs_filblks_t	rtblks = 0;	/* total rt blks */
-	int		error;
+	struct xfs_inode	*ip;
+	xfs_qcnt_t		nblks;
+	xfs_filblks_t		rtblks = 0;	/* total rt blks */
+	int			error;
 
 	ASSERT(XFS_IS_QUOTA_RUNNING(mp));
 
@@ -1136,20 +1135,18 @@ xfs_qm_dqusage_adjust(
 	 * rootino must have its resources accounted for, not so with the quota
 	 * inodes.
 	 */
-	if (xfs_is_quota_inode(&mp->m_sb, ino)) {
-		*res = BULKSTAT_RV_NOTHING;
-		return -EINVAL;
-	}
+	if (xfs_is_quota_inode(&mp->m_sb, ino))
+		return 0;
 
 	/*
 	 * We don't _need_ to take the ilock EXCL here because quotacheck runs
 	 * at mount time and therefore nobody will be racing chown/chproj.
 	 */
-	error = xfs_iget(mp, NULL, ino, XFS_IGET_DONTCACHE, 0, &ip);
-	if (error) {
-		*res = BULKSTAT_RV_NOTHING;
+	error = xfs_iget(mp, tp, ino, XFS_IGET_DONTCACHE, 0, &ip);
+	if (error == -EINVAL || error == -ENOENT)
+		return 0;
+	if (error)
 		return error;
-	}
 
 	ASSERT(ip->i_delayed_blks == 0);
 
@@ -1157,7 +1154,7 @@ xfs_qm_dqusage_adjust(
 		struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
 
 		if (!(ifp->if_flags & XFS_IFEXTENTS)) {
-			error = xfs_iread_extents(NULL, ip, XFS_DATA_FORK);
+			error = xfs_iread_extents(tp, ip, XFS_DATA_FORK);
 			if (error)
 				goto error0;
 		}
@@ -1200,13 +1197,8 @@ xfs_qm_dqusage_adjust(
 			goto error0;
 	}
 
-	xfs_irele(ip);
-	*res = BULKSTAT_RV_DIDONE;
-	return 0;
-
 error0:
 	xfs_irele(ip);
-	*res = BULKSTAT_RV_GIVEUP;
 	return error;
 }
 
@@ -1270,18 +1262,13 @@ STATIC int
 xfs_qm_quotacheck(
 	xfs_mount_t	*mp)
 {
-	int			done, count, error, error2;
-	xfs_ino_t		lastino;
-	size_t			structsz;
+	int			error, error2;
 	uint			flags;
 	LIST_HEAD		(buffer_list);
 	struct xfs_inode	*uip = mp->m_quotainfo->qi_uquotaip;
 	struct xfs_inode	*gip = mp->m_quotainfo->qi_gquotaip;
 	struct xfs_inode	*pip = mp->m_quotainfo->qi_pquotaip;
 
-	count = INT_MAX;
-	structsz = 1;
-	lastino = 0;
 	flags = 0;
 
 	ASSERT(uip || gip || pip);
@@ -1318,18 +1305,9 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	do {
-		/*
-		 * Iterate thru all the inodes in the file system,
-		 * adjusting the corresponding dquot counters in core.
-		 */
-		error = xfs_bulkstat(mp, &lastino, &count,
-				     xfs_qm_dqusage_adjust,
-				     structsz, NULL, &done);
-		if (error)
-			break;
-
-	} while (!done);
+	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	if (error)
+		goto error_return;
 
 	/*
 	 * We've made all the changes that we need to make incore.  Flush them

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (2 preceding siblings ...)
  2019-05-29 22:26 ` [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-06-04  7:54   ` Dave Chinner
  2019-05-29 22:26 ` [PATCH 05/11] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

When userspace passes in a @lastip pointer we should copy the results
back, even if the @ocount pointer is NULL.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   13 ++++++-------
 fs/xfs/xfs_ioctl32.c |   13 ++++++-------
 2 files changed, 12 insertions(+), 14 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index d7dfc13f30f5..5ffbdcff3dba 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -768,14 +768,13 @@ xfs_ioc_bulkstat(
 	if (error)
 		return error;
 
-	if (bulkreq.ocount != NULL) {
-		if (copy_to_user(bulkreq.lastip, &inlast,
-						sizeof(xfs_ino_t)))
-			return -EFAULT;
+	if (bulkreq.lastip != NULL &&
+	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+		return -EFAULT;
 
-		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
-			return -EFAULT;
-	}
+	if (bulkreq.ocount != NULL &&
+	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+		return -EFAULT;
 
 	return 0;
 }
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 614fc6886d24..814ffe6fbab7 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -310,14 +310,13 @@ xfs_compat_ioc_bulkstat(
 	if (error)
 		return error;
 
-	if (bulkreq.ocount != NULL) {
-		if (copy_to_user(bulkreq.lastip, &inlast,
-						sizeof(xfs_ino_t)))
-			return -EFAULT;
+	if (bulkreq.lastip != NULL &&
+	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+		return -EFAULT;
 
-		if (copy_to_user(bulkreq.ocount, &count, sizeof(count)))
-			return -EFAULT;
-	}
+	if (bulkreq.ocount != NULL &&
+	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+		return -EFAULT;
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/11] xfs: convert bulkstat to new iwalk infrastructure
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (3 preceding siblings ...)
  2019-05-29 22:26 ` [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 06/11] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a new ibulk structure incore to help us deal with bulk inode stat
state tracking and then convert the bulkstat code to use the new iwalk
iterator.  This disentangles inode walking from bulk stat control for
simpler code and enables us to isolate the formatter functions to the
ioctl handling code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   65 ++++++--
 fs/xfs/xfs_ioctl.h   |    5 +
 fs/xfs/xfs_ioctl32.c |   88 +++++------
 fs/xfs/xfs_itable.c  |  407 ++++++++++++++------------------------------------
 fs/xfs/xfs_itable.h  |   79 ++++------
 5 files changed, 245 insertions(+), 399 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 5ffbdcff3dba..43734901aeb9 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -721,16 +721,28 @@ xfs_ioc_space(
 	return error;
 }
 
+/* Return 0 on success or positive error */
+int
+xfs_bulkstat_one_fmt(
+	struct xfs_ibulk	*breq,
+	const struct xfs_bstat	*bstat)
+{
+	if (copy_to_user(breq->ubuffer, bstat, sizeof(*bstat)))
+		return -EFAULT;
+	return xfs_ibulk_advance(breq, sizeof(struct xfs_bstat));
+}
+
 STATIC int
 xfs_ioc_bulkstat(
 	xfs_mount_t		*mp,
 	unsigned int		cmd,
 	void			__user *arg)
 {
-	xfs_fsop_bulkreq_t	bulkreq;
-	int			count;	/* # of records returned */
-	xfs_ino_t		inlast;	/* last inode number */
-	int			done;
+	struct xfs_fsop_bulkreq	bulkreq;
+	struct xfs_ibulk	breq = {
+		.mp		= mp,
+	};
+	xfs_ino_t		lastino;
 	int			error;
 
 	/* done = 1 if there are more stats to get and if bulkstat */
@@ -745,35 +757,54 @@ xfs_ioc_bulkstat(
 	if (copy_from_user(&bulkreq, arg, sizeof(xfs_fsop_bulkreq_t)))
 		return -EFAULT;
 
-	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
+	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
 		return -EFAULT;
 
-	if ((count = bulkreq.icount) <= 0)
+	if (bulkreq.icount <= 0)
 		return -EINVAL;
 
 	if (bulkreq.ubuffer == NULL)
 		return -EINVAL;
 
-	if (cmd == XFS_IOC_FSINUMBERS)
-		error = xfs_inumbers(mp, &inlast, &count,
+	breq.ubuffer = bulkreq.ubuffer;
+	breq.icount = bulkreq.icount;
+
+	/*
+	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
+	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
+	 * that *lastip contains either zero or the number of the last inode to
+	 * be examined by the previous call and return results starting with
+	 * the next inode after that.  The new bulk request functions take the
+	 * inode to start with, so we have to adjust the lastino/startino
+	 * parameter to maintain correct function.
+	 */
+	if (cmd == XFS_IOC_FSINUMBERS) {
+		int	count = breq.icount;
+
+		breq.startino = lastino;
+		error = xfs_inumbers(mp, &breq.startino, &count,
 					bulkreq.ubuffer, xfs_inumbers_fmt);
-	else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE)
-		error = xfs_bulkstat_one(mp, inlast, bulkreq.ubuffer,
-					sizeof(xfs_bstat_t), NULL, &done);
-	else	/* XFS_IOC_FSBULKSTAT */
-		error = xfs_bulkstat(mp, &inlast, &count, xfs_bulkstat_one,
-				     sizeof(xfs_bstat_t), bulkreq.ubuffer,
-				     &done);
+		breq.ocount = count;
+		lastino = breq.startino;
+	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
+		breq.startino = lastino;
+		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
+		lastino = breq.startino;
+	} else {	/* XFS_IOC_FSBULKSTAT */
+		breq.startino = lastino + 1;
+		error = xfs_bulkstat(&breq, xfs_bulkstat_one_fmt);
+		lastino = breq.startino - 1;
+	}
 
 	if (error)
 		return error;
 
 	if (bulkreq.lastip != NULL &&
-	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
 		return -EFAULT;
 
 	if (bulkreq.ocount != NULL &&
-	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
 		return -EFAULT;
 
 	return 0;
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index 4b17f67c888a..f32c8aadfeba 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -77,4 +77,9 @@ xfs_set_dmattrs(
 	uint			evmask,
 	uint16_t		state);
 
+struct xfs_ibulk;
+struct xfs_bstat;
+
+int xfs_bulkstat_one_fmt(struct xfs_ibulk *breq, const struct xfs_bstat *bstat);
+
 #endif
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 814ffe6fbab7..add15819daf3 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -172,15 +172,10 @@ xfs_bstime_store_compat(
 /* Return 0 on success or positive error (to xfs_bulkstat()) */
 STATIC int
 xfs_bulkstat_one_fmt_compat(
-	void			__user *ubuffer,
-	int			ubsize,
-	int			*ubused,
-	const xfs_bstat_t	*buffer)
+	struct xfs_ibulk	*breq,
+	const struct xfs_bstat	*buffer)
 {
-	compat_xfs_bstat_t	__user *p32 = ubuffer;
-
-	if (ubsize < sizeof(*p32))
-		return -ENOMEM;
+	struct compat_xfs_bstat	__user *p32 = breq->ubuffer;
 
 	if (put_user(buffer->bs_ino,	  &p32->bs_ino)		||
 	    put_user(buffer->bs_mode,	  &p32->bs_mode)	||
@@ -205,23 +200,8 @@ xfs_bulkstat_one_fmt_compat(
 	    put_user(buffer->bs_dmstate,  &p32->bs_dmstate)	||
 	    put_user(buffer->bs_aextents, &p32->bs_aextents))
 		return -EFAULT;
-	if (ubused)
-		*ubused = sizeof(*p32);
-	return 0;
-}
 
-STATIC int
-xfs_bulkstat_one_compat(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* buffer to place output in */
-	int		ubsize,		/* size of buffer */
-	int		*ubused,	/* bytes used by me */
-	int		*stat)		/* BULKSTAT_RV_... */
-{
-	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
-				    xfs_bulkstat_one_fmt_compat,
-				    ubused, stat);
+	return xfs_ibulk_advance(breq, sizeof(struct compat_xfs_bstat));
 }
 
 /* copied from xfs_ioctl.c */
@@ -232,10 +212,11 @@ xfs_compat_ioc_bulkstat(
 	compat_xfs_fsop_bulkreq_t __user *p32)
 {
 	u32			addr;
-	xfs_fsop_bulkreq_t	bulkreq;
-	int			count;	/* # of records returned */
-	xfs_ino_t		inlast;	/* last inode number */
-	int			done;
+	struct xfs_fsop_bulkreq	bulkreq;
+	struct xfs_ibulk	breq = {
+		.mp		= mp,
+	};
+	xfs_ino_t		lastino;
 	int			error;
 
 	/*
@@ -245,8 +226,7 @@ xfs_compat_ioc_bulkstat(
 	 * functions and structure size are the correct ones to use ...
 	 */
 	inumbers_fmt_pf inumbers_func = xfs_inumbers_fmt_compat;
-	bulkstat_one_pf	bs_one_func = xfs_bulkstat_one_compat;
-	size_t bs_one_size = sizeof(struct compat_xfs_bstat);
+	bulkstat_one_fmt_pf	bs_one_func = xfs_bulkstat_one_fmt_compat;
 
 #ifdef CONFIG_X86_X32
 	if (in_x32_syscall()) {
@@ -259,8 +239,7 @@ xfs_compat_ioc_bulkstat(
 		 * x32 userspace expects.
 		 */
 		inumbers_func = xfs_inumbers_fmt;
-		bs_one_func = xfs_bulkstat_one;
-		bs_one_size = sizeof(struct xfs_bstat);
+		bs_one_func = xfs_bulkstat_one_fmt;
 	}
 #endif
 
@@ -284,38 +263,57 @@ xfs_compat_ioc_bulkstat(
 		return -EFAULT;
 	bulkreq.ocount = compat_ptr(addr);
 
-	if (copy_from_user(&inlast, bulkreq.lastip, sizeof(__s64)))
+	if (copy_from_user(&lastino, bulkreq.lastip, sizeof(__s64)))
 		return -EFAULT;
+	breq.startino = lastino + 1;
 
-	if ((count = bulkreq.icount) <= 0)
+	if (bulkreq.icount <= 0)
 		return -EINVAL;
 
 	if (bulkreq.ubuffer == NULL)
 		return -EINVAL;
 
+	breq.ubuffer = bulkreq.ubuffer;
+	breq.icount = bulkreq.icount;
+
+	/*
+	 * FSBULKSTAT_SINGLE expects that *lastip contains the inode number
+	 * that we want to stat.  However, FSINUMBERS and FSBULKSTAT expect
+	 * that *lastip contains either zero or the number of the last inode to
+	 * be examined by the previous call and return results starting with
+	 * the next inode after that.  The new bulk request functions take the
+	 * inode to start with, so we have to adjust the lastino/startino
+	 * parameter to maintain correct function.
+	 */
 	if (cmd == XFS_IOC_FSINUMBERS_32) {
-		error = xfs_inumbers(mp, &inlast, &count,
+		int	count = breq.icount;
+
+		breq.startino = lastino;
+		error = xfs_inumbers(mp, &breq.startino, &count,
 				bulkreq.ubuffer, inumbers_func);
+		breq.ocount = count;
+		lastino = breq.startino;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
-		int res;
-
-		error = bs_one_func(mp, inlast, bulkreq.ubuffer,
-				bs_one_size, NULL, &res);
+		breq.startino = lastino;
+		error = xfs_bulkstat_one(&breq, bs_one_func);
+		lastino = breq.startino;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_32) {
-		error = xfs_bulkstat(mp, &inlast, &count,
-			bs_one_func, bs_one_size,
-			bulkreq.ubuffer, &done);
-	} else
+		breq.startino = lastino + 1;
+		error = xfs_bulkstat(&breq, bs_one_func);
+		lastino = breq.startino - 1;
+	} else {
 		error = -EINVAL;
+	}
 	if (error)
 		return error;
 
+	lastino = breq.startino - 1;
 	if (bulkreq.lastip != NULL &&
-	    copy_to_user(bulkreq.lastip, &inlast, sizeof(xfs_ino_t)))
+	    copy_to_user(bulkreq.lastip, &lastino, sizeof(xfs_ino_t)))
 		return -EFAULT;
 
 	if (bulkreq.ocount != NULL &&
-	    copy_to_user(bulkreq.ocount, &count, sizeof(count)))
+	    copy_to_user(bulkreq.ocount, &breq.ocount, sizeof(__s32)))
 		return -EFAULT;
 
 	return 0;
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 96590d9f917c..454bc992bf93 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -22,37 +22,63 @@
 #include "xfs_iwalk.h"
 
 /*
- * Return stat information for one inode.
- * Return 0 if ok, else errno.
+ * Bulk Stat
+ * =========
+ *
+ * Use the inode walking functions to fill out struct xfs_bstat for every
+ * allocated inode, then pass the stat information to some externally provided
+ * iteration function.
  */
-int
+
+struct xfs_bstat_chunk {
+	bulkstat_one_fmt_pf	formatter;
+	struct xfs_ibulk	*breq;
+};
+
+/*
+ * Fill out the bulkstat info for a single inode and report it somewhere.
+ *
+ * bc->breq->lastino is effectively the inode cursor as we walk through the
+ * filesystem.  Therefore, we update it any time we need to move the cursor
+ * forward, regardless of whether or not we're sending any bstat information
+ * back to userspace.  If the inode is internal metadata or, has been freed
+ * out from under us, we just simply keep going.
+ *
+ * However, if any other type of error happens we want to stop right where we
+ * are so that userspace will call back with exact number of the bad inode and
+ * we can send back an error code.
+ *
+ * Note that if the formatter tells us there's no space left in the buffer we
+ * move the cursor forward and abort the walk.
+ */
+STATIC int
 xfs_bulkstat_one_int(
-	struct xfs_mount	*mp,		/* mount point for filesystem */
-	xfs_ino_t		ino,		/* inode to get data for */
-	void __user		*buffer,	/* buffer to place output in */
-	int			ubsize,		/* size of buffer */
-	bulkstat_one_fmt_pf	formatter,	/* formatter, copy to user */
-	int			*ubused,	/* bytes used by me */
-	int			*stat)		/* BULKSTAT_RV_... */
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
+	struct xfs_bstat_chunk	*bc = data;
 	struct xfs_icdinode	*dic;		/* dinode core info pointer */
 	struct xfs_inode	*ip;		/* incore inode pointer */
 	struct inode		*inode;
 	struct xfs_bstat	*buf;		/* return buffer */
 	int			error = 0;	/* error value */
 
-	*stat = BULKSTAT_RV_NOTHING;
-
-	if (!buffer || xfs_internal_inum(mp, ino))
+	if (xfs_internal_inum(mp, ino)) {
+		bc->breq->startino = ino + 1;
 		return -EINVAL;
+	}
 
 	buf = kmem_zalloc(sizeof(*buf), KM_SLEEP | KM_MAYFAIL);
 	if (!buf)
 		return -ENOMEM;
 
-	error = xfs_iget(mp, NULL, ino,
+	error = xfs_iget(mp, tp, ino,
 			 (XFS_IGET_DONTCACHE | XFS_IGET_UNTRUSTED),
 			 XFS_ILOCK_SHARED, &ip);
+	if (error == -ENOENT || error == -EINVAL)
+		bc->breq->startino = ino + 1;
 	if (error)
 		goto out_free;
 
@@ -119,43 +145,45 @@ xfs_bulkstat_one_int(
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 	xfs_irele(ip);
 
-	error = formatter(buffer, ubsize, ubused, buf);
-	if (!error)
-		*stat = BULKSTAT_RV_DIDONE;
-
- out_free:
+	error = bc->formatter(bc->breq, buf);
+	switch (error) {
+	case XFS_IBULK_BUFFER_FULL:
+		error = XFS_IWALK_ABORT;
+		/* fall through */
+	case 0:
+		bc->breq->startino = ino + 1;
+		break;
+	}
+out_free:
 	kmem_free(buf);
 	return error;
 }
 
-/* Return 0 on success or positive error */
-STATIC int
-xfs_bulkstat_one_fmt(
-	void			__user *ubuffer,
-	int			ubsize,
-	int			*ubused,
-	const xfs_bstat_t	*buffer)
-{
-	if (ubsize < sizeof(*buffer))
-		return -ENOMEM;
-	if (copy_to_user(ubuffer, buffer, sizeof(*buffer)))
-		return -EFAULT;
-	if (ubused)
-		*ubused = sizeof(*buffer);
-	return 0;
-}
-
+/* Bulkstat a single inode. */
 int
 xfs_bulkstat_one(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	ino,		/* inode number to get data for */
-	void		__user *buffer,	/* buffer to place output in */
-	int		ubsize,		/* size of buffer */
-	int		*ubused,	/* bytes used by me */
-	int		*stat)		/* BULKSTAT_RV_... */
+	struct xfs_ibulk	*breq,
+	bulkstat_one_fmt_pf	formatter)
 {
-	return xfs_bulkstat_one_int(mp, ino, buffer, ubsize,
-				    xfs_bulkstat_one_fmt, ubused, stat);
+	struct xfs_bstat_chunk	bc = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
+	int			error;
+
+	breq->icount = 1;
+	breq->ocount = 0;
+
+	error = xfs_bulkstat_one_int(breq->mp, NULL, breq->startino, &bc);
+
+	/*
+	 * If we reported one inode to userspace then we abort because we hit
+	 * the end of the buffer.  Don't leak that back to userspace.
+	 */
+	if (error == XFS_IWALK_ABORT)
+		error = 0;
+
+	return error;
 }
 
 /*
@@ -251,256 +279,65 @@ xfs_bulkstat_grab_ichunk(
 
 #define XFS_BULKSTAT_UBLEFT(ubleft)	((ubleft) >= statstruct_size)
 
-struct xfs_bulkstat_agichunk {
-	char		__user **ac_ubuffer;/* pointer into user's buffer */
-	int		ac_ubleft;	/* bytes left in user's buffer */
-	int		ac_ubelem;	/* spaces used in user's buffer */
-};
-
-/*
- * Process inodes in chunk with a pointer to a formatter function
- * that will iget the inode and fill in the appropriate structure.
- */
 static int
-xfs_bulkstat_ag_ichunk(
-	struct xfs_mount		*mp,
-	xfs_agnumber_t			agno,
-	struct xfs_inobt_rec_incore	*irbp,
-	bulkstat_one_pf			formatter,
-	size_t				statstruct_size,
-	struct xfs_bulkstat_agichunk	*acp,
-	xfs_agino_t			*last_agino)
+xfs_bulkstat_iwalk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		ino,
+	void			*data)
 {
-	char				__user **ubufp = acp->ac_ubuffer;
-	int				chunkidx;
-	int				error = 0;
-	xfs_agino_t			agino = irbp->ir_startino;
-
-	for (chunkidx = 0; chunkidx < XFS_INODES_PER_CHUNK;
-	     chunkidx++, agino++) {
-		int		fmterror;
-		int		ubused;
-
-		/* inode won't fit in buffer, we are done */
-		if (acp->ac_ubleft < statstruct_size)
-			break;
-
-		/* Skip if this inode is free */
-		if (XFS_INOBT_MASK(chunkidx) & irbp->ir_free)
-			continue;
-
-		/* Get the inode and fill in a single buffer */
-		ubused = statstruct_size;
-		error = formatter(mp, XFS_AGINO_TO_INO(mp, agno, agino),
-				  *ubufp, acp->ac_ubleft, &ubused, &fmterror);
-
-		if (fmterror == BULKSTAT_RV_GIVEUP ||
-		    (error && error != -ENOENT && error != -EINVAL)) {
-			acp->ac_ubleft = 0;
-			ASSERT(error);
-			break;
-		}
-
-		/* be careful not to leak error if at end of chunk */
-		if (fmterror == BULKSTAT_RV_NOTHING || error) {
-			error = 0;
-			continue;
-		}
-
-		*ubufp += ubused;
-		acp->ac_ubleft -= ubused;
-		acp->ac_ubelem++;
-	}
-
-	/*
-	 * Post-update *last_agino. At this point, agino will always point one
-	 * inode past the last inode we processed successfully. Hence we
-	 * substract that inode when setting the *last_agino cursor so that we
-	 * return the correct cookie to userspace. On the next bulkstat call,
-	 * the inode under the lastino cookie will be skipped as we have already
-	 * processed it here.
-	 */
-	*last_agino = agino - 1;
+	int			error;
 
+	error = xfs_bulkstat_one_int(mp, tp, ino, data);
+	/* bulkstat just skips over missing inodes */
+	if (error == -ENOENT || error == -EINVAL)
+		return 0;
 	return error;
 }
 
 /*
- * Return stat information in bulk (by-inode) for the filesystem.
+ * Check the incoming lastino parameter.
+ *
+ * We allow any inode value that could map to physical space inside the
+ * filesystem because if there are no inodes there, bulkstat moves on to the
+ * next chunk.  In other words, the magic agino value of zero takes us to the
+ * first chunk in the AG, and an agino value past the end of the AG takes us to
+ * the first chunk in the next AG.
+ *
+ * Therefore we can end early if the requested inode is beyond the end of the
+ * filesystem or doesn't map properly.
  */
-int					/* error status */
-xfs_bulkstat(
-	xfs_mount_t		*mp,	/* mount point for filesystem */
-	xfs_ino_t		*lastinop, /* last inode returned */
-	int			*ubcountp, /* size of buffer/count returned */
-	bulkstat_one_pf		formatter, /* func that'd fill a single buf */
-	size_t			statstruct_size, /* sizeof struct filling */
-	char			__user *ubuffer, /* buffer with inode stats */
-	int			*done)	/* 1 if there are more stats to get */
+static inline bool
+xfs_bulkstat_already_done(
+	struct xfs_mount	*mp,
+	xfs_ino_t		startino)
 {
-	xfs_buf_t		*agbp;	/* agi header buffer */
-	xfs_agino_t		agino;	/* inode # in allocation group */
-	xfs_agnumber_t		agno;	/* allocation group number */
-	xfs_btree_cur_t		*cur;	/* btree cursor for ialloc btree */
-	xfs_inobt_rec_incore_t	*irbuf;	/* start of irec buffer */
-	int			nirbuf;	/* size of irbuf */
-	int			ubcount; /* size of user's buffer */
-	struct xfs_bulkstat_agichunk ac;
-	int			error = 0;
-
-	/*
-	 * Get the last inode value, see if there's nothing to do.
-	 */
-	agno = XFS_INO_TO_AGNO(mp, *lastinop);
-	agino = XFS_INO_TO_AGINO(mp, *lastinop);
-	if (agno >= mp->m_sb.sb_agcount ||
-	    *lastinop != XFS_AGINO_TO_INO(mp, agno, agino)) {
-		*done = 1;
-		*ubcountp = 0;
-		return 0;
-	}
-
-	ubcount = *ubcountp; /* statstruct's */
-	ac.ac_ubuffer = &ubuffer;
-	ac.ac_ubleft = ubcount * statstruct_size; /* bytes */;
-	ac.ac_ubelem = 0;
-
-	*ubcountp = 0;
-	*done = 0;
-
-	irbuf = kmem_zalloc_large(PAGE_SIZE * 4, KM_SLEEP);
-	if (!irbuf)
-		return -ENOMEM;
-	nirbuf = (PAGE_SIZE * 4) / sizeof(*irbuf);
-
-	/*
-	 * Loop over the allocation groups, starting from the last
-	 * inode returned; 0 means start of the allocation group.
-	 */
-	while (agno < mp->m_sb.sb_agcount) {
-		struct xfs_inobt_rec_incore	*irbp = irbuf;
-		struct xfs_inobt_rec_incore	*irbufend = irbuf + nirbuf;
-		bool				end_of_ag = false;
-		int				icount = 0;
-		int				stat;
-
-		error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
-		if (error)
-			break;
-		/*
-		 * Allocate and initialize a btree cursor for ialloc btree.
-		 */
-		cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
-					    XFS_BTNUM_INO);
-		if (agino > 0) {
-			/*
-			 * In the middle of an allocation group, we need to get
-			 * the remainder of the chunk we're in.
-			 */
-			struct xfs_inobt_rec_incore	r;
-
-			error = xfs_bulkstat_grab_ichunk(cur, agino, &icount, &r);
-			if (error)
-				goto del_cursor;
-			if (icount) {
-				irbp->ir_startino = r.ir_startino;
-				irbp->ir_holemask = r.ir_holemask;
-				irbp->ir_count = r.ir_count;
-				irbp->ir_freecount = r.ir_freecount;
-				irbp->ir_free = r.ir_free;
-				irbp++;
-			}
-			/* Increment to the next record */
-			error = xfs_btree_increment(cur, 0, &stat);
-		} else {
-			/* Start of ag.  Lookup the first inode chunk */
-			error = xfs_inobt_lookup(cur, 0, XFS_LOOKUP_GE, &stat);
-		}
-		if (error || stat == 0) {
-			end_of_ag = true;
-			goto del_cursor;
-		}
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, startino);
 
-		/*
-		 * Loop through inode btree records in this ag,
-		 * until we run out of inodes or space in the buffer.
-		 */
-		while (irbp < irbufend && icount < ubcount) {
-			struct xfs_inobt_rec_incore	r;
-
-			error = xfs_inobt_get_rec(cur, &r, &stat);
-			if (error || stat == 0) {
-				end_of_ag = true;
-				goto del_cursor;
-			}
-
-			/*
-			 * If this chunk has any allocated inodes, save it.
-			 * Also start read-ahead now for this chunk.
-			 */
-			if (r.ir_freecount < r.ir_count) {
-				xfs_bulkstat_ichunk_ra(mp, agno, &r);
-				irbp->ir_startino = r.ir_startino;
-				irbp->ir_holemask = r.ir_holemask;
-				irbp->ir_count = r.ir_count;
-				irbp->ir_freecount = r.ir_freecount;
-				irbp->ir_free = r.ir_free;
-				irbp++;
-				icount += r.ir_count - r.ir_freecount;
-			}
-			error = xfs_btree_increment(cur, 0, &stat);
-			if (error || stat == 0) {
-				end_of_ag = true;
-				goto del_cursor;
-			}
-			cond_resched();
-		}
+	return agno >= mp->m_sb.sb_agcount ||
+	       startino != XFS_AGINO_TO_INO(mp, agno, agino);
+}
 
-		/*
-		 * Drop the btree buffers and the agi buffer as we can't hold any
-		 * of the locks these represent when calling iget. If there is a
-		 * pending error, then we are done.
-		 */
-del_cursor:
-		xfs_btree_del_cursor(cur, error);
-		xfs_buf_relse(agbp);
-		if (error)
-			break;
-		/*
-		 * Now format all the good inodes into the user's buffer. The
-		 * call to xfs_bulkstat_ag_ichunk() sets up the agino pointer
-		 * for the next loop iteration.
-		 */
-		irbufend = irbp;
-		for (irbp = irbuf;
-		     irbp < irbufend && ac.ac_ubleft >= statstruct_size;
-		     irbp++) {
-			error = xfs_bulkstat_ag_ichunk(mp, agno, irbp,
-					formatter, statstruct_size, &ac,
-					&agino);
-			if (error)
-				break;
+/* Return stat information in bulk (by-inode) for the filesystem. */
+int
+xfs_bulkstat(
+	struct xfs_ibulk	*breq,
+	bulkstat_one_fmt_pf	formatter)
+{
+	struct xfs_bstat_chunk	bc = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
+	int			error;
 
-			cond_resched();
-		}
+	breq->ocount = 0;
 
-		/*
-		 * If we've run out of space or had a formatting error, we
-		 * are now done
-		 */
-		if (ac.ac_ubleft < statstruct_size || error)
-			break;
+	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
+		return 0;
 
-		if (end_of_ag) {
-			agno++;
-			agino = 0;
-		}
-	}
-	/*
-	 * Done, we're either out of filesystem or space to put the data.
-	 */
-	kmem_free(irbuf);
-	*ubcountp = ac.ac_ubelem;
+	error = xfs_iwalk(breq->mp, NULL, breq->startino, xfs_bulkstat_iwalk,
+			breq->icount, &bc);
 
 	/*
 	 * We found some inodes, so clear the error status and return them.
@@ -509,17 +346,9 @@ xfs_bulkstat(
 	 * triggered again and propagated to userspace as there will be no
 	 * formatted inodes in the buffer.
 	 */
-	if (ac.ac_ubelem)
+	if (breq->ocount > 0)
 		error = 0;
 
-	/*
-	 * If we ran out of filesystem, lastino will point off the end of
-	 * the filesystem so the next call will return immediately.
-	 */
-	*lastinop = XFS_AGINO_TO_INO(mp, agno, agino);
-	if (agno >= mp->m_sb.sb_agcount)
-		*done = 1;
-
 	return error;
 }
 
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 369e3f159d4e..366d391eb11f 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -5,63 +5,46 @@
 #ifndef __XFS_ITABLE_H__
 #define	__XFS_ITABLE_H__
 
-/*
- * xfs_bulkstat() is used to fill in xfs_bstat structures as well as dm_stat
- * structures (by the dmi library). This is a pointer to a formatter function
- * that will iget the inode and fill in the appropriate structure.
- * see xfs_bulkstat_one() and xfs_dm_bulkstat_one() in dmapi_xfs.c
- */
-typedef int (*bulkstat_one_pf)(struct xfs_mount	*mp,
-			       xfs_ino_t	ino,
-			       void		__user *buffer,
-			       int		ubsize,
-			       int		*ubused,
-			       int		*stat);
+/* In-memory representation of a userspace request for batch inode data. */
+struct xfs_ibulk {
+	struct xfs_mount	*mp;
+	void __user		*ubuffer; /* user output buffer */
+	xfs_ino_t		startino; /* start with this inode */
+	unsigned int		icount;   /* number of elements in ubuffer */
+	unsigned int		ocount;   /* number of records returned */
+};
+
+/* Return value that means we want to abort the walk. */
+#define XFS_IBULK_ABORT		(XFS_IWALK_ABORT)
+
+/* Return value that means the formatting buffer is now full. */
+#define XFS_IBULK_BUFFER_FULL	(2)
 
 /*
- * Values for stat return value.
+ * Advance the user buffer pointer by one record of the given size.  If the
+ * buffer is now full, return the appropriate error code.
  */
-#define BULKSTAT_RV_NOTHING	0
-#define BULKSTAT_RV_DIDONE	1
-#define BULKSTAT_RV_GIVEUP	2
+static inline int
+xfs_ibulk_advance(
+	struct xfs_ibulk	*breq,
+	size_t			bytes)
+{
+	char __user		*b = breq->ubuffer;
+
+	breq->ubuffer = b + bytes;
+	breq->ocount++;
+	return breq->ocount == breq->icount ? XFS_IBULK_BUFFER_FULL : 0;
+}
 
 /*
  * Return stat information in bulk (by-inode) for the filesystem.
  */
-int					/* error status */
-xfs_bulkstat(
-	xfs_mount_t	*mp,		/* mount point for filesystem */
-	xfs_ino_t	*lastino,	/* last inode returned */
-	int		*count,		/* size of buffer/count returned */
-	bulkstat_one_pf formatter,	/* func that'd fill a single buf */
-	size_t		statstruct_size,/* sizeof struct that we're filling */
-	char		__user *ubuffer,/* buffer with inode stats */
-	int		*done);		/* 1 if there are more stats to get */
 
-typedef int (*bulkstat_one_fmt_pf)(  /* used size in bytes or negative error */
-	void			__user *ubuffer, /* buffer to write to */
-	int			ubsize,		 /* remaining user buffer sz */
-	int			*ubused,	 /* bytes used by formatter */
-	const xfs_bstat_t	*buffer);        /* buffer to read from */
+typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
+		const struct xfs_bstat *bstat);
 
-int
-xfs_bulkstat_one_int(
-	xfs_mount_t		*mp,
-	xfs_ino_t		ino,
-	void			__user *buffer,
-	int			ubsize,
-	bulkstat_one_fmt_pf	formatter,
-	int			*ubused,
-	int			*stat);
-
-int
-xfs_bulkstat_one(
-	xfs_mount_t		*mp,
-	xfs_ino_t		ino,
-	void			__user *buffer,
-	int			ubsize,
-	int			*ubused,
-	int			*stat);
+int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
+int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 
 typedef int (*inumbers_fmt_pf)(
 	void			__user *ubuffer, /* buffer to write to */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/11] xfs: move bulkstat ichunk helpers to iwalk code
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (4 preceding siblings ...)
  2019-05-29 22:26 ` [PATCH 05/11] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-05-29 22:26 ` [PATCH 07/11] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've reworked the bulkstat code to use iwalk, we can move the
old bulkstat ichunk helpers to xfs_iwalk.c.  No functional changes here.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_itable.c |   93 --------------------------------------------------
 fs/xfs/xfs_itable.h |    8 ----
 fs/xfs/xfs_iwalk.c  |   95 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 93 insertions(+), 103 deletions(-)


diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 454bc992bf93..06abe5c9c0ee 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -186,99 +186,6 @@ xfs_bulkstat_one(
 	return error;
 }
 
-/*
- * Loop over all clusters in a chunk for a given incore inode allocation btree
- * record.  Do a readahead if there are any allocated inodes in that cluster.
- */
-void
-xfs_bulkstat_ichunk_ra(
-	struct xfs_mount		*mp,
-	xfs_agnumber_t			agno,
-	struct xfs_inobt_rec_incore	*irec)
-{
-	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
-	xfs_agblock_t			agbno;
-	struct blk_plug			plug;
-	int				i;	/* inode chunk index */
-
-	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
-
-	blk_start_plug(&plug);
-	for (i = 0;
-	     i < XFS_INODES_PER_CHUNK;
-	     i += igeo->ig_inodes_per_cluster,
-			agbno += igeo->ig_blocks_per_cluster) {
-		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
-		    ~irec->ir_free) {
-			xfs_btree_reada_bufs(mp, agno, agbno,
-					igeo->ig_blocks_per_cluster,
-					&xfs_inode_buf_ops);
-		}
-	}
-	blk_finish_plug(&plug);
-}
-
-/*
- * Lookup the inode chunk that the given inode lives in and then get the record
- * if we found the chunk.  If the inode was not the last in the chunk and there
- * are some left allocated, update the data for the pointed-to record as well as
- * return the count of grabbed inodes.
- */
-int
-xfs_bulkstat_grab_ichunk(
-	struct xfs_btree_cur		*cur,	/* btree cursor */
-	xfs_agino_t			agino,	/* starting inode of chunk */
-	int				*icount,/* return # of inodes grabbed */
-	struct xfs_inobt_rec_incore	*irec)	/* btree record */
-{
-	int				idx;	/* index into inode chunk */
-	int				stat;
-	int				error = 0;
-
-	/* Lookup the inode chunk that this inode lives in */
-	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
-	if (error)
-		return error;
-	if (!stat) {
-		*icount = 0;
-		return error;
-	}
-
-	/* Get the record, should always work */
-	error = xfs_inobt_get_rec(cur, irec, &stat);
-	if (error)
-		return error;
-	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
-
-	/* Check if the record contains the inode in request */
-	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
-		*icount = 0;
-		return 0;
-	}
-
-	idx = agino - irec->ir_startino + 1;
-	if (idx < XFS_INODES_PER_CHUNK &&
-	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
-		int	i;
-
-		/* We got a right chunk with some left inodes allocated at it.
-		 * Grab the chunk record.  Mark all the uninteresting inodes
-		 * free -- because they're before our start point.
-		 */
-		for (i = 0; i < idx; i++) {
-			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
-				irec->ir_freecount++;
-		}
-
-		irec->ir_free |= xfs_inobt_maskn(0, idx);
-		*icount = irec->ir_count - irec->ir_freecount;
-	}
-
-	return 0;
-}
-
-#define XFS_BULKSTAT_UBLEFT(ubleft)	((ubleft) >= statstruct_size)
-
 static int
 xfs_bulkstat_iwalk(
 	struct xfs_mount	*mp,
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index 366d391eb11f..a2562fe8d282 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -67,12 +67,4 @@ xfs_inumbers(
 	void			__user *buffer, /* buffer with inode info */
 	inumbers_fmt_pf		formatter);
 
-/* Temporarily needed while we refactor functions. */
-struct xfs_btree_cur;
-struct xfs_inobt_rec_incore;
-void xfs_bulkstat_ichunk_ra(struct xfs_mount *mp, xfs_agnumber_t agno,
-		struct xfs_inobt_rec_incore *irec);
-int xfs_bulkstat_grab_ichunk(struct xfs_btree_cur *cur, xfs_agino_t agino,
-		int *icount, struct xfs_inobt_rec_incore *irec);
-
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 0ce3baa159ba..77320c297284 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -54,6 +54,97 @@ struct xfs_iwalk_ag {
 	void				*data;
 };
 
+/*
+ * Loop over all clusters in a chunk for a given incore inode allocation btree
+ * record.  Do a readahead if there are any allocated inodes in that cluster.
+ */
+STATIC void
+xfs_iwalk_ichunk_ra(
+	struct xfs_mount		*mp,
+	xfs_agnumber_t			agno,
+	struct xfs_inobt_rec_incore	*irec)
+{
+	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
+	xfs_agblock_t			agbno;
+	struct blk_plug			plug;
+	int				i;	/* inode chunk index */
+
+	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
+
+	blk_start_plug(&plug);
+	for (i = 0;
+	     i < XFS_INODES_PER_CHUNK;
+	     i += igeo->ig_inodes_per_cluster,
+			agbno += igeo->ig_blocks_per_cluster) {
+		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
+		    ~irec->ir_free) {
+			xfs_btree_reada_bufs(mp, agno, agbno,
+					igeo->ig_blocks_per_cluster,
+					&xfs_inode_buf_ops);
+		}
+	}
+	blk_finish_plug(&plug);
+}
+
+/*
+ * Lookup the inode chunk that the given inode lives in and then get the record
+ * if we found the chunk.  If the inode was not the last in the chunk and there
+ * are some left allocated, update the data for the pointed-to record as well as
+ * return the count of grabbed inodes.
+ */
+STATIC int
+xfs_iwalk_grab_ichunk(
+	struct xfs_btree_cur		*cur,	/* btree cursor */
+	xfs_agino_t			agino,	/* starting inode of chunk */
+	int				*icount,/* return # of inodes grabbed */
+	struct xfs_inobt_rec_incore	*irec)	/* btree record */
+{
+	int				idx;	/* index into inode chunk */
+	int				stat;
+	int				error = 0;
+
+	/* Lookup the inode chunk that this inode lives in */
+	error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_LE, &stat);
+	if (error)
+		return error;
+	if (!stat) {
+		*icount = 0;
+		return error;
+	}
+
+	/* Get the record, should always work */
+	error = xfs_inobt_get_rec(cur, irec, &stat);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, stat == 1);
+
+	/* Check if the record contains the inode in request */
+	if (irec->ir_startino + XFS_INODES_PER_CHUNK <= agino) {
+		*icount = 0;
+		return 0;
+	}
+
+	idx = agino - irec->ir_startino + 1;
+	if (idx < XFS_INODES_PER_CHUNK &&
+	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
+		int	i;
+
+		/* We got a right chunk with some left inodes allocated at it.
+		 * Grab the chunk record.  Mark all the uninteresting inodes
+		 * free -- because they're before our start point.
+		 */
+		for (i = 0; i < idx; i++) {
+			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
+				irec->ir_freecount++;
+		}
+
+		irec->ir_free |= xfs_inobt_maskn(0, idx);
+		*icount = irec->ir_count - irec->ir_freecount;
+	}
+
+	return 0;
+}
+
 /* Allocate memory for a walk. */
 STATIC int
 xfs_iwalk_allocbuf(
@@ -204,7 +295,7 @@ xfs_iwalk_ag_start(
 	 * We require a lookup cache of at least two elements so that we don't
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
-	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
+	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
 			&iwag->recs[iwag->nr_recs]);
 	if (error)
 		return error;
@@ -305,7 +396,7 @@ xfs_iwalk_ag(
 		 * Start readahead for this inode chunk in anticipation of
 		 * walking the inodes.
 		 */
-		xfs_bulkstat_ichunk_ra(mp, agno, irec);
+		xfs_iwalk_ichunk_ra(mp, agno, irec);
 
 		/*
 		 * Add this inobt record to our cache, flush the cache if

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/11] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (5 preceding siblings ...)
  2019-05-29 22:26 ` [PATCH 06/11] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
@ 2019-05-29 22:26 ` Darrick J. Wong
  2019-05-29 22:27 ` [PATCH 08/11] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:26 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that the inode chunk grabbing function is a static function in the
iwalk code, change its behavior so that @agino is the inode where we
want to /start/ the iteration.  This reduces cognitive friction with the
callers and simplifes the code.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |   37 +++++++++++++++++--------------------
 1 file changed, 17 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 77320c297284..8e7881e95674 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -87,10 +87,10 @@ xfs_iwalk_ichunk_ra(
 }
 
 /*
- * Lookup the inode chunk that the given inode lives in and then get the record
- * if we found the chunk.  If the inode was not the last in the chunk and there
- * are some left allocated, update the data for the pointed-to record as well as
- * return the count of grabbed inodes.
+ * Lookup the inode chunk that the given @agino lives in and then get the
+ * record if we found the chunk.  Set the bits in @irec's free mask that
+ * correspond to the inodes before @agino so that we skip them.  This is how we
+ * restart an inode walk that was interrupted in the middle of an inode record.
  */
 STATIC int
 xfs_iwalk_grab_ichunk(
@@ -101,6 +101,7 @@ xfs_iwalk_grab_ichunk(
 {
 	int				idx;	/* index into inode chunk */
 	int				stat;
+	int				i;
 	int				error = 0;
 
 	/* Lookup the inode chunk that this inode lives in */
@@ -124,24 +125,20 @@ xfs_iwalk_grab_ichunk(
 		return 0;
 	}
 
-	idx = agino - irec->ir_startino + 1;
-	if (idx < XFS_INODES_PER_CHUNK &&
-	    (xfs_inobt_maskn(idx, XFS_INODES_PER_CHUNK - idx) & ~irec->ir_free)) {
-		int	i;
+	idx = agino - irec->ir_startino;
 
-		/* We got a right chunk with some left inodes allocated at it.
-		 * Grab the chunk record.  Mark all the uninteresting inodes
-		 * free -- because they're before our start point.
-		 */
-		for (i = 0; i < idx; i++) {
-			if (XFS_INOBT_MASK(i) & ~irec->ir_free)
-				irec->ir_freecount++;
-		}
-
-		irec->ir_free |= xfs_inobt_maskn(0, idx);
-		*icount = irec->ir_count - irec->ir_freecount;
+	/*
+	 * We got a right chunk with some left inodes allocated at it.  Grab
+	 * the chunk record.  Mark all the uninteresting inodes free because
+	 * they're before our start point.
+	 */
+	for (i = 0; i < idx; i++) {
+		if (XFS_INOBT_MASK(i) & ~irec->ir_free)
+			irec->ir_freecount++;
 	}
 
+	irec->ir_free |= xfs_inobt_maskn(0, idx);
+	*icount = irec->ir_count - irec->ir_freecount;
 	return 0;
 }
 
@@ -295,7 +292,7 @@ xfs_iwalk_ag_start(
 	 * We require a lookup cache of at least two elements so that we don't
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
-	error = xfs_iwalk_grab_ichunk(*curpp, agino - 1, &icount,
+	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
 			&iwag->recs[iwag->nr_recs]);
 	if (error)
 		return error;

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/11] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (6 preceding siblings ...)
  2019-05-29 22:26 ` [PATCH 07/11] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
@ 2019-05-29 22:27 ` Darrick J. Wong
  2019-05-29 22:27 ` [PATCH 09/11] xfs: multithreaded iwalk implementation Darrick J. Wong
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:27 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor xfs_iwalk_ichunk_ra to avoid long conditionals.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 8e7881e95674..3c523afdcfa0 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -76,8 +76,10 @@ xfs_iwalk_ichunk_ra(
 	     i < XFS_INODES_PER_CHUNK;
 	     i += igeo->ig_inodes_per_cluster,
 			agbno += igeo->ig_blocks_per_cluster) {
-		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
-		    ~irec->ir_free) {
+		xfs_inofree_t	imask;
+
+		imask = xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster);
+		if (imask & ~irec->ir_free) {
 			xfs_btree_reada_bufs(mp, agno, agbno,
 					igeo->ig_blocks_per_cluster,
 					&xfs_inode_buf_ops);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/11] xfs: multithreaded iwalk implementation
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (7 preceding siblings ...)
  2019-05-29 22:27 ` [PATCH 08/11] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
@ 2019-05-29 22:27 ` Darrick J. Wong
  2019-05-29 22:27 ` [PATCH 10/11] xfs: poll waiting for quotacheck Darrick J. Wong
  2019-05-29 22:27 ` [PATCH 11/11] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:27 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a parallel iwalk implementation and switch quotacheck to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile      |    1 
 fs/xfs/xfs_globals.c |    3 +
 fs/xfs/xfs_iwalk.c   |   76 +++++++++++++++++++++++++++++++-
 fs/xfs/xfs_iwalk.h   |    2 +
 fs/xfs/xfs_pwork.c   |  118 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_pwork.h   |   50 +++++++++++++++++++++
 fs/xfs/xfs_qm.c      |    2 -
 fs/xfs/xfs_sysctl.h  |    6 +++
 fs/xfs/xfs_sysfs.c   |   40 +++++++++++++++++
 fs/xfs/xfs_trace.h   |   18 ++++++++
 10 files changed, 313 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_pwork.c
 create mode 100644 fs/xfs/xfs_pwork.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 74d30ef0dbce..48940a27d4aa 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -84,6 +84,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_message.o \
 				   xfs_mount.o \
 				   xfs_mru_cache.o \
+				   xfs_pwork.o \
 				   xfs_reflink.o \
 				   xfs_stats.o \
 				   xfs_super.o \
diff --git a/fs/xfs/xfs_globals.c b/fs/xfs/xfs_globals.c
index d0d377384120..4f93f2c4dc38 100644
--- a/fs/xfs/xfs_globals.c
+++ b/fs/xfs/xfs_globals.c
@@ -31,6 +31,9 @@ xfs_param_t xfs_params = {
 	.fstrm_timer	= {	1,		30*100,		3600*100},
 	.eofb_timer	= {	1,		300,		3600*24},
 	.cowb_timer	= {	1,		1800,		3600*24},
+#ifdef DEBUG
+	.pwork_threads	= {	0,		0,		NR_CPUS	},
+#endif
 };
 
 struct xfs_globals xfs_globals = {
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 3c523afdcfa0..453790bae194 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -21,6 +21,7 @@
 #include "xfs_health.h"
 #include "xfs_trans.h"
 #include "xfs_iwalk.h"
+#include "xfs_pwork.h"
 
 /*
  * Walking All the Inodes in the Filesystem
@@ -38,6 +39,9 @@
  */
 
 struct xfs_iwalk_ag {
+	/* parallel work control data; will be null if single threaded */
+	struct xfs_pwork		pwork;
+
 	struct xfs_mount		*mp;
 	struct xfs_trans		*tp;
 
@@ -190,6 +194,9 @@ xfs_iwalk_ag_recs(
 		trace_xfs_iwalk_ag_rec(mp, agno, irec->ir_startino,
 				irec->ir_free);
 		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
+			if (xfs_pwork_want_abort(&iwag->pwork))
+				return 0;
+
 			/* Skip if this inode is free */
 			if (XFS_INOBT_MASK(j) & irec->ir_free)
 				continue;
@@ -374,7 +381,7 @@ xfs_iwalk_ag(
 	if (error)
 		goto out_cur;
 
-	while (has_more) {
+	while (has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
 		struct xfs_inobt_rec_incore	*irec;
 
 		/* Fetch the inobt record. */
@@ -410,7 +417,7 @@ xfs_iwalk_ag(
 	}
 
 	/* Walk any records left behind in the cache. */
-	if (iwag->nr_recs) {
+	if (iwag->nr_recs && !xfs_pwork_want_abort(&iwag->pwork)) {
 		xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
 		return xfs_iwalk_ag_recs(iwag);
 	}
@@ -469,6 +476,7 @@ xfs_iwalk(
 		.iwalk_fn	= iwalk_fn,
 		.data		= data,
 		.startino	= startino,
+		.pwork		= XFS_PWORK_SINGLE_THREADED,
 	};
 	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
 	int			error;
@@ -490,3 +498,67 @@ xfs_iwalk(
 	xfs_iwalk_freebuf(&iwag);
 	return error;
 }
+
+/* Run per-thread iwalk work. */
+static int
+xfs_iwalk_ag_work(
+	struct xfs_mount	*mp,
+	struct xfs_pwork	*pwork)
+{
+	struct xfs_iwalk_ag	*iwag;
+	int			error;
+
+	iwag = container_of(pwork, struct xfs_iwalk_ag, pwork);
+	error = xfs_iwalk_allocbuf(iwag);
+	if (error)
+		goto out;
+
+	error = xfs_iwalk_ag(iwag);
+	xfs_iwalk_freebuf(iwag);
+out:
+	kmem_free(iwag);
+	return error;
+}
+
+/*
+ * Walk all the inodes in the filesystem using multiple threads to process each
+ * AG.
+ */
+int
+xfs_iwalk_threaded(
+	struct xfs_mount	*mp,
+	xfs_ino_t		startino,
+	xfs_iwalk_fn		iwalk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_pwork_ctl	pctl;
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	unsigned int		nr_threads;
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	nr_threads = xfs_pwork_guess_datadev_parallelism(mp);
+	error = xfs_pwork_init(mp, &pctl, xfs_iwalk_ag_work, "xfs_iwalk",
+			nr_threads);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		struct xfs_iwalk_ag	*iwag;
+
+		iwag = kmem_alloc(sizeof(struct xfs_iwalk_ag), KM_SLEEP);
+		iwag->mp = mp;
+		iwag->tp = NULL;
+		iwag->iwalk_fn = iwalk_fn;
+		iwag->data = data;
+		iwag->startino = startino;
+		iwag->recs = NULL;
+		xfs_iwalk_set_prefetch(iwag, max_prefetch);
+		xfs_pwork_queue(&pctl, &iwag->pwork);
+		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	return xfs_pwork_destroy(&pctl);
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 45b1baabcd2d..40233a05a766 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -14,5 +14,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
 
 int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
 
 #endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
new file mode 100644
index 000000000000..c1419558b089
--- /dev/null
+++ b/fs/xfs/xfs_pwork.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_trace.h"
+#include "xfs_sysctl.h"
+#include "xfs_pwork.h"
+
+/*
+ * Parallel Work Queue
+ * ===================
+ * Abstract away the details of running a large and "obviously" parallelizable
+ * task across multiple CPUs.  Callers initialize the pwork control object with
+ * a desired level of parallelization and a work function.  Next, they embed
+ * struct xfs_pwork in whatever structure they use to pass work context to a
+ * worker thread and queue that pwork.  The work function will be passed the
+ * pwork item when it is run (from process context) and any returned error will
+ * cause all threads to abort.
+ */
+
+/* Invoke our caller's function. */
+static void
+xfs_pwork_work(
+	struct work_struct	*work)
+{
+	struct xfs_pwork	*pwork;
+	struct xfs_pwork_ctl	*pctl;
+	int			error;
+
+	pwork = container_of(work, struct xfs_pwork, work);
+	pctl = pwork->pctl;
+	error = pctl->work_fn(pctl->mp, pwork);
+	if (error && !pctl->error)
+		pctl->error = error;
+}
+
+/*
+ * Set up control data for parallel work.  @work_fn is the function that will
+ * be called.  @tag will be written into the kernel threads.  @nr_threads is
+ * the level of parallelism desired, or 0 for no limit.
+ */
+int
+xfs_pwork_init(
+	struct xfs_mount	*mp,
+	struct xfs_pwork_ctl	*pctl,
+	xfs_pwork_work_fn	work_fn,
+	const char		*tag,
+	unsigned int		nr_threads)
+{
+#ifdef DEBUG
+	if (xfs_globals.pwork_threads > 0)
+		nr_threads = xfs_globals.pwork_threads;
+#endif
+	trace_xfs_pwork_init(mp, nr_threads, current->pid);
+
+	pctl->wq = alloc_workqueue("%s-%d", WQ_FREEZABLE, nr_threads, tag,
+			current->pid);
+	if (!pctl->wq)
+		return -ENOMEM;
+	pctl->work_fn = work_fn;
+	pctl->error = 0;
+	pctl->mp = mp;
+
+	return 0;
+}
+
+/* Queue some parallel work. */
+void
+xfs_pwork_queue(
+	struct xfs_pwork_ctl	*pctl,
+	struct xfs_pwork	*pwork)
+{
+	INIT_WORK(&pwork->work, xfs_pwork_work);
+	pwork->pctl = pctl;
+	queue_work(pctl->wq, &pwork->work);
+}
+
+/* Wait for the work to finish and tear down the control structure. */
+int
+xfs_pwork_destroy(
+	struct xfs_pwork_ctl	*pctl)
+{
+	destroy_workqueue(pctl->wq);
+	pctl->wq = NULL;
+	return pctl->error;
+}
+
+/*
+ * Return the amount of parallelism that the data device can handle, or 0 for
+ * no limit.
+ */
+unsigned int
+xfs_pwork_guess_datadev_parallelism(
+	struct xfs_mount	*mp)
+{
+	struct xfs_buftarg	*btp = mp->m_ddev_targp;
+	int			iomin;
+	int			ioopt;
+
+	if (blk_queue_nonrot(btp->bt_bdev->bd_queue))
+		return num_online_cpus();
+	if (mp->m_sb.sb_width && mp->m_sb.sb_unit)
+		return mp->m_sb.sb_width / mp->m_sb.sb_unit;
+	iomin = bdev_io_min(btp->bt_bdev);
+	ioopt = bdev_io_opt(btp->bt_bdev);
+	if (iomin && ioopt)
+		return ioopt / iomin;
+
+	return 1;
+}
diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
new file mode 100644
index 000000000000..e0c1354a2d8c
--- /dev/null
+++ b/fs/xfs/xfs_pwork.h
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * Copyright (C) 2019 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_PWORK_H__
+#define __XFS_PWORK_H__
+
+struct xfs_pwork;
+struct xfs_mount;
+
+typedef int (*xfs_pwork_work_fn)(struct xfs_mount *mp, struct xfs_pwork *pwork);
+
+/*
+ * Parallel work coordination structure.
+ */
+struct xfs_pwork_ctl {
+	struct workqueue_struct	*wq;
+	struct xfs_mount	*mp;
+	xfs_pwork_work_fn	work_fn;
+	int			error;
+};
+
+/*
+ * Embed this parallel work control item inside your own work structure,
+ * then queue work with it.
+ */
+struct xfs_pwork {
+	struct work_struct	work;
+	struct xfs_pwork_ctl	*pctl;
+};
+
+#define XFS_PWORK_SINGLE_THREADED	{ .pctl = NULL }
+
+/* Have we been told to abort? */
+static inline bool
+xfs_pwork_want_abort(
+	struct xfs_pwork	*pwork)
+{
+	return pwork->pctl && pwork->pctl->error;
+}
+
+int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
+		xfs_pwork_work_fn work_fn, const char *tag,
+		unsigned int nr_threads);
+void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
+int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
+unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
+
+#endif /* __XFS_PWORK_H__ */
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index a5b2260406a8..e4f3785f7a64 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	error = xfs_iwalk(mp, NULL, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
 	if (error)
 		goto error_return;
 
diff --git a/fs/xfs/xfs_sysctl.h b/fs/xfs/xfs_sysctl.h
index ad7f9be13087..b555e045e2f4 100644
--- a/fs/xfs/xfs_sysctl.h
+++ b/fs/xfs/xfs_sysctl.h
@@ -37,6 +37,9 @@ typedef struct xfs_param {
 	xfs_sysctl_val_t fstrm_timer;	/* Filestream dir-AG assoc'n timeout. */
 	xfs_sysctl_val_t eofb_timer;	/* Interval between eofb scan wakeups */
 	xfs_sysctl_val_t cowb_timer;	/* Interval between cowb scan wakeups */
+#ifdef DEBUG
+	xfs_sysctl_val_t pwork_threads;	/* Parallel workqueue thread count */
+#endif
 } xfs_param_t;
 
 /*
@@ -82,6 +85,9 @@ enum {
 extern xfs_param_t	xfs_params;
 
 struct xfs_globals {
+#ifdef DEBUG
+	int	pwork_threads;		/* parallel workqueue threads */
+#endif
 	int	log_recovery_delay;	/* log recovery delay (secs) */
 	int	mount_delay;		/* mount setup delay (secs) */
 	bool	bug_on_assert;		/* BUG() the kernel on assert failure */
diff --git a/fs/xfs/xfs_sysfs.c b/fs/xfs/xfs_sysfs.c
index cabda13f3c64..910e6b9cb1a7 100644
--- a/fs/xfs/xfs_sysfs.c
+++ b/fs/xfs/xfs_sysfs.c
@@ -206,11 +206,51 @@ always_cow_show(
 }
 XFS_SYSFS_ATTR_RW(always_cow);
 
+#ifdef DEBUG
+/*
+ * Override how many threads the parallel work queue is allowed to create.
+ * This has to be a debug-only global (instead of an errortag) because one of
+ * the main users of parallel workqueues is mount time quotacheck.
+ */
+STATIC ssize_t
+pwork_threads_store(
+	struct kobject	*kobject,
+	const char	*buf,
+	size_t		count)
+{
+	int		ret;
+	int		val;
+
+	ret = kstrtoint(buf, 0, &val);
+	if (ret)
+		return ret;
+
+	if (val < 0 || val > NR_CPUS)
+		return -EINVAL;
+
+	xfs_globals.pwork_threads = val;
+
+	return count;
+}
+
+STATIC ssize_t
+pwork_threads_show(
+	struct kobject	*kobject,
+	char		*buf)
+{
+	return snprintf(buf, PAGE_SIZE, "%d\n", xfs_globals.pwork_threads);
+}
+XFS_SYSFS_ATTR_RW(pwork_threads);
+#endif /* DEBUG */
+
 static struct attribute *xfs_dbg_attrs[] = {
 	ATTR_LIST(bug_on_assert),
 	ATTR_LIST(log_recovery_delay),
 	ATTR_LIST(mount_delay),
 	ATTR_LIST(always_cow),
+#ifdef DEBUG
+	ATTR_LIST(pwork_threads),
+#endif
 	NULL,
 };
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a2881659f776..5c0eea7bfcb0 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3556,6 +3556,24 @@ TRACE_EVENT(xfs_iwalk_ag_rec,
 		  __entry->startino, __entry->freemask)
 )
 
+TRACE_EVENT(xfs_pwork_init,
+	TP_PROTO(struct xfs_mount *mp, unsigned int nr_threads, pid_t pid),
+	TP_ARGS(mp, nr_threads, pid),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned int, nr_threads)
+		__field(pid_t, pid)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->nr_threads = nr_threads;
+		__entry->pid = pid;
+	),
+	TP_printk("dev %d:%d nr_threads %u pid %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nr_threads, __entry->pid)
+)
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/11] xfs: poll waiting for quotacheck
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (8 preceding siblings ...)
  2019-05-29 22:27 ` [PATCH 09/11] xfs: multithreaded iwalk implementation Darrick J. Wong
@ 2019-05-29 22:27 ` Darrick J. Wong
  2019-05-29 22:27 ` [PATCH 11/11] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:27 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a pwork destroy function that uses polling instead of
uninterruptible sleep to wait for work items to finish so that we can
touch the softlockup watchdog.  IOWs, gross hack.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_iwalk.c |    3 +++
 fs/xfs/xfs_iwalk.h |    3 ++-
 fs/xfs/xfs_pwork.c |   21 +++++++++++++++++++++
 fs/xfs/xfs_pwork.h |    2 ++
 fs/xfs/xfs_qm.c    |    2 +-
 5 files changed, 29 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 453790bae194..7f40d0633651 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -530,6 +530,7 @@ xfs_iwalk_threaded(
 	xfs_ino_t		startino,
 	xfs_iwalk_fn		iwalk_fn,
 	unsigned int		max_prefetch,
+	bool			polled,
 	void			*data)
 {
 	struct xfs_pwork_ctl	pctl;
@@ -560,5 +561,7 @@ xfs_iwalk_threaded(
 		startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
 	}
 
+	if (polled)
+		return xfs_pwork_destroy_poll(&pctl);
 	return xfs_pwork_destroy(&pctl);
 }
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 40233a05a766..76d8f87a39ef 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -15,6 +15,7 @@ typedef int (*xfs_iwalk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
 int xfs_iwalk(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
 int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
-		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, void *data);
+		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
+		void *data);
 
 #endif /* __XFS_IWALK_H__ */
diff --git a/fs/xfs/xfs_pwork.c b/fs/xfs/xfs_pwork.c
index c1419558b089..a9c39615233a 100644
--- a/fs/xfs/xfs_pwork.c
+++ b/fs/xfs/xfs_pwork.c
@@ -13,6 +13,7 @@
 #include "xfs_trace.h"
 #include "xfs_sysctl.h"
 #include "xfs_pwork.h"
+#include <linux/nmi.h>
 
 /*
  * Parallel Work Queue
@@ -40,6 +41,7 @@ xfs_pwork_work(
 	error = pctl->work_fn(pctl->mp, pwork);
 	if (error && !pctl->error)
 		pctl->error = error;
+	atomic_dec(&pctl->nr_work);
 }
 
 /*
@@ -68,6 +70,7 @@ xfs_pwork_init(
 	pctl->work_fn = work_fn;
 	pctl->error = 0;
 	pctl->mp = mp;
+	atomic_set(&pctl->nr_work, 0);
 
 	return 0;
 }
@@ -80,6 +83,7 @@ xfs_pwork_queue(
 {
 	INIT_WORK(&pwork->work, xfs_pwork_work);
 	pwork->pctl = pctl;
+	atomic_inc(&pctl->nr_work);
 	queue_work(pctl->wq, &pwork->work);
 }
 
@@ -93,6 +97,23 @@ xfs_pwork_destroy(
 	return pctl->error;
 }
 
+/*
+ * Wait for the work to finish and tear down the control structure.
+ * Continually poll completion status and touch the soft lockup watchdog.
+ * This is for things like mount that hold locks.
+ */
+int
+xfs_pwork_destroy_poll(
+	struct xfs_pwork_ctl	*pctl)
+{
+	while (atomic_read(&pctl->nr_work) > 0) {
+		msleep(1);
+		touch_softlockup_watchdog();
+	}
+
+	return xfs_pwork_destroy(pctl);
+}
+
 /*
  * Return the amount of parallelism that the data device can handle, or 0 for
  * no limit.
diff --git a/fs/xfs/xfs_pwork.h b/fs/xfs/xfs_pwork.h
index e0c1354a2d8c..08da723a8dc9 100644
--- a/fs/xfs/xfs_pwork.h
+++ b/fs/xfs/xfs_pwork.h
@@ -18,6 +18,7 @@ struct xfs_pwork_ctl {
 	struct workqueue_struct	*wq;
 	struct xfs_mount	*mp;
 	xfs_pwork_work_fn	work_fn;
+	atomic_t		nr_work;
 	int			error;
 };
 
@@ -45,6 +46,7 @@ int xfs_pwork_init(struct xfs_mount *mp, struct xfs_pwork_ctl *pctl,
 		unsigned int nr_threads);
 void xfs_pwork_queue(struct xfs_pwork_ctl *pctl, struct xfs_pwork *pwork);
 int xfs_pwork_destroy(struct xfs_pwork_ctl *pctl);
+int xfs_pwork_destroy_poll(struct xfs_pwork_ctl *pctl);
 unsigned int xfs_pwork_guess_datadev_parallelism(struct xfs_mount *mp);
 
 #endif /* __XFS_PWORK_H__ */
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e4f3785f7a64..de6a623ada02 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -1305,7 +1305,7 @@ xfs_qm_quotacheck(
 		flags |= XFS_PQUOTA_CHKD;
 	}
 
-	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, NULL);
+	error = xfs_iwalk_threaded(mp, 0, xfs_qm_dqusage_adjust, 0, true, NULL);
 	if (error)
 		goto error_return;
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/11] xfs: refactor INUMBERS to use iwalk functions
  2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
                   ` (9 preceding siblings ...)
  2019-05-29 22:27 ` [PATCH 10/11] xfs: poll waiting for quotacheck Darrick J. Wong
@ 2019-05-29 22:27 ` Darrick J. Wong
  10 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-29 22:27 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we have generic functions to walk inode records, refactor the
INUMBERS implementation to use it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_ioctl.c   |   20 ++++--
 fs/xfs/xfs_ioctl.h   |    2 +
 fs/xfs/xfs_ioctl32.c |   35 ++++------
 fs/xfs/xfs_itable.c  |  168 ++++++++++++++++++++------------------------------
 fs/xfs/xfs_itable.h  |   22 +------
 fs/xfs/xfs_iwalk.c   |  155 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iwalk.h   |   12 ++++
 7 files changed, 256 insertions(+), 158 deletions(-)


diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 43734901aeb9..4fa9a2c8b029 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -732,6 +732,16 @@ xfs_bulkstat_one_fmt(
 	return xfs_ibulk_advance(breq, sizeof(struct xfs_bstat));
 }
 
+int
+xfs_inumbers_fmt(
+	struct xfs_ibulk	*breq,
+	const struct xfs_inogrp	*igrp)
+{
+	if (copy_to_user(breq->ubuffer, igrp, sizeof(*igrp)))
+		return -EFAULT;
+	return xfs_ibulk_advance(breq, sizeof(struct xfs_inogrp));
+}
+
 STATIC int
 xfs_ioc_bulkstat(
 	xfs_mount_t		*mp,
@@ -779,13 +789,9 @@ xfs_ioc_bulkstat(
 	 * parameter to maintain correct function.
 	 */
 	if (cmd == XFS_IOC_FSINUMBERS) {
-		int	count = breq.icount;
-
-		breq.startino = lastino;
-		error = xfs_inumbers(mp, &breq.startino, &count,
-					bulkreq.ubuffer, xfs_inumbers_fmt);
-		breq.ocount = count;
-		lastino = breq.startino;
+		breq.startino = lastino + 1;
+		error = xfs_inumbers(&breq, xfs_inumbers_fmt);
+		lastino = breq.startino - 1;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE) {
 		breq.startino = lastino;
 		error = xfs_bulkstat_one(&breq, xfs_bulkstat_one_fmt);
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index f32c8aadfeba..fb303eaa8863 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -79,7 +79,9 @@ xfs_set_dmattrs(
 
 struct xfs_ibulk;
 struct xfs_bstat;
+struct xfs_inogrp;
 
 int xfs_bulkstat_one_fmt(struct xfs_ibulk *breq, const struct xfs_bstat *bstat);
+int xfs_inumbers_fmt(struct xfs_ibulk *breq, const struct xfs_inogrp *igrp);
 
 #endif
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index add15819daf3..dd53a9692e68 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -85,22 +85,17 @@ xfs_compat_growfs_rt_copyin(
 
 STATIC int
 xfs_inumbers_fmt_compat(
-	void			__user *ubuffer,
-	const struct xfs_inogrp	*buffer,
-	long			count,
-	long			*written)
+	struct xfs_ibulk	*breq,
+	const struct xfs_inogrp	*igrp)
 {
-	compat_xfs_inogrp_t	__user *p32 = ubuffer;
-	long			i;
+	struct compat_xfs_inogrp __user *p32 = breq->ubuffer;
 
-	for (i = 0; i < count; i++) {
-		if (put_user(buffer[i].xi_startino,   &p32[i].xi_startino) ||
-		    put_user(buffer[i].xi_alloccount, &p32[i].xi_alloccount) ||
-		    put_user(buffer[i].xi_allocmask,  &p32[i].xi_allocmask))
-			return -EFAULT;
-	}
-	*written = count * sizeof(*p32);
-	return 0;
+	if (put_user(igrp->xi_startino,   &p32->xi_startino) ||
+	    put_user(igrp->xi_alloccount, &p32->xi_alloccount) ||
+	    put_user(igrp->xi_allocmask,  &p32->xi_allocmask))
+		return -EFAULT;
+
+	return xfs_ibulk_advance(breq, sizeof(struct compat_xfs_inogrp));
 }
 
 #else
@@ -225,7 +220,7 @@ xfs_compat_ioc_bulkstat(
 	 * to userpace memory via bulkreq.ubuffer.  Normally the compat
 	 * functions and structure size are the correct ones to use ...
 	 */
-	inumbers_fmt_pf inumbers_func = xfs_inumbers_fmt_compat;
+	inumbers_fmt_pf		inumbers_func = xfs_inumbers_fmt_compat;
 	bulkstat_one_fmt_pf	bs_one_func = xfs_bulkstat_one_fmt_compat;
 
 #ifdef CONFIG_X86_X32
@@ -286,13 +281,9 @@ xfs_compat_ioc_bulkstat(
 	 * parameter to maintain correct function.
 	 */
 	if (cmd == XFS_IOC_FSINUMBERS_32) {
-		int	count = breq.icount;
-
-		breq.startino = lastino;
-		error = xfs_inumbers(mp, &breq.startino, &count,
-				bulkreq.ubuffer, inumbers_func);
-		breq.ocount = count;
-		lastino = breq.startino;
+		breq.startino = lastino + 1;
+		error = xfs_inumbers(&breq, inumbers_func);
+		lastino = breq.startino - 1;
 	} else if (cmd == XFS_IOC_FSBULKSTAT_SINGLE_32) {
 		breq.startino = lastino;
 		error = xfs_bulkstat_one(&breq, bs_one_func);
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index 06abe5c9c0ee..bade54d6ac64 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -259,121 +259,85 @@ xfs_bulkstat(
 	return error;
 }
 
-int
-xfs_inumbers_fmt(
-	void			__user *ubuffer, /* buffer to write to */
-	const struct xfs_inogrp	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written)	/* # of bytes written */
+struct xfs_inumbers_chunk {
+	inumbers_fmt_pf		formatter;
+	struct xfs_ibulk	*breq;
+};
+
+/*
+ * INUMBERS
+ * ========
+ * This is how we export inode btree records to userspace, so that XFS tools
+ * can figure out where inodes are allocated.
+ */
+
+/*
+ * Format the inode group structure and report it somewhere.
+ *
+ * Similar to xfs_bulkstat_one_int, lastino is the inode cursor as we walk
+ * through the filesystem so we move it forward unless there was a runtime
+ * error.  If the formatter tells us the buffer is now full we also move the
+ * cursor forward and abort the walk.
+ */
+STATIC int
+xfs_inumbers_walk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_agnumber_t		agno,
+	const struct xfs_inobt_rec_incore *irec,
+	void			*data)
 {
-	if (copy_to_user(ubuffer, buffer, count * sizeof(*buffer)))
-		return -EFAULT;
-	*written = count * sizeof(*buffer);
-	return 0;
+	struct xfs_inogrp	inogrp = {
+		.xi_startino	= XFS_AGINO_TO_INO(mp, agno, irec->ir_startino),
+		.xi_alloccount	= irec->ir_count - irec->ir_freecount,
+		.xi_allocmask	= ~irec->ir_free,
+	};
+	struct xfs_inumbers_chunk *ic = data;
+	xfs_agino_t		agino;
+	int			error;
+
+	error = ic->formatter(ic->breq, &inogrp);
+	if (error && error != XFS_IBULK_BUFFER_FULL)
+		return error;
+	if (error == XFS_IBULK_BUFFER_FULL)
+		error = XFS_INOBT_WALK_ABORT;
+
+	agino = irec->ir_startino + XFS_INODES_PER_CHUNK;
+	ic->breq->startino = XFS_AGINO_TO_INO(mp, agno, agino);
+	return error;
 }
 
 /*
  * Return inode number table for the filesystem.
  */
-int					/* error status */
+int
 xfs_inumbers(
-	struct xfs_mount	*mp,/* mount point for filesystem */
-	xfs_ino_t		*lastino,/* last inode returned */
-	int			*count,/* size of buffer/count returned */
-	void			__user *ubuffer,/* buffer with inode descriptions */
+	struct xfs_ibulk	*breq,
 	inumbers_fmt_pf		formatter)
 {
-	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, *lastino);
-	xfs_agino_t		agino = XFS_INO_TO_AGINO(mp, *lastino);
-	struct xfs_btree_cur	*cur = NULL;
-	struct xfs_buf		*agbp = NULL;
-	struct xfs_inogrp	*buffer;
-	int			bcount;
-	int			left = *count;
-	int			bufidx = 0;
+	struct xfs_inumbers_chunk ic = {
+		.formatter	= formatter,
+		.breq		= breq,
+	};
 	int			error = 0;
 
-	*count = 0;
-	if (agno >= mp->m_sb.sb_agcount ||
-	    *lastino != XFS_AGINO_TO_INO(mp, agno, agino))
-		return error;
+	breq->ocount = 0;
 
-	bcount = min(left, (int)(PAGE_SIZE / sizeof(*buffer)));
-	buffer = kmem_zalloc(bcount * sizeof(*buffer), KM_SLEEP);
-	do {
-		struct xfs_inobt_rec_incore	r;
-		int				stat;
-
-		if (!agbp) {
-			error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
-			if (error)
-				break;
-
-			cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno,
-						    XFS_BTNUM_INO);
-			error = xfs_inobt_lookup(cur, agino, XFS_LOOKUP_GE,
-						 &stat);
-			if (error)
-				break;
-			if (!stat)
-				goto next_ag;
-		}
-
-		error = xfs_inobt_get_rec(cur, &r, &stat);
-		if (error)
-			break;
-		if (!stat)
-			goto next_ag;
-
-		agino = r.ir_startino + XFS_INODES_PER_CHUNK - 1;
-		buffer[bufidx].xi_startino =
-			XFS_AGINO_TO_INO(mp, agno, r.ir_startino);
-		buffer[bufidx].xi_alloccount = r.ir_count - r.ir_freecount;
-		buffer[bufidx].xi_allocmask = ~r.ir_free;
-		if (++bufidx == bcount) {
-			long	written;
-
-			error = formatter(ubuffer, buffer, bufidx, &written);
-			if (error)
-				break;
-			ubuffer += written;
-			*count += bufidx;
-			bufidx = 0;
-		}
-		if (!--left)
-			break;
-
-		error = xfs_btree_increment(cur, 0, &stat);
-		if (error)
-			break;
-		if (stat)
-			continue;
-
-next_ag:
-		xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
-		cur = NULL;
-		xfs_buf_relse(agbp);
-		agbp = NULL;
-		agino = 0;
-		agno++;
-	} while (agno < mp->m_sb.sb_agcount);
-
-	if (!error) {
-		if (bufidx) {
-			long	written;
-
-			error = formatter(ubuffer, buffer, bufidx, &written);
-			if (!error)
-				*count += bufidx;
-		}
-		*lastino = XFS_AGINO_TO_INO(mp, agno, agino);
-	}
+	if (xfs_bulkstat_already_done(breq->mp, breq->startino))
+		return 0;
+
+	error = xfs_inobt_walk(breq->mp, NULL, breq->startino,
+			xfs_inumbers_walk, breq->icount, &ic);
 
-	kmem_free(buffer);
-	if (cur)
-		xfs_btree_del_cursor(cur, error);
-	if (agbp)
-		xfs_buf_relse(agbp);
+	/*
+	 * We found some inode groups, so clear the error status and return
+	 * them.  The lastino pointer will point directly at the inode that
+	 * triggered any error that occurred, so on the next call the error
+	 * will be triggered again and propagated to userspace as there will be
+	 * no formatted inode groups in the buffer.
+	 */
+	if (breq->ocount > 0)
+		error = 0;
 
 	return error;
 }
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index a2562fe8d282..b4c89454e27a 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -46,25 +46,9 @@ typedef int (*bulkstat_one_fmt_pf)(struct xfs_ibulk *breq,
 int xfs_bulkstat_one(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 int xfs_bulkstat(struct xfs_ibulk *breq, bulkstat_one_fmt_pf formatter);
 
-typedef int (*inumbers_fmt_pf)(
-	void			__user *ubuffer, /* buffer to write to */
-	const xfs_inogrp_t	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written);	/* # of bytes written */
+typedef int (*inumbers_fmt_pf)(struct xfs_ibulk *breq,
+		const struct xfs_inogrp *igrp);
 
-int
-xfs_inumbers_fmt(
-	void			__user *ubuffer, /* buffer to write to */
-	const xfs_inogrp_t	*buffer,	/* buffer to read from */
-	long			count,		/* # of elements to read */
-	long			*written);	/* # of bytes written */
-
-int					/* error status */
-xfs_inumbers(
-	xfs_mount_t		*mp,	/* mount point for filesystem */
-	xfs_ino_t		*last,	/* last inode returned */
-	int			*count,	/* size of buffer/count returned */
-	void			__user *buffer, /* buffer with inode info */
-	inumbers_fmt_pf		formatter);
+int xfs_inumbers(struct xfs_ibulk *breq, inumbers_fmt_pf formatter);
 
 #endif	/* __XFS_ITABLE_H__ */
diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c
index 7f40d0633651..afa4b22ffb3d 100644
--- a/fs/xfs/xfs_iwalk.c
+++ b/fs/xfs/xfs_iwalk.c
@@ -54,7 +54,10 @@ struct xfs_iwalk_ag {
 	unsigned int			nr_recs;
 
 	/* Inode walk function and data pointer. */
-	xfs_iwalk_fn			iwalk_fn;
+	union {
+		xfs_iwalk_fn		iwalk_fn;
+		xfs_inobt_walk_fn	inobt_walk_fn;
+	};
 	void				*data;
 };
 
@@ -94,16 +97,18 @@ xfs_iwalk_ichunk_ra(
 
 /*
  * Lookup the inode chunk that the given @agino lives in and then get the
- * record if we found the chunk.  Set the bits in @irec's free mask that
- * correspond to the inodes before @agino so that we skip them.  This is how we
- * restart an inode walk that was interrupted in the middle of an inode record.
+ * record if we found the chunk.  If @trim is set, set the bits in @irec's free
+ * mask that correspond to the inodes before @agino so that we skip them.
+ * This is how we restart an inode walk that was interrupted in the middle of
+ * an inode record.
  */
 STATIC int
 xfs_iwalk_grab_ichunk(
 	struct xfs_btree_cur		*cur,	/* btree cursor */
 	xfs_agino_t			agino,	/* starting inode of chunk */
 	int				*icount,/* return # of inodes grabbed */
-	struct xfs_inobt_rec_incore	*irec)	/* btree record */
+	struct xfs_inobt_rec_incore	*irec,	/* btree record */
+	bool				trim)
 {
 	int				idx;	/* index into inode chunk */
 	int				stat;
@@ -131,6 +136,12 @@ xfs_iwalk_grab_ichunk(
 		return 0;
 	}
 
+	/* Return the entire record if the caller wants the whole thing. */
+	if (!trim) {
+		*icount = irec->ir_count;
+		return 0;
+	}
+
 	idx = agino - irec->ir_startino;
 
 	/*
@@ -278,7 +289,8 @@ xfs_iwalk_ag_start(
 	xfs_agino_t		agino,
 	struct xfs_btree_cur	**curpp,
 	struct xfs_buf		**agi_bpp,
-	int			*has_more)
+	int			*has_more,
+	bool			trim)
 {
 	struct xfs_mount	*mp = iwag->mp;
 	struct xfs_trans	*tp = iwag->tp;
@@ -302,7 +314,7 @@ xfs_iwalk_ag_start(
 	 * have to deal with tearing down the cursor to walk the records.
 	 */
 	error = xfs_iwalk_grab_ichunk(*curpp, agino, &icount,
-			&iwag->recs[iwag->nr_recs]);
+			&iwag->recs[iwag->nr_recs], trim);
 	if (error)
 		return error;
 	if (icount)
@@ -377,7 +389,8 @@ xfs_iwalk_ag(
 	/* Set up our cursor at the right place in the inode btree. */
 	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
 	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
-	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
+			true);
 	if (error)
 		goto out_cur;
 
@@ -565,3 +578,129 @@ xfs_iwalk_threaded(
 		return xfs_pwork_destroy_poll(&pctl);
 	return xfs_pwork_destroy(&pctl);
 }
+
+/* For each inuse inode in each cached inobt record, call our function. */
+STATIC int
+xfs_inobt_walk_ag_recs(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_trans		*tp = iwag->tp;
+	struct xfs_inobt_rec_incore	*irec;
+	unsigned int			i;
+	xfs_agnumber_t			agno;
+	int				error;
+
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	for (i = 0, irec = iwag->recs; i < iwag->nr_recs; i++, irec++) {
+		trace_xfs_iwalk_ag_rec(mp, agno, irec->ir_startino,
+				irec->ir_free);
+		error = iwag->inobt_walk_fn(mp, tp, agno, irec, iwag->data);
+		if (error)
+			return error;
+	}
+
+	iwag->nr_recs = 0;
+	return 0;
+}
+
+/*
+ * Walk all inode btree records in a single AG, from @iwag->startino to the end
+ * of the AG.
+ */
+STATIC int
+xfs_inobt_walk_ag(
+	struct xfs_iwalk_ag		*iwag)
+{
+	struct xfs_mount		*mp = iwag->mp;
+	struct xfs_buf			*agi_bp = NULL;
+	struct xfs_btree_cur		*cur = NULL;
+	xfs_agnumber_t			agno;
+	xfs_agino_t			agino;
+	int				has_more;
+	int				error = 0;
+
+	/* Set up our cursor at the right place in the inode btree. */
+	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
+	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
+	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more,
+			false);
+	if (error)
+		goto out_cur;
+
+	while (has_more && !xfs_pwork_want_abort(&iwag->pwork)) {
+		struct xfs_inobt_rec_incore	*irec;
+
+		/* Fetch the inobt record. */
+		irec = &iwag->recs[iwag->nr_recs];
+		error = xfs_inobt_get_rec(cur, irec, &has_more);
+		if (error)
+			goto out_cur;
+		if (!has_more)
+			break;
+
+		/*
+		 * Add this inobt record to our cache, flush the cache if
+		 * needed, and move on to the next record.
+		 */
+		error = xfs_iwalk_ag_increment(iwag, xfs_inobt_walk_ag_recs,
+				agno, &cur, &agi_bp, &has_more);
+		if (error)
+			goto out_cur;
+		cond_resched();
+	}
+
+	/* Walk any records left behind in the cache. */
+	if (iwag->nr_recs && !xfs_pwork_want_abort(&iwag->pwork)) {
+		xfs_iwalk_del_inobt(iwag->tp, &cur, &agi_bp, error);
+		return xfs_inobt_walk_ag_recs(iwag);
+	}
+
+out_cur:
+	xfs_iwalk_del_inobt(iwag->tp, &cur, &agi_bp, error);
+	return error;
+}
+
+/*
+ * Walk all inode btree records in the filesystem starting from @startino.  The
+ * @inobt_walk_fn will be called for each btree record, being passed the incore
+ * record and @data.  @max_prefetch controls how many inobt records we try to
+ * cache ahead of time.
+ */
+int
+xfs_inobt_walk(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_ino_t		startino,
+	xfs_inobt_walk_fn	inobt_walk_fn,
+	unsigned int		max_prefetch,
+	void			*data)
+{
+	struct xfs_iwalk_ag	iwag = {
+		.mp		= mp,
+		.tp		= tp,
+		.inobt_walk_fn	= inobt_walk_fn,
+		.data		= data,
+		.startino	= startino,
+		.pwork		= XFS_PWORK_SINGLE_THREADED,
+	};
+	xfs_agnumber_t		agno = XFS_INO_TO_AGNO(mp, startino);
+	int			error;
+
+	ASSERT(agno < mp->m_sb.sb_agcount);
+
+	xfs_iwalk_set_prefetch(&iwag, max_prefetch * XFS_INODES_PER_CHUNK);
+	error = xfs_iwalk_allocbuf(&iwag);
+	if (error)
+		return error;
+
+	for (; agno < mp->m_sb.sb_agcount; agno++) {
+		error = xfs_inobt_walk_ag(&iwag);
+		if (error)
+			break;
+		iwag.startino = XFS_AGINO_TO_INO(mp, agno + 1, 0);
+	}
+
+	xfs_iwalk_freebuf(&iwag);
+	return error;
+}
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index 76d8f87a39ef..20bee93d4676 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -18,4 +18,16 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
 		xfs_iwalk_fn iwalk_fn, unsigned int max_prefetch, bool poll,
 		void *data);
 
+/* Walk all inode btree records in the filesystem starting from @startino. */
+typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
+				 xfs_agnumber_t agno,
+				 const struct xfs_inobt_rec_incore *irec,
+				 void *data);
+/* Return value (for xfs_inobt_walk_fn) that aborts the walk immediately. */
+#define XFS_INOBT_WALK_ABORT	(XFS_IWALK_ABORT)
+
+int xfs_inobt_walk(struct xfs_mount *mp, struct xfs_trans *tp,
+		xfs_ino_t startino, xfs_inobt_walk_fn inobt_walk_fn,
+		unsigned int max_prefetch, void *data);
+
 #endif /* __XFS_IWALK_H__ */

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] xfs: separate inode geometry
  2019-05-29 22:26 ` [PATCH 01/11] xfs: separate inode geometry Darrick J. Wong
@ 2019-05-30  1:18   ` Dave Chinner
  2019-05-30 22:33     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2019-05-30  1:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 29, 2019 at 03:26:20PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Separate the inode geometry information into a distinct structure.

I like the idea, but have lots of comments on the patch....

> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h       |   33 +++++++++++-
>  fs/xfs/libxfs/xfs_ialloc.c       |  109 ++++++++++++++++++++------------------
>  fs/xfs/libxfs/xfs_ialloc.h       |    6 +-
>  fs/xfs/libxfs/xfs_ialloc_btree.c |   15 +++--
>  fs/xfs/libxfs/xfs_inode_buf.c    |    2 -
>  fs/xfs/libxfs/xfs_sb.c           |   24 +++++---
>  fs/xfs/libxfs/xfs_trans_resv.c   |   17 +++---
>  fs/xfs/libxfs/xfs_trans_space.h  |    7 +-
>  fs/xfs/libxfs/xfs_types.c        |    4 +
>  fs/xfs/scrub/ialloc.c            |   22 ++++----
>  fs/xfs/scrub/quota.c             |    2 -
>  fs/xfs/xfs_fsops.c               |    4 +
>  fs/xfs/xfs_inode.c               |   17 +++---
>  fs/xfs/xfs_itable.c              |   11 ++--
>  fs/xfs/xfs_log_recover.c         |   23 ++++----
>  fs/xfs/xfs_mount.c               |   49 +++++++++--------
>  fs/xfs/xfs_mount.h               |   17 ------
>  fs/xfs/xfs_super.c               |    6 +-
>  18 files changed, 205 insertions(+), 163 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 9bb3c48843ec..66f527b1c461 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -1071,7 +1071,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define	XFS_INO_MASK(k)			(uint32_t)((1ULL << (k)) - 1)
>  #define	XFS_INO_OFFSET_BITS(mp)		(mp)->m_sb.sb_inopblog
>  #define	XFS_INO_AGBNO_BITS(mp)		(mp)->m_sb.sb_agblklog
> -#define	XFS_INO_AGINO_BITS(mp)		(mp)->m_agino_log
> +#define	XFS_INO_AGINO_BITS(mp)		((mp)->m_ino_geo.ig_agino_log)
>  #define	XFS_INO_AGNO_BITS(mp)		(mp)->m_agno_log
>  #define	XFS_INO_BITS(mp)		\
>  	XFS_INO_AGNO_BITS(mp) + XFS_INO_AGINO_BITS(mp)
> @@ -1694,4 +1694,35 @@ struct xfs_acl {
>  #define SGI_ACL_FILE_SIZE	(sizeof(SGI_ACL_FILE)-1)
>  #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
>  
> +struct xfs_ino_geometry {
> +	/* Maximum inode count in this filesystem. */
> +	uint64_t	ig_maxicount;

Naming is hard. What's the point of adding "ig_" prefix when the
variables mostly already have an "i" somewhere in them that means
"inode"?  And when they are referenced as igeo->ig_i...., then we've
got so much redudant namespace in the code.....

This is one of the reasons when the struct xfs_da_geometry is not
namespaced - it's clear from the code context it's
directory/attribute geometry info, so it doesn't need lots of extra
namespace gunk.

> +
> +	/* Minimum inode buffer size, in bytes. */
> +	unsigned int	ig_min_cluster_size;

What does the "minimum" in this variable mean? cluster size is fixed
for a filesystem, there's no minimum or maximum size....

> +
> +	/* Inode cluster sizes, adjusted to be at least 1 fsb. */
> +	unsigned int	ig_inodes_per_cluster;
> +	unsigned int	ig_blocks_per_cluster;
> +
> +	/* Inode cluster alignment. */
> +	unsigned int	ig_cluster_align;
> +	unsigned int	ig_cluster_align_inodes;
> +
> +	unsigned int	ig_inobt_mxr[2]; /* max inobt btree records */
> +	unsigned int	ig_inobt_mnr[2]; /* min inobt btree records */
> +	unsigned int	ig_in_maxlevels; /* max inobt btree levels. */

inobt_maxlevels?

> +
> +	/* Minimum inode allocation size */
> +	unsigned int	ig_ialloc_inos;
> +	unsigned int	ig_ialloc_blks;

What's "minimum" about these values?

> +	/* Minimum inode blocks for a sparse allocation. */
> +	unsigned int	ig_ialloc_min_blks;
> +
> +	unsigned int	ig_inoalign_mask;/* mask sb_inoalignmt if used */

This goes with the cluster alignment variables, it's always set by
mkfs and used to convert inode numbers to cluster agbnos...

> +	unsigned int	ig_agino_log;	/* #bits for agino in inum */
> +	unsigned int	ig_sinoalign;	/* stripe unit inode alignment */

And this one should be renamed ialloc_align and moved up with the
the other ialloc variables....


> +};
> +
>  #endif /* __XFS_FORMAT_H__ */
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index fe9898875097..c881e0521331 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -299,7 +299,7 @@ xfs_ialloc_inode_init(
>  	 * sizes, manipulate the inodes in buffers  which are multiples of the
>  	 * blocks size.
>  	 */
> -	nbufs = length / mp->m_blocks_per_cluster;
> +	nbufs = length / mp->m_ino_geo.ig_blocks_per_cluster;
>  
>  	/*
>  	 * Figure out what version number to use in the inodes we create.  If
> @@ -343,9 +343,10 @@ xfs_ialloc_inode_init(
>  		 * Get the block.
>  		 */
>  		d = XFS_AGB_TO_DADDR(mp, agno, agbno +
> -				(j * mp->m_blocks_per_cluster));
> +				(j * mp->m_ino_geo.ig_blocks_per_cluster));
>  		fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
> -					 mp->m_bsize * mp->m_blocks_per_cluster,
> +					 mp->m_bsize *
> +					 mp->m_ino_geo.ig_blocks_per_cluster,
>  					 XBF_UNMAPPED);

This doesn't improve readability of the code. Please use a local
igeom variable rather than repeatedly using mp->m_ino_geo.ig_....
in the function.


> @@ -690,10 +693,10 @@ xfs_ialloc_ag_alloc(
>  		 * but not to use them in the actual exact allocation.
>  		 */
>  		args.alignment = 1;
> -		args.minalignslop = args.mp->m_cluster_align - 1;
> +		args.minalignslop = args.mp->m_ino_geo.ig_cluster_align - 1;

Ummm, why not igeo->... , like:

>  
>  		/* Allow space for the inode btree to split. */
> -		args.minleft = args.mp->m_in_maxlevels - 1;
> +		args.minleft = igeo->ig_in_maxlevels - 1;

3 lines down?

>  		if ((error = xfs_alloc_vextent(&args)))
>  			return error;
>  
> @@ -720,12 +723,12 @@ xfs_ialloc_ag_alloc(
>  		 * pieces, so don't need alignment anyway.
>  		 */
>  		isaligned = 0;
> -		if (args.mp->m_sinoalign) {
> +		if (igeo->ig_sinoalign) {
>  			ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN));
>  			args.alignment = args.mp->m_dalign;
>  			isaligned = 1;
>  		} else
> -			args.alignment = args.mp->m_cluster_align;
> +			args.alignment = args.mp->m_ino_geo.ig_cluster_align;

Ditto (and others).

>  	int			noroom = 0;
>  	xfs_agnumber_t		start_agno;
>  	struct xfs_perag	*pag;
> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
>  	int			okalloc = 1;
>  
>  	if (*IO_agbp) {
> @@ -1733,9 +1737,9 @@ xfs_dialloc(
>  	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
>  	 * which will sacrifice the preciseness but improve the performance.
>  	 */
> -	if (mp->m_maxicount &&
> -	    percpu_counter_read_positive(&mp->m_icount) + mp->m_ialloc_inos
> -							> mp->m_maxicount) {
> +	if (mp->m_ino_geo.ig_maxicount &&

igeo?

> +	    percpu_counter_read_positive(&mp->m_icount) + igeo->ig_ialloc_inos
> +							> igeo->ig_maxicount) {
>  		noroom = 1;
>  		okalloc = 0;
>  	}
> @@ -1852,7 +1856,8 @@ xfs_difree_inode_chunk(
>  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
>  		/* not sparse, calculate extent info directly */
>  		xfs_bmap_add_free(tp, XFS_AGB_TO_FSB(mp, agno, sagbno),
> -				  mp->m_ialloc_blks, &XFS_RMAP_OINFO_INODES);
> +				  mp->m_ino_geo.ig_ialloc_blks,
> +				  &XFS_RMAP_OINFO_INODES);
>  		return;
>  	}
>  
> @@ -2261,7 +2266,7 @@ xfs_imap_lookup(
>  
>  	/* check that the returned record contains the required inode */
>  	if (rec.ir_startino > agino ||
> -	    rec.ir_startino + mp->m_ialloc_inos <= agino)
> +	    rec.ir_startino + mp->m_ino_geo.ig_ialloc_inos <= agino)
>  		return -EINVAL;
>  
>  	/* for untrusted inodes check it is allocated first */
> @@ -2352,7 +2357,7 @@ xfs_imap(
>  	 * If the inode cluster size is the same as the blocksize or
>  	 * smaller we get to the buffer by simple arithmetics.
>  	 */
> -	if (mp->m_blocks_per_cluster == 1) {
> +	if (mp->m_ino_geo.ig_blocks_per_cluster == 1) {

igeo...

>  		offset = XFS_INO_TO_OFFSET(mp, ino);
>  		ASSERT(offset < mp->m_sb.sb_inopblock);
>  
> @@ -2368,8 +2373,8 @@ xfs_imap(
>  	 * find the location. Otherwise we have to do a btree
>  	 * lookup to find the location.
>  	 */
> -	if (mp->m_inoalign_mask) {
> -		offset_agbno = agbno & mp->m_inoalign_mask;
> +	if (mp->m_ino_geo.ig_inoalign_mask) {
> +		offset_agbno = agbno & mp->m_ino_geo.ig_inoalign_mask;

and here too.

>  		chunk_agbno = agbno - offset_agbno;
>  	} else {
>  		error = xfs_imap_lookup(mp, tp, agno, agino, agbno,
> @@ -2381,13 +2386,13 @@ xfs_imap(
>  out_map:
>  	ASSERT(agbno >= chunk_agbno);
>  	cluster_agbno = chunk_agbno +
> -		((offset_agbno / mp->m_blocks_per_cluster) *
> -		 mp->m_blocks_per_cluster);
> +		((offset_agbno / mp->m_ino_geo.ig_blocks_per_cluster) *
> +		 mp->m_ino_geo.ig_blocks_per_cluster);

And here.

>  	offset = ((agbno - cluster_agbno) * mp->m_sb.sb_inopblock) +
>  		XFS_INO_TO_OFFSET(mp, ino);
>  
>  	imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, cluster_agbno);
> -	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> +	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_ino_geo.ig_blocks_per_cluster);

and here...

>  	imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
>  
>  	/*
> @@ -2409,7 +2414,7 @@ xfs_imap(
>  }
>  
>  /*
> - * Compute and fill in value of m_in_maxlevels.
> + * Compute and fill in value of m_ino_geo.ig_in_maxlevels.
>   */
>  void
>  xfs_ialloc_compute_maxlevels(
> @@ -2418,8 +2423,8 @@ xfs_ialloc_compute_maxlevels(
>  	uint		inodes;
>  
>  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> -	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp->m_inobt_mnr,
> -							 inodes);
> +	mp->m_ino_geo.ig_in_maxlevels = xfs_btree_compute_maxlevels(
> +			mp->m_ino_geo.ig_inobt_mnr, inodes);


Once we take away the macro:

	inode = (1LL << igeo->agino_log) >> XFS_INODES_PER_CHUNK_LOG
	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
							inodes);

So, shouldn't we just pass igeo into this function now?

>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> index e936b7cc9389..b74fa2addd51 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.h
> +++ b/fs/xfs/libxfs/xfs_ialloc.h
> @@ -28,9 +28,9 @@ static inline int
>  xfs_icluster_size_fsb(
>  	struct xfs_mount	*mp)
>  {
> -	if (mp->m_sb.sb_blocksize >= mp->m_inode_cluster_size)
> +	if (mp->m_sb.sb_blocksize >= mp->m_ino_geo.ig_min_cluster_size)
>  		return 1;
> -	return mp->m_inode_cluster_size >> mp->m_sb.sb_blocklog;
> +	return mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_blocklog;
>  }

The return value of this is placed in the mp->m_ino_geo structure.
This should pass in the igeo structure the result is written into.
It's other caller should be using the value in the igeo structure,
not calling this function.

>  
> index bc2dfacd2f4a..79cc5cf21e1b 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -28,7 +28,7 @@ xfs_inobt_get_minrecs(
>  	struct xfs_btree_cur	*cur,
>  	int			level)
>  {
> -	return cur->bc_mp->m_inobt_mnr[level != 0];
> +	return cur->bc_mp->m_ino_geo.ig_inobt_mnr[level != 0];
>  }

Put a igeo pointer in the inobt union of the btree cursor?

	return cur->bc_private.a.igeo->inobt_mnr[level != 0];

>  
>  STATIC struct xfs_btree_cur *
> @@ -164,7 +164,7 @@ xfs_inobt_get_maxrecs(
>  	struct xfs_btree_cur	*cur,
>  	int			level)
>  {
> -	return cur->bc_mp->m_inobt_mxr[level != 0];
> +	return cur->bc_mp->m_ino_geo.ig_inobt_mxr[level != 0];
>  }
>  
>  STATIC void
> @@ -281,10 +281,11 @@ xfs_inobt_verify(
>  
>  	/* level verification */
>  	level = be16_to_cpu(block->bb_level);
> -	if (level >= mp->m_in_maxlevels)
> +	if (level >= mp->m_ino_geo.ig_in_maxlevels)
>  		return __this_address;
>  
> -	return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
> +	return xfs_btree_sblock_verify(bp,
> +			mp->m_ino_geo.ig_inobt_mxr[level != 0]);
>  }
>  
>  static void
> @@ -546,7 +547,7 @@ xfs_inobt_max_size(
>  	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
>  
>  	/* Bail out if we're uninitialized, which can happen in mkfs. */
> -	if (mp->m_inobt_mxr[0] == 0)
> +	if (mp->m_ino_geo.ig_inobt_mxr[0] == 0)
>  		return 0;
>  
>  	/*
> @@ -558,7 +559,7 @@ xfs_inobt_max_size(
>  	    XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart) == agno)
>  		agblocks -= mp->m_sb.sb_logblocks;
>  
> -	return xfs_btree_calc_size(mp->m_inobt_mnr,
> +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr,
>  				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
>  					XFS_INODES_PER_CHUNK);
>  }
> @@ -619,5 +620,5 @@ xfs_iallocbt_calc_size(
>  	struct xfs_mount	*mp,
>  	unsigned long long	len)
>  {
> -	return xfs_btree_calc_size(mp->m_inobt_mnr, len);
> +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr, len);

Should pass igeo into this function now, not xfs_mount.

>  }
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index e021d5133ccb..641aa1c2f1ae 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -36,7 +36,7 @@ xfs_inobp_check(
>  	int		j;
>  	xfs_dinode_t	*dip;
>  
> -	j = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> +	j = mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_inodelog;

isn't that "inodes per cluster"?

>  
>  	for (i = 0; i < j; i++) {
>  		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index e76a3e5d28d7..9416fc741788 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -804,16 +804,18 @@ const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
>   */
>  void
>  xfs_sb_mount_common(
> -	struct xfs_mount *mp,
> -	struct xfs_sb	*sbp)
> +	struct xfs_mount	*mp,
> +	struct xfs_sb		*sbp)
>  {
> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> +
>  	mp->m_agfrotor = mp->m_agirotor = 0;
>  	mp->m_maxagi = mp->m_sb.sb_agcount;
>  	mp->m_blkbit_log = sbp->sb_blocklog + XFS_NBBYLOG;
>  	mp->m_blkbb_log = sbp->sb_blocklog - BBSHIFT;
>  	mp->m_sectbb_log = sbp->sb_sectlog - BBSHIFT;
>  	mp->m_agno_log = xfs_highbit32(sbp->sb_agcount - 1) + 1;
> -	mp->m_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
> +	mp->m_ino_geo.ig_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;

igeo.

>  
> @@ -307,7 +308,8 @@ xfs_calc_iunlink_remove_reservation(
>  	struct xfs_mount        *mp)
>  {
>  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
> +	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1),
> +			 mp->m_ino_geo.ig_min_cluster_size);
>  }

I'm starting to think a M_IGEO(mp)-like macro is in order here....

>  
> diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
> index 9b47117180cb..fa7386bf76e9 100644
> --- a/fs/xfs/scrub/ialloc.c
> +++ b/fs/xfs/scrub/ialloc.c
> @@ -230,7 +230,7 @@ xchk_iallocbt_check_cluster(
>  	int				error = 0;
>  
>  	nr_inodes = min_t(unsigned int, XFS_INODES_PER_CHUNK,
> -			mp->m_inodes_per_cluster);
> +			mp->m_ino_geo.ig_inodes_per_cluster);

igeo.... (many uses in this function)

> @@ -355,6 +356,7 @@ xchk_iallocbt_rec_alignment(
>  {
>  	struct xfs_mount		*mp = bs->sc->mp;
>  	struct xchk_iallocbt		*iabt = bs->private;
> +	struct xfs_ino_geometry		*ig = &mp->m_ino_geo;

igeo, for consistency with the rest of the code.

> @@ -2567,7 +2568,8 @@ xfs_ifree_cluster(
>  		 * to mark all the active inodes on the buffer stale.
>  		 */
>  		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
> -					mp->m_bsize * mp->m_blocks_per_cluster,
> +					mp->m_bsize *
> +						igeo->ig_blocks_per_cluster,
>  					XBF_UNMAPPED);

Back off the indent, don't use another line :)

> @@ -3476,19 +3478,20 @@ xfs_iflush_cluster(
>  	int			cilist_size;
>  	struct xfs_inode	**cilist;
>  	struct xfs_inode	*cip;
> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
>  	int			nr_found;
>  	int			clcount = 0;
>  	int			i;
>  
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>  
> -	inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> +	inodes_per_cluster = igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog;

that's igeo->inodes_per_cluster again, right?

>  	cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
>  	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
>  	if (!cilist)
>  		goto out_put;
>  
> -	mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> +	mask = ~(((igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog)) - 1);

Isn't that:

	mask = ~(inodes_per_cluster - 1);

>  	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
>  	rcu_read_lock();
>  	/* really need a gang lookup range call here */
> diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> index 1e1a0af1dd34..cff28ee73deb 100644
> --- a/fs/xfs/xfs_itable.c
> +++ b/fs/xfs/xfs_itable.c
> @@ -167,6 +167,7 @@ xfs_bulkstat_ichunk_ra(
>  	xfs_agnumber_t			agno,
>  	struct xfs_inobt_rec_incore	*irec)
>  {
> +	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
>  	xfs_agblock_t			agbno;
>  	struct blk_plug			plug;
>  	int				i;	/* inode chunk index */
> @@ -174,12 +175,14 @@ xfs_bulkstat_ichunk_ra(
>  	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
>  
>  	blk_start_plug(&plug);
> -	for (i = 0; i < XFS_INODES_PER_CHUNK;
> -	     i += mp->m_inodes_per_cluster, agbno += mp->m_blocks_per_cluster) {
> -		if (xfs_inobt_maskn(i, mp->m_inodes_per_cluster) &
> +	for (i = 0;
> +	     i < XFS_INODES_PER_CHUNK;
> +	     i += igeo->ig_inodes_per_cluster,
> +			agbno += igeo->ig_blocks_per_cluster) {
> +		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
>  		    ~irec->ir_free) {
>  			xfs_btree_reada_bufs(mp, agno, agbno,
> -					mp->m_blocks_per_cluster,
> +					igeo->ig_blocks_per_cluster,
>  					&xfs_inode_buf_ops);
>  		}

That's a mess :(

	for (i = 0; i < XFS_INODES_PER_CHUNK; i += igeo->inodes_per_cluster) {
		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
							~irec->ir_free) {
			xfs_btree_reada_bufs(mp, agno, agbno,
					igeo->ig_blocks_per_cluster,
					&xfs_inode_buf_ops);
		}
		agbno += igeo->blocks_per_cluster;
	}

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 9329f5adbfbe..15118e531184 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2882,19 +2882,19 @@ xlog_recover_buffer_pass2(
>  	 *
>  	 * Also make sure that only inode buffers with good sizes stay in
>  	 * the buffer cache.  The kernel moves inodes in buffers of 1 block
> -	 * or mp->m_inode_cluster_size bytes, whichever is bigger.  The inode
> +	 * or ig_min_cluster_size bytes, whichever is bigger.  The inode
>  	 * buffers in the log can be a different size if the log was generated
>  	 * by an older kernel using unclustered inode buffers or a newer kernel
>  	 * running with a different inode cluster size.  Regardless, if the
> -	 * the inode buffer size isn't max(blocksize, mp->m_inode_cluster_size)
> -	 * for *our* value of mp->m_inode_cluster_size, then we need to keep
> +	 * the inode buffer size isn't max(blocksize, ig_min_cluster_size)
> +	 * for *our* value of ig_min_cluster_size, then we need to keep
>  	 * the buffer out of the buffer cache so that the buffer won't
>  	 * overlap with future reads of those inodes.
>  	 */
>  	if (XFS_DINODE_MAGIC ==
>  	    be16_to_cpu(*((__be16 *)xfs_buf_offset(bp, 0))) &&
>  	    (BBTOB(bp->b_io_length) != max(log->l_mp->m_sb.sb_blocksize,
> -			(uint32_t)log->l_mp->m_inode_cluster_size))) {
> +			(uint32_t)log->l_mp->m_ino_geo.ig_min_cluster_size))) {

cluster size is already an unsigned int so the cast ca go.

>  		xfs_buf_stale(bp);
>  		error = xfs_bwrite(bp);
>  	} else {
> @@ -3849,6 +3849,7 @@ xlog_recover_do_icreate_pass2(
>  {
>  	struct xfs_mount	*mp = log->l_mp;
>  	struct xfs_icreate_log	*icl;
> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
>  	xfs_agnumber_t		agno;
>  	xfs_agblock_t		agbno;
>  	unsigned int		count;
> @@ -3898,10 +3899,10 @@ xlog_recover_do_icreate_pass2(
>  
>  	/*
>  	 * The inode chunk is either full or sparse and we only support
> -	 * m_ialloc_min_blks sized sparse allocations at this time.
> +	 * m_ino_geo.ig_ialloc_min_blks sized sparse allocations at this time.
>  	 */
> -	if (length != mp->m_ialloc_blks &&
> -	    length != mp->m_ialloc_min_blks) {
> +	if (length != igeo->ig_ialloc_blks &&
> +	    length != igeo->ig_ialloc_min_blks) {
>  		xfs_warn(log->l_mp,
>  			 "%s: unsupported chunk length", __FUNCTION__);
>  		return -EINVAL;
> @@ -3921,13 +3922,13 @@ xlog_recover_do_icreate_pass2(
>  	 * buffers for cancellation so we don't overwrite anything written after
>  	 * a cancellation.
>  	 */
> -	bb_per_cluster = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> -	nbufs = length / mp->m_blocks_per_cluster;
> +	bb_per_cluster = XFS_FSB_TO_BB(mp, igeo->ig_blocks_per_cluster);
> +	nbufs = length / igeo->ig_blocks_per_cluster;
>  	for (i = 0, cancel_count = 0; i < nbufs; i++) {
>  		xfs_daddr_t	daddr;
>  
> -		daddr = XFS_AGB_TO_DADDR(mp, agno,
> -					 agbno + i * mp->m_blocks_per_cluster);
> +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno +
> +				i * igeo->ig_blocks_per_cluster);

makes no sense to change the line break location.

		daddr = XFS_AGB_TO_DADDR(mp, agno,
				agbno + i * igeo->ig_blocks_per_cluster);


>   */
>  STATIC void
> -xfs_set_maxicount(xfs_mount_t *mp)
> +xfs_set_maxicount(
> +	struct xfs_mount	*mp)
>  {
> -	xfs_sb_t	*sbp = &(mp->m_sb);
> -	uint64_t	icount;
> +	struct xfs_sb		*sbp = &(mp->m_sb);

kill the ().

> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> +	uint64_t		icount;
>  
>  	if (sbp->sb_imax_pct) {
>  		/*
> @@ -445,11 +447,11 @@ xfs_set_maxicount(xfs_mount_t *mp)
>  		 */
>  		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
>  		do_div(icount, 100);
> -		do_div(icount, mp->m_ialloc_blks);
> -		mp->m_maxicount = (icount * mp->m_ialloc_blks)  <<
> -				   sbp->sb_inopblog;
> +		do_div(icount, igeo->ig_ialloc_blks);
> +		igeo->ig_maxicount = XFS_FSB_TO_INO(mp,
> +				icount * igeo->ig_ialloc_blks);
>  	} else {
> -		mp->m_maxicount = 0;
> +		igeo->ig_maxicount = 0;
>  	}
>  }
>  
> @@ -518,18 +520,18 @@ xfs_set_inoalignment(xfs_mount_t *mp)
>  {
>  	if (xfs_sb_version_hasalign(&mp->m_sb) &&
>  		mp->m_sb.sb_inoalignmt >= xfs_icluster_size_fsb(mp))
> -		mp->m_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
> +		mp->m_ino_geo.ig_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
>  	else
> -		mp->m_inoalign_mask = 0;
> +		mp->m_ino_geo.ig_inoalign_mask = 0;
>  	/*
>  	 * If we are using stripe alignment, check whether
>  	 * the stripe unit is a multiple of the inode alignment
>  	 */
> -	if (mp->m_dalign && mp->m_inoalign_mask &&
> -	    !(mp->m_dalign & mp->m_inoalign_mask))
> -		mp->m_sinoalign = mp->m_dalign;
> +	if (mp->m_dalign && mp->m_ino_geo.ig_inoalign_mask &&
> +	    !(mp->m_dalign & mp->m_ino_geo.ig_inoalign_mask))
> +		mp->m_ino_geo.ig_sinoalign = mp->m_dalign;
>  	else
> -		mp->m_sinoalign = 0;
> +		mp->m_ino_geo.ig_sinoalign = 0;

should pass in igeo to this function....

>  }
>  
>  /*
> @@ -683,6 +685,7 @@ xfs_mountfs(
>  {
>  	struct xfs_sb		*sbp = &(mp->m_sb);
>  	struct xfs_inode	*rip;
> +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
>  	uint64_t		resblks;
>  	uint			quotamount = 0;
>  	uint			quotaflags = 0;
> @@ -797,18 +800,20 @@ xfs_mountfs(
>  	 * has set the inode alignment value appropriately for larger cluster
>  	 * sizes.
>  	 */
> -	mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> +	igeo->ig_min_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
>  	if (xfs_sb_version_hascrc(&mp->m_sb)) {
> -		int	new_size = mp->m_inode_cluster_size;
> +		int	new_size = igeo->ig_min_cluster_size;
>  
>  		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
>  		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
> -			mp->m_inode_cluster_size = new_size;
> +			igeo->ig_min_cluster_size = new_size;
>  	}
> -	mp->m_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> -	mp->m_inodes_per_cluster = XFS_FSB_TO_INO(mp, mp->m_blocks_per_cluster);
> -	mp->m_cluster_align = xfs_ialloc_cluster_alignment(mp);
> -	mp->m_cluster_align_inodes = XFS_FSB_TO_INO(mp, mp->m_cluster_align);
> +	igeo->ig_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> +	igeo->ig_inodes_per_cluster = XFS_FSB_TO_INO(mp,
> +			igeo->ig_blocks_per_cluster);
> +	igeo->ig_cluster_align = xfs_ialloc_cluster_alignment(mp);
> +	igeo->ig_cluster_align_inodes = XFS_FSB_TO_INO(mp,
> +			igeo->ig_cluster_align);

Can we separate out all the igeo initialsation into a single init
function rather than being spread out over several functions?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 01/11] xfs: separate inode geometry
  2019-05-30  1:18   ` Dave Chinner
@ 2019-05-30 22:33     ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-05-30 22:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, May 30, 2019 at 11:18:33AM +1000, Dave Chinner wrote:
> On Wed, May 29, 2019 at 03:26:20PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Separate the inode geometry information into a distinct structure.
> 
> I like the idea, but have lots of comments on the patch....
> 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h       |   33 +++++++++++-
> >  fs/xfs/libxfs/xfs_ialloc.c       |  109 ++++++++++++++++++++------------------
> >  fs/xfs/libxfs/xfs_ialloc.h       |    6 +-
> >  fs/xfs/libxfs/xfs_ialloc_btree.c |   15 +++--
> >  fs/xfs/libxfs/xfs_inode_buf.c    |    2 -
> >  fs/xfs/libxfs/xfs_sb.c           |   24 +++++---
> >  fs/xfs/libxfs/xfs_trans_resv.c   |   17 +++---
> >  fs/xfs/libxfs/xfs_trans_space.h  |    7 +-
> >  fs/xfs/libxfs/xfs_types.c        |    4 +
> >  fs/xfs/scrub/ialloc.c            |   22 ++++----
> >  fs/xfs/scrub/quota.c             |    2 -
> >  fs/xfs/xfs_fsops.c               |    4 +
> >  fs/xfs/xfs_inode.c               |   17 +++---
> >  fs/xfs/xfs_itable.c              |   11 ++--
> >  fs/xfs/xfs_log_recover.c         |   23 ++++----
> >  fs/xfs/xfs_mount.c               |   49 +++++++++--------
> >  fs/xfs/xfs_mount.h               |   17 ------
> >  fs/xfs/xfs_super.c               |    6 +-
> >  18 files changed, 205 insertions(+), 163 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 9bb3c48843ec..66f527b1c461 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -1071,7 +1071,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define	XFS_INO_MASK(k)			(uint32_t)((1ULL << (k)) - 1)
> >  #define	XFS_INO_OFFSET_BITS(mp)		(mp)->m_sb.sb_inopblog
> >  #define	XFS_INO_AGBNO_BITS(mp)		(mp)->m_sb.sb_agblklog
> > -#define	XFS_INO_AGINO_BITS(mp)		(mp)->m_agino_log
> > +#define	XFS_INO_AGINO_BITS(mp)		((mp)->m_ino_geo.ig_agino_log)
> >  #define	XFS_INO_AGNO_BITS(mp)		(mp)->m_agno_log
> >  #define	XFS_INO_BITS(mp)		\
> >  	XFS_INO_AGNO_BITS(mp) + XFS_INO_AGINO_BITS(mp)
> > @@ -1694,4 +1694,35 @@ struct xfs_acl {
> >  #define SGI_ACL_FILE_SIZE	(sizeof(SGI_ACL_FILE)-1)
> >  #define SGI_ACL_DEFAULT_SIZE	(sizeof(SGI_ACL_DEFAULT)-1)
> >  
> > +struct xfs_ino_geometry {
> > +	/* Maximum inode count in this filesystem. */
> > +	uint64_t	ig_maxicount;
> 
> Naming is hard. What's the point of adding "ig_" prefix when the
> variables mostly already have an "i" somewhere in them that means
> "inode"?  And when they are referenced as igeo->ig_i...., then we've
> got so much redudant namespace in the code.....
> 
> This is one of the reasons when the struct xfs_da_geometry is not
> namespaced - it's clear from the code context it's
> directory/attribute geometry info, so it doesn't need lots of extra
> namespace gunk.

Ok, no more namespacing gunk.  Whoopee!! :)

> > +
> > +	/* Minimum inode buffer size, in bytes. */
> > +	unsigned int	ig_min_cluster_size;
> 
> What does the "minimum" in this variable mean? cluster size is fixed
> for a filesystem, there's no minimum or maximum size....

The comment came from xfs_mount.h ("min inode buf size"), which didn't
help a lot.  This variable and the two after have caused me quite a lot
of confusion over the years. :)

AFAICT, this value is some sort of "desired" cluster buffer size, which
for V4 filesystems is fixed at 8K and for V5 filesystems is calculated
as 8K * (inode size / 256)...

> > +	/* Inode cluster sizes, adjusted to be at least 1 fsb. */
> > +	unsigned int	ig_inodes_per_cluster;
> > +	unsigned int	ig_blocks_per_cluster;

...however, it's still possible for this "desired" cluster buffer size
to be less than a single fs block.  We don't support partial-block
buffers, so /most/ of the code uses ig_{inodes,blocks}_per_cluster or
xfs_icluster_size_fsb to walk through all the inodes in an actual inode
cluster buffer.

I'm not sure what to call ig_min_cluster_size (or its former name,
inode_cluster_size) since it's mostly aspirational.

In particular, if I fire up mkfs.xfs -m crc=1 -b size=64k -i size=512, I
see the following inode geometry:

(gdb) p mp->m_ino_geo
$1 = {
  maxicount = 1611776, 
  inode_cluster_size = 16384, 

Desired cluster size of 8192 * (512 / 256), or 16k.  However, we only
support 64k blocks, so..

  inodes_per_cluster = 128, 

...however, 65536/512 = 128, so this makes sense...

  blocks_per_cluster = 1, 

...1 block per cluster, as expected.

  cluster_align = 1, 
  cluster_align_inodes = 128, 
  inobt_mxr = {4092, 8185}, 
  inobt_mnr = {2046, 4092}, 
  inobt_maxlevels = 2, 
  ialloc_inos = 128, 
  ialloc_blks = 1, 
  ialloc_min_blks = 1, 
  inoalign_mask = 4294967295, 
  agino_log = 21, 
  sinoalign = 0
}

So we can't just use inodes_per_cluster as a stand-in for
inode_cluster_size >> inodelog, because they're not totally equivalent.

For comparison, this is what you get with a 4k block filesystem:

(gdb) p mp->m_ino_geo
$1 = {
  maxicount = 1611776, 
  inode_cluster_size = 16384, 
  inodes_per_cluster = 32, 
  blocks_per_cluster = 4, 
  cluster_align = 8, 
  cluster_align_inodes = 64, 
  inobt_mxr = {252, 505}, 
  inobt_mnr = {126, 252}, 
  inobt_maxlevels = 3, 
  ialloc_inos = 64, 
  ialloc_blks = 8, 
  ialloc_min_blks = 4, 
  inoalign_mask = 7, 
  agino_log = 21, 
  sinoalign = 0
}

In theory there shouldn't be /any/ users of inode_cluster_size except
for xfs_icluster_size_fsb() since it makes no sense to deal with a
partial inode cluster buffer, right?  With two exceptions, all users of
inode_cluster_size open-code rounding it up to at least 1FSB.  Those
cases can be converted to inodes_per_cluster.

However, there are those two cases that don't do that -- xfs_inobp_check
and xfs_iflush_cluster.  I don't see how either of these are correct for
64k-block filesystems?  xfs_inobp_check is a debugging function so maybe
it's less noticeable.

It seems to me that xfs_iflush_cluster only flushes the first part of an
inode cluster buffer when blocksizes are large?  On that 64k block
filesystem above, xfs_iflush will use xfs_iflush_cluster to see if there
are any other inodes that can be flushed out with the write, but since
inodes_cluster_size = 16384, it'll only look at the first 32 inodes in a
128-inode cluster buffer.  We probably never see any ill effects because
reclaim will eventually flush the other 96 inodes.

So I think a lot of these (inode_cluster_size >> inodelog) clauses can
be cleaned up, and I think there's a bug in xfs_iflush_cluster.  But
all that makes me hesitant to "just clean it all up", at least not
without someone else looking at this. :)

> > +
> > +	/* Inode cluster alignment. */
> > +	unsigned int	ig_cluster_align;
> > +	unsigned int	ig_cluster_align_inodes;
> > +
> > +	unsigned int	ig_inobt_mxr[2]; /* max inobt btree records */
> > +	unsigned int	ig_inobt_mnr[2]; /* min inobt btree records */
> > +	unsigned int	ig_in_maxlevels; /* max inobt btree levels. */
> 
> inobt_maxlevels?

Ok.

> > +
> > +	/* Minimum inode allocation size */
> > +	unsigned int	ig_ialloc_inos;
> > +	unsigned int	ig_ialloc_blks;
> 
> What's "minimum" about these values?

Hmm, nothing. :)

/* Size of inode allocations under normal operation */

> > +	/* Minimum inode blocks for a sparse allocation. */
> > +	unsigned int	ig_ialloc_min_blks;
> > +
> > +	unsigned int	ig_inoalign_mask;/* mask sb_inoalignmt if used */
> 
> This goes with the cluster alignment variables, it's always set by
> mkfs and used to convert inode numbers to cluster agbnos...

Ok.

> > +	unsigned int	ig_agino_log;	/* #bits for agino in inum */
> > +	unsigned int	ig_sinoalign;	/* stripe unit inode alignment */
> 
> And this one should be renamed ialloc_align and moved up with the
> the other ialloc variables....

Ok.

> > +};
> > +
> >  #endif /* __XFS_FORMAT_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index fe9898875097..c881e0521331 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -299,7 +299,7 @@ xfs_ialloc_inode_init(
> >  	 * sizes, manipulate the inodes in buffers  which are multiples of the
> >  	 * blocks size.
> >  	 */
> > -	nbufs = length / mp->m_blocks_per_cluster;
> > +	nbufs = length / mp->m_ino_geo.ig_blocks_per_cluster;
> >  
> >  	/*
> >  	 * Figure out what version number to use in the inodes we create.  If
> > @@ -343,9 +343,10 @@ xfs_ialloc_inode_init(
> >  		 * Get the block.
> >  		 */
> >  		d = XFS_AGB_TO_DADDR(mp, agno, agbno +
> > -				(j * mp->m_blocks_per_cluster));
> > +				(j * mp->m_ino_geo.ig_blocks_per_cluster));
> >  		fbuf = xfs_trans_get_buf(tp, mp->m_ddev_targp, d,
> > -					 mp->m_bsize * mp->m_blocks_per_cluster,
> > +					 mp->m_bsize *
> > +					 mp->m_ino_geo.ig_blocks_per_cluster,
> >  					 XBF_UNMAPPED);
> 
> This doesn't improve readability of the code. Please use a local
> igeom variable rather than repeatedly using mp->m_ino_geo.ig_....
> in the function.

Can I create an M_IGEO macro and replace mp->m_blocks_per_cluster with
M_IGEO(mp)->blocks_per_cluster instead?
> 
> 
> > @@ -690,10 +693,10 @@ xfs_ialloc_ag_alloc(
> >  		 * but not to use them in the actual exact allocation.
> >  		 */
> >  		args.alignment = 1;
> > -		args.minalignslop = args.mp->m_cluster_align - 1;
> > +		args.minalignslop = args.mp->m_ino_geo.ig_cluster_align - 1;
> 
> Ummm, why not igeo->... , like:
> 
> >  
> >  		/* Allow space for the inode btree to split. */
> > -		args.minleft = args.mp->m_in_maxlevels - 1;
> > +		args.minleft = igeo->ig_in_maxlevels - 1;
> 
> 3 lines down?

djwong drain bamage. :(

> >  		if ((error = xfs_alloc_vextent(&args)))
> >  			return error;
> >  
> > @@ -720,12 +723,12 @@ xfs_ialloc_ag_alloc(
> >  		 * pieces, so don't need alignment anyway.
> >  		 */
> >  		isaligned = 0;
> > -		if (args.mp->m_sinoalign) {
> > +		if (igeo->ig_sinoalign) {
> >  			ASSERT(!(args.mp->m_flags & XFS_MOUNT_NOALIGN));
> >  			args.alignment = args.mp->m_dalign;
> >  			isaligned = 1;
> >  		} else
> > -			args.alignment = args.mp->m_cluster_align;
> > +			args.alignment = args.mp->m_ino_geo.ig_cluster_align;
> 
> Ditto (and others).

Will fix...

> >  	int			noroom = 0;
> >  	xfs_agnumber_t		start_agno;
> >  	struct xfs_perag	*pag;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	int			okalloc = 1;
> >  
> >  	if (*IO_agbp) {
> > @@ -1733,9 +1737,9 @@ xfs_dialloc(
> >  	 * Read rough value of mp->m_icount by percpu_counter_read_positive,
> >  	 * which will sacrifice the preciseness but improve the performance.
> >  	 */
> > -	if (mp->m_maxicount &&
> > -	    percpu_counter_read_positive(&mp->m_icount) + mp->m_ialloc_inos
> > -							> mp->m_maxicount) {
> > +	if (mp->m_ino_geo.ig_maxicount &&
> 
> igeo?

All of...

> > +	    percpu_counter_read_positive(&mp->m_icount) + igeo->ig_ialloc_inos
> > +							> igeo->ig_maxicount) {
> >  		noroom = 1;
> >  		okalloc = 0;
> >  	}
> > @@ -1852,7 +1856,8 @@ xfs_difree_inode_chunk(
> >  	if (!xfs_inobt_issparse(rec->ir_holemask)) {
> >  		/* not sparse, calculate extent info directly */
> >  		xfs_bmap_add_free(tp, XFS_AGB_TO_FSB(mp, agno, sagbno),
> > -				  mp->m_ialloc_blks, &XFS_RMAP_OINFO_INODES);
> > +				  mp->m_ino_geo.ig_ialloc_blks,
> > +				  &XFS_RMAP_OINFO_INODES);
> >  		return;
> >  	}
> >  
> > @@ -2261,7 +2266,7 @@ xfs_imap_lookup(
> >  
> >  	/* check that the returned record contains the required inode */
> >  	if (rec.ir_startino > agino ||
> > -	    rec.ir_startino + mp->m_ialloc_inos <= agino)
> > +	    rec.ir_startino + mp->m_ino_geo.ig_ialloc_inos <= agino)
> >  		return -EINVAL;
> >  
> >  	/* for untrusted inodes check it is allocated first */
> > @@ -2352,7 +2357,7 @@ xfs_imap(
> >  	 * If the inode cluster size is the same as the blocksize or
> >  	 * smaller we get to the buffer by simple arithmetics.
> >  	 */
> > -	if (mp->m_blocks_per_cluster == 1) {
> > +	if (mp->m_ino_geo.ig_blocks_per_cluster == 1) {
> 
> igeo...

These dorky...

> >  		offset = XFS_INO_TO_OFFSET(mp, ino);
> >  		ASSERT(offset < mp->m_sb.sb_inopblock);
> >  
> > @@ -2368,8 +2373,8 @@ xfs_imap(
> >  	 * find the location. Otherwise we have to do a btree
> >  	 * lookup to find the location.
> >  	 */
> > -	if (mp->m_inoalign_mask) {
> > -		offset_agbno = agbno & mp->m_inoalign_mask;
> > +	if (mp->m_ino_geo.ig_inoalign_mask) {
> > +		offset_agbno = agbno & mp->m_ino_geo.ig_inoalign_mask;
> 
> and here too.

little...

> >  		chunk_agbno = agbno - offset_agbno;
> >  	} else {
> >  		error = xfs_imap_lookup(mp, tp, agno, agino, agbno,
> > @@ -2381,13 +2386,13 @@ xfs_imap(
> >  out_map:
> >  	ASSERT(agbno >= chunk_agbno);
> >  	cluster_agbno = chunk_agbno +
> > -		((offset_agbno / mp->m_blocks_per_cluster) *
> > -		 mp->m_blocks_per_cluster);
> > +		((offset_agbno / mp->m_ino_geo.ig_blocks_per_cluster) *
> > +		 mp->m_ino_geo.ig_blocks_per_cluster);
> 
> And here.

omissions and...

> >  	offset = ((agbno - cluster_agbno) * mp->m_sb.sb_inopblock) +
> >  		XFS_INO_TO_OFFSET(mp, ino);
> >  
> >  	imap->im_blkno = XFS_AGB_TO_DADDR(mp, agno, cluster_agbno);
> > -	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> > +	imap->im_len = XFS_FSB_TO_BB(mp, mp->m_ino_geo.ig_blocks_per_cluster);
> 
> and here...

...untidy bits.

> >  	imap->im_boffset = (unsigned short)(offset << mp->m_sb.sb_inodelog);
> >  
> >  	/*
> > @@ -2409,7 +2414,7 @@ xfs_imap(
> >  }
> >  
> >  /*
> > - * Compute and fill in value of m_in_maxlevels.
> > + * Compute and fill in value of m_ino_geo.ig_in_maxlevels.
> >   */
> >  void
> >  xfs_ialloc_compute_maxlevels(
> > @@ -2418,8 +2423,8 @@ xfs_ialloc_compute_maxlevels(
> >  	uint		inodes;
> >  
> >  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> > -	mp->m_in_maxlevels = xfs_btree_compute_maxlevels(mp->m_inobt_mnr,
> > -							 inodes);
> > +	mp->m_ino_geo.ig_in_maxlevels = xfs_btree_compute_maxlevels(
> > +			mp->m_ino_geo.ig_inobt_mnr, inodes);
> 
> 
> Once we take away the macro:
> 
> 	inode = (1LL << igeo->agino_log) >> XFS_INODES_PER_CHUNK_LOG
> 	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
> 							inodes);
> 
> So, shouldn't we just pass igeo into this function now?

Yeah.

> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> > index e936b7cc9389..b74fa2addd51 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.h
> > +++ b/fs/xfs/libxfs/xfs_ialloc.h
> > @@ -28,9 +28,9 @@ static inline int
> >  xfs_icluster_size_fsb(
> >  	struct xfs_mount	*mp)
> >  {
> > -	if (mp->m_sb.sb_blocksize >= mp->m_inode_cluster_size)
> > +	if (mp->m_sb.sb_blocksize >= mp->m_ino_geo.ig_min_cluster_size)
> >  		return 1;
> > -	return mp->m_inode_cluster_size >> mp->m_sb.sb_blocklog;
> > +	return mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_blocklog;
> >  }
> 
> The return value of this is placed in the mp->m_ino_geo structure.
> This should pass in the igeo structure the result is written into.
> It's other caller should be using the value in the igeo structure,
> not calling this function.

Ok.

> > index bc2dfacd2f4a..79cc5cf21e1b 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -28,7 +28,7 @@ xfs_inobt_get_minrecs(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level)
> >  {
> > -	return cur->bc_mp->m_inobt_mnr[level != 0];
> > +	return cur->bc_mp->m_ino_geo.ig_inobt_mnr[level != 0];
> >  }
> 
> Put a igeo pointer in the inobt union of the btree cursor?
> 
> 	return cur->bc_private.a.igeo->inobt_mnr[level != 0];

Or just M_IGEO(cur->bc_mp)->inobt_mtr[level != 0]; ?

> >  
> >  STATIC struct xfs_btree_cur *
> > @@ -164,7 +164,7 @@ xfs_inobt_get_maxrecs(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level)
> >  {
> > -	return cur->bc_mp->m_inobt_mxr[level != 0];
> > +	return cur->bc_mp->m_ino_geo.ig_inobt_mxr[level != 0];
> >  }
> >  
> >  STATIC void
> > @@ -281,10 +281,11 @@ xfs_inobt_verify(
> >  
> >  	/* level verification */
> >  	level = be16_to_cpu(block->bb_level);
> > -	if (level >= mp->m_in_maxlevels)
> > +	if (level >= mp->m_ino_geo.ig_in_maxlevels)
> >  		return __this_address;
> >  
> > -	return xfs_btree_sblock_verify(bp, mp->m_inobt_mxr[level != 0]);
> > +	return xfs_btree_sblock_verify(bp,
> > +			mp->m_ino_geo.ig_inobt_mxr[level != 0]);
> >  }
> >  
> >  static void
> > @@ -546,7 +547,7 @@ xfs_inobt_max_size(
> >  	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
> >  
> >  	/* Bail out if we're uninitialized, which can happen in mkfs. */
> > -	if (mp->m_inobt_mxr[0] == 0)
> > +	if (mp->m_ino_geo.ig_inobt_mxr[0] == 0)
> >  		return 0;
> >  
> >  	/*
> > @@ -558,7 +559,7 @@ xfs_inobt_max_size(
> >  	    XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart) == agno)
> >  		agblocks -= mp->m_sb.sb_logblocks;
> >  
> > -	return xfs_btree_calc_size(mp->m_inobt_mnr,
> > +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr,
> >  				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
> >  					XFS_INODES_PER_CHUNK);
> >  }
> > @@ -619,5 +620,5 @@ xfs_iallocbt_calc_size(
> >  	struct xfs_mount	*mp,
> >  	unsigned long long	len)
> >  {
> > -	return xfs_btree_calc_size(mp->m_inobt_mnr, len);
> > +	return xfs_btree_calc_size(mp->m_ino_geo.ig_inobt_mnr, len);
> 
> Should pass igeo into this function now, not xfs_mount.

Ok.

> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index e021d5133ccb..641aa1c2f1ae 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -36,7 +36,7 @@ xfs_inobp_check(
> >  	int		j;
> >  	xfs_dinode_t	*dip;
> >  
> > -	j = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> > +	j = mp->m_ino_geo.ig_min_cluster_size >> mp->m_sb.sb_inodelog;
> 
> isn't that "inodes per cluster"?

Yes.  However, I think I want to keep the "move everything to structure"
in one patch and make a new one for "convert all the opencoded igeo bits".

> >  
> >  	for (i = 0; i < j; i++) {
> >  		dip = xfs_buf_offset(bp, i * mp->m_sb.sb_inodesize);
> > diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> > index e76a3e5d28d7..9416fc741788 100644
> > --- a/fs/xfs/libxfs/xfs_sb.c
> > +++ b/fs/xfs/libxfs/xfs_sb.c
> > @@ -804,16 +804,18 @@ const struct xfs_buf_ops xfs_sb_quiet_buf_ops = {
> >   */
> >  void
> >  xfs_sb_mount_common(
> > -	struct xfs_mount *mp,
> > -	struct xfs_sb	*sbp)
> > +	struct xfs_mount	*mp,
> > +	struct xfs_sb		*sbp)
> >  {
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> > +
> >  	mp->m_agfrotor = mp->m_agirotor = 0;
> >  	mp->m_maxagi = mp->m_sb.sb_agcount;
> >  	mp->m_blkbit_log = sbp->sb_blocklog + XFS_NBBYLOG;
> >  	mp->m_blkbb_log = sbp->sb_blocklog - BBSHIFT;
> >  	mp->m_sectbb_log = sbp->sb_sectlog - BBSHIFT;
> >  	mp->m_agno_log = xfs_highbit32(sbp->sb_agcount - 1) + 1;
> > -	mp->m_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
> > +	mp->m_ino_geo.ig_agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
> 
> igeo.
> 
> >  
> > @@ -307,7 +308,8 @@ xfs_calc_iunlink_remove_reservation(
> >  	struct xfs_mount        *mp)
> >  {
> >  	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > -	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1), mp->m_inode_cluster_size);
> > +	       2 * max_t(uint, XFS_FSB_TO_B(mp, 1),
> > +			 mp->m_ino_geo.ig_min_cluster_size);
> >  }
> 
> I'm starting to think a M_IGEO(mp)-like macro is in order here....

Already done.

> >  
> > diff --git a/fs/xfs/scrub/ialloc.c b/fs/xfs/scrub/ialloc.c
> > index 9b47117180cb..fa7386bf76e9 100644
> > --- a/fs/xfs/scrub/ialloc.c
> > +++ b/fs/xfs/scrub/ialloc.c
> > @@ -230,7 +230,7 @@ xchk_iallocbt_check_cluster(
> >  	int				error = 0;
> >  
> >  	nr_inodes = min_t(unsigned int, XFS_INODES_PER_CHUNK,
> > -			mp->m_inodes_per_cluster);
> > +			mp->m_ino_geo.ig_inodes_per_cluster);
> 
> igeo.... (many uses in this function)
> 
> > @@ -355,6 +356,7 @@ xchk_iallocbt_rec_alignment(
> >  {
> >  	struct xfs_mount		*mp = bs->sc->mp;
> >  	struct xchk_iallocbt		*iabt = bs->private;
> > +	struct xfs_ino_geometry		*ig = &mp->m_ino_geo;
> 
> igeo, for consistency with the rest of the code.
> 
> > @@ -2567,7 +2568,8 @@ xfs_ifree_cluster(
> >  		 * to mark all the active inodes on the buffer stale.
> >  		 */
> >  		bp = xfs_trans_get_buf(tp, mp->m_ddev_targp, blkno,
> > -					mp->m_bsize * mp->m_blocks_per_cluster,
> > +					mp->m_bsize *
> > +						igeo->ig_blocks_per_cluster,
> >  					XBF_UNMAPPED);
> 
> Back off the indent, don't use another line :)

Ok.

> > @@ -3476,19 +3478,20 @@ xfs_iflush_cluster(
> >  	int			cilist_size;
> >  	struct xfs_inode	**cilist;
> >  	struct xfs_inode	*cip;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	int			nr_found;
> >  	int			clcount = 0;
> >  	int			i;
> >  
> >  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> >  
> > -	inodes_per_cluster = mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog;
> > +	inodes_per_cluster = igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog;
> 
> that's igeo->inodes_per_cluster again, right?

Yep.  I think the fact that we don't adjust m_inode_cluster_size to
match what xfs_icluster_size_fsb spits out means that the usage here is
incorrect, but we can fix in a subsequent patch.

> >  	cilist_size = inodes_per_cluster * sizeof(xfs_inode_t *);
> >  	cilist = kmem_alloc(cilist_size, KM_MAYFAIL|KM_NOFS);
> >  	if (!cilist)
> >  		goto out_put;
> >  
> > -	mask = ~(((mp->m_inode_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> > +	mask = ~(((igeo->ig_min_cluster_size >> mp->m_sb.sb_inodelog)) - 1);
> 
> Isn't that:
> 
> 	mask = ~(inodes_per_cluster - 1);

Yep.

> >  	first_index = XFS_INO_TO_AGINO(mp, ip->i_ino) & mask;
> >  	rcu_read_lock();
> >  	/* really need a gang lookup range call here */
> > diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
> > index 1e1a0af1dd34..cff28ee73deb 100644
> > --- a/fs/xfs/xfs_itable.c
> > +++ b/fs/xfs/xfs_itable.c
> > @@ -167,6 +167,7 @@ xfs_bulkstat_ichunk_ra(
> >  	xfs_agnumber_t			agno,
> >  	struct xfs_inobt_rec_incore	*irec)
> >  {
> > +	struct xfs_ino_geometry		*igeo = &mp->m_ino_geo;
> >  	xfs_agblock_t			agbno;
> >  	struct blk_plug			plug;
> >  	int				i;	/* inode chunk index */
> > @@ -174,12 +175,14 @@ xfs_bulkstat_ichunk_ra(
> >  	agbno = XFS_AGINO_TO_AGBNO(mp, irec->ir_startino);
> >  
> >  	blk_start_plug(&plug);
> > -	for (i = 0; i < XFS_INODES_PER_CHUNK;
> > -	     i += mp->m_inodes_per_cluster, agbno += mp->m_blocks_per_cluster) {
> > -		if (xfs_inobt_maskn(i, mp->m_inodes_per_cluster) &
> > +	for (i = 0;
> > +	     i < XFS_INODES_PER_CHUNK;
> > +	     i += igeo->ig_inodes_per_cluster,
> > +			agbno += igeo->ig_blocks_per_cluster) {
> > +		if (xfs_inobt_maskn(i, igeo->ig_inodes_per_cluster) &
> >  		    ~irec->ir_free) {
> >  			xfs_btree_reada_bufs(mp, agno, agbno,
> > -					mp->m_blocks_per_cluster,
> > +					igeo->ig_blocks_per_cluster,
> >  					&xfs_inode_buf_ops);
> >  		}
> 
> That's a mess :(
> 
> 	for (i = 0; i < XFS_INODES_PER_CHUNK; i += igeo->inodes_per_cluster) {
> 		if (xfs_inobt_maskn(i, igeo->inodes_per_cluster) &
> 							~irec->ir_free) {
> 			xfs_btree_reada_bufs(mp, agno, agbno,
> 					igeo->ig_blocks_per_cluster,
> 					&xfs_inode_buf_ops);
> 		}
> 		agbno += igeo->blocks_per_cluster;

<nod> I'll clean it up some more.

> 	}
> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 9329f5adbfbe..15118e531184 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2882,19 +2882,19 @@ xlog_recover_buffer_pass2(
> >  	 *
> >  	 * Also make sure that only inode buffers with good sizes stay in
> >  	 * the buffer cache.  The kernel moves inodes in buffers of 1 block
> > -	 * or mp->m_inode_cluster_size bytes, whichever is bigger.  The inode
> > +	 * or ig_min_cluster_size bytes, whichever is bigger.  The inode
> >  	 * buffers in the log can be a different size if the log was generated
> >  	 * by an older kernel using unclustered inode buffers or a newer kernel
> >  	 * running with a different inode cluster size.  Regardless, if the
> > -	 * the inode buffer size isn't max(blocksize, mp->m_inode_cluster_size)
> > -	 * for *our* value of mp->m_inode_cluster_size, then we need to keep
> > +	 * the inode buffer size isn't max(blocksize, ig_min_cluster_size)
> > +	 * for *our* value of ig_min_cluster_size, then we need to keep
> >  	 * the buffer out of the buffer cache so that the buffer won't
> >  	 * overlap with future reads of those inodes.
> >  	 */
> >  	if (XFS_DINODE_MAGIC ==
> >  	    be16_to_cpu(*((__be16 *)xfs_buf_offset(bp, 0))) &&
> >  	    (BBTOB(bp->b_io_length) != max(log->l_mp->m_sb.sb_blocksize,
> > -			(uint32_t)log->l_mp->m_inode_cluster_size))) {
> > +			(uint32_t)log->l_mp->m_ino_geo.ig_min_cluster_size))) {
> 
> cluster size is already an unsigned int so the cast ca go.

Ok.

> >  		xfs_buf_stale(bp);
> >  		error = xfs_bwrite(bp);
> >  	} else {
> > @@ -3849,6 +3849,7 @@ xlog_recover_do_icreate_pass2(
> >  {
> >  	struct xfs_mount	*mp = log->l_mp;
> >  	struct xfs_icreate_log	*icl;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	xfs_agnumber_t		agno;
> >  	xfs_agblock_t		agbno;
> >  	unsigned int		count;
> > @@ -3898,10 +3899,10 @@ xlog_recover_do_icreate_pass2(
> >  
> >  	/*
> >  	 * The inode chunk is either full or sparse and we only support
> > -	 * m_ialloc_min_blks sized sparse allocations at this time.
> > +	 * m_ino_geo.ig_ialloc_min_blks sized sparse allocations at this time.
> >  	 */
> > -	if (length != mp->m_ialloc_blks &&
> > -	    length != mp->m_ialloc_min_blks) {
> > +	if (length != igeo->ig_ialloc_blks &&
> > +	    length != igeo->ig_ialloc_min_blks) {
> >  		xfs_warn(log->l_mp,
> >  			 "%s: unsupported chunk length", __FUNCTION__);
> >  		return -EINVAL;
> > @@ -3921,13 +3922,13 @@ xlog_recover_do_icreate_pass2(
> >  	 * buffers for cancellation so we don't overwrite anything written after
> >  	 * a cancellation.
> >  	 */
> > -	bb_per_cluster = XFS_FSB_TO_BB(mp, mp->m_blocks_per_cluster);
> > -	nbufs = length / mp->m_blocks_per_cluster;
> > +	bb_per_cluster = XFS_FSB_TO_BB(mp, igeo->ig_blocks_per_cluster);
> > +	nbufs = length / igeo->ig_blocks_per_cluster;
> >  	for (i = 0, cancel_count = 0; i < nbufs; i++) {
> >  		xfs_daddr_t	daddr;
> >  
> > -		daddr = XFS_AGB_TO_DADDR(mp, agno,
> > -					 agbno + i * mp->m_blocks_per_cluster);
> > +		daddr = XFS_AGB_TO_DADDR(mp, agno, agbno +
> > +				i * igeo->ig_blocks_per_cluster);
> 
> makes no sense to change the line break location.
> 
> 		daddr = XFS_AGB_TO_DADDR(mp, agno,
> 				agbno + i * igeo->ig_blocks_per_cluster);

Ok.

> 
> 
> >   */
> >  STATIC void
> > -xfs_set_maxicount(xfs_mount_t *mp)
> > +xfs_set_maxicount(
> > +	struct xfs_mount	*mp)
> >  {
> > -	xfs_sb_t	*sbp = &(mp->m_sb);
> > -	uint64_t	icount;
> > +	struct xfs_sb		*sbp = &(mp->m_sb);
> 
> kill the ().
> 
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> > +	uint64_t		icount;
> >  
> >  	if (sbp->sb_imax_pct) {
> >  		/*
> > @@ -445,11 +447,11 @@ xfs_set_maxicount(xfs_mount_t *mp)
> >  		 */
> >  		icount = sbp->sb_dblocks * sbp->sb_imax_pct;
> >  		do_div(icount, 100);
> > -		do_div(icount, mp->m_ialloc_blks);
> > -		mp->m_maxicount = (icount * mp->m_ialloc_blks)  <<
> > -				   sbp->sb_inopblog;
> > +		do_div(icount, igeo->ig_ialloc_blks);
> > +		igeo->ig_maxicount = XFS_FSB_TO_INO(mp,
> > +				icount * igeo->ig_ialloc_blks);
> >  	} else {
> > -		mp->m_maxicount = 0;
> > +		igeo->ig_maxicount = 0;
> >  	}
> >  }
> >  
> > @@ -518,18 +520,18 @@ xfs_set_inoalignment(xfs_mount_t *mp)
> >  {
> >  	if (xfs_sb_version_hasalign(&mp->m_sb) &&
> >  		mp->m_sb.sb_inoalignmt >= xfs_icluster_size_fsb(mp))
> > -		mp->m_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
> > +		mp->m_ino_geo.ig_inoalign_mask = mp->m_sb.sb_inoalignmt - 1;
> >  	else
> > -		mp->m_inoalign_mask = 0;
> > +		mp->m_ino_geo.ig_inoalign_mask = 0;
> >  	/*
> >  	 * If we are using stripe alignment, check whether
> >  	 * the stripe unit is a multiple of the inode alignment
> >  	 */
> > -	if (mp->m_dalign && mp->m_inoalign_mask &&
> > -	    !(mp->m_dalign & mp->m_inoalign_mask))
> > -		mp->m_sinoalign = mp->m_dalign;
> > +	if (mp->m_dalign && mp->m_ino_geo.ig_inoalign_mask &&
> > +	    !(mp->m_dalign & mp->m_ino_geo.ig_inoalign_mask))
> > +		mp->m_ino_geo.ig_sinoalign = mp->m_dalign;
> >  	else
> > -		mp->m_sinoalign = 0;
> > +		mp->m_ino_geo.ig_sinoalign = 0;
> 
> should pass in igeo to this function....

Ok.

> >  }
> >  
> >  /*
> > @@ -683,6 +685,7 @@ xfs_mountfs(
> >  {
> >  	struct xfs_sb		*sbp = &(mp->m_sb);
> >  	struct xfs_inode	*rip;
> > +	struct xfs_ino_geometry	*igeo = &mp->m_ino_geo;
> >  	uint64_t		resblks;
> >  	uint			quotamount = 0;
> >  	uint			quotaflags = 0;
> > @@ -797,18 +800,20 @@ xfs_mountfs(
> >  	 * has set the inode alignment value appropriately for larger cluster
> >  	 * sizes.
> >  	 */
> > -	mp->m_inode_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> > +	igeo->ig_min_cluster_size = XFS_INODE_BIG_CLUSTER_SIZE;
> >  	if (xfs_sb_version_hascrc(&mp->m_sb)) {
> > -		int	new_size = mp->m_inode_cluster_size;
> > +		int	new_size = igeo->ig_min_cluster_size;
> >  
> >  		new_size *= mp->m_sb.sb_inodesize / XFS_DINODE_MIN_SIZE;
> >  		if (mp->m_sb.sb_inoalignmt >= XFS_B_TO_FSBT(mp, new_size))
> > -			mp->m_inode_cluster_size = new_size;
> > +			igeo->ig_min_cluster_size = new_size;
> >  	}
> > -	mp->m_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> > -	mp->m_inodes_per_cluster = XFS_FSB_TO_INO(mp, mp->m_blocks_per_cluster);
> > -	mp->m_cluster_align = xfs_ialloc_cluster_alignment(mp);
> > -	mp->m_cluster_align_inodes = XFS_FSB_TO_INO(mp, mp->m_cluster_align);
> > +	igeo->ig_blocks_per_cluster = xfs_icluster_size_fsb(mp);
> > +	igeo->ig_inodes_per_cluster = XFS_FSB_TO_INO(mp,
> > +			igeo->ig_blocks_per_cluster);
> > +	igeo->ig_cluster_align = xfs_ialloc_cluster_alignment(mp);
> > +	igeo->ig_cluster_align_inodes = XFS_FSB_TO_INO(mp,
> > +			igeo->ig_cluster_align);
> 
> Can we separate out all the igeo initialsation into a single init
> function rather than being spread out over several functions?

I'll try it out.  The current spaghetti is pretty gross.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] xfs: create simplified inode walk function
  2019-05-29 22:26 ` [PATCH 02/11] xfs: create simplified inode walk function Darrick J. Wong
@ 2019-06-04  7:41   ` Dave Chinner
  2019-06-04 16:39     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2019-06-04  7:41 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 29, 2019 at 03:26:27PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Description? :)

> +/*
> + * Walking All the Inodes in the Filesystem
> + * ========================================
> + * Starting at some @startino, call a walk function on every allocated inode in
> + * the system.  The walk function is called with the relevant inode number and
> + * a pointer to caller-provided data.  The walk function can return the usual
> + * negative error code, 0, or XFS_IWALK_ABORT to stop the iteration.  This
> + * return value is returned to the caller.

The walker iterates inodes in what order? What does it do with
inodes before @startino?

> + * Internally, we allow the walk function to do anything, which means that we
> + * cannot maintain the inobt cursor or our lock on the AGI buffer.  We
> + * therefore build up a batch of inobt records in kernel memory and only call
> + * the walk function when our memory buffer is full.
> + */

"It is the responsibility of the walk function to ensure it accesses
allocated inodes, as the inobt records may be stale by the time they are
acted upon."

> +struct xfs_iwalk_ag {
> +	struct xfs_mount		*mp;
> +	struct xfs_trans		*tp;
> +
> +	/* Where do we start the traversal? */
> +	xfs_ino_t			startino;
> +
> +	/* Array of inobt records we cache. */
> +	struct xfs_inobt_rec_incore	*recs;
> +	unsigned int			sz_recs;
> +	unsigned int			nr_recs;

sz is the size of the allocated array, nr is the number of entries
used?

> +	/* Inode walk function and data pointer. */
> +	xfs_iwalk_fn			iwalk_fn;
> +	void				*data;
> +};
> +
> +/* Allocate memory for a walk. */
> +STATIC int
> +xfs_iwalk_allocbuf(
> +	struct xfs_iwalk_ag	*iwag)
> +{
> +	size_t			size;
> +
> +	ASSERT(iwag->recs == NULL);
> +	iwag->nr_recs = 0;
> +
> +	/* Allocate a prefetch buffer for inobt records. */
> +	size = iwag->sz_recs * sizeof(struct xfs_inobt_rec_incore);
> +	iwag->recs = kmem_alloc(size, KM_SLEEP);
> +	if (iwag->recs == NULL)
> +		return -ENOMEM;

KM_SLEEP will never fail. You mean to use KM_MAYFAIL here?

> +
> +	return 0;
> +}
> +
> +/* Free memory we allocated for a walk. */
> +STATIC void
> +xfs_iwalk_freebuf(
> +	struct xfs_iwalk_ag	*iwag)
> +{
> +	ASSERT(iwag->recs != NULL);
> +	kmem_free(iwag->recs);
> +}

No need for the assert here - kmem_free() handles null pointers just
fine.

> +/* For each inuse inode in each cached inobt record, call our function. */
> +STATIC int
> +xfs_iwalk_ag_recs(
> +	struct xfs_iwalk_ag		*iwag)
> +{
> +	struct xfs_mount		*mp = iwag->mp;
> +	struct xfs_trans		*tp = iwag->tp;
> +	struct xfs_inobt_rec_incore	*irec;
> +	xfs_ino_t			ino;
> +	unsigned int			i, j;
> +	xfs_agnumber_t			agno;
> +	int				error;
> +
> +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> +	for (i = 0, irec = iwag->recs; i < iwag->nr_recs; i++, irec++) {

I kinda prefer single iterator loops for array walking like this:

	for (i = 0; i < iwag->nr_recs; i++) {
		irec = &iwag->recs[i];

It's much easier to read and understand what is going on...

> +		trace_xfs_iwalk_ag_rec(mp, agno, irec->ir_startino,
> +				irec->ir_free);

Could just pass irec to the trace function and extract startino/free
within the tracepoint macro....

> +		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> +			/* Skip if this inode is free */
> +			if (XFS_INOBT_MASK(j) & irec->ir_free)
> +				continue;
> +
> +			/* Otherwise call our function. */
> +			ino = XFS_AGINO_TO_INO(mp, agno, irec->ir_startino + j);
> +			error = iwag->iwalk_fn(mp, tp, ino, iwag->data);
> +			if (error)
> +				return error;
> +		}
> +	}
> +
> +	iwag->nr_recs = 0;

Why is this zeroed here?

> +	return 0;
> +}
> +
> +/* Read AGI and create inobt cursor. */
> +static inline int
> +xfs_iwalk_inobt_cur(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_agnumber_t		agno,
> +	struct xfs_btree_cur	**curpp,
> +	struct xfs_buf		**agi_bpp)
> +{
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	ASSERT(*agi_bpp == NULL);
> +
> +	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
> +	if (!cur)
> +		return -ENOMEM;
> +	*curpp = cur;
> +	return 0;
> +}

This is a common pattern. Used in xfs_imap_lookup(), xfs_bulkstat(),
xfs_inumbers and xfs_inobt_count_blocks. Perhaps should be a common
inobt function?

> +
> +/* Delete cursor and let go of AGI. */
> +static inline void
> +xfs_iwalk_del_inobt(
> +	struct xfs_trans	*tp,
> +	struct xfs_btree_cur	**curpp,
> +	struct xfs_buf		**agi_bpp,
> +	int			error)
> +{
> +	if (*curpp) {
> +		xfs_btree_del_cursor(*curpp, error);
> +		*curpp = NULL;
> +	}
> +	if (*agi_bpp) {
> +		xfs_trans_brelse(tp, *agi_bpp);
> +		*agi_bpp = NULL;
> +	}
> +}
> +
> +/*
> + * Set ourselves up for walking inobt records starting from a given point in
> + * the filesystem.
> + *
> + * If caller passed in a nonzero start inode number, load the record from the
> + * inobt and make the record look like all the inodes before agino are free so
> + * that we skip them, and then move the cursor to the next inobt record.  This
> + * is how we support starting an iwalk in the middle of an inode chunk.
> + *
> + * If the caller passed in a start number of zero, move the cursor to the first
> + * inobt record.
> + *
> + * The caller is responsible for cleaning up the cursor and buffer pointer
> + * regardless of the error status.
> + */
> +STATIC int
> +xfs_iwalk_ag_start(
> +	struct xfs_iwalk_ag	*iwag,
> +	xfs_agnumber_t		agno,
> +	xfs_agino_t		agino,
> +	struct xfs_btree_cur	**curpp,
> +	struct xfs_buf		**agi_bpp,
> +	int			*has_more)
> +{
> +	struct xfs_mount	*mp = iwag->mp;
> +	struct xfs_trans	*tp = iwag->tp;
> +	int			icount;
> +	int			error;
> +
> +	/* Set up a fresh cursor and empty the inobt cache. */
> +	iwag->nr_recs = 0;
> +	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
> +	if (error)
> +		return error;
> +
> +	/* Starting at the beginning of the AG?  That's easy! */
> +	if (agino == 0)
> +		return xfs_inobt_lookup(*curpp, 0, XFS_LOOKUP_GE, has_more);
> +
> +	/*
> +	 * Otherwise, we have to grab the inobt record where we left off, stuff
> +	 * the record into our cache, and then see if there are more records.
> +	 * We require a lookup cache of at least two elements so that we don't
> +	 * have to deal with tearing down the cursor to walk the records.
> +	 */
> +	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
> +			&iwag->recs[iwag->nr_recs]);
> +	if (error)
> +		return error;
> +	if (icount)
> +		iwag->nr_recs++;
> +
> +	ASSERT(iwag->nr_recs < iwag->sz_recs);

Why this code does what it does with nr_recs is a bit of a mystery
to me...

> +	return xfs_btree_increment(*curpp, 0, has_more);
> +}
> +
> +typedef int (*xfs_iwalk_ag_recs_fn)(struct xfs_iwalk_ag *iwag);
> +
> +/*
> + * Acknowledge that we added an inobt record to the cache.  Flush the inobt
> + * record cache if the buffer is full, and position the cursor wherever it
> + * needs to be so that we can keep going.
> + */
> +STATIC int
> +xfs_iwalk_ag_increment(
> +	struct xfs_iwalk_ag		*iwag,
> +	xfs_iwalk_ag_recs_fn		walk_ag_recs_fn,
> +	xfs_agnumber_t			agno,
> +	struct xfs_btree_cur		**curpp,
> +	struct xfs_buf			**agi_bpp,
> +	int				*has_more)
> +{
> +	struct xfs_mount		*mp = iwag->mp;
> +	struct xfs_trans		*tp = iwag->tp;
> +	struct xfs_inobt_rec_incore	*irec;
> +	xfs_agino_t			restart;
> +	int				error;
> +
> +	iwag->nr_recs++;
> +
> +	/* If there's space, just increment and look for more records. */
> +	if (iwag->nr_recs < iwag->sz_recs)
> +		return xfs_btree_increment(*curpp, 0, has_more);

Incrementing before explaining why we're incrementing seems a bit
fack-to-bront....

> +	/*
> +	 * Otherwise the record cache is full; delete the cursor and walk the
> +	 * records...
> +	 */
> +	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
> +	irec = &iwag->recs[iwag->nr_recs - 1];
> +	restart = irec->ir_startino + XFS_INODES_PER_CHUNK - 1;
> +
> +	error = walk_ag_recs_fn(iwag);
> +	if (error)
> +		return error;

Urk, so an "increment" function actually run all the object callbacks?
But only if it fails to increment?

> +
> +	/* ...and recreate cursor where we left off. */
> +	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
> +	if (error)
> +		return error;
> +
> +	return xfs_inobt_lookup(*curpp, restart, XFS_LOOKUP_GE, has_more);

And then it goes an increments anyway?

That's all a bit .... non-obvious. Especially as it has a single
caller - this should really be something like
xfs_iwalk_run_callbacks(). Bit more context below...

> +}
> +
> +/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
> +STATIC int
> +xfs_iwalk_ag(
> +	struct xfs_iwalk_ag		*iwag)
> +{
> +	struct xfs_mount		*mp = iwag->mp;
> +	struct xfs_trans		*tp = iwag->tp;
> +	struct xfs_buf			*agi_bp = NULL;
> +	struct xfs_btree_cur		*cur = NULL;
> +	xfs_agnumber_t			agno;
> +	xfs_agino_t			agino;
> +	int				has_more;
> +	int				error = 0;
> +
> +	/* Set up our cursor at the right place in the inode btree. */
> +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> +	if (error)
> +		goto out_cur;
> +
> +	while (has_more) {
> +		struct xfs_inobt_rec_incore	*irec;
> +
> +		/* Fetch the inobt record. */
> +		irec = &iwag->recs[iwag->nr_recs];
> +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> +		if (error)
> +			goto out_cur;
> +		if (!has_more)
> +			break;
> +
> +		/* No allocated inodes in this chunk; skip it. */
> +		if (irec->ir_freecount == irec->ir_count) {
> +			error = xfs_btree_increment(cur, 0, &has_more);
> +			goto next_loop;
> +		}
> +
> +		/*
> +		 * Start readahead for this inode chunk in anticipation of
> +		 * walking the inodes.
> +		 */
> +		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> +
> +		/*
> +		 * Add this inobt record to our cache, flush the cache if
> +		 * needed, and move on to the next record.
> +		 */
> +		error = xfs_iwalk_ag_increment(iwag, xfs_iwalk_ag_recs, agno,
> +				&cur, &agi_bp, &has_more);

Ok, so given this loop already has an increment case in it, it seems
like it would be better to pull some of this function into the loop
somewhat like:

	while (has_more) {
		struct xfs_inobt_rec_incore	*irec;

		cond_resched();

		/* Fetch the inobt record. */
		irec = &iwag->recs[iwag->nr_recs];
		error = xfs_inobt_get_rec(cur, irec, &has_more);
		if (error || !has_more)
			break;

		/* No allocated inodes in this chunk; skip it. */
		if (irec->ir_freecount == irec->ir_count) {
			error = xfs_btree_increment(cur, 0, &has_more);
			if (error)
				break;
			continue;
		}

		/*
		 * Start readahead for this inode chunk in anticipation of
		 * walking the inodes.
		 */
		xfs_bulkstat_ichunk_ra(mp, agno, irec);

		/* If there's space in the buffer, just grab more records. */
		if (++iwag->nr_recs < iwag->sz_recs)
			error = xfs_btree_increment(cur, 0, &has_more);
			if (error)
				break;
			continue;
		}

		error = xfs_iwalk_run_callbacks(iwag, ...);
	}

	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
	if (!iwag->nr_recs || error)
		return error;
	return xfs_iwalk_ag_recs(iwag);
}


> +	/* Walk any records left behind in the cache. */
> +	if (iwag->nr_recs) {
> +		xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> +		return xfs_iwalk_ag_recs(iwag);
> +	}
> +
> +out_cur:
> +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> +	return error;
> +}
> +
> +/*
> + * Given the number of inodes to prefetch, set the number of inobt records that
> + * we cache in memory, which controls the number of inodes we try to read
> + * ahead.
> + *
> + * If no max prefetch was given, default to one page's worth of inobt records;
> + * this should be plenty of inodes to read ahead.

That's a lot of inodes on a 64k page size machine. I think it would
be better capped at number that doesn't change with processor
architecture...

> + */
> +static inline void
> +xfs_iwalk_set_prefetch(
> +	struct xfs_iwalk_ag	*iwag,
> +	unsigned int		max_prefetch)
> +{
> +	if (max_prefetch)
> +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> +					XFS_INODES_PER_CHUNK;
> +	else
> +		iwag->sz_recs = PAGE_SIZE / sizeof(struct xfs_inobt_rec_incore);
> +
> +	/*
> +	 * Allocate enough space to prefetch at least two records so that we
> +	 * can cache both the inobt record where the iwalk started and the next
> +	 * record.  This simplifies the AG inode walk loop setup code.
> +	 */
> +	if (iwag->sz_recs < 2)
> +		iwag->sz_recs = 2;

	iwag->sz_recs = max(iwag->sz_recs, 2);

....
> +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> +	error = xfs_iwalk_allocbuf(&iwag);
....
> +	xfs_iwalk_freebuf(&iwag);

I'd drop the "buf" from the names of those two functions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions
  2019-05-29 22:26 ` [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
@ 2019-06-04  7:52   ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2019-06-04  7:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 29, 2019 at 03:26:34PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Convert quotacheck to use the new iwalk iterator to dig through the
> inodes.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Looks good, much cleaner.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one
  2019-05-29 22:26 ` [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
@ 2019-06-04  7:54   ` Dave Chinner
  2019-06-04 14:24     ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2019-06-04  7:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 29, 2019 at 03:26:40PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> When userspace passes in a @lastip pointer we should copy the results
> back, even if the @ocount pointer is NULL.

Makes sense and the code change is simple enough, but this changes
what we return to userspace, right?  Does any of xfsprogs or fstests
test code actually exercise this case? If not, how have you
determined it isn't going to break anything?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one
  2019-06-04  7:54   ` Dave Chinner
@ 2019-06-04 14:24     ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-06-04 14:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 05:54:42PM +1000, Dave Chinner wrote:
> On Wed, May 29, 2019 at 03:26:40PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > When userspace passes in a @lastip pointer we should copy the results
> > back, even if the @ocount pointer is NULL.
> 
> Makes sense and the code change is simple enough, but this changes
> what we return to userspace, right?  Does any of xfsprogs or fstests
> test code actually exercise this case? If not, how have you
> determined it isn't going to break anything?

Coming in a future xfstests submission along with other basic
functionality checks. :)

(Future, as in "later today"...)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/11] xfs: create simplified inode walk function
  2019-06-04  7:41   ` Dave Chinner
@ 2019-06-04 16:39     ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2019-06-04 16:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 04, 2019 at 05:41:42PM +1000, Dave Chinner wrote:
> On Wed, May 29, 2019 at 03:26:27PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Description? :)
> 
> > +/*
> > + * Walking All the Inodes in the Filesystem
> > + * ========================================
> > + * Starting at some @startino, call a walk function on every allocated inode in
> > + * the system.  The walk function is called with the relevant inode number and
> > + * a pointer to caller-provided data.  The walk function can return the usual
> > + * negative error code, 0, or XFS_IWALK_ABORT to stop the iteration.  This
> > + * return value is returned to the caller.
> 
> The walker iterates inodes in what order? What does it do with
> inodes before @startino?

They're walked in increasing order, and it ignores the ones before @startino.

How about:

/*
 * This iterator function walks a subset of filesystem inodes in increasing
 * order from @startino until there are no more inodes.  For each allocated
 * inode it finds, it calls a walk function with the relevant inode number and
 * a pointer to caller-provided data.  The walk function can return the usual
 * negative error code to stop the iteration; 0 to continue the iteration; or
 * XFS_IWALK_ABORT to stop the iteration.  This return value is returned to the
 * caller.
 */

> > + * Internally, we allow the walk function to do anything, which means that we
> > + * cannot maintain the inobt cursor or our lock on the AGI buffer.  We
> > + * therefore build up a batch of inobt records in kernel memory and only call
> > + * the walk function when our memory buffer is full.
> > + */
> 
> "It is the responsibility of the walk function to ensure it accesses
> allocated inodes, as the inobt records may be stale by the time they are
> acted upon."

Added.

> 
> > +struct xfs_iwalk_ag {
> > +	struct xfs_mount		*mp;
> > +	struct xfs_trans		*tp;
> > +
> > +	/* Where do we start the traversal? */
> > +	xfs_ino_t			startino;
> > +
> > +	/* Array of inobt records we cache. */
> > +	struct xfs_inobt_rec_incore	*recs;
> > +	unsigned int			sz_recs;
> > +	unsigned int			nr_recs;
> 
> sz is the size of the allocated array, nr is the number of entries
> used?

Yes.  I'll clarify that:

	/* Number of entries allocated for the @recs array. */
	unsigned int			sz_recs;

	/* Number of entries in the @recs array that are in use. */
	unsigned int			nr_recs;


> > +	/* Inode walk function and data pointer. */
> > +	xfs_iwalk_fn			iwalk_fn;
> > +	void				*data;
> > +};
> > +
> > +/* Allocate memory for a walk. */
> > +STATIC int
> > +xfs_iwalk_allocbuf(
> > +	struct xfs_iwalk_ag	*iwag)
> > +{
> > +	size_t			size;
> > +
> > +	ASSERT(iwag->recs == NULL);
> > +	iwag->nr_recs = 0;
> > +
> > +	/* Allocate a prefetch buffer for inobt records. */
> > +	size = iwag->sz_recs * sizeof(struct xfs_inobt_rec_incore);
> > +	iwag->recs = kmem_alloc(size, KM_SLEEP);
> > +	if (iwag->recs == NULL)
> > +		return -ENOMEM;
> 
> KM_SLEEP will never fail. You mean to use KM_MAYFAIL here?
> 
> > +
> > +	return 0;
> > +}
> > +
> > +/* Free memory we allocated for a walk. */
> > +STATIC void
> > +xfs_iwalk_freebuf(
> > +	struct xfs_iwalk_ag	*iwag)
> > +{
> > +	ASSERT(iwag->recs != NULL);
> > +	kmem_free(iwag->recs);
> > +}
> 
> No need for the assert here - kmem_free() handles null pointers just
> fine.
> 
> > +/* For each inuse inode in each cached inobt record, call our function. */
> > +STATIC int
> > +xfs_iwalk_ag_recs(
> > +	struct xfs_iwalk_ag		*iwag)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_inobt_rec_incore	*irec;
> > +	xfs_ino_t			ino;
> > +	unsigned int			i, j;
> > +	xfs_agnumber_t			agno;
> > +	int				error;
> > +
> > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > +	for (i = 0, irec = iwag->recs; i < iwag->nr_recs; i++, irec++) {
> 
> I kinda prefer single iterator loops for array walking like this:
> 
> 	for (i = 0; i < iwag->nr_recs; i++) {
> 		irec = &iwag->recs[i];
> 
> It's much easier to read and understand what is going on...

Ok, I'll shorten the variable scope while I'm at it.

> > +		trace_xfs_iwalk_ag_rec(mp, agno, irec->ir_startino,
> > +				irec->ir_free);
> 
> Could just pass irec to the trace function and extract startino/free
> within the tracepoint macro....

<nod>

> > +		for (j = 0; j < XFS_INODES_PER_CHUNK; j++) {
> > +			/* Skip if this inode is free */
> > +			if (XFS_INOBT_MASK(j) & irec->ir_free)
> > +				continue;
> > +
> > +			/* Otherwise call our function. */
> > +			ino = XFS_AGINO_TO_INO(mp, agno, irec->ir_startino + j);
> > +			error = iwag->iwalk_fn(mp, tp, ino, iwag->data);
> > +			if (error)
> > +				return error;
> > +		}
> > +	}
> > +
> > +	iwag->nr_recs = 0;
> 
> Why is this zeroed here?

Hmm, that should be pushed to the caller, especially given the name...

> > +	return 0;
> > +}
> > +
> > +/* Read AGI and create inobt cursor. */
> > +static inline int
> > +xfs_iwalk_inobt_cur(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	xfs_agnumber_t		agno,
> > +	struct xfs_btree_cur	**curpp,
> > +	struct xfs_buf		**agi_bpp)
> > +{
> > +	struct xfs_btree_cur	*cur;
> > +	int			error;
> > +
> > +	ASSERT(*agi_bpp == NULL);
> > +
> > +	error = xfs_ialloc_read_agi(mp, tp, agno, agi_bpp);
> > +	if (error)
> > +		return error;
> > +
> > +	cur = xfs_inobt_init_cursor(mp, tp, *agi_bpp, agno, XFS_BTNUM_INO);
> > +	if (!cur)
> > +		return -ENOMEM;
> > +	*curpp = cur;
> > +	return 0;
> > +}
> 
> This is a common pattern. Used in xfs_imap_lookup(), xfs_bulkstat(),
> xfs_inumbers and xfs_inobt_count_blocks. Perhaps should be a common
> inobt function?

We're about to zap the middle two callers, but yes, these two could be
common functions.  I wasn't sure if it was worth it to save a few lines.

> > +
> > +/* Delete cursor and let go of AGI. */
> > +static inline void
> > +xfs_iwalk_del_inobt(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_btree_cur	**curpp,
> > +	struct xfs_buf		**agi_bpp,
> > +	int			error)
> > +{
> > +	if (*curpp) {
> > +		xfs_btree_del_cursor(*curpp, error);
> > +		*curpp = NULL;
> > +	}
> > +	if (*agi_bpp) {
> > +		xfs_trans_brelse(tp, *agi_bpp);
> > +		*agi_bpp = NULL;
> > +	}
> > +}
> > +
> > +/*
> > + * Set ourselves up for walking inobt records starting from a given point in
> > + * the filesystem.
> > + *
> > + * If caller passed in a nonzero start inode number, load the record from the
> > + * inobt and make the record look like all the inodes before agino are free so
> > + * that we skip them, and then move the cursor to the next inobt record.  This
> > + * is how we support starting an iwalk in the middle of an inode chunk.
> > + *
> > + * If the caller passed in a start number of zero, move the cursor to the first
> > + * inobt record.
> > + *
> > + * The caller is responsible for cleaning up the cursor and buffer pointer
> > + * regardless of the error status.
> > + */
> > +STATIC int
> > +xfs_iwalk_ag_start(
> > +	struct xfs_iwalk_ag	*iwag,
> > +	xfs_agnumber_t		agno,
> > +	xfs_agino_t		agino,
> > +	struct xfs_btree_cur	**curpp,
> > +	struct xfs_buf		**agi_bpp,
> > +	int			*has_more)
> > +{
> > +	struct xfs_mount	*mp = iwag->mp;
> > +	struct xfs_trans	*tp = iwag->tp;
> > +	int			icount;
> > +	int			error;
> > +
> > +	/* Set up a fresh cursor and empty the inobt cache. */
> > +	iwag->nr_recs = 0;
> > +	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Starting at the beginning of the AG?  That's easy! */
> > +	if (agino == 0)
> > +		return xfs_inobt_lookup(*curpp, 0, XFS_LOOKUP_GE, has_more);
> > +
> > +	/*
> > +	 * Otherwise, we have to grab the inobt record where we left off, stuff
> > +	 * the record into our cache, and then see if there are more records.
> > +	 * We require a lookup cache of at least two elements so that we don't
> > +	 * have to deal with tearing down the cursor to walk the records.
> > +	 */
> > +	error = xfs_bulkstat_grab_ichunk(*curpp, agino - 1, &icount,
> > +			&iwag->recs[iwag->nr_recs]);
> > +	if (error)
> > +		return error;
> > +	if (icount)
> > +		iwag->nr_recs++;
> > +
> > +	ASSERT(iwag->nr_recs < iwag->sz_recs);
> 
> Why this code does what it does with nr_recs is a bit of a mystery
> to me...

sz_recs is the number of records we can store in the inobt record cache,
and nr_recs is the number of records that are actually cached.
Therefore, nr_recs should start at zero and increase until nr == sz at
which point we have to run_callbacks().

I'll add the following to the assert if that was a point of confusion:

	/*
	 * set_prefetch is supposed to give us a large enough inobt
	 * record cache that grab_ichunk can stage a partial first
	 * record and the loop body can cache a record without having to
	 * check for cache space until after it reads an inobt record.
	 */

> > +	return xfs_btree_increment(*curpp, 0, has_more);
> > +}
> > +
> > +typedef int (*xfs_iwalk_ag_recs_fn)(struct xfs_iwalk_ag *iwag);
> > +
> > +/*
> > + * Acknowledge that we added an inobt record to the cache.  Flush the inobt
> > + * record cache if the buffer is full, and position the cursor wherever it
> > + * needs to be so that we can keep going.
> > + */
> > +STATIC int
> > +xfs_iwalk_ag_increment(
> > +	struct xfs_iwalk_ag		*iwag,
> > +	xfs_iwalk_ag_recs_fn		walk_ag_recs_fn,
> > +	xfs_agnumber_t			agno,
> > +	struct xfs_btree_cur		**curpp,
> > +	struct xfs_buf			**agi_bpp,
> > +	int				*has_more)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_inobt_rec_incore	*irec;
> > +	xfs_agino_t			restart;
> > +	int				error;
> > +
> > +	iwag->nr_recs++;
> > +
> > +	/* If there's space, just increment and look for more records. */
> > +	if (iwag->nr_recs < iwag->sz_recs)
> > +		return xfs_btree_increment(*curpp, 0, has_more);
> 
> Incrementing before explaining why we're incrementing seems a bit
> fack-to-bront....
> 
> > +	/*
> > +	 * Otherwise the record cache is full; delete the cursor and walk the
> > +	 * records...
> > +	 */
> > +	xfs_iwalk_del_inobt(tp, curpp, agi_bpp, 0);
> > +	irec = &iwag->recs[iwag->nr_recs - 1];
> > +	restart = irec->ir_startino + XFS_INODES_PER_CHUNK - 1;
> > +
> > +	error = walk_ag_recs_fn(iwag);
> > +	if (error)
> > +		return error;
> 
> Urk, so an "increment" function actually run all the object callbacks?
> But only if it fails to increment?
> 
> > +
> > +	/* ...and recreate cursor where we left off. */
> > +	error = xfs_iwalk_inobt_cur(mp, tp, agno, curpp, agi_bpp);
> > +	if (error)
> > +		return error;
> > +
> > +	return xfs_inobt_lookup(*curpp, restart, XFS_LOOKUP_GE, has_more);
> 
> And then it goes an increments anyway?
> 
> That's all a bit .... non-obvious. Especially as it has a single
> caller - this should really be something like
> xfs_iwalk_run_callbacks(). Bit more context below...

(I'll just skip to the big code blob below...)

> > +}
> > +
> > +/* Walk all inodes in a single AG, from @iwag->startino to the end of the AG. */
> > +STATIC int
> > +xfs_iwalk_ag(
> > +	struct xfs_iwalk_ag		*iwag)
> > +{
> > +	struct xfs_mount		*mp = iwag->mp;
> > +	struct xfs_trans		*tp = iwag->tp;
> > +	struct xfs_buf			*agi_bp = NULL;
> > +	struct xfs_btree_cur		*cur = NULL;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agino_t			agino;
> > +	int				has_more;
> > +	int				error = 0;
> > +
> > +	/* Set up our cursor at the right place in the inode btree. */
> > +	agno = XFS_INO_TO_AGNO(mp, iwag->startino);
> > +	agino = XFS_INO_TO_AGINO(mp, iwag->startino);
> > +	error = xfs_iwalk_ag_start(iwag, agno, agino, &cur, &agi_bp, &has_more);
> > +	if (error)
> > +		goto out_cur;
> > +
> > +	while (has_more) {
> > +		struct xfs_inobt_rec_incore	*irec;
> > +
> > +		/* Fetch the inobt record. */
> > +		irec = &iwag->recs[iwag->nr_recs];
> > +		error = xfs_inobt_get_rec(cur, irec, &has_more);
> > +		if (error)
> > +			goto out_cur;
> > +		if (!has_more)
> > +			break;
> > +
> > +		/* No allocated inodes in this chunk; skip it. */
> > +		if (irec->ir_freecount == irec->ir_count) {
> > +			error = xfs_btree_increment(cur, 0, &has_more);
> > +			goto next_loop;
> > +		}
> > +
> > +		/*
> > +		 * Start readahead for this inode chunk in anticipation of
> > +		 * walking the inodes.
> > +		 */
> > +		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> > +
> > +		/*
> > +		 * Add this inobt record to our cache, flush the cache if
> > +		 * needed, and move on to the next record.
> > +		 */
> > +		error = xfs_iwalk_ag_increment(iwag, xfs_iwalk_ag_recs, agno,
> > +				&cur, &agi_bp, &has_more);
> 
> Ok, so given this loop already has an increment case in it, it seems
> like it would be better to pull some of this function into the loop
> somewhat like:
> 
> 	while (has_more) {
> 		struct xfs_inobt_rec_incore	*irec;
> 
> 		cond_resched();
> 
> 		/* Fetch the inobt record. */
> 		irec = &iwag->recs[iwag->nr_recs];
> 		error = xfs_inobt_get_rec(cur, irec, &has_more);
> 		if (error || !has_more)
> 			break;
> 
> 		/* No allocated inodes in this chunk; skip it. */
> 		if (irec->ir_freecount == irec->ir_count) {
> 			error = xfs_btree_increment(cur, 0, &has_more);
> 			if (error)
> 				break;
> 			continue;
> 		}
> 
> 		/*
> 		 * Start readahead for this inode chunk in anticipation of
> 		 * walking the inodes.
> 		 */
> 		xfs_bulkstat_ichunk_ra(mp, agno, irec);
> 
> 		/* If there's space in the buffer, just grab more records. */
> 		if (++iwag->nr_recs < iwag->sz_recs)
> 			error = xfs_btree_increment(cur, 0, &has_more);
> 			if (error)
> 				break;
> 			continue;
> 		}
> 
> 		error = xfs_iwalk_run_callbacks(iwag, ...);
> 	}
> 
> 	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> 	if (!iwag->nr_recs || error)
> 		return error;
> 	return xfs_iwalk_ag_recs(iwag);
> }

Yeah, that is cleaner. :)

> > +	/* Walk any records left behind in the cache. */
> > +	if (iwag->nr_recs) {
> > +		xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > +		return xfs_iwalk_ag_recs(iwag);
> > +	}
> > +
> > +out_cur:
> > +	xfs_iwalk_del_inobt(tp, &cur, &agi_bp, error);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Given the number of inodes to prefetch, set the number of inobt records that
> > + * we cache in memory, which controls the number of inodes we try to read
> > + * ahead.
> > + *
> > + * If no max prefetch was given, default to one page's worth of inobt records;
> > + * this should be plenty of inodes to read ahead.
> 
> That's a lot of inodes on a 64k page size machine. I think it would
> be better capped at number that doesn't change with processor
> architecture...

4096 / sizeof(...); then?

since that's a single x86 page which means we're unlikely to fail the
memory allocation? :)

> > + */
> > +static inline void
> > +xfs_iwalk_set_prefetch(
> > +	struct xfs_iwalk_ag	*iwag,
> > +	unsigned int		max_prefetch)
> > +{
> > +	if (max_prefetch)
> > +		iwag->sz_recs = round_up(max_prefetch, XFS_INODES_PER_CHUNK) /
> > +					XFS_INODES_PER_CHUNK;
> > +	else
> > +		iwag->sz_recs = PAGE_SIZE / sizeof(struct xfs_inobt_rec_incore);
> > +
> > +	/*
> > +	 * Allocate enough space to prefetch at least two records so that we
> > +	 * can cache both the inobt record where the iwalk started and the next
> > +	 * record.  This simplifies the AG inode walk loop setup code.
> > +	 */
> > +	if (iwag->sz_recs < 2)
> > +		iwag->sz_recs = 2;
> 
> 	iwag->sz_recs = max(iwag->sz_recs, 2);
> 
> ....
> > +	xfs_iwalk_set_prefetch(&iwag, max_prefetch);
> > +	error = xfs_iwalk_allocbuf(&iwag);
> ....
> > +	xfs_iwalk_freebuf(&iwag);
> 
> I'd drop the "buf" from the names of those two functions...

<nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-06-04 16:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-29 22:26 [PATCH 00/11] xfs: refactor and improve inode iteration Darrick J. Wong
2019-05-29 22:26 ` [PATCH 01/11] xfs: separate inode geometry Darrick J. Wong
2019-05-30  1:18   ` Dave Chinner
2019-05-30 22:33     ` Darrick J. Wong
2019-05-29 22:26 ` [PATCH 02/11] xfs: create simplified inode walk function Darrick J. Wong
2019-06-04  7:41   ` Dave Chinner
2019-06-04 16:39     ` Darrick J. Wong
2019-05-29 22:26 ` [PATCH 03/11] xfs: convert quotacheck to use the new iwalk functions Darrick J. Wong
2019-06-04  7:52   ` Dave Chinner
2019-05-29 22:26 ` [PATCH 04/11] xfs: bulkstat should copy lastip whenever userspace supplies one Darrick J. Wong
2019-06-04  7:54   ` Dave Chinner
2019-06-04 14:24     ` Darrick J. Wong
2019-05-29 22:26 ` [PATCH 05/11] xfs: convert bulkstat to new iwalk infrastructure Darrick J. Wong
2019-05-29 22:26 ` [PATCH 06/11] xfs: move bulkstat ichunk helpers to iwalk code Darrick J. Wong
2019-05-29 22:26 ` [PATCH 07/11] xfs: change xfs_iwalk_grab_ichunk to use startino, not lastino Darrick J. Wong
2019-05-29 22:27 ` [PATCH 08/11] xfs: clean up long conditionals in xfs_iwalk_ichunk_ra Darrick J. Wong
2019-05-29 22:27 ` [PATCH 09/11] xfs: multithreaded iwalk implementation Darrick J. Wong
2019-05-29 22:27 ` [PATCH 10/11] xfs: poll waiting for quotacheck Darrick J. Wong
2019-05-29 22:27 ` [PATCH 11/11] xfs: refactor INUMBERS to use iwalk functions Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.