linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height
@ 2021-09-18  1:29 Darrick J. Wong
  2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

Hi all,

Chandan Babu pointed out that his large extent counters series depends
on the ability to have btree cursors of arbitrary heights, so I've
ported this to 5.15-rc1 so his patchsets won't have to depend on
djwong-dev for submission.

In this series, we rearrange the incore btree cursor so that we can
support btrees of any height.  This will become necessary for realtime
rmap and reflink since we'd like to handle tall trees without bloating
the AG btree cursors.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-dynamic-depth-5.16
---
 fs/xfs/libxfs/xfs_ag_resv.c        |    4 -
 fs/xfs/libxfs/xfs_alloc.c          |   18 +-
 fs/xfs/libxfs/xfs_alloc_btree.c    |    7 -
 fs/xfs/libxfs/xfs_bmap.c           |   24 ++-
 fs/xfs/libxfs/xfs_bmap_btree.c     |    7 -
 fs/xfs/libxfs/xfs_btree.c          |  266 ++++++++++++++++++++++++------------
 fs/xfs/libxfs/xfs_btree.h          |   52 +++++--
 fs/xfs/libxfs/xfs_btree_staging.c  |    8 +
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    7 -
 fs/xfs/libxfs/xfs_refcount_btree.c |    6 -
 fs/xfs/libxfs/xfs_rmap_btree.c     |   46 +++---
 fs/xfs/libxfs/xfs_rmap_btree.h     |    2 
 fs/xfs/libxfs/xfs_trans_resv.c     |   12 ++
 fs/xfs/libxfs/xfs_trans_space.h    |    7 +
 fs/xfs/scrub/agheader.c            |   13 +-
 fs/xfs/scrub/agheader_repair.c     |    8 +
 fs/xfs/scrub/bitmap.c              |   16 +-
 fs/xfs/scrub/bmap.c                |    2 
 fs/xfs/scrub/btree.c               |  118 ++++++++--------
 fs/xfs/scrub/btree.h               |   17 ++
 fs/xfs/scrub/dabtree.c             |   62 ++++----
 fs/xfs/scrub/repair.h              |    3 
 fs/xfs/scrub/scrub.c               |   60 ++++----
 fs/xfs/scrub/trace.c               |    7 +
 fs/xfs/scrub/trace.h               |   10 +
 fs/xfs/xfs_mount.c                 |    2 
 fs/xfs/xfs_super.c                 |   11 -
 fs/xfs/xfs_trace.h                 |    2 
 28 files changed, 466 insertions(+), 331 deletions(-)


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
  2021-09-21  8:36   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 02/14] xfs: don't allocate scrub contexts on the stack Darrick J. Wong
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   12 ++++++------
 fs/xfs/libxfs/xfs_bmap.c  |   12 ++++++------
 fs/xfs/libxfs/xfs_btree.c |   12 ++++++------
 fs/xfs/libxfs/xfs_btree.h |   12 ++++++------
 4 files changed, 24 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 95157f5a5a6c..35fb1dd3be95 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -426,8 +426,8 @@ xfs_alloc_fix_len(
  */
 STATIC int				/* error code */
 xfs_alloc_fixup_trees(
-	xfs_btree_cur_t	*cnt_cur,	/* cursor for by-size btree */
-	xfs_btree_cur_t	*bno_cur,	/* cursor for by-block btree */
+	struct xfs_btree_cur *cnt_cur,	/* cursor for by-size btree */
+	struct xfs_btree_cur *bno_cur,	/* cursor for by-block btree */
 	xfs_agblock_t	fbno,		/* starting block of free extent */
 	xfs_extlen_t	flen,		/* length of free extent */
 	xfs_agblock_t	rbno,		/* starting block of returned extent */
@@ -1200,8 +1200,8 @@ xfs_alloc_ag_vextent_exact(
 	xfs_alloc_arg_t	*args)	/* allocation argument structure */
 {
 	struct xfs_agf __maybe_unused *agf = args->agbp->b_addr;
-	xfs_btree_cur_t	*bno_cur;/* by block-number btree cursor */
-	xfs_btree_cur_t	*cnt_cur;/* by count btree cursor */
+	struct xfs_btree_cur *bno_cur;/* by block-number btree cursor */
+	struct xfs_btree_cur *cnt_cur;/* by count btree cursor */
 	int		error;
 	xfs_agblock_t	fbno;	/* start block of found extent */
 	xfs_extlen_t	flen;	/* length of found extent */
@@ -1658,8 +1658,8 @@ xfs_alloc_ag_vextent_size(
 	xfs_alloc_arg_t	*args)		/* allocation argument structure */
 {
 	struct xfs_agf	*agf = args->agbp->b_addr;
-	xfs_btree_cur_t	*bno_cur;	/* cursor for bno btree */
-	xfs_btree_cur_t	*cnt_cur;	/* cursor for cnt btree */
+	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
+	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
 	int		error;		/* error result */
 	xfs_agblock_t	fbno;		/* start of found freespace */
 	xfs_extlen_t	flen;		/* length of found freespace */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b48230f1a361..499c977cbf56 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -316,7 +316,7 @@ xfs_check_block(
  */
 STATIC void
 xfs_bmap_check_leaf_extents(
-	xfs_btree_cur_t		*cur,	/* btree cursor or null */
+	struct xfs_btree_cur	*cur,	/* btree cursor or null */
 	xfs_inode_t		*ip,		/* incore inode pointer */
 	int			whichfork)	/* data or attr fork */
 {
@@ -925,7 +925,7 @@ xfs_bmap_add_attrfork_btree(
 	int			*flags)		/* inode logging flags */
 {
 	struct xfs_btree_block	*block = ip->i_df.if_broot;
-	xfs_btree_cur_t		*cur;		/* btree cursor */
+	struct xfs_btree_cur	*cur;		/* btree cursor */
 	int			error;		/* error return value */
 	xfs_mount_t		*mp;		/* file system mount struct */
 	int			stat;		/* newroot status */
@@ -968,7 +968,7 @@ xfs_bmap_add_attrfork_extents(
 	struct xfs_inode	*ip,		/* incore inode pointer */
 	int			*flags)		/* inode logging flags */
 {
-	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
+	struct xfs_btree_cur	*cur;		/* bmap btree cursor */
 	int			error;		/* error return value */
 
 	if (ip->i_df.if_nextents * sizeof(struct xfs_bmbt_rec) <=
@@ -1988,11 +1988,11 @@ xfs_bmap_add_extent_unwritten_real(
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	int			whichfork,
 	struct xfs_iext_cursor	*icur,
-	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
+	struct xfs_btree_cur	**curp,	/* if *curp is null, not a btree */
 	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
 	int			*logflagsp) /* inode logging flags */
 {
-	xfs_btree_cur_t		*cur;	/* btree cursor */
+	struct xfs_btree_cur	*cur;	/* btree cursor */
 	int			error;	/* error return value */
 	int			i;	/* temp state */
 	struct xfs_ifork	*ifp;	/* inode fork pointer */
@@ -5045,7 +5045,7 @@ xfs_bmap_del_extent_real(
 	xfs_inode_t		*ip,	/* incore inode pointer */
 	xfs_trans_t		*tp,	/* current transaction pointer */
 	struct xfs_iext_cursor	*icur,
-	xfs_btree_cur_t		*cur,	/* if null, not a btree */
+	struct xfs_btree_cur	*cur,	/* if null, not a btree */
 	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
 	int			*logflagsp, /* inode logging flags */
 	int			whichfork, /* data or attr fork */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 298395481713..b0cce0932f02 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -388,14 +388,14 @@ xfs_btree_del_cursor(
  */
 int					/* error */
 xfs_btree_dup_cursor(
-	xfs_btree_cur_t	*cur,		/* input cursor */
-	xfs_btree_cur_t	**ncur)		/* output cursor */
+	struct xfs_btree_cur *cur,		/* input cursor */
+	struct xfs_btree_cur **ncur)		/* output cursor */
 {
 	struct xfs_buf	*bp;		/* btree block's buffer pointer */
 	int		error;		/* error return value */
 	int		i;		/* level number of btree block */
 	xfs_mount_t	*mp;		/* mount structure for filesystem */
-	xfs_btree_cur_t	*new;		/* new cursor value */
+	struct xfs_btree_cur *new;		/* new cursor value */
 	xfs_trans_t	*tp;		/* transaction pointer, can be NULL */
 
 	tp = cur->bc_tp;
@@ -691,7 +691,7 @@ xfs_btree_get_block(
  */
 STATIC int				/* success=1, failure=0 */
 xfs_btree_firstrec(
-	xfs_btree_cur_t		*cur,	/* btree cursor */
+	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level)	/* level to change */
 {
 	struct xfs_btree_block	*block;	/* generic btree block pointer */
@@ -721,7 +721,7 @@ xfs_btree_firstrec(
  */
 STATIC int				/* success=1, failure=0 */
 xfs_btree_lastrec(
-	xfs_btree_cur_t		*cur,	/* btree cursor */
+	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			level)	/* level to change */
 {
 	struct xfs_btree_block	*block;	/* generic btree block pointer */
@@ -985,7 +985,7 @@ xfs_btree_readahead_ptr(
  */
 STATIC void
 xfs_btree_setbuf(
-	xfs_btree_cur_t		*cur,	/* btree cursor */
+	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			lev,	/* level in btree */
 	struct xfs_buf		*bp)	/* new buffer to set */
 {
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 4eaf8517f850..513ade4a89f8 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -216,7 +216,7 @@ struct xfs_btree_cur_ino {
  * Btree cursor structure.
  * This collects all information needed by the btree code in one place.
  */
-typedef struct xfs_btree_cur
+struct xfs_btree_cur
 {
 	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
@@ -243,7 +243,7 @@ typedef struct xfs_btree_cur
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
 	};
-} xfs_btree_cur_t;
+};
 
 /* cursor flags */
 #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
@@ -309,7 +309,7 @@ xfs_btree_check_sptr(
  */
 void
 xfs_btree_del_cursor(
-	xfs_btree_cur_t		*cur,	/* btree cursor */
+	struct xfs_btree_cur	*cur,	/* btree cursor */
 	int			error);	/* del because of error */
 
 /*
@@ -318,8 +318,8 @@ xfs_btree_del_cursor(
  */
 int					/* error */
 xfs_btree_dup_cursor(
-	xfs_btree_cur_t		*cur,	/* input cursor */
-	xfs_btree_cur_t		**ncur);/* output cursor */
+	struct xfs_btree_cur		*cur,	/* input cursor */
+	struct xfs_btree_cur		**ncur);/* output cursor */
 
 /*
  * Compute first and last byte offsets for the fields given.
@@ -527,7 +527,7 @@ struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
 /* Does this cursor point to the last block in the given level? */
 static inline bool
 xfs_btree_islastblock(
-	xfs_btree_cur_t		*cur,
+	struct xfs_btree_cur	*cur,
 	int			level)
 {
 	struct xfs_btree_block	*block;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 02/14] xfs: don't allocate scrub contexts on the stack
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
  2021-09-21  8:39   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 03/14] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Convert the on-stack scrub context, btree scrub context, and da btree
scrub context into a heap allocation so that we reduce stack usage and
gain the ability to handle tall btrees without issue.

Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
for the btree scrub, and ~200 bytes for the main scrub context.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c   |   54 ++++++++++++++++++++++++------------------
 fs/xfs/scrub/btree.h   |    1 +
 fs/xfs/scrub/dabtree.c |   62 ++++++++++++++++++++++++++----------------------
 fs/xfs/scrub/scrub.c   |   60 ++++++++++++++++++++++++++--------------------
 4 files changed, 98 insertions(+), 79 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index eccb855dc904..26dcb4691e31 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -627,15 +627,8 @@ xchk_btree(
 	const struct xfs_owner_info	*oinfo,
 	void				*private)
 {
-	struct xchk_btree		bs = {
-		.cur			= cur,
-		.scrub_rec		= scrub_fn,
-		.oinfo			= oinfo,
-		.firstrec		= true,
-		.private		= private,
-		.sc			= sc,
-	};
 	union xfs_btree_ptr		ptr;
+	struct xchk_btree		*bs;
 	union xfs_btree_ptr		*pp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
@@ -646,10 +639,24 @@ xchk_btree(
 	int				i;
 	int				error = 0;
 
+	/*
+	 * Allocate the btree scrub context from the heap, because this
+	 * structure can get rather large.
+	 */
+	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
+	if (!bs)
+		return -ENOMEM;
+	bs->cur = cur;
+	bs->scrub_rec = scrub_fn;
+	bs->oinfo = oinfo;
+	bs->firstrec = true;
+	bs->private = private;
+	bs->sc = sc;
+
 	/* Initialize scrub state */
 	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
-		bs.firstkey[i] = true;
-	INIT_LIST_HEAD(&bs.to_check);
+		bs->firstkey[i] = true;
+	INIT_LIST_HEAD(&bs->to_check);
 
 	/* Don't try to check a tree with a height we can't handle. */
 	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
@@ -663,9 +670,9 @@ xchk_btree(
 	 */
 	level = cur->bc_nlevels - 1;
 	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
-	if (!xchk_btree_ptr_ok(&bs, cur->bc_nlevels, &ptr))
+	if (!xchk_btree_ptr_ok(bs, cur->bc_nlevels, &ptr))
 		goto out;
-	error = xchk_btree_get_block(&bs, level, &ptr, &block, &bp);
+	error = xchk_btree_get_block(bs, level, &ptr, &block, &bp);
 	if (error || !block)
 		goto out;
 
@@ -678,7 +685,7 @@ xchk_btree(
 			/* End of leaf, pop back towards the root. */
 			if (cur->bc_ptrs[level] >
 			    be16_to_cpu(block->bb_numrecs)) {
-				xchk_btree_block_keys(&bs, level, block);
+				xchk_btree_block_keys(bs, level, block);
 				if (level < cur->bc_nlevels - 1)
 					cur->bc_ptrs[level + 1]++;
 				level++;
@@ -686,11 +693,11 @@ xchk_btree(
 			}
 
 			/* Records in order for scrub? */
-			xchk_btree_rec(&bs);
+			xchk_btree_rec(bs);
 
 			/* Call out to the record checker. */
 			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
-			error = bs.scrub_rec(&bs, recp);
+			error = bs->scrub_rec(bs, recp);
 			if (error)
 				break;
 			if (xchk_should_terminate(sc, &error) ||
@@ -703,7 +710,7 @@ xchk_btree(
 
 		/* End of node, pop back towards the root. */
 		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
-			xchk_btree_block_keys(&bs, level, block);
+			xchk_btree_block_keys(bs, level, block);
 			if (level < cur->bc_nlevels - 1)
 				cur->bc_ptrs[level + 1]++;
 			level++;
@@ -711,16 +718,16 @@ xchk_btree(
 		}
 
 		/* Keys in order for scrub? */
-		xchk_btree_key(&bs, level);
+		xchk_btree_key(bs, level);
 
 		/* Drill another level deeper. */
 		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
-		if (!xchk_btree_ptr_ok(&bs, level, pp)) {
+		if (!xchk_btree_ptr_ok(bs, level, pp)) {
 			cur->bc_ptrs[level]++;
 			continue;
 		}
 		level--;
-		error = xchk_btree_get_block(&bs, level, pp, &block, &bp);
+		error = xchk_btree_get_block(bs, level, pp, &block, &bp);
 		if (error || !block)
 			goto out;
 
@@ -729,13 +736,14 @@ xchk_btree(
 
 out:
 	/* Process deferred owner checks on btree blocks. */
-	list_for_each_entry_safe(co, n, &bs.to_check, list) {
-		if (!error && bs.cur)
-			error = xchk_btree_check_block_owner(&bs,
-					co->level, co->daddr);
+	list_for_each_entry_safe(co, n, &bs->to_check, list) {
+		if (!error && bs->cur)
+			error = xchk_btree_check_block_owner(bs, co->level,
+					co->daddr);
 		list_del(&co->list);
 		kmem_free(co);
 	}
+	kmem_free(bs);
 
 	return error;
 }
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index b7d2fc01fbf9..d5c0b0cbc505 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -44,6 +44,7 @@ struct xchk_btree {
 	bool				firstkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
 };
+
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
 		void *private);
diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
index 8a52514bc1ff..b962cfbbd92b 100644
--- a/fs/xfs/scrub/dabtree.c
+++ b/fs/xfs/scrub/dabtree.c
@@ -473,7 +473,7 @@ xchk_da_btree(
 	xchk_da_btree_rec_fn		scrub_fn,
 	void				*private)
 {
-	struct xchk_da_btree		ds = {};
+	struct xchk_da_btree		*ds;
 	struct xfs_mount		*mp = sc->mp;
 	struct xfs_da_state_blk		*blks;
 	struct xfs_da_node_entry	*key;
@@ -486,32 +486,35 @@ xchk_da_btree(
 		return 0;
 
 	/* Set up initial da state. */
-	ds.dargs.dp = sc->ip;
-	ds.dargs.whichfork = whichfork;
-	ds.dargs.trans = sc->tp;
-	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
-	ds.state = xfs_da_state_alloc(&ds.dargs);
-	ds.sc = sc;
-	ds.private = private;
+	ds = kmem_zalloc(sizeof(struct xchk_da_btree), KM_NOFS | KM_MAYFAIL);
+	if (!ds)
+		return -ENOMEM;
+	ds->dargs.dp = sc->ip;
+	ds->dargs.whichfork = whichfork;
+	ds->dargs.trans = sc->tp;
+	ds->dargs.op_flags = XFS_DA_OP_OKNOENT;
+	ds->state = xfs_da_state_alloc(&ds->dargs);
+	ds->sc = sc;
+	ds->private = private;
 	if (whichfork == XFS_ATTR_FORK) {
-		ds.dargs.geo = mp->m_attr_geo;
-		ds.lowest = 0;
-		ds.highest = 0;
+		ds->dargs.geo = mp->m_attr_geo;
+		ds->lowest = 0;
+		ds->highest = 0;
 	} else {
-		ds.dargs.geo = mp->m_dir_geo;
-		ds.lowest = ds.dargs.geo->leafblk;
-		ds.highest = ds.dargs.geo->freeblk;
+		ds->dargs.geo = mp->m_dir_geo;
+		ds->lowest = ds->dargs.geo->leafblk;
+		ds->highest = ds->dargs.geo->freeblk;
 	}
-	blkno = ds.lowest;
+	blkno = ds->lowest;
 	level = 0;
 
 	/* Find the root of the da tree, if present. */
-	blks = ds.state->path.blk;
-	error = xchk_da_btree_block(&ds, level, blkno);
+	blks = ds->state->path.blk;
+	error = xchk_da_btree_block(ds, level, blkno);
 	if (error)
 		goto out_state;
 	/*
-	 * We didn't find a block at ds.lowest, which means that there's
+	 * We didn't find a block at ds->lowest, which means that there's
 	 * no LEAF1/LEAFN tree (at least not where it's supposed to be),
 	 * so jump out now.
 	 */
@@ -523,16 +526,16 @@ xchk_da_btree(
 		/* Handle leaf block. */
 		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
 			/* End of leaf, pop back towards the root. */
-			if (blks[level].index >= ds.maxrecs[level]) {
+			if (blks[level].index >= ds->maxrecs[level]) {
 				if (level > 0)
 					blks[level - 1].index++;
-				ds.tree_level++;
+				ds->tree_level++;
 				level--;
 				continue;
 			}
 
 			/* Dispatch record scrubbing. */
-			error = scrub_fn(&ds, level);
+			error = scrub_fn(ds, level);
 			if (error)
 				break;
 			if (xchk_should_terminate(sc, &error) ||
@@ -545,17 +548,17 @@ xchk_da_btree(
 
 
 		/* End of node, pop back towards the root. */
-		if (blks[level].index >= ds.maxrecs[level]) {
+		if (blks[level].index >= ds->maxrecs[level]) {
 			if (level > 0)
 				blks[level - 1].index++;
-			ds.tree_level++;
+			ds->tree_level++;
 			level--;
 			continue;
 		}
 
 		/* Hashes in order for scrub? */
-		key = xchk_da_btree_node_entry(&ds, level);
-		error = xchk_da_btree_hash(&ds, level, &key->hashval);
+		key = xchk_da_btree_node_entry(ds, level);
+		error = xchk_da_btree_hash(ds, level, &key->hashval);
 		if (error)
 			goto out;
 
@@ -564,11 +567,11 @@ xchk_da_btree(
 		level++;
 		if (level >= XFS_DA_NODE_MAXDEPTH) {
 			/* Too deep! */
-			xchk_da_set_corrupt(&ds, level - 1);
+			xchk_da_set_corrupt(ds, level - 1);
 			break;
 		}
-		ds.tree_level--;
-		error = xchk_da_btree_block(&ds, level, blkno);
+		ds->tree_level--;
+		error = xchk_da_btree_block(ds, level, blkno);
 		if (error)
 			goto out;
 		if (blks[level].bp == NULL)
@@ -587,6 +590,7 @@ xchk_da_btree(
 	}
 
 out_state:
-	xfs_da_state_free(ds.state);
+	xfs_da_state_free(ds->state);
+	kmem_free(ds);
 	return error;
 }
diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
index 51e4c61916d2..0569b15526ea 100644
--- a/fs/xfs/scrub/scrub.c
+++ b/fs/xfs/scrub/scrub.c
@@ -461,15 +461,10 @@ xfs_scrub_metadata(
 	struct file			*file,
 	struct xfs_scrub_metadata	*sm)
 {
-	struct xfs_scrub		sc = {
-		.file			= file,
-		.sm			= sm,
-	};
+	struct xfs_scrub		*sc;
 	struct xfs_mount		*mp = XFS_I(file_inode(file))->i_mount;
 	int				error = 0;
 
-	sc.mp = mp;
-
 	BUILD_BUG_ON(sizeof(meta_scrub_ops) !=
 		(sizeof(struct xchk_meta_ops) * XFS_SCRUB_TYPE_NR));
 
@@ -489,59 +484,68 @@ xfs_scrub_metadata(
 
 	xchk_experimental_warning(mp);
 
-	sc.ops = &meta_scrub_ops[sm->sm_type];
-	sc.sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
+	sc = kmem_zalloc(sizeof(struct xfs_scrub), KM_NOFS | KM_MAYFAIL);
+	if (!sc) {
+		error = -ENOMEM;
+		goto out;
+	}
+
+	sc->mp = mp;
+	sc->file = file;
+	sc->sm = sm;
+	sc->ops = &meta_scrub_ops[sm->sm_type];
+	sc->sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
 retry_op:
 	/*
 	 * When repairs are allowed, prevent freezing or readonly remount while
 	 * scrub is running with a real transaction.
 	 */
 	if (sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) {
-		error = mnt_want_write_file(sc.file);
+		error = mnt_want_write_file(sc->file);
 		if (error)
 			goto out;
 	}
 
 	/* Set up for the operation. */
-	error = sc.ops->setup(&sc);
+	error = sc->ops->setup(sc);
 	if (error)
 		goto out_teardown;
 
 	/* Scrub for errors. */
-	error = sc.ops->scrub(&sc);
-	if (!(sc.flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
+	error = sc->ops->scrub(sc);
+	if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
 		/*
 		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
 		 * Tear down everything we hold, then set up again with
 		 * preparation for worst-case scenarios.
 		 */
-		error = xchk_teardown(&sc, 0);
+		error = xchk_teardown(sc, 0);
 		if (error)
 			goto out;
-		sc.flags |= XCHK_TRY_HARDER;
+		sc->flags |= XCHK_TRY_HARDER;
 		goto retry_op;
 	} else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
 		goto out_teardown;
 
-	xchk_update_health(&sc);
+	xchk_update_health(sc);
 
-	if ((sc.sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
-	    !(sc.flags & XREP_ALREADY_FIXED)) {
+	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
+	    !(sc->flags & XREP_ALREADY_FIXED)) {
 		bool needs_fix;
 
 		/* Let debug users force us into the repair routines. */
 		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
-			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
+			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
 
-		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
-						XFS_SCRUB_OFLAG_XCORRUPT |
-						XFS_SCRUB_OFLAG_PREEN));
+		needs_fix = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
+						 XFS_SCRUB_OFLAG_XCORRUPT |
+						 XFS_SCRUB_OFLAG_PREEN));
 		/*
 		 * If userspace asked for a repair but it wasn't necessary,
 		 * report that back to userspace.
 		 */
 		if (!needs_fix) {
-			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
+			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
 			goto out_nofix;
 		}
 
@@ -549,26 +553,28 @@ xfs_scrub_metadata(
 		 * If it's broken, userspace wants us to fix it, and we haven't
 		 * already tried to fix it, then attempt a repair.
 		 */
-		error = xrep_attempt(&sc);
+		error = xrep_attempt(sc);
 		if (error == -EAGAIN) {
 			/*
 			 * Either the repair function succeeded or it couldn't
 			 * get all the resources it needs; either way, we go
 			 * back to the beginning and call the scrub function.
 			 */
-			error = xchk_teardown(&sc, 0);
+			error = xchk_teardown(sc, 0);
 			if (error) {
 				xrep_failure(mp);
-				goto out;
+				goto out_sc;
 			}
 			goto retry_op;
 		}
 	}
 
 out_nofix:
-	xchk_postmortem(&sc);
+	xchk_postmortem(sc);
 out_teardown:
-	error = xchk_teardown(&sc, error);
+	error = xchk_teardown(sc, error);
+out_sc:
+	kmem_free(sc);
 out:
 	trace_xchk_done(XFS_I(file_inode(file)), sm, error);
 	if (error == -EFSCORRUPTED || error == -EFSBADCRC) {


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 03/14] xfs: dynamically allocate btree scrub context structure
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
  2021-09-18  1:29 ` [PATCH 02/14] xfs: don't allocate scrub contexts on the stack Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
  2021-09-21  8:43   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 04/14] xfs: stricter btree height checking when looking for errors Darrick J. Wong
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have.  This will
enable us to use memory more efficiently once we start adding very tall
btree types.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   38 +++++++++++++++++---------------------
 fs/xfs/scrub/btree.h |   16 +++++++++++++---
 2 files changed, 30 insertions(+), 24 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 26dcb4691e31..7b7762ae22e5 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -141,9 +141,10 @@ xchk_btree_rec(
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
 	/* If this isn't the first record, are they in order? */
-	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
+	if (bs->levels[0].has_lastkey &&
+	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
-	bs->firstrec = false;
+	bs->levels[0].has_lastkey = true;
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
 
 	if (cur->bc_nlevels == 1)
@@ -188,11 +189,11 @@ xchk_btree_key(
 	trace_xchk_btree_key(bs->sc, cur, level);
 
 	/* If this isn't the first key, are they in order? */
-	if (!bs->firstkey[level] &&
-	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
+	if (bs->levels[level].has_lastkey &&
+	    !cur->bc_ops->keys_inorder(cur, &bs->levels[level].lastkey, key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	bs->firstkey[level] = false;
-	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
+	bs->levels[level].has_lastkey = true;
+	memcpy(&bs->levels[level].lastkey, key, cur->bc_ops->key_len);
 
 	if (level + 1 >= cur->bc_nlevels)
 		return;
@@ -632,38 +633,33 @@ xchk_btree(
 	union xfs_btree_ptr		*pp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
-	int				level;
 	struct xfs_buf			*bp;
 	struct check_owner		*co;
 	struct check_owner		*n;
-	int				i;
+	size_t				cur_sz;
+	int				level;
 	int				error = 0;
 
 	/*
 	 * Allocate the btree scrub context from the heap, because this
-	 * structure can get rather large.
+	 * structure can get rather large.  Don't let a caller feed us a
+	 * totally absurd size.
 	 */
-	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
+	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
+	if (cur_sz > PAGE_SIZE) {
+		xchk_btree_set_corrupt(sc, cur, 0);
+		return 0;
+	}
+	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
 	if (!bs)
 		return -ENOMEM;
 	bs->cur = cur;
 	bs->scrub_rec = scrub_fn;
 	bs->oinfo = oinfo;
-	bs->firstrec = true;
 	bs->private = private;
 	bs->sc = sc;
-
-	/* Initialize scrub state */
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
-		bs->firstkey[i] = true;
 	INIT_LIST_HEAD(&bs->to_check);
 
-	/* Don't try to check a tree with a height we can't handle. */
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
-		xchk_btree_set_corrupt(sc, cur, 0);
-		goto out;
-	}
-
 	/*
 	 * Load the root of the btree.  The helper function absorbs
 	 * error codes for us.
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index d5c0b0cbc505..7f8c54d8020e 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -29,6 +29,11 @@ typedef int (*xchk_btree_rec_fn)(
 	struct xchk_btree		*bs,
 	const union xfs_btree_rec	*rec);
 
+struct xchk_btree_levels {
+	union xfs_btree_key		lastkey;
+	bool				has_lastkey;
+};
+
 struct xchk_btree {
 	/* caller-provided scrub state */
 	struct xfs_scrub		*sc;
@@ -39,12 +44,17 @@ struct xchk_btree {
 
 	/* internal scrub state */
 	union xfs_btree_rec		lastrec;
-	bool				firstrec;
-	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
-	bool				firstkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
+	struct xchk_btree_levels	levels[];
 };
 
+static inline size_t
+xchk_btree_sizeof(unsigned int levels)
+{
+	return sizeof(struct xchk_btree) +
+				(levels * sizeof(struct xchk_btree_levels));
+}
+
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
 		void *private);


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 04/14] xfs: stricter btree height checking when looking for errors
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 03/14] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:54   ` Chandan Babu R
  2021-09-18  1:29 ` [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots Darrick J. Wong
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Since each btree type has its own precomputed maxlevels variable now,
use them instead of the generic XFS_BTREE_MAXLEVELS to check the level
of each per-AG btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index ae3c9f6e2c69..a2c3af77b6c2 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -555,11 +555,11 @@ xchk_agf(
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
-	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+	if (level <= 0 || level > mp->m_ag_maxlevels)
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
-	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+	if (level <= 0 || level > mp->m_ag_maxlevels)
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	if (xfs_has_rmapbt(mp)) {
@@ -568,7 +568,7 @@ xchk_agf(
 			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 		level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
-		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		if (level <= 0 || level > mp->m_rmap_maxlevels)
 			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 	}
 
@@ -578,7 +578,7 @@ xchk_agf(
 			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 		level = be32_to_cpu(agf->agf_refcount_level);
-		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		if (level <= 0 || level > mp->m_refc_maxlevels)
 			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 	}
 
@@ -850,6 +850,7 @@ xchk_agi(
 	struct xfs_mount	*mp = sc->mp;
 	struct xfs_agi		*agi;
 	struct xfs_perag	*pag;
+	struct xfs_ino_geometry	*igeo = M_IGEO(sc->mp);
 	xfs_agnumber_t		agno = sc->sm->sm_agno;
 	xfs_agblock_t		agbno;
 	xfs_agblock_t		eoag;
@@ -880,7 +881,7 @@ xchk_agi(
 		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
 	level = be32_to_cpu(agi->agi_level);
-	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+	if (level <= 0 || level > igeo->inobt_maxlevels)
 		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
 	if (xfs_has_finobt(mp)) {
@@ -889,7 +890,7 @@ xchk_agi(
 			xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 
 		level = be32_to_cpu(agi->agi_free_level);
-		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
+		if (level <= 0 || level > igeo->inobt_maxlevels)
 			xchk_block_set_corrupt(sc, sc->sa.agi_bp);
 	}
 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 04/14] xfs: stricter btree height checking when looking for errors Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:54   ` Chandan Babu R
  2021-09-18  1:29 ` [PATCH 06/14] xfs: check that bc_nlevels never overflows Darrick J. Wong
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

When we're scanning for btree roots to rebuild the AG headers, make sure
that the proposed tree does not exceed the maximum height for that btree
type (and not just XFS_BTREE_MAXLEVELS).

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/agheader_repair.c |    8 +++++++-
 fs/xfs/scrub/repair.h          |    3 +++
 2 files changed, 10 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 0f8deee66f15..05c27149b65d 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -122,7 +122,7 @@ xrep_check_btree_root(
 	xfs_agnumber_t			agno = sc->sm->sm_agno;
 
 	return xfs_verify_agbno(mp, agno, fab->root) &&
-	       fab->height <= XFS_BTREE_MAXLEVELS;
+	       fab->height <= fab->maxlevels;
 }
 
 /*
@@ -339,18 +339,22 @@ xrep_agf(
 		[XREP_AGF_BNOBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
 			.buf_ops = &xfs_bnobt_buf_ops,
+			.maxlevels = sc->mp->m_ag_maxlevels,
 		},
 		[XREP_AGF_CNTBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
 			.buf_ops = &xfs_cntbt_buf_ops,
+			.maxlevels = sc->mp->m_ag_maxlevels,
 		},
 		[XREP_AGF_RMAPBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
 			.buf_ops = &xfs_rmapbt_buf_ops,
+			.maxlevels = sc->mp->m_rmap_maxlevels,
 		},
 		[XREP_AGF_REFCOUNTBT] = {
 			.rmap_owner = XFS_RMAP_OWN_REFC,
 			.buf_ops = &xfs_refcountbt_buf_ops,
+			.maxlevels = sc->mp->m_refc_maxlevels,
 		},
 		[XREP_AGF_END] = {
 			.buf_ops = NULL,
@@ -881,10 +885,12 @@ xrep_agi(
 		[XREP_AGI_INOBT] = {
 			.rmap_owner = XFS_RMAP_OWN_INOBT,
 			.buf_ops = &xfs_inobt_buf_ops,
+			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
 		},
 		[XREP_AGI_FINOBT] = {
 			.rmap_owner = XFS_RMAP_OWN_INOBT,
 			.buf_ops = &xfs_finobt_buf_ops,
+			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
 		},
 		[XREP_AGI_END] = {
 			.buf_ops = NULL
diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
index 3bb152d52a07..840f74ec431c 100644
--- a/fs/xfs/scrub/repair.h
+++ b/fs/xfs/scrub/repair.h
@@ -44,6 +44,9 @@ struct xrep_find_ag_btree {
 	/* in: buffer ops */
 	const struct xfs_buf_ops	*buf_ops;
 
+	/* in: maximum btree height */
+	unsigned int			maxlevels;
+
 	/* out: the highest btree block found and the tree height */
 	xfs_agblock_t			root;
 	unsigned int			height;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 06/14] xfs: check that bc_nlevels never overflows
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:54   ` Chandan Babu R
  2021-09-21  8:44   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 07/14] xfs: support dynamic btree cursor heights Darrick J. Wong
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Warn if we ever bump nlevels higher than the allowed maximum cursor
height.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c         |    2 ++
 fs/xfs/libxfs/xfs_btree_staging.c |    2 ++
 2 files changed, 4 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index b0cce0932f02..bc4e49f0456a 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2933,6 +2933,7 @@ xfs_btree_new_iroot(
 	be16_add_cpu(&block->bb_level, 1);
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
+	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 	cur->bc_ptrs[level + 1] = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
@@ -3096,6 +3097,7 @@ xfs_btree_new_root(
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
 	cur->bc_ptrs[cur->bc_nlevels] = nptr;
 	cur->bc_nlevels++;
+	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 	*stat = 1;
 	return 0;
 error0:
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index ac9e80152b5c..26143297bb7b 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -703,6 +703,7 @@ xfs_btree_bload_compute_geometry(
 			 * block-based btree level.
 			 */
 			cur->bc_nlevels++;
+			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 			xfs_btree_bload_level_geometry(cur, bbl, level,
 					nr_this_level, &avg_per_block,
 					&level_blocks, &dontcare64);
@@ -718,6 +719,7 @@ xfs_btree_bload_compute_geometry(
 
 			/* Otherwise, we need another level of btree. */
 			cur->bc_nlevels++;
+			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 		}
 
 		nr_blocks += level_blocks;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 07/14] xfs: support dynamic btree cursor heights
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 06/14] xfs: check that bc_nlevels never overflows Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:49   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 08/14] xfs: refactor btree cursor allocation function Darrick J. Wong
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  The realtime rmap btree
(which is rooted in an inode) will require the ability to support many
more levels than a per-AG btree cursor, which means that we're going to
create two btree cursor caches to conserve memory for the more common
case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |    6 +-
 fs/xfs/libxfs/xfs_bmap.c  |   10 +--
 fs/xfs/libxfs/xfs_btree.c |  154 +++++++++++++++++++++++----------------------
 fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
 fs/xfs/scrub/bitmap.c     |   16 ++---
 fs/xfs/scrub/bmap.c       |    2 -
 fs/xfs/scrub/btree.c      |   40 ++++++------
 fs/xfs/scrub/trace.c      |    7 +-
 fs/xfs/scrub/trace.h      |   10 +--
 fs/xfs/xfs_super.c        |    2 -
 fs/xfs/xfs_trace.h        |    2 -
 11 files changed, 147 insertions(+), 130 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 35fb1dd3be95..55c5adc9b54e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -488,8 +488,8 @@ xfs_alloc_fixup_trees(
 		struct xfs_btree_block	*bnoblock;
 		struct xfs_btree_block	*cntblock;
 
-		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_bufs[0]);
-		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_bufs[0]);
+		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_levels[0].bp);
+		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_levels[0].bp);
 
 		if (XFS_IS_CORRUPT(mp,
 				   bnoblock->bb_numrecs !=
@@ -1512,7 +1512,7 @@ xfs_alloc_ag_vextent_lastblock(
 	 * than minlen.
 	 */
 	if (*len || args->alignment > 1) {
-		acur->cnt->bc_ptrs[0] = 1;
+		acur->cnt->bc_levels[0].ptr = 1;
 		do {
 			error = xfs_alloc_get_rec(acur->cnt, bno, len, &i);
 			if (error)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 499c977cbf56..644b956301b6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -240,10 +240,10 @@ xfs_bmap_get_bp(
 		return NULL;
 
 	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
-		if (!cur->bc_bufs[i])
+		if (!cur->bc_levels[i].bp)
 			break;
-		if (xfs_buf_daddr(cur->bc_bufs[i]) == bno)
-			return cur->bc_bufs[i];
+		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
+			return cur->bc_levels[i].bp;
 	}
 
 	/* Chase down all the log items to see if the bp is there */
@@ -629,8 +629,8 @@ xfs_bmap_btree_to_extents(
 	ip->i_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
-	if (cur->bc_bufs[0] == cbp)
-		cur->bc_bufs[0] = NULL;
+	if (cur->bc_levels[0].bp == cbp)
+		cur->bc_levels[0].bp = NULL;
 	xfs_iroot_realloc(ip, -1, whichfork);
 	ASSERT(ifp->if_broot == NULL);
 	ifp->if_format = XFS_DINODE_FMT_EXTENTS;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index bc4e49f0456a..93fb50516bc2 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -367,8 +367,8 @@ xfs_btree_del_cursor(
 	 * way we won't have initialized all the entries down to 0.
 	 */
 	for (i = 0; i < cur->bc_nlevels; i++) {
-		if (cur->bc_bufs[i])
-			xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
+		if (cur->bc_levels[i].bp)
+			xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
 		else if (!error)
 			break;
 	}
@@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
 	 * For each level current, re-get the buffer and copy the ptr value.
 	 */
 	for (i = 0; i < new->bc_nlevels; i++) {
-		new->bc_ptrs[i] = cur->bc_ptrs[i];
-		new->bc_ra[i] = cur->bc_ra[i];
-		bp = cur->bc_bufs[i];
+		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
+		new->bc_levels[i].ra = cur->bc_levels[i].ra;
+		bp = cur->bc_levels[i].bp;
 		if (bp) {
 			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 						   xfs_buf_daddr(bp), mp->m_bsize,
@@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
 				return error;
 			}
 		}
-		new->bc_bufs[i] = bp;
+		new->bc_levels[i].bp = bp;
 	}
 	*ncur = new;
 	return 0;
@@ -681,7 +681,7 @@ xfs_btree_get_block(
 		return xfs_btree_get_iroot(cur);
 	}
 
-	*bpp = cur->bc_bufs[level];
+	*bpp = cur->bc_levels[level].bp;
 	return XFS_BUF_TO_BLOCK(*bpp);
 }
 
@@ -711,7 +711,7 @@ xfs_btree_firstrec(
 	/*
 	 * Set the ptr value to 1, that's the first record/key.
 	 */
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 	return 1;
 }
 
@@ -741,7 +741,7 @@ xfs_btree_lastrec(
 	/*
 	 * Set the ptr value to numrecs, that's the last record/key.
 	 */
-	cur->bc_ptrs[level] = be16_to_cpu(block->bb_numrecs);
+	cur->bc_levels[level].ptr = be16_to_cpu(block->bb_numrecs);
 	return 1;
 }
 
@@ -922,11 +922,11 @@ xfs_btree_readahead(
 	    (lev == cur->bc_nlevels - 1))
 		return 0;
 
-	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
+	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
 		return 0;
 
-	cur->bc_ra[lev] |= lr;
-	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[lev]);
+	cur->bc_levels[lev].ra |= lr;
+	block = XFS_BUF_TO_BLOCK(cur->bc_levels[lev].bp);
 
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		return xfs_btree_readahead_lblock(cur, lr, block);
@@ -991,22 +991,22 @@ xfs_btree_setbuf(
 {
 	struct xfs_btree_block	*b;	/* btree block */
 
-	if (cur->bc_bufs[lev])
-		xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[lev]);
-	cur->bc_bufs[lev] = bp;
-	cur->bc_ra[lev] = 0;
+	if (cur->bc_levels[lev].bp)
+		xfs_trans_brelse(cur->bc_tp, cur->bc_levels[lev].bp);
+	cur->bc_levels[lev].bp = bp;
+	cur->bc_levels[lev].ra = 0;
 
 	b = XFS_BUF_TO_BLOCK(bp);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (b->bb_u.l.bb_leftsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.l.bb_rightsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	} else {
 		if (b->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	}
 }
 
@@ -1548,7 +1548,7 @@ xfs_btree_increment(
 #endif
 
 	/* We're done if we remain in the block after the increment. */
-	if (++cur->bc_ptrs[level] <= xfs_btree_get_numrecs(block))
+	if (++cur->bc_levels[level].ptr <= xfs_btree_get_numrecs(block))
 		goto out1;
 
 	/* Fail if we just went off the right edge of the tree. */
@@ -1571,7 +1571,7 @@ xfs_btree_increment(
 			goto error0;
 #endif
 
-		if (++cur->bc_ptrs[lev] <= xfs_btree_get_numrecs(block))
+		if (++cur->bc_levels[lev].ptr <= xfs_btree_get_numrecs(block))
 			break;
 
 		/* Read-ahead the right block for the next loop. */
@@ -1598,14 +1598,14 @@ xfs_btree_increment(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = 1;
+		cur->bc_levels[lev].ptr = 1;
 	}
 out1:
 	*stat = 1;
@@ -1641,7 +1641,7 @@ xfs_btree_decrement(
 	xfs_btree_readahead(cur, level, XFS_BTCUR_LEFTRA);
 
 	/* We're done if we remain in the block after the decrement. */
-	if (--cur->bc_ptrs[level] > 0)
+	if (--cur->bc_levels[level].ptr > 0)
 		goto out1;
 
 	/* Get a pointer to the btree block. */
@@ -1665,7 +1665,7 @@ xfs_btree_decrement(
 	 * Stop when we don't go off the left edge of a block.
 	 */
 	for (lev = level + 1; lev < cur->bc_nlevels; lev++) {
-		if (--cur->bc_ptrs[lev] > 0)
+		if (--cur->bc_levels[lev].ptr > 0)
 			break;
 		/* Read-ahead the left block for the next loop. */
 		xfs_btree_readahead(cur, lev, XFS_BTCUR_LEFTRA);
@@ -1691,13 +1691,13 @@ xfs_btree_decrement(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = xfs_btree_get_numrecs(block);
+		cur->bc_levels[lev].ptr = xfs_btree_get_numrecs(block);
 	}
 out1:
 	*stat = 1;
@@ -1735,7 +1735,7 @@ xfs_btree_lookup_get_block(
 	 *
 	 * Otherwise throw it away and get a new one.
 	 */
-	bp = cur->bc_bufs[level];
+	bp = cur->bc_levels[level].bp;
 	error = xfs_btree_ptr_to_daddr(cur, pp, &daddr);
 	if (error)
 		return error;
@@ -1864,7 +1864,7 @@ xfs_btree_lookup(
 					return -EFSCORRUPTED;
 				}
 
-				cur->bc_ptrs[0] = dir != XFS_LOOKUP_LE;
+				cur->bc_levels[0].ptr = dir != XFS_LOOKUP_LE;
 				*stat = 0;
 				return 0;
 			}
@@ -1916,7 +1916,7 @@ xfs_btree_lookup(
 			if (error)
 				goto error0;
 
-			cur->bc_ptrs[level] = keyno;
+			cur->bc_levels[level].ptr = keyno;
 		}
 	}
 
@@ -1933,7 +1933,7 @@ xfs_btree_lookup(
 		    !xfs_btree_ptr_is_null(cur, &ptr)) {
 			int	i;
 
-			cur->bc_ptrs[0] = keyno;
+			cur->bc_levels[0].ptr = keyno;
 			error = xfs_btree_increment(cur, 0, &i);
 			if (error)
 				goto error0;
@@ -1944,7 +1944,7 @@ xfs_btree_lookup(
 		}
 	} else if (dir == XFS_LOOKUP_LE && diff > 0)
 		keyno--;
-	cur->bc_ptrs[0] = keyno;
+	cur->bc_levels[0].ptr = keyno;
 
 	/* Return if we succeeded or not. */
 	if (keyno == 0 || keyno > xfs_btree_get_numrecs(block))
@@ -2104,7 +2104,7 @@ __xfs_btree_updkeys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		nlkey = xfs_btree_key_addr(cur, ptr, block);
 		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
 		if (!force_all &&
@@ -2171,7 +2171,7 @@ xfs_btree_update_keys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		kp = xfs_btree_key_addr(cur, ptr, block);
 		xfs_btree_copy_keys(cur, kp, &key, 1);
 		xfs_btree_log_keys(cur, bp, ptr, ptr);
@@ -2205,7 +2205,7 @@ xfs_btree_update(
 		goto error0;
 #endif
 	/* Get the address of the rec to be updated. */
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	rp = xfs_btree_rec_addr(cur, ptr, block);
 
 	/* Fill in the new contents and log them. */
@@ -2280,7 +2280,7 @@ xfs_btree_lshift(
 	 * If the cursor entry is the one that would be moved, don't
 	 * do it... it's too complicated.
 	 */
-	if (cur->bc_ptrs[level] <= 1)
+	if (cur->bc_levels[level].ptr <= 1)
 		goto out0;
 
 	/* Set up the left neighbor as "left". */
@@ -2414,7 +2414,7 @@ xfs_btree_lshift(
 		goto error0;
 
 	/* Slide the cursor value left one. */
-	cur->bc_ptrs[level]--;
+	cur->bc_levels[level].ptr--;
 
 	*stat = 1;
 	return 0;
@@ -2476,7 +2476,7 @@ xfs_btree_rshift(
 	 * do it... it's too complicated.
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
-	if (cur->bc_ptrs[level] >= lrecs)
+	if (cur->bc_levels[level].ptr >= lrecs)
 		goto out0;
 
 	/* Set up the right neighbor as "right". */
@@ -2664,7 +2664,7 @@ __xfs_btree_split(
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
 	rrecs = lrecs / 2;
-	if ((lrecs & 1) && cur->bc_ptrs[level] <= rrecs + 1)
+	if ((lrecs & 1) && cur->bc_levels[level].ptr <= rrecs + 1)
 		rrecs++;
 	src_index = (lrecs - rrecs + 1);
 
@@ -2760,9 +2760,9 @@ __xfs_btree_split(
 	 * If it's just pointing past the last entry in left, then we'll
 	 * insert there, so don't change anything in that case.
 	 */
-	if (cur->bc_ptrs[level] > lrecs + 1) {
+	if (cur->bc_levels[level].ptr > lrecs + 1) {
 		xfs_btree_setbuf(cur, level, rbp);
-		cur->bc_ptrs[level] -= lrecs;
+		cur->bc_levels[level].ptr -= lrecs;
 	}
 	/*
 	 * If there are more levels, we'll need another cursor which refers
@@ -2772,7 +2772,7 @@ __xfs_btree_split(
 		error = xfs_btree_dup_cursor(cur, curp);
 		if (error)
 			goto error0;
-		(*curp)->bc_ptrs[level + 1]++;
+		(*curp)->bc_levels[level + 1].ptr++;
 	}
 	*ptrp = rptr;
 	*stat = 1;
@@ -2934,7 +2934,7 @@ xfs_btree_new_iroot(
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
-	cur->bc_ptrs[level + 1] = 1;
+	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
 	ckp = xfs_btree_key_addr(cur, 1, cblock);
@@ -3095,7 +3095,7 @@ xfs_btree_new_root(
 
 	/* Fix up the cursor. */
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
-	cur->bc_ptrs[cur->bc_nlevels] = nptr;
+	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 	*stat = 1;
@@ -3154,7 +3154,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 	if (*stat) {
-		*oindex = *index = cur->bc_ptrs[level];
+		*oindex = *index = cur->bc_levels[level].ptr;
 		return 0;
 	}
 
@@ -3169,7 +3169,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 
-	*index = cur->bc_ptrs[level];
+	*index = cur->bc_levels[level].ptr;
 	return 0;
 }
 
@@ -3216,7 +3216,7 @@ xfs_btree_insrec(
 	}
 
 	/* If we're off the left edge, return failure. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3559,7 +3559,7 @@ xfs_btree_kill_iroot(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level - 1] = NULL;
+	cur->bc_levels[level - 1].bp = NULL;
 	be16_add_cpu(&block->bb_level, -1);
 	xfs_trans_log_inode(cur->bc_tp, ip,
 		XFS_ILOG_CORE | xfs_ilog_fbroot(cur->bc_ino.whichfork));
@@ -3592,8 +3592,8 @@ xfs_btree_kill_root(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level] = NULL;
-	cur->bc_ra[level] = 0;
+	cur->bc_levels[level].bp = NULL;
+	cur->bc_levels[level].ra = 0;
 	cur->bc_nlevels--;
 
 	return 0;
@@ -3652,7 +3652,7 @@ xfs_btree_delrec(
 	tcur = NULL;
 
 	/* Get the index of the entry being deleted, check for nothing there. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3962,7 +3962,7 @@ xfs_btree_delrec(
 				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
 				tcur = NULL;
 				if (level == 0)
-					cur->bc_ptrs[0]++;
+					cur->bc_levels[0].ptr++;
 
 				*stat = 1;
 				return 0;
@@ -4099,9 +4099,9 @@ xfs_btree_delrec(
 	 * cursor to the left block, and fix up the index.
 	 */
 	if (bp != lbp) {
-		cur->bc_bufs[level] = lbp;
-		cur->bc_ptrs[level] += lrecs;
-		cur->bc_ra[level] = 0;
+		cur->bc_levels[level].bp = lbp;
+		cur->bc_levels[level].ptr += lrecs;
+		cur->bc_levels[level].ra = 0;
 	}
 	/*
 	 * If we joined with the right neighbor and there's a level above
@@ -4121,11 +4121,11 @@ xfs_btree_delrec(
 	 * We can't use decrement because it would change the next level up.
 	 */
 	if (level > 0)
-		cur->bc_ptrs[level]--;
+		cur->bc_levels[level].ptr--;
 
 	/*
 	 * We combined blocks, so we have to update the parent keys if the
-	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
+	 * btree supports overlapped intervals.  However, bc_levels[level + 1].ptr
 	 * points to the old block so that the caller knows which record to
 	 * delete.  Therefore, the caller must be savvy enough to call updkeys
 	 * for us if we return stat == 2.  The other exit points from this
@@ -4184,7 +4184,7 @@ xfs_btree_delete(
 
 	if (i == 0) {
 		for (level = 1; level < cur->bc_nlevels; level++) {
-			if (cur->bc_ptrs[level] == 0) {
+			if (cur->bc_levels[level].ptr == 0) {
 				error = xfs_btree_decrement(cur, level, &i);
 				if (error)
 					goto error0;
@@ -4215,7 +4215,7 @@ xfs_btree_get_rec(
 	int			error;	/* error return value */
 #endif
 
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 #ifdef DEBUG
@@ -4663,23 +4663,23 @@ xfs_btree_overlapped_query_range(
 	if (error)
 		goto out;
 #endif
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr > be16_to_cpu(block->bb_numrecs)) {
 pop_up:
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
 
 		if (level == 0) {
 			/* Handle a leaf node. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
 
 			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
 			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
@@ -4702,14 +4702,14 @@ xfs_btree_overlapped_query_range(
 				/* Record is larger than high key; pop. */
 				goto pop_up;
 			}
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* Handle an internal node. */
-		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
-		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		lkp = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
+		hkp = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 
 		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
 		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
@@ -4732,13 +4732,13 @@ xfs_btree_overlapped_query_range(
 			if (error)
 				goto out;
 #endif
-			cur->bc_ptrs[level] = 1;
+			cur->bc_levels[level].ptr = 1;
 			continue;
 		} else if (hdiff < 0) {
 			/* The low key is larger than the upper range; pop. */
 			goto pop_up;
 		}
-		cur->bc_ptrs[level]++;
+		cur->bc_levels[level].ptr++;
 	}
 
 out:
@@ -4749,13 +4749,13 @@ xfs_btree_overlapped_query_range(
 	 * with a zero-results range query, so release the buffers if we
 	 * failed to return any results.
 	 */
-	if (cur->bc_bufs[0] == NULL) {
+	if (cur->bc_levels[0].bp == NULL) {
 		for (i = 0; i < cur->bc_nlevels; i++) {
-			if (cur->bc_bufs[i]) {
-				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
-				cur->bc_bufs[i] = NULL;
-				cur->bc_ptrs[i] = 0;
-				cur->bc_ra[i] = 0;
+			if (cur->bc_levels[i].bp) {
+				xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
+				cur->bc_levels[i].bp = NULL;
+				cur->bc_levels[i].ptr = 0;
+				cur->bc_levels[i].ra = 0;
 			}
 		}
 	}
@@ -4917,7 +4917,7 @@ xfs_btree_has_more_records(
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 	/* There are still records in this block. */
-	if (cur->bc_ptrs[0] < xfs_btree_get_numrecs(block))
+	if (cur->bc_levels[0].ptr < xfs_btree_get_numrecs(block))
 		return true;
 
 	/* There are more record blocks. */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 513ade4a89f8..827c44bf24dc 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
 #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
 };
 
+struct xfs_btree_level {
+	/* buffer pointer */
+	struct xfs_buf	*bp;
+
+	/* key/record number */
+	unsigned int	ptr;
+
+	/* readahead info */
+#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
+#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
+	uint8_t		ra;
+};
+
 /*
  * Btree cursor structure.
  * This collects all information needed by the btree code in one place.
@@ -223,11 +236,6 @@ struct xfs_btree_cur
 	const struct xfs_btree_ops *bc_ops;
 	uint			bc_flags; /* btree features - below */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
-	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
-	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
-	uint8_t		bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
-#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
-#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
 	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
@@ -243,8 +251,17 @@ struct xfs_btree_cur
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
 	};
+
+	/* Must be at the end of the struct! */
+	struct xfs_btree_level	bc_levels[];
 };
 
+static inline size_t xfs_btree_cur_sizeof(unsigned int nlevels)
+{
+	return sizeof(struct xfs_btree_cur) +
+	       sizeof(struct xfs_btree_level) * (nlevels);
+}
+
 /* cursor flags */
 #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
 #define XFS_BTREE_ROOT_IN_INODE		(1<<1)	/* root may be variable size */
@@ -258,7 +275,6 @@ struct xfs_btree_cur
  */
 #define XFS_BTREE_STAGING		(1<<5)
 
-
 #define	XFS_BTREE_NOERROR	0
 #define	XFS_BTREE_ERROR		1
 
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index d6d24c866bc4..b8b8e871e3b7 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -222,20 +222,20 @@ xbitmap_disunion(
  * 1  2  3
  *
  * Pretend for this example that each leaf block has 100 btree records.  For
- * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
- * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
+ * the first btree record, we'll observe that bc_levels[0].ptr == 1, so we record
+ * that we saw block 1.  Then we observe that bc_levels[1].ptr == 1, so we record
  * block 4.  The list is [1, 4].
  *
- * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
+ * For the second btree record, we see that bc_levels[0].ptr == 2, so we exit the
  * loop.  The list remains [1, 4].
  *
  * For the 101st btree record, we've moved onto leaf block 2.  Now
- * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
- * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
+ * bc_levels[0].ptr == 1 again, so we record that we saw block 2.  We see that
+ * bc_levels[1].ptr == 2, so we exit the loop.  The list is now [1, 4, 2].
  *
- * For the 102nd record, bc_ptrs[0] == 2, so we continue.
+ * For the 102nd record, bc_levels[0].ptr == 2, so we continue.
  *
- * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
+ * For the 201st record, we've moved on to leaf block 3.  bc_levels[0].ptr == 1, so
  * we add 3 to the list.  Now it is [1, 4, 2, 3].
  *
  * For the 300th record we just exit, with the list being [1, 4, 2, 3].
@@ -256,7 +256,7 @@ xbitmap_set_btcur_path(
 	int			i;
 	int			error;
 
-	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
+	for (i = 0; i < cur->bc_nlevels && cur->bc_levels[i].ptr == 1; i++) {
 		xfs_btree_get_block(cur, i, &bp);
 		if (!bp)
 			continue;
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 017da9ceaee9..a4cbbc346f60 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -402,7 +402,7 @@ xchk_bmapbt_rec(
 	 * the root since the verifiers don't do that.
 	 */
 	if (xfs_has_crc(bs->cur->bc_mp) &&
-	    bs->cur->bc_ptrs[0] == 1) {
+	    bs->cur->bc_levels[0].ptr == 1) {
 		for (i = 0; i < bs->cur->bc_nlevels - 1; i++) {
 			block = xfs_btree_get_block(bs->cur, i, &bp);
 			owner = be64_to_cpu(block->bb_u.l.bb_owner);
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 7b7762ae22e5..5a453ce151ed 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -136,7 +136,7 @@ xchk_btree_rec(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, 0, &bp);
-	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+	rec = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
 
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
@@ -153,7 +153,7 @@ xchk_btree_rec(
 	/* Is this at least as large as the parent low key? */
 	cur->bc_ops->init_key_from_rec(&key, rec);
 	keyblock = xfs_btree_get_block(cur, 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
@@ -162,7 +162,7 @@ xchk_btree_rec(
 
 	/* Is this no larger than the parent high key? */
 	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
@@ -184,7 +184,7 @@ xchk_btree_key(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, level, &bp);
-	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+	key = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
 
 	trace_xchk_btree_key(bs->sc, cur, level);
 
@@ -200,7 +200,7 @@ xchk_btree_key(
 
 	/* Is this at least as large as the parent low key? */
 	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 
@@ -208,8 +208,8 @@ xchk_btree_key(
 		return;
 
 	/* Is this no larger than the parent high key? */
-	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
@@ -292,7 +292,7 @@ xchk_btree_block_check_sibling(
 
 	/* Compare upper level pointer to sibling pointer. */
 	pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
-	pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
+	pp = xfs_btree_ptr_addr(ncur, ncur->bc_levels[level + 1].ptr, pblock);
 	if (!xchk_btree_ptr_ok(bs, level + 1, pp))
 		goto out;
 	if (pbp)
@@ -597,7 +597,7 @@ xchk_btree_block_keys(
 
 	/* Obtain the parent's copy of the keys for this block. */
 	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
-	parent_keys = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1],
+	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
@@ -608,7 +608,7 @@ xchk_btree_block_keys(
 
 	/* Get high keys */
 	high_bk = xfs_btree_high_key_from_key(cur, &block_keys);
-	high_pk = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1],
+	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
@@ -672,18 +672,18 @@ xchk_btree(
 	if (error || !block)
 		goto out;
 
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		if (level == 0) {
 			/* End of leaf, pop back towards the root. */
-			if (cur->bc_ptrs[level] >
+			if (cur->bc_levels[level].ptr >
 			    be16_to_cpu(block->bb_numrecs)) {
 				xchk_btree_block_keys(bs, level, block);
 				if (level < cur->bc_nlevels - 1)
-					cur->bc_ptrs[level + 1]++;
+					cur->bc_levels[level + 1].ptr++;
 				level++;
 				continue;
 			}
@@ -692,7 +692,7 @@ xchk_btree(
 			xchk_btree_rec(bs);
 
 			/* Call out to the record checker. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
 			error = bs->scrub_rec(bs, recp);
 			if (error)
 				break;
@@ -700,15 +700,15 @@ xchk_btree(
 			    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 				break;
 
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr > be16_to_cpu(block->bb_numrecs)) {
 			xchk_btree_block_keys(bs, level, block);
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
@@ -717,9 +717,9 @@ xchk_btree(
 		xchk_btree_key(bs, level);
 
 		/* Drill another level deeper. */
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 		if (!xchk_btree_ptr_ok(bs, level, pp)) {
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 		level--;
@@ -727,7 +727,7 @@ xchk_btree(
 		if (error || !block)
 			goto out;
 
-		cur->bc_ptrs[level] = 1;
+		cur->bc_levels[level].ptr = 1;
 	}
 
 out:
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index c0ef53fe6611..816dfc8e5a80 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
 	struct xfs_btree_cur	*cur,
 	int			level)
 {
-	if (level < cur->bc_nlevels && cur->bc_bufs[level])
+	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
 		return XFS_DADDR_TO_FSB(cur->bc_mp,
-				xfs_buf_daddr(cur->bc_bufs[level]));
-	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
+				xfs_buf_daddr(cur->bc_levels[level].bp));
+	else if (level == cur->bc_nlevels - 1 &&
+		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
 		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a7bbb84f91a7..93ece6df02e3 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -348,7 +348,7 @@ TRACE_EVENT(xchk_btree_op_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
@@ -389,7 +389,7 @@ TRACE_EVENT(xchk_ifork_btree_op_error,
 		__entry->type = sc->sm->sm_type;
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->error = error;
@@ -431,7 +431,7 @@ TRACE_EVENT(xchk_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -471,7 +471,7 @@ TRACE_EVENT(xchk_ifork_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d ino 0x%llx fork %s type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -511,7 +511,7 @@ DECLARE_EVENT_CLASS(xchk_sbtree_class,
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 	),
 	TP_printk("dev %d:%d type %s btree %s agno 0x%x agbno 0x%x level %d nlevels %d ptr %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c4e0cd1c1c8c..30bae0657343 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1966,7 +1966,7 @@ xfs_init_zones(void)
 		goto out_destroy_log_ticket_zone;
 
 	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-					       sizeof(struct xfs_btree_cur),
+				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
 					       0, 0, NULL);
 	if (!xfs_btree_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1033a95fbf8e..4a8076ef8cb4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2476,7 +2476,7 @@ DECLARE_EVENT_CLASS(xfs_btree_cur_class,
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->daddr = bp ? xfs_buf_daddr(bp) : -1;
 	),
 	TP_printk("dev %d:%d btree %s level %d/%d ptr %d daddr 0x%llx",


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 08/14] xfs: refactor btree cursor allocation function
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 07/14] xfs: support dynamic btree cursor heights Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:53   ` Christoph Hellwig
  2021-09-18  1:29 ` [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code Darrick J. Wong
                   ` (5 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Refactor btree allocation to a common helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    7 +------
 fs/xfs/libxfs/xfs_bmap_btree.c     |    7 +------
 fs/xfs/libxfs/xfs_btree.c          |   18 ++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h          |    2 ++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    7 +------
 fs/xfs/libxfs/xfs_refcount_btree.c |    6 +-----
 fs/xfs/libxfs/xfs_rmap_btree.c     |    6 +-----
 7 files changed, 25 insertions(+), 28 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6746fd735550..c644b11132f6 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -477,12 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 72444b8b38a6..a06987e36db5 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -552,13 +552,8 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
-	cur->bc_btnum = XFS_BTNUM_BMAP;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 93fb50516bc2..70785004414e 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4926,3 +4926,21 @@ xfs_btree_has_more_records(
 	else
 		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
 }
+
+/* Allocate a new btree cursor of the appropriate size. */
+struct xfs_btree_cur *
+xfs_btree_alloc_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+
+	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = btnum;
+	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+
+	return cur;
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 827c44bf24dc..6540c4957c36 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -573,5 +573,7 @@ void xfs_btree_copy_ptrs(struct xfs_btree_cur *cur,
 void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
 		union xfs_btree_key *dst_key,
 		const union xfs_btree_key *src_key, int numkeys);
+struct xfs_btree_cur *xfs_btree_alloc_cursor(struct xfs_mount *mp,
+		struct xfs_trans *tp, xfs_btnum_t btnum);
 
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 27190840c5d8..c8fea6a464d5 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -432,10 +432,7 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
@@ -444,8 +441,6 @@ xfs_inobt_init_common(
 		cur->bc_ops = &xfs_finobt_ops;
 	}
 
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1ef9b99962ab..48c45e31d897 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -322,11 +322,7 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = XFS_BTNUM_REFC;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b7dbbfb3aeed..f3c4d0965cc9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -451,13 +451,9 @@ xfs_rmapbt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
 	/* Overlapping btree; 2 keys per pointer. */
-	cur->bc_btnum = XFS_BTNUM_RMAP;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 08/14] xfs: refactor btree cursor allocation function Darrick J. Wong
@ 2021-09-18  1:29 ` Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:56   ` Christoph Hellwig
  2021-09-18  1:30 ` [PATCH 10/14] xfs: encode the max btree height in the cursor Darrick J. Wong
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:29 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

The btree geometry computation function has an off-by-one error in that
it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
This can result in repairs failing unnecessarily on very fragmented
filesystems.  Subsequent patches to remove MAXLEVELS usage in favor of
the per-btree type computations will make this a much more likely
occurrence.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree_staging.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index 26143297bb7b..cc56efc2b90a 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -662,7 +662,7 @@ xfs_btree_bload_compute_geometry(
 	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
 
 	bbl->nr_records = nr_this_level = nr_records;
-	for (cur->bc_nlevels = 1; cur->bc_nlevels < XFS_BTREE_MAXLEVELS;) {
+	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
 		uint64_t	level_blocks;
 		uint64_t	dontcare64;
 		unsigned int	level = cur->bc_nlevels - 1;
@@ -726,7 +726,7 @@ xfs_btree_bload_compute_geometry(
 		nr_this_level = level_blocks;
 	}
 
-	if (cur->bc_nlevels == XFS_BTREE_MAXLEVELS)
+	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
 		return -EOVERFLOW;
 
 	bbl->btree_height = cur->bc_nlevels;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 10/14] xfs: encode the max btree height in the cursor
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-09-18  1:29 ` [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code Darrick J. Wong
@ 2021-09-18  1:30 ` Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:57   ` Christoph Hellwig
  2021-09-18  1:30 ` [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:30 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c          |    2 +-
 fs/xfs/libxfs/xfs_btree.c         |    5 +++--
 fs/xfs/libxfs/xfs_btree.h         |    3 ++-
 fs/xfs/libxfs/xfs_btree_staging.c |   10 +++++-----
 4 files changed, 11 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 644b956301b6..2ae5bf9a74e7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -239,7 +239,7 @@ xfs_bmap_get_bp(
 	if (!cur)
 		return NULL;
 
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
+	for (i = 0; i < cur->bc_maxlevels; i++) {
 		if (!cur->bc_levels[i].bp)
 			break;
 		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 70785004414e..2486ba22c01d 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2933,7 +2933,7 @@ xfs_btree_new_iroot(
 	be16_add_cpu(&block->bb_level, 1);
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
@@ -3097,7 +3097,7 @@ xfs_btree_new_root(
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
 	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	*stat = 1;
 	return 0;
 error0:
@@ -4941,6 +4941,7 @@ xfs_btree_alloc_cursor(
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
+	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 6540c4957c36..6075918efa0c 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -235,9 +235,10 @@ struct xfs_btree_cur
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
 	uint			bc_flags; /* btree features - below */
-	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
+	uint8_t		bc_maxlevels;	/* maximum levels for this btree type */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
 	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
+	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
 	int		bc_statoff;	/* offset of btre stats array */
 
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index cc56efc2b90a..dd75e208b543 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -657,12 +657,12 @@ xfs_btree_bload_compute_geometry(
 	 * checking levels 0 and 1 here, so set bc_nlevels such that the btree
 	 * code doesn't interpret either as the root level.
 	 */
-	cur->bc_nlevels = XFS_BTREE_MAXLEVELS - 1;
+	cur->bc_nlevels = cur->bc_maxlevels - 1;
 	xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0);
 	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
 
 	bbl->nr_records = nr_this_level = nr_records;
-	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
+	for (cur->bc_nlevels = 1; cur->bc_nlevels <= cur->bc_maxlevels;) {
 		uint64_t	level_blocks;
 		uint64_t	dontcare64;
 		unsigned int	level = cur->bc_nlevels - 1;
@@ -703,7 +703,7 @@ xfs_btree_bload_compute_geometry(
 			 * block-based btree level.
 			 */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 			xfs_btree_bload_level_geometry(cur, bbl, level,
 					nr_this_level, &avg_per_block,
 					&level_blocks, &dontcare64);
@@ -719,14 +719,14 @@ xfs_btree_bload_compute_geometry(
 
 			/* Otherwise, we need another level of btree. */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 		}
 
 		nr_blocks += level_blocks;
 		nr_this_level = level_blocks;
 	}
 
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
+	if (cur->bc_nlevels > cur->bc_maxlevels)
 		return -EOVERFLOW;
 
 	bbl->btree_height = cur->bc_nlevels;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-09-18  1:30 ` [PATCH 10/14] xfs: encode the max btree height in the cursor Darrick J. Wong
@ 2021-09-18  1:30 ` Darrick J. Wong
  2021-09-20  9:56   ` Chandan Babu R
  2021-09-20 23:06   ` Dave Chinner
  2021-09-18  1:30 ` [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:30 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Replace the statically-sized btree cursor zone with dynamically sized
allocations so that we can reduce the memory overhead for per-AG bt
cursors while handling very tall btrees for rt metadata.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |   40 ++++++++++++++++++++++++++++++++--------
 fs/xfs/libxfs/xfs_btree.h |    2 --
 fs/xfs/xfs_super.c        |   11 +----------
 3 files changed, 33 insertions(+), 20 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 2486ba22c01d..f9516828a847 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -23,11 +23,6 @@
 #include "xfs_btree_staging.h"
 #include "xfs_ag.h"
 
-/*
- * Cursor allocation zone.
- */
-kmem_zone_t	*xfs_btree_cur_zone;
-
 /*
  * Btree magic numbers.
  */
@@ -379,7 +374,7 @@ xfs_btree_del_cursor(
 		kmem_free(cur->bc_ops);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
-	kmem_cache_free(xfs_btree_cur_zone, cur);
+	kmem_free(cur);
 }
 
 /*
@@ -4927,6 +4922,32 @@ xfs_btree_has_more_records(
 		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
 }
 
+/* Compute the maximum allowed height for a given btree type. */
+static unsigned int
+xfs_btree_maxlevels(
+	struct xfs_mount	*mp,
+	xfs_btnum_t		btnum)
+{
+	switch (btnum) {
+	case XFS_BTNUM_BNO:
+	case XFS_BTNUM_CNT:
+		return mp->m_ag_maxlevels;
+	case XFS_BTNUM_BMAP:
+		return max(mp->m_bm_maxlevels[XFS_DATA_FORK],
+			   mp->m_bm_maxlevels[XFS_ATTR_FORK]);
+	case XFS_BTNUM_INO:
+	case XFS_BTNUM_FINO:
+		return M_IGEO(mp)->inobt_maxlevels;
+	case XFS_BTNUM_RMAP:
+		return mp->m_rmap_maxlevels;
+	case XFS_BTNUM_REFC:
+		return mp->m_refc_maxlevels;
+	default:
+		ASSERT(0);
+		return XFS_BTREE_MAXLEVELS;
+	}
+}
+
 /* Allocate a new btree cursor of the appropriate size. */
 struct xfs_btree_cur *
 xfs_btree_alloc_cursor(
@@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
 	xfs_btnum_t		btnum)
 {
 	struct xfs_btree_cur	*cur;
+	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
+
+	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
+	cur->bc_maxlevels = maxlevels;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 6075918efa0c..ae83fbf58c18 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -13,8 +13,6 @@ struct xfs_trans;
 struct xfs_ifork;
 struct xfs_perag;
 
-extern kmem_zone_t	*xfs_btree_cur_zone;
-
 /*
  * Generic key, ptr and record wrapper structures.
  *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 30bae0657343..25a548bbb0b2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1965,17 +1965,11 @@ xfs_init_zones(void)
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 
-	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
-					       0, 0, NULL);
-	if (!xfs_btree_cur_zone)
-		goto out_destroy_bmap_free_item_zone;
-
 	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
 					      sizeof(struct xfs_da_state),
 					      0, 0, NULL);
 	if (!xfs_da_state_zone)
-		goto out_destroy_btree_cur_zone;
+		goto out_destroy_bmap_free_item_zone;
 
 	xfs_ifork_zone = kmem_cache_create("xfs_ifork",
 					   sizeof(struct xfs_ifork),
@@ -2105,8 +2099,6 @@ xfs_init_zones(void)
 	kmem_cache_destroy(xfs_ifork_zone);
  out_destroy_da_state_zone:
 	kmem_cache_destroy(xfs_da_state_zone);
- out_destroy_btree_cur_zone:
-	kmem_cache_destroy(xfs_btree_cur_zone);
  out_destroy_bmap_free_item_zone:
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
  out_destroy_log_ticket_zone:
@@ -2138,7 +2130,6 @@ xfs_destroy_zones(void)
 	kmem_cache_destroy(xfs_trans_zone);
 	kmem_cache_destroy(xfs_ifork_zone);
 	kmem_cache_destroy(xfs_da_state_zone);
-	kmem_cache_destroy(xfs_btree_cur_zone);
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
 	kmem_cache_destroy(xfs_log_ticket_zone);
 }


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-09-18  1:30 ` [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
@ 2021-09-18  1:30 ` Darrick J. Wong
  2021-09-20  9:56   ` Chandan Babu R
  2021-09-18  1:30 ` [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
  2021-09-18  1:30 ` [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
  13 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:30 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Compute the actual maximum btree height when deciding if per-AG block
reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.
This is a long-winded way of saying that we're removing one more
usage of XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag_resv.c |    4 +++-
 fs/xfs/libxfs/xfs_btree.c   |   19 +++++++++++++++----
 fs/xfs/libxfs/xfs_btree.h   |    1 +
 3 files changed, 19 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 2aa2b3484c28..931481fbdd72 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -72,6 +72,7 @@ xfs_ag_resv_critical(
 {
 	xfs_extlen_t			avail;
 	xfs_extlen_t			orig;
+	xfs_extlen_t			btree_maxlevels;
 
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
@@ -91,7 +92,8 @@ xfs_ag_resv_critical(
 	trace_xfs_ag_resv_critical(pag, type, avail);
 
 	/* Critically low if less than 10% or max btree height remains. */
-	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
+	btree_maxlevels = xfs_btree_maxlevels(pag->pag_mount, XFS_BTNUM_MAX);
+	return XFS_TEST_ERROR(avail < orig / 10 || avail < btree_maxlevels,
 			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
 }
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index f9516828a847..6cf49f7e1299 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4922,12 +4922,17 @@ xfs_btree_has_more_records(
 		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
 }
 
-/* Compute the maximum allowed height for a given btree type. */
-static unsigned int
+/*
+ * Compute the maximum allowed height for a given btree type.  If XFS_BTNUM_MAX
+ * is passed in, the maximum allowed height for all btree types is returned.
+ */
+unsigned int
 xfs_btree_maxlevels(
 	struct xfs_mount	*mp,
 	xfs_btnum_t		btnum)
 {
+	unsigned int		ret;
+
 	switch (btnum) {
 	case XFS_BTNUM_BNO:
 	case XFS_BTNUM_CNT:
@@ -4943,9 +4948,15 @@ xfs_btree_maxlevels(
 	case XFS_BTNUM_REFC:
 		return mp->m_refc_maxlevels;
 	default:
-		ASSERT(0);
-		return XFS_BTREE_MAXLEVELS;
+		break;
 	}
+
+	ret = mp->m_ag_maxlevels;
+	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
+	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
+	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
+	ret = max(ret, mp->m_rmap_maxlevels);
+	return max(ret, mp->m_refc_maxlevels);
 }
 
 /* Allocate a new btree cursor of the appropriate size. */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index ae83fbf58c18..106760c540c7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -574,5 +574,6 @@ void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
 		const union xfs_btree_key *src_key, int numkeys);
 struct xfs_btree_cur *xfs_btree_alloc_cursor(struct xfs_mount *mp,
 		struct xfs_trans *tp, xfs_btnum_t btnum);
+unsigned int xfs_btree_maxlevels(struct xfs_mount *mp, xfs_btnum_t btnum);
 
 #endif	/* __XFS_BTREE_H__ */


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-09-18  1:30 ` [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
@ 2021-09-18  1:30 ` Darrick J. Wong
  2021-09-20  9:56   ` Chandan Babu R
  2021-09-18  1:30 ` [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
  13 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:30 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c       |   34 +++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h       |    2 ++
 fs/xfs/libxfs/xfs_rmap_btree.c  |   40 ++++++++++++++++++++-------------------
 fs/xfs/libxfs/xfs_rmap_btree.h  |    2 +-
 fs/xfs/libxfs/xfs_trans_resv.c  |   12 ++++++++++++
 fs/xfs/libxfs/xfs_trans_space.h |    7 +++++++
 fs/xfs/xfs_mount.c              |    2 +-
 7 files changed, 78 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 6cf49f7e1299..005bc42cf0bd 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4526,6 +4526,40 @@ xfs_btree_compute_maxlevels(
 	return level;
 }
 
+/*
+ * Compute the maximum height of a btree that is allowed to consume up to the
+ * given number of blocks.
+ */
+unsigned int
+xfs_btree_compute_maxlevels_size(
+	unsigned long long	max_btblocks,
+	unsigned int		leaf_mnr)
+{
+	unsigned long long	leaf_blocks = leaf_mnr;
+	unsigned long long	blocks_left;
+	unsigned int		maxlevels;
+
+	if (max_btblocks < 1)
+		return 0;
+
+	/*
+	 * The loop increments maxlevels as long as there would be enough
+	 * blocks left in the reservation to handle each node block at the
+	 * current level pointing to the minimum possible number of leaf blocks
+	 * at the next level down.  We start the loop assuming a single-level
+	 * btree consuming one block.
+	 */
+	maxlevels = 1;
+	blocks_left = max_btblocks - 1;
+	while (leaf_blocks < blocks_left) {
+		maxlevels++;
+		blocks_left -= leaf_blocks;
+		leaf_blocks *= leaf_mnr;
+	}
+
+	return maxlevels;
+}
+
 /*
  * Query a regular btree for all records overlapping a given interval.
  * Start with a LE lookup of the key of low_rec and return all records
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 106760c540c7..d256d869f0af 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -476,6 +476,8 @@ xfs_failaddr_t xfs_btree_lblock_verify(struct xfs_buf *bp,
 		unsigned int max_recs);
 
 uint xfs_btree_compute_maxlevels(uint *limits, unsigned long len);
+unsigned int xfs_btree_compute_maxlevels_size(unsigned long long max_btblocks,
+		unsigned int leaf_mnr);
 unsigned long long xfs_btree_calc_size(uint *limits, unsigned long long len);
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index f3c4d0965cc9..85caeb14e4db 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -535,30 +535,32 @@ xfs_rmapbt_maxrecs(
 }
 
 /* Compute the maximum height of an rmap btree. */
-void
+unsigned int
 xfs_rmapbt_compute_maxlevels(
-	struct xfs_mount		*mp)
+	struct xfs_mount	*mp)
 {
+	if (!xfs_has_reflink(mp)) {
+		/*
+		 * If there's no block sharing, compute the maximum rmapbt
+		 * height assuming one rmap record per AG block.
+		 */
+		return xfs_btree_compute_maxlevels(mp->m_rmap_mnr,
+				mp->m_sb.sb_agblocks);
+	}
+
 	/*
-	 * On a non-reflink filesystem, the maximum number of rmap
-	 * records is the number of blocks in the AG, hence the max
-	 * rmapbt height is log_$maxrecs($agblocks).  However, with
-	 * reflink each AG block can have up to 2^32 (per the refcount
-	 * record format) owners, which means that theoretically we
-	 * could face up to 2^64 rmap records.
+	 * Compute the asymptotic maxlevels for an rmapbt on a reflink fs.
 	 *
-	 * That effectively means that the max rmapbt height must be
-	 * XFS_BTREE_MAXLEVELS.  "Fortunately" we'll run out of AG
-	 * blocks to feed the rmapbt long before the rmapbt reaches
-	 * maximum height.  The reflink code uses ag_resv_critical to
-	 * disallow reflinking when less than 10% of the per-AG metadata
-	 * block reservation since the fallback is a regular file copy.
+	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
+	 * refcount record format) owners, which means that theoretically we
+	 * could face up to 2^64 rmap records.  However, we're likely to run
+	 * out of blocks in the AG long before that happens, which means that
+	 * we must compute the max height based on what the btree will look
+	 * like if it consumes almost all the blocks in the AG due to maximal
+	 * sharing factor.
 	 */
-	if (xfs_has_reflink(mp))
-		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
-	else
-		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(
-				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+	return xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks,
+			mp->m_rmap_mnr[1]);
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index f2eee6572af4..5aaecf755abd 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -49,7 +49,7 @@ struct xfs_btree_cur *xfs_rmapbt_stage_cursor(struct xfs_mount *mp,
 void xfs_rmapbt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 int xfs_rmapbt_maxrecs(int blocklen, int leaf);
-extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
+unsigned int xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
 
 extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 5e300daa2559..679f10e08f31 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -814,6 +814,15 @@ xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
 	struct xfs_trans_resv	*resp)
 {
+	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
+
+	/*
+	 * In the early days of rmap+reflink, we hardcoded the rmap maxlevels
+	 * to 9 even if the AG size was smaller.
+	 */
+	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
+		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
+
 	/*
 	 * The following transactions are logged in physical format and
 	 * require a permanent reservation on space.
@@ -916,4 +925,7 @@ xfs_trans_resv_calc(
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
+
+	/* Put everything back the way it was.  This goes at the end. */
+	mp->m_rmap_maxlevels = rmap_maxlevels;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 50332be34388..440c9c390b86 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -17,6 +17,13 @@
 /* Adding one rmap could split every level up to the top of the tree. */
 #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
 
+/*
+ * Note that we historically set m_rmap_maxlevels to 9 when reflink was
+ * enabled, so we must preserve this behavior to avoid changing the transaction
+ * space reservations.
+ */
+#define XFS_OLD_REFLINK_RMAP_MAXLEVELS	(9)
+
 /* Blocks we might need to add "b" rmaps to a tree. */
 #define XFS_NRMAPADD_SPACE_RES(mp, b)\
 	(((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 06dac09eddbd..e600a0b781c8 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -635,7 +635,7 @@ xfs_mountfs(
 	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
 	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
 	xfs_mount_setup_inode_geom(mp);
-	xfs_rmapbt_compute_maxlevels(mp);
+	mp->m_rmap_maxlevels = xfs_rmapbt_compute_maxlevels(mp);
 	xfs_refcountbt_compute_maxlevels(mp);
 
 	/*


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS
  2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-09-18  1:30 ` [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-09-18  1:30 ` Darrick J. Wong
  2021-09-20  9:57   ` Chandan Babu R
  13 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-18  1:30 UTC (permalink / raw)
  To: djwong, chandan.babu, chandanrlinux; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |    2 --
 fs/xfs/libxfs/xfs_btree.h |    2 --
 2 files changed, 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 005bc42cf0bd..a7c866332911 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -5003,8 +5003,6 @@ xfs_btree_alloc_cursor(
 	struct xfs_btree_cur	*cur;
 	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
 
-	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
-
 	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index d256d869f0af..91154dd63472 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -90,8 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
-#define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
-
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef
  2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
@ 2021-09-20  9:53   ` Chandan Babu R
  2021-09-21  8:36   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>

The changes are straightforward replacements.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |   12 ++++++------
>  fs/xfs/libxfs/xfs_bmap.c  |   12 ++++++------
>  fs/xfs/libxfs/xfs_btree.c |   12 ++++++------
>  fs/xfs/libxfs/xfs_btree.h |   12 ++++++------
>  4 files changed, 24 insertions(+), 24 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 95157f5a5a6c..35fb1dd3be95 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -426,8 +426,8 @@ xfs_alloc_fix_len(
>   */
>  STATIC int				/* error code */
>  xfs_alloc_fixup_trees(
> -	xfs_btree_cur_t	*cnt_cur,	/* cursor for by-size btree */
> -	xfs_btree_cur_t	*bno_cur,	/* cursor for by-block btree */
> +	struct xfs_btree_cur *cnt_cur,	/* cursor for by-size btree */
> +	struct xfs_btree_cur *bno_cur,	/* cursor for by-block btree */
>  	xfs_agblock_t	fbno,		/* starting block of free extent */
>  	xfs_extlen_t	flen,		/* length of free extent */
>  	xfs_agblock_t	rbno,		/* starting block of returned extent */
> @@ -1200,8 +1200,8 @@ xfs_alloc_ag_vextent_exact(
>  	xfs_alloc_arg_t	*args)	/* allocation argument structure */
>  {
>  	struct xfs_agf __maybe_unused *agf = args->agbp->b_addr;
> -	xfs_btree_cur_t	*bno_cur;/* by block-number btree cursor */
> -	xfs_btree_cur_t	*cnt_cur;/* by count btree cursor */
> +	struct xfs_btree_cur *bno_cur;/* by block-number btree cursor */
> +	struct xfs_btree_cur *cnt_cur;/* by count btree cursor */
>  	int		error;
>  	xfs_agblock_t	fbno;	/* start block of found extent */
>  	xfs_extlen_t	flen;	/* length of found extent */
> @@ -1658,8 +1658,8 @@ xfs_alloc_ag_vextent_size(
>  	xfs_alloc_arg_t	*args)		/* allocation argument structure */
>  {
>  	struct xfs_agf	*agf = args->agbp->b_addr;
> -	xfs_btree_cur_t	*bno_cur;	/* cursor for bno btree */
> -	xfs_btree_cur_t	*cnt_cur;	/* cursor for cnt btree */
> +	struct xfs_btree_cur *bno_cur;	/* cursor for bno btree */
> +	struct xfs_btree_cur *cnt_cur;	/* cursor for cnt btree */
>  	int		error;		/* error result */
>  	xfs_agblock_t	fbno;		/* start of found freespace */
>  	xfs_extlen_t	flen;		/* length of found freespace */
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index b48230f1a361..499c977cbf56 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -316,7 +316,7 @@ xfs_check_block(
>   */
>  STATIC void
>  xfs_bmap_check_leaf_extents(
> -	xfs_btree_cur_t		*cur,	/* btree cursor or null */
> +	struct xfs_btree_cur	*cur,	/* btree cursor or null */
>  	xfs_inode_t		*ip,		/* incore inode pointer */
>  	int			whichfork)	/* data or attr fork */
>  {
> @@ -925,7 +925,7 @@ xfs_bmap_add_attrfork_btree(
>  	int			*flags)		/* inode logging flags */
>  {
>  	struct xfs_btree_block	*block = ip->i_df.if_broot;
> -	xfs_btree_cur_t		*cur;		/* btree cursor */
> +	struct xfs_btree_cur	*cur;		/* btree cursor */
>  	int			error;		/* error return value */
>  	xfs_mount_t		*mp;		/* file system mount struct */
>  	int			stat;		/* newroot status */
> @@ -968,7 +968,7 @@ xfs_bmap_add_attrfork_extents(
>  	struct xfs_inode	*ip,		/* incore inode pointer */
>  	int			*flags)		/* inode logging flags */
>  {
> -	xfs_btree_cur_t		*cur;		/* bmap btree cursor */
> +	struct xfs_btree_cur	*cur;		/* bmap btree cursor */
>  	int			error;		/* error return value */
>  
>  	if (ip->i_df.if_nextents * sizeof(struct xfs_bmbt_rec) <=
> @@ -1988,11 +1988,11 @@ xfs_bmap_add_extent_unwritten_real(
>  	xfs_inode_t		*ip,	/* incore inode pointer */
>  	int			whichfork,
>  	struct xfs_iext_cursor	*icur,
> -	xfs_btree_cur_t		**curp,	/* if *curp is null, not a btree */
> +	struct xfs_btree_cur	**curp,	/* if *curp is null, not a btree */
>  	xfs_bmbt_irec_t		*new,	/* new data to add to file extents */
>  	int			*logflagsp) /* inode logging flags */
>  {
> -	xfs_btree_cur_t		*cur;	/* btree cursor */
> +	struct xfs_btree_cur	*cur;	/* btree cursor */
>  	int			error;	/* error return value */
>  	int			i;	/* temp state */
>  	struct xfs_ifork	*ifp;	/* inode fork pointer */
> @@ -5045,7 +5045,7 @@ xfs_bmap_del_extent_real(
>  	xfs_inode_t		*ip,	/* incore inode pointer */
>  	xfs_trans_t		*tp,	/* current transaction pointer */
>  	struct xfs_iext_cursor	*icur,
> -	xfs_btree_cur_t		*cur,	/* if null, not a btree */
> +	struct xfs_btree_cur	*cur,	/* if null, not a btree */
>  	xfs_bmbt_irec_t		*del,	/* data to remove from extents */
>  	int			*logflagsp, /* inode logging flags */
>  	int			whichfork, /* data or attr fork */
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 298395481713..b0cce0932f02 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -388,14 +388,14 @@ xfs_btree_del_cursor(
>   */
>  int					/* error */
>  xfs_btree_dup_cursor(
> -	xfs_btree_cur_t	*cur,		/* input cursor */
> -	xfs_btree_cur_t	**ncur)		/* output cursor */
> +	struct xfs_btree_cur *cur,		/* input cursor */
> +	struct xfs_btree_cur **ncur)		/* output cursor */
>  {
>  	struct xfs_buf	*bp;		/* btree block's buffer pointer */
>  	int		error;		/* error return value */
>  	int		i;		/* level number of btree block */
>  	xfs_mount_t	*mp;		/* mount structure for filesystem */
> -	xfs_btree_cur_t	*new;		/* new cursor value */
> +	struct xfs_btree_cur *new;		/* new cursor value */
>  	xfs_trans_t	*tp;		/* transaction pointer, can be NULL */
>  
>  	tp = cur->bc_tp;
> @@ -691,7 +691,7 @@ xfs_btree_get_block(
>   */
>  STATIC int				/* success=1, failure=0 */
>  xfs_btree_firstrec(
> -	xfs_btree_cur_t		*cur,	/* btree cursor */
> +	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			level)	/* level to change */
>  {
>  	struct xfs_btree_block	*block;	/* generic btree block pointer */
> @@ -721,7 +721,7 @@ xfs_btree_firstrec(
>   */
>  STATIC int				/* success=1, failure=0 */
>  xfs_btree_lastrec(
> -	xfs_btree_cur_t		*cur,	/* btree cursor */
> +	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			level)	/* level to change */
>  {
>  	struct xfs_btree_block	*block;	/* generic btree block pointer */
> @@ -985,7 +985,7 @@ xfs_btree_readahead_ptr(
>   */
>  STATIC void
>  xfs_btree_setbuf(
> -	xfs_btree_cur_t		*cur,	/* btree cursor */
> +	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			lev,	/* level in btree */
>  	struct xfs_buf		*bp)	/* new buffer to set */
>  {
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 4eaf8517f850..513ade4a89f8 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -216,7 +216,7 @@ struct xfs_btree_cur_ino {
>   * Btree cursor structure.
>   * This collects all information needed by the btree code in one place.
>   */
> -typedef struct xfs_btree_cur
> +struct xfs_btree_cur
>  {
>  	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
>  	struct xfs_mount	*bc_mp;	/* file system mount struct */
> @@ -243,7 +243,7 @@ typedef struct xfs_btree_cur
>  		struct xfs_btree_cur_ag	bc_ag;
>  		struct xfs_btree_cur_ino bc_ino;
>  	};
> -} xfs_btree_cur_t;
> +};
>  
>  /* cursor flags */
>  #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
> @@ -309,7 +309,7 @@ xfs_btree_check_sptr(
>   */
>  void
>  xfs_btree_del_cursor(
> -	xfs_btree_cur_t		*cur,	/* btree cursor */
> +	struct xfs_btree_cur	*cur,	/* btree cursor */
>  	int			error);	/* del because of error */
>  
>  /*
> @@ -318,8 +318,8 @@ xfs_btree_del_cursor(
>   */
>  int					/* error */
>  xfs_btree_dup_cursor(
> -	xfs_btree_cur_t		*cur,	/* input cursor */
> -	xfs_btree_cur_t		**ncur);/* output cursor */
> +	struct xfs_btree_cur		*cur,	/* input cursor */
> +	struct xfs_btree_cur		**ncur);/* output cursor */
>  
>  /*
>   * Compute first and last byte offsets for the fields given.
> @@ -527,7 +527,7 @@ struct xfs_ifork *xfs_btree_ifork_ptr(struct xfs_btree_cur *cur);
>  /* Does this cursor point to the last block in the given level? */
>  static inline bool
>  xfs_btree_islastblock(
> -	xfs_btree_cur_t		*cur,
> +	struct xfs_btree_cur	*cur,
>  	int			level)
>  {
>  	struct xfs_btree_block	*block;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/14] xfs: don't allocate scrub contexts on the stack
  2021-09-18  1:29 ` [PATCH 02/14] xfs: don't allocate scrub contexts on the stack Darrick J. Wong
@ 2021-09-20  9:53   ` Chandan Babu R
  2021-09-20 17:39     ` Darrick J. Wong
  2021-09-21  8:39   ` Christoph Hellwig
  1 sibling, 1 reply; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Convert the on-stack scrub context, btree scrub context, and da btree
> scrub context into a heap allocation so that we reduce stack usage and
> gain the ability to handle tall btrees without issue.
>
> Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
> for the btree scrub, and ~200 bytes for the main scrub context.
>

Apart from the nits pointed below, the remaining changes look good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>


> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/btree.c   |   54 ++++++++++++++++++++++++------------------
>  fs/xfs/scrub/btree.h   |    1 +
>  fs/xfs/scrub/dabtree.c |   62 ++++++++++++++++++++++++++----------------------
>  fs/xfs/scrub/scrub.c   |   60 ++++++++++++++++++++++++++--------------------
>  4 files changed, 98 insertions(+), 79 deletions(-)
>
>
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index eccb855dc904..26dcb4691e31 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -627,15 +627,8 @@ xchk_btree(
>  	const struct xfs_owner_info	*oinfo,
>  	void				*private)
>  {
> -	struct xchk_btree		bs = {
> -		.cur			= cur,
> -		.scrub_rec		= scrub_fn,
> -		.oinfo			= oinfo,
> -		.firstrec		= true,
> -		.private		= private,
> -		.sc			= sc,
> -	};
>  	union xfs_btree_ptr		ptr;
> +	struct xchk_btree		*bs;
>  	union xfs_btree_ptr		*pp;
>  	union xfs_btree_rec		*recp;
>  	struct xfs_btree_block		*block;
> @@ -646,10 +639,24 @@ xchk_btree(
>  	int				i;
>  	int				error = 0;
>  
> +	/*
> +	 * Allocate the btree scrub context from the heap, because this
> +	 * structure can get rather large.
> +	 */
> +	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
> +	if (!bs)
> +		return -ENOMEM;
> +	bs->cur = cur;
> +	bs->scrub_rec = scrub_fn;
> +	bs->oinfo = oinfo;
> +	bs->firstrec = true;
> +	bs->private = private;
> +	bs->sc = sc;
> +
>  	/* Initialize scrub state */
>  	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> -		bs.firstkey[i] = true;
> -	INIT_LIST_HEAD(&bs.to_check);
> +		bs->firstkey[i] = true;
> +	INIT_LIST_HEAD(&bs->to_check);
>  
>  	/* Don't try to check a tree with a height we can't handle. */
>  	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> @@ -663,9 +670,9 @@ xchk_btree(
>  	 */
>  	level = cur->bc_nlevels - 1;
>  	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> -	if (!xchk_btree_ptr_ok(&bs, cur->bc_nlevels, &ptr))
> +	if (!xchk_btree_ptr_ok(bs, cur->bc_nlevels, &ptr))
>  		goto out;
> -	error = xchk_btree_get_block(&bs, level, &ptr, &block, &bp);
> +	error = xchk_btree_get_block(bs, level, &ptr, &block, &bp);
>  	if (error || !block)
>  		goto out;
>  
> @@ -678,7 +685,7 @@ xchk_btree(
>  			/* End of leaf, pop back towards the root. */
>  			if (cur->bc_ptrs[level] >
>  			    be16_to_cpu(block->bb_numrecs)) {
> -				xchk_btree_block_keys(&bs, level, block);
> +				xchk_btree_block_keys(bs, level, block);
>  				if (level < cur->bc_nlevels - 1)
>  					cur->bc_ptrs[level + 1]++;
>  				level++;
> @@ -686,11 +693,11 @@ xchk_btree(
>  			}
>  
>  			/* Records in order for scrub? */
> -			xchk_btree_rec(&bs);
> +			xchk_btree_rec(bs);
>  
>  			/* Call out to the record checker. */
>  			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> -			error = bs.scrub_rec(&bs, recp);
> +			error = bs->scrub_rec(bs, recp);
>  			if (error)
>  				break;
>  			if (xchk_should_terminate(sc, &error) ||
> @@ -703,7 +710,7 @@ xchk_btree(
>  
>  		/* End of node, pop back towards the root. */
>  		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> -			xchk_btree_block_keys(&bs, level, block);
> +			xchk_btree_block_keys(bs, level, block);
>  			if (level < cur->bc_nlevels - 1)
>  				cur->bc_ptrs[level + 1]++;
>  			level++;
> @@ -711,16 +718,16 @@ xchk_btree(
>  		}
>  
>  		/* Keys in order for scrub? */
> -		xchk_btree_key(&bs, level);
> +		xchk_btree_key(bs, level);
>  
>  		/* Drill another level deeper. */
>  		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> -		if (!xchk_btree_ptr_ok(&bs, level, pp)) {
> +		if (!xchk_btree_ptr_ok(bs, level, pp)) {
>  			cur->bc_ptrs[level]++;
>  			continue;
>  		}
>  		level--;
> -		error = xchk_btree_get_block(&bs, level, pp, &block, &bp);
> +		error = xchk_btree_get_block(bs, level, pp, &block, &bp);
>  		if (error || !block)
>  			goto out;
>  
> @@ -729,13 +736,14 @@ xchk_btree(
>  
>  out:
>  	/* Process deferred owner checks on btree blocks. */
> -	list_for_each_entry_safe(co, n, &bs.to_check, list) {
> -		if (!error && bs.cur)
> -			error = xchk_btree_check_block_owner(&bs,
> -					co->level, co->daddr);
> +	list_for_each_entry_safe(co, n, &bs->to_check, list) {
> +		if (!error && bs->cur)
> +			error = xchk_btree_check_block_owner(bs, co->level,
> +					co->daddr);
>  		list_del(&co->list);
>  		kmem_free(co);
>  	}
> +	kmem_free(bs);
>  
>  	return error;
>  }
> diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
> index b7d2fc01fbf9..d5c0b0cbc505 100644
> --- a/fs/xfs/scrub/btree.h
> +++ b/fs/xfs/scrub/btree.h
> @@ -44,6 +44,7 @@ struct xchk_btree {
>  	bool				firstkey[XFS_BTREE_MAXLEVELS];
>  	struct list_head		to_check;
>  };
> +
>  int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
>  		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
>  		void *private);
> diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
> index 8a52514bc1ff..b962cfbbd92b 100644
> --- a/fs/xfs/scrub/dabtree.c
> +++ b/fs/xfs/scrub/dabtree.c
> @@ -473,7 +473,7 @@ xchk_da_btree(
>  	xchk_da_btree_rec_fn		scrub_fn,
>  	void				*private)
>  {
> -	struct xchk_da_btree		ds = {};
> +	struct xchk_da_btree		*ds;
>  	struct xfs_mount		*mp = sc->mp;
>  	struct xfs_da_state_blk		*blks;
>  	struct xfs_da_node_entry	*key;
> @@ -486,32 +486,35 @@ xchk_da_btree(
>  		return 0;
>  
>  	/* Set up initial da state. */
> -	ds.dargs.dp = sc->ip;
> -	ds.dargs.whichfork = whichfork;
> -	ds.dargs.trans = sc->tp;
> -	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
> -	ds.state = xfs_da_state_alloc(&ds.dargs);
> -	ds.sc = sc;
> -	ds.private = private;
> +	ds = kmem_zalloc(sizeof(struct xchk_da_btree), KM_NOFS | KM_MAYFAIL);
> +	if (!ds)
> +		return -ENOMEM;
> +	ds->dargs.dp = sc->ip;
> +	ds->dargs.whichfork = whichfork;
> +	ds->dargs.trans = sc->tp;
> +	ds->dargs.op_flags = XFS_DA_OP_OKNOENT;
> +	ds->state = xfs_da_state_alloc(&ds->dargs);
> +	ds->sc = sc;
> +	ds->private = private;
>  	if (whichfork == XFS_ATTR_FORK) {
> -		ds.dargs.geo = mp->m_attr_geo;
> -		ds.lowest = 0;
> -		ds.highest = 0;
> +		ds->dargs.geo = mp->m_attr_geo;
> +		ds->lowest = 0;
> +		ds->highest = 0;
>  	} else {
> -		ds.dargs.geo = mp->m_dir_geo;
> -		ds.lowest = ds.dargs.geo->leafblk;
> -		ds.highest = ds.dargs.geo->freeblk;
> +		ds->dargs.geo = mp->m_dir_geo;
> +		ds->lowest = ds->dargs.geo->leafblk;
> +		ds->highest = ds->dargs.geo->freeblk;
>  	}
> -	blkno = ds.lowest;
> +	blkno = ds->lowest;
>  	level = 0;
>  
>  	/* Find the root of the da tree, if present. */
> -	blks = ds.state->path.blk;
> -	error = xchk_da_btree_block(&ds, level, blkno);
> +	blks = ds->state->path.blk;
> +	error = xchk_da_btree_block(ds, level, blkno);
>  	if (error)
>  		goto out_state;
>  	/*
> -	 * We didn't find a block at ds.lowest, which means that there's
> +	 * We didn't find a block at ds->lowest, which means that there's
>  	 * no LEAF1/LEAFN tree (at least not where it's supposed to be),
>  	 * so jump out now.
>  	 */
> @@ -523,16 +526,16 @@ xchk_da_btree(
>  		/* Handle leaf block. */
>  		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
>  			/* End of leaf, pop back towards the root. */
> -			if (blks[level].index >= ds.maxrecs[level]) {
> +			if (blks[level].index >= ds->maxrecs[level]) {
>  				if (level > 0)
>  					blks[level - 1].index++;
> -				ds.tree_level++;
> +				ds->tree_level++;
>  				level--;
>  				continue;
>  			}
>  
>  			/* Dispatch record scrubbing. */
> -			error = scrub_fn(&ds, level);
> +			error = scrub_fn(ds, level);
>  			if (error)
>  				break;
>  			if (xchk_should_terminate(sc, &error) ||
> @@ -545,17 +548,17 @@ xchk_da_btree(
>  
>  
>  		/* End of node, pop back towards the root. */
> -		if (blks[level].index >= ds.maxrecs[level]) {
> +		if (blks[level].index >= ds->maxrecs[level]) {
>  			if (level > 0)
>  				blks[level - 1].index++;
> -			ds.tree_level++;
> +			ds->tree_level++;
>  			level--;
>  			continue;
>  		}
>  
>  		/* Hashes in order for scrub? */
> -		key = xchk_da_btree_node_entry(&ds, level);
> -		error = xchk_da_btree_hash(&ds, level, &key->hashval);
> +		key = xchk_da_btree_node_entry(ds, level);
> +		error = xchk_da_btree_hash(ds, level, &key->hashval);
>  		if (error)
>  			goto out;
>  
> @@ -564,11 +567,11 @@ xchk_da_btree(
>  		level++;
>  		if (level >= XFS_DA_NODE_MAXDEPTH) {
>  			/* Too deep! */
> -			xchk_da_set_corrupt(&ds, level - 1);
> +			xchk_da_set_corrupt(ds, level - 1);
>  			break;
>  		}
> -		ds.tree_level--;
> -		error = xchk_da_btree_block(&ds, level, blkno);
> +		ds->tree_level--;
> +		error = xchk_da_btree_block(ds, level, blkno);
>  		if (error)
>  			goto out;
>  		if (blks[level].bp == NULL)
> @@ -587,6 +590,7 @@ xchk_da_btree(
>  	}
>  
>  out_state:
> -	xfs_da_state_free(ds.state);
> +	xfs_da_state_free(ds->state);
> +	kmem_free(ds);
>  	return error;
>  }
> diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> index 51e4c61916d2..0569b15526ea 100644
> --- a/fs/xfs/scrub/scrub.c
> +++ b/fs/xfs/scrub/scrub.c
> @@ -461,15 +461,10 @@ xfs_scrub_metadata(
>  	struct file			*file,
>  	struct xfs_scrub_metadata	*sm)
>  {
> -	struct xfs_scrub		sc = {
> -		.file			= file,
> -		.sm			= sm,
> -	};
> +	struct xfs_scrub		*sc;
>  	struct xfs_mount		*mp = XFS_I(file_inode(file))->i_mount;
>  	int				error = 0;
>  
> -	sc.mp = mp;
> -
>  	BUILD_BUG_ON(sizeof(meta_scrub_ops) !=
>  		(sizeof(struct xchk_meta_ops) * XFS_SCRUB_TYPE_NR));
>  
> @@ -489,59 +484,68 @@ xfs_scrub_metadata(
>  
>  	xchk_experimental_warning(mp);
>  
> -	sc.ops = &meta_scrub_ops[sm->sm_type];
> -	sc.sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
> +	sc = kmem_zalloc(sizeof(struct xfs_scrub), KM_NOFS | KM_MAYFAIL);
> +	if (!sc) {
> +		error = -ENOMEM;
> +		goto out;
> +	}
> +
> +	sc->mp = mp;
> +	sc->file = file;
> +	sc->sm = sm;
> +	sc->ops = &meta_scrub_ops[sm->sm_type];
> +	sc->sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
>  retry_op:
>  	/*
>  	 * When repairs are allowed, prevent freezing or readonly remount while
>  	 * scrub is running with a real transaction.
>  	 */
>  	if (sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) {
> -		error = mnt_want_write_file(sc.file);
> +		error = mnt_want_write_file(sc->file);
>  		if (error)
>  			goto out;

The above should be "goto out_sc" ...

>  	}
>  
>  	/* Set up for the operation. */
> -	error = sc.ops->setup(&sc);
> +	error = sc->ops->setup(sc);
>  	if (error)
>  		goto out_teardown;
>  
>  	/* Scrub for errors. */
> -	error = sc.ops->scrub(&sc);
> -	if (!(sc.flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
> +	error = sc->ops->scrub(sc);
> +	if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
>  		/*
>  		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
>  		 * Tear down everything we hold, then set up again with
>  		 * preparation for worst-case scenarios.
>  		 */
> -		error = xchk_teardown(&sc, 0);
> +		error = xchk_teardown(sc, 0);
>  		if (error)
>  			goto out;

... also, the one above.

> -		sc.flags |= XCHK_TRY_HARDER;
> +		sc->flags |= XCHK_TRY_HARDER;
>  		goto retry_op;
>  	} else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
>  		goto out_teardown;
>  
> -	xchk_update_health(&sc);
> +	xchk_update_health(sc);
>  
> -	if ((sc.sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
> -	    !(sc.flags & XREP_ALREADY_FIXED)) {
> +	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
> +	    !(sc->flags & XREP_ALREADY_FIXED)) {
>  		bool needs_fix;
>  
>  		/* Let debug users force us into the repair routines. */
>  		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
> -			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> +			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
>  
> -		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> -						XFS_SCRUB_OFLAG_XCORRUPT |
> -						XFS_SCRUB_OFLAG_PREEN));
> +		needs_fix = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> +						 XFS_SCRUB_OFLAG_XCORRUPT |
> +						 XFS_SCRUB_OFLAG_PREEN));
>  		/*
>  		 * If userspace asked for a repair but it wasn't necessary,
>  		 * report that back to userspace.
>  		 */
>  		if (!needs_fix) {
> -			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
> +			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
>  			goto out_nofix;
>  		}
>  
> @@ -549,26 +553,28 @@ xfs_scrub_metadata(
>  		 * If it's broken, userspace wants us to fix it, and we haven't
>  		 * already tried to fix it, then attempt a repair.
>  		 */
> -		error = xrep_attempt(&sc);
> +		error = xrep_attempt(sc);
>  		if (error == -EAGAIN) {
>  			/*
>  			 * Either the repair function succeeded or it couldn't
>  			 * get all the resources it needs; either way, we go
>  			 * back to the beginning and call the scrub function.
>  			 */
> -			error = xchk_teardown(&sc, 0);
> +			error = xchk_teardown(sc, 0);
>  			if (error) {
>  				xrep_failure(mp);
> -				goto out;
> +				goto out_sc;
>  			}
>  			goto retry_op;
>  		}
>  	}
>  
>  out_nofix:
> -	xchk_postmortem(&sc);
> +	xchk_postmortem(sc);
>  out_teardown:
> -	error = xchk_teardown(&sc, error);
> +	error = xchk_teardown(sc, error);
> +out_sc:
> +	kmem_free(sc);
>  out:
>  	trace_xchk_done(XFS_I(file_inode(file)), sm, error);
>  	if (error == -EFSCORRUPTED || error == -EFSBADCRC) {


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/14] xfs: dynamically allocate btree scrub context structure
  2021-09-18  1:29 ` [PATCH 03/14] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-09-20  9:53   ` Chandan Babu R
  2021-09-21  8:43   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Reorganize struct xchk_btree so that we can dynamically size the context
> structure to fit the type of btree cursor that we have.  This will
> enable us to use memory more efficiently once we start adding very tall
> btree types.

The changes look good to me from the perspective of functional correctness.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/btree.c |   38 +++++++++++++++++---------------------
>  fs/xfs/scrub/btree.h |   16 +++++++++++++---
>  2 files changed, 30 insertions(+), 24 deletions(-)
>
>
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index 26dcb4691e31..7b7762ae22e5 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -141,9 +141,10 @@ xchk_btree_rec(
>  	trace_xchk_btree_rec(bs->sc, cur, 0);
>  
>  	/* If this isn't the first record, are they in order? */
> -	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
> +	if (bs->levels[0].has_lastkey &&
> +	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
>  		xchk_btree_set_corrupt(bs->sc, cur, 0);
> -	bs->firstrec = false;
> +	bs->levels[0].has_lastkey = true;
>  	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
>  
>  	if (cur->bc_nlevels == 1)
> @@ -188,11 +189,11 @@ xchk_btree_key(
>  	trace_xchk_btree_key(bs->sc, cur, level);
>  
>  	/* If this isn't the first key, are they in order? */
> -	if (!bs->firstkey[level] &&
> -	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
> +	if (bs->levels[level].has_lastkey &&
> +	    !cur->bc_ops->keys_inorder(cur, &bs->levels[level].lastkey, key))
>  		xchk_btree_set_corrupt(bs->sc, cur, level);
> -	bs->firstkey[level] = false;
> -	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
> +	bs->levels[level].has_lastkey = true;
> +	memcpy(&bs->levels[level].lastkey, key, cur->bc_ops->key_len);
>  
>  	if (level + 1 >= cur->bc_nlevels)
>  		return;
> @@ -632,38 +633,33 @@ xchk_btree(
>  	union xfs_btree_ptr		*pp;
>  	union xfs_btree_rec		*recp;
>  	struct xfs_btree_block		*block;
> -	int				level;
>  	struct xfs_buf			*bp;
>  	struct check_owner		*co;
>  	struct check_owner		*n;
> -	int				i;
> +	size_t				cur_sz;
> +	int				level;
>  	int				error = 0;
>  
>  	/*
>  	 * Allocate the btree scrub context from the heap, because this
> -	 * structure can get rather large.
> +	 * structure can get rather large.  Don't let a caller feed us a
> +	 * totally absurd size.
>  	 */
> -	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
> +	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
> +	if (cur_sz > PAGE_SIZE) {
> +		xchk_btree_set_corrupt(sc, cur, 0);
> +		return 0;
> +	}
> +	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
>  	if (!bs)
>  		return -ENOMEM;
>  	bs->cur = cur;
>  	bs->scrub_rec = scrub_fn;
>  	bs->oinfo = oinfo;
> -	bs->firstrec = true;
>  	bs->private = private;
>  	bs->sc = sc;
> -
> -	/* Initialize scrub state */
> -	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> -		bs->firstkey[i] = true;
>  	INIT_LIST_HEAD(&bs->to_check);
>  
> -	/* Don't try to check a tree with a height we can't handle. */
> -	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> -		xchk_btree_set_corrupt(sc, cur, 0);
> -		goto out;
> -	}
> -
>  	/*
>  	 * Load the root of the btree.  The helper function absorbs
>  	 * error codes for us.
> diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
> index d5c0b0cbc505..7f8c54d8020e 100644
> --- a/fs/xfs/scrub/btree.h
> +++ b/fs/xfs/scrub/btree.h
> @@ -29,6 +29,11 @@ typedef int (*xchk_btree_rec_fn)(
>  	struct xchk_btree		*bs,
>  	const union xfs_btree_rec	*rec);
>  
> +struct xchk_btree_levels {
> +	union xfs_btree_key		lastkey;
> +	bool				has_lastkey;
> +};
> +
>  struct xchk_btree {
>  	/* caller-provided scrub state */
>  	struct xfs_scrub		*sc;
> @@ -39,12 +44,17 @@ struct xchk_btree {
>  
>  	/* internal scrub state */
>  	union xfs_btree_rec		lastrec;
> -	bool				firstrec;
> -	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
> -	bool				firstkey[XFS_BTREE_MAXLEVELS];
>  	struct list_head		to_check;
> +	struct xchk_btree_levels	levels[];
>  };
>  
> +static inline size_t
> +xchk_btree_sizeof(unsigned int levels)
> +{
> +	return sizeof(struct xchk_btree) +
> +				(levels * sizeof(struct xchk_btree_levels));
> +}
> +
>  int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
>  		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
>  		void *private);


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/14] xfs: stricter btree height checking when looking for errors
  2021-09-18  1:29 ` [PATCH 04/14] xfs: stricter btree height checking when looking for errors Darrick J. Wong
@ 2021-09-20  9:54   ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Since each btree type has its own precomputed maxlevels variable now,
> use them instead of the generic XFS_BTREE_MAXLEVELS to check the level
> of each per-AG btree.

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/agheader.c |   13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
>
>
> diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
> index ae3c9f6e2c69..a2c3af77b6c2 100644
> --- a/fs/xfs/scrub/agheader.c
> +++ b/fs/xfs/scrub/agheader.c
> @@ -555,11 +555,11 @@ xchk_agf(
>  		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  
>  	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
> -	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +	if (level <= 0 || level > mp->m_ag_maxlevels)
>  		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  
>  	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
> -	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +	if (level <= 0 || level > mp->m_ag_maxlevels)
>  		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  
>  	if (xfs_has_rmapbt(mp)) {
> @@ -568,7 +568,7 @@ xchk_agf(
>  			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  
>  		level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]);
> -		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		if (level <= 0 || level > mp->m_rmap_maxlevels)
>  			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  	}
>  
> @@ -578,7 +578,7 @@ xchk_agf(
>  			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  
>  		level = be32_to_cpu(agf->agf_refcount_level);
> -		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		if (level <= 0 || level > mp->m_refc_maxlevels)
>  			xchk_block_set_corrupt(sc, sc->sa.agf_bp);
>  	}
>  
> @@ -850,6 +850,7 @@ xchk_agi(
>  	struct xfs_mount	*mp = sc->mp;
>  	struct xfs_agi		*agi;
>  	struct xfs_perag	*pag;
> +	struct xfs_ino_geometry	*igeo = M_IGEO(sc->mp);
>  	xfs_agnumber_t		agno = sc->sm->sm_agno;
>  	xfs_agblock_t		agbno;
>  	xfs_agblock_t		eoag;
> @@ -880,7 +881,7 @@ xchk_agi(
>  		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
>  
>  	level = be32_to_cpu(agi->agi_level);
> -	if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +	if (level <= 0 || level > igeo->inobt_maxlevels)
>  		xchk_block_set_corrupt(sc, sc->sa.agi_bp);
>  
>  	if (xfs_has_finobt(mp)) {
> @@ -889,7 +890,7 @@ xchk_agi(
>  			xchk_block_set_corrupt(sc, sc->sa.agi_bp);
>  
>  		level = be32_to_cpu(agi->agi_free_level);
> -		if (level <= 0 || level > XFS_BTREE_MAXLEVELS)
> +		if (level <= 0 || level > igeo->inobt_maxlevels)
>  			xchk_block_set_corrupt(sc, sc->sa.agi_bp);
>  	}
>  


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots
  2021-09-18  1:29 ` [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots Darrick J. Wong
@ 2021-09-20  9:54   ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> When we're scanning for btree roots to rebuild the AG headers, make sure
> that the proposed tree does not exceed the maximum height for that btree
> type (and not just XFS_BTREE_MAXLEVELS).
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/agheader_repair.c |    8 +++++++-
>  fs/xfs/scrub/repair.h          |    3 +++
>  2 files changed, 10 insertions(+), 1 deletion(-)
>
>
> diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
> index 0f8deee66f15..05c27149b65d 100644
> --- a/fs/xfs/scrub/agheader_repair.c
> +++ b/fs/xfs/scrub/agheader_repair.c
> @@ -122,7 +122,7 @@ xrep_check_btree_root(
>  	xfs_agnumber_t			agno = sc->sm->sm_agno;
>  
>  	return xfs_verify_agbno(mp, agno, fab->root) &&
> -	       fab->height <= XFS_BTREE_MAXLEVELS;
> +	       fab->height <= fab->maxlevels;
>  }
>  
>  /*
> @@ -339,18 +339,22 @@ xrep_agf(
>  		[XREP_AGF_BNOBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_AG,
>  			.buf_ops = &xfs_bnobt_buf_ops,
> +			.maxlevels = sc->mp->m_ag_maxlevels,
>  		},
>  		[XREP_AGF_CNTBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_AG,
>  			.buf_ops = &xfs_cntbt_buf_ops,
> +			.maxlevels = sc->mp->m_ag_maxlevels,
>  		},
>  		[XREP_AGF_RMAPBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_AG,
>  			.buf_ops = &xfs_rmapbt_buf_ops,
> +			.maxlevels = sc->mp->m_rmap_maxlevels,
>  		},
>  		[XREP_AGF_REFCOUNTBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_REFC,
>  			.buf_ops = &xfs_refcountbt_buf_ops,
> +			.maxlevels = sc->mp->m_refc_maxlevels,
>  		},
>  		[XREP_AGF_END] = {
>  			.buf_ops = NULL,
> @@ -881,10 +885,12 @@ xrep_agi(
>  		[XREP_AGI_INOBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_INOBT,
>  			.buf_ops = &xfs_inobt_buf_ops,
> +			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
>  		},
>  		[XREP_AGI_FINOBT] = {
>  			.rmap_owner = XFS_RMAP_OWN_INOBT,
>  			.buf_ops = &xfs_finobt_buf_ops,
> +			.maxlevels = M_IGEO(sc->mp)->inobt_maxlevels,
>  		},
>  		[XREP_AGI_END] = {
>  			.buf_ops = NULL
> diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h
> index 3bb152d52a07..840f74ec431c 100644
> --- a/fs/xfs/scrub/repair.h
> +++ b/fs/xfs/scrub/repair.h
> @@ -44,6 +44,9 @@ struct xrep_find_ag_btree {
>  	/* in: buffer ops */
>  	const struct xfs_buf_ops	*buf_ops;
>  
> +	/* in: maximum btree height */
> +	unsigned int			maxlevels;
> +
>  	/* out: the highest btree block found and the tree height */
>  	xfs_agblock_t			root;
>  	unsigned int			height;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/14] xfs: check that bc_nlevels never overflows
  2021-09-18  1:29 ` [PATCH 06/14] xfs: check that bc_nlevels never overflows Darrick J. Wong
@ 2021-09-20  9:54   ` Chandan Babu R
  2021-09-21  8:44   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Warn if we ever bump nlevels higher than the allowed maximum cursor
> height.

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c         |    2 ++
>  fs/xfs/libxfs/xfs_btree_staging.c |    2 ++
>  2 files changed, 4 insertions(+)
>
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index b0cce0932f02..bc4e49f0456a 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2933,6 +2933,7 @@ xfs_btree_new_iroot(
>  	be16_add_cpu(&block->bb_level, 1);
>  	xfs_btree_set_numrecs(block, 1);
>  	cur->bc_nlevels++;
> +	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
>  	cur->bc_ptrs[level + 1] = 1;
>  
>  	kp = xfs_btree_key_addr(cur, 1, block);
> @@ -3096,6 +3097,7 @@ xfs_btree_new_root(
>  	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
>  	cur->bc_ptrs[cur->bc_nlevels] = nptr;
>  	cur->bc_nlevels++;
> +	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
>  	*stat = 1;
>  	return 0;
>  error0:
> diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
> index ac9e80152b5c..26143297bb7b 100644
> --- a/fs/xfs/libxfs/xfs_btree_staging.c
> +++ b/fs/xfs/libxfs/xfs_btree_staging.c
> @@ -703,6 +703,7 @@ xfs_btree_bload_compute_geometry(
>  			 * block-based btree level.
>  			 */
>  			cur->bc_nlevels++;
> +			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
>  			xfs_btree_bload_level_geometry(cur, bbl, level,
>  					nr_this_level, &avg_per_block,
>  					&level_blocks, &dontcare64);
> @@ -718,6 +719,7 @@ xfs_btree_bload_compute_geometry(
>  
>  			/* Otherwise, we need another level of btree. */
>  			cur->bc_nlevels++;
> +			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
>  		}
>  
>  		nr_blocks += level_blocks;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/14] xfs: support dynamic btree cursor heights
  2021-09-18  1:29 ` [PATCH 07/14] xfs: support dynamic btree cursor heights Darrick J. Wong
@ 2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:49   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Split out the btree level information into a separate struct and put it
> at the end of the cursor structure as a VLA.  The realtime rmap btree
> (which is rooted in an inode) will require the ability to support many
> more levels than a per-AG btree cursor, which means that we're going to
> create two btree cursor caches to conserve memory for the more common
> case.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    6 +-
>  fs/xfs/libxfs/xfs_bmap.c  |   10 +--
>  fs/xfs/libxfs/xfs_btree.c |  154 +++++++++++++++++++++++----------------------
>  fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
>  fs/xfs/scrub/bitmap.c     |   16 ++---
>  fs/xfs/scrub/bmap.c       |    2 -
>  fs/xfs/scrub/btree.c      |   40 ++++++------
>  fs/xfs/scrub/trace.c      |    7 +-
>  fs/xfs/scrub/trace.h      |   10 +--
>  fs/xfs/xfs_super.c        |    2 -
>  fs/xfs/xfs_trace.h        |    2 -
>  11 files changed, 147 insertions(+), 130 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 35fb1dd3be95..55c5adc9b54e 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -488,8 +488,8 @@ xfs_alloc_fixup_trees(
>  		struct xfs_btree_block	*bnoblock;
>  		struct xfs_btree_block	*cntblock;
>  
> -		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_bufs[0]);
> -		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_bufs[0]);
> +		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_levels[0].bp);
> +		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_levels[0].bp);
>  
>  		if (XFS_IS_CORRUPT(mp,
>  				   bnoblock->bb_numrecs !=
> @@ -1512,7 +1512,7 @@ xfs_alloc_ag_vextent_lastblock(
>  	 * than minlen.
>  	 */
>  	if (*len || args->alignment > 1) {
> -		acur->cnt->bc_ptrs[0] = 1;
> +		acur->cnt->bc_levels[0].ptr = 1;
>  		do {
>  			error = xfs_alloc_get_rec(acur->cnt, bno, len, &i);
>  			if (error)
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 499c977cbf56..644b956301b6 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -240,10 +240,10 @@ xfs_bmap_get_bp(
>  		return NULL;
>  
>  	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
> -		if (!cur->bc_bufs[i])
> +		if (!cur->bc_levels[i].bp)
>  			break;
> -		if (xfs_buf_daddr(cur->bc_bufs[i]) == bno)
> -			return cur->bc_bufs[i];
> +		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
> +			return cur->bc_levels[i].bp;
>  	}
>  
>  	/* Chase down all the log items to see if the bp is there */
> @@ -629,8 +629,8 @@ xfs_bmap_btree_to_extents(
>  	ip->i_nblocks--;
>  	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
>  	xfs_trans_binval(tp, cbp);
> -	if (cur->bc_bufs[0] == cbp)
> -		cur->bc_bufs[0] = NULL;
> +	if (cur->bc_levels[0].bp == cbp)
> +		cur->bc_levels[0].bp = NULL;
>  	xfs_iroot_realloc(ip, -1, whichfork);
>  	ASSERT(ifp->if_broot == NULL);
>  	ifp->if_format = XFS_DINODE_FMT_EXTENTS;
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index bc4e49f0456a..93fb50516bc2 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -367,8 +367,8 @@ xfs_btree_del_cursor(
>  	 * way we won't have initialized all the entries down to 0.
>  	 */
>  	for (i = 0; i < cur->bc_nlevels; i++) {
> -		if (cur->bc_bufs[i])
> -			xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> +		if (cur->bc_levels[i].bp)
> +			xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
>  		else if (!error)
>  			break;
>  	}
> @@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
>  	 * For each level current, re-get the buffer and copy the ptr value.
>  	 */
>  	for (i = 0; i < new->bc_nlevels; i++) {
> -		new->bc_ptrs[i] = cur->bc_ptrs[i];
> -		new->bc_ra[i] = cur->bc_ra[i];
> -		bp = cur->bc_bufs[i];
> +		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
> +		new->bc_levels[i].ra = cur->bc_levels[i].ra;
> +		bp = cur->bc_levels[i].bp;
>  		if (bp) {
>  			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
>  						   xfs_buf_daddr(bp), mp->m_bsize,
> @@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
>  				return error;
>  			}
>  		}
> -		new->bc_bufs[i] = bp;
> +		new->bc_levels[i].bp = bp;
>  	}
>  	*ncur = new;
>  	return 0;
> @@ -681,7 +681,7 @@ xfs_btree_get_block(
>  		return xfs_btree_get_iroot(cur);
>  	}
>  
> -	*bpp = cur->bc_bufs[level];
> +	*bpp = cur->bc_levels[level].bp;
>  	return XFS_BUF_TO_BLOCK(*bpp);
>  }
>  
> @@ -711,7 +711,7 @@ xfs_btree_firstrec(
>  	/*
>  	 * Set the ptr value to 1, that's the first record/key.
>  	 */
> -	cur->bc_ptrs[level] = 1;
> +	cur->bc_levels[level].ptr = 1;
>  	return 1;
>  }
>  
> @@ -741,7 +741,7 @@ xfs_btree_lastrec(
>  	/*
>  	 * Set the ptr value to numrecs, that's the last record/key.
>  	 */
> -	cur->bc_ptrs[level] = be16_to_cpu(block->bb_numrecs);
> +	cur->bc_levels[level].ptr = be16_to_cpu(block->bb_numrecs);
>  	return 1;
>  }
>  
> @@ -922,11 +922,11 @@ xfs_btree_readahead(
>  	    (lev == cur->bc_nlevels - 1))
>  		return 0;
>  
> -	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
> +	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
>  		return 0;
>  
> -	cur->bc_ra[lev] |= lr;
> -	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[lev]);
> +	cur->bc_levels[lev].ra |= lr;
> +	block = XFS_BUF_TO_BLOCK(cur->bc_levels[lev].bp);
>  
>  	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
>  		return xfs_btree_readahead_lblock(cur, lr, block);
> @@ -991,22 +991,22 @@ xfs_btree_setbuf(
>  {
>  	struct xfs_btree_block	*b;	/* btree block */
>  
> -	if (cur->bc_bufs[lev])
> -		xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[lev]);
> -	cur->bc_bufs[lev] = bp;
> -	cur->bc_ra[lev] = 0;
> +	if (cur->bc_levels[lev].bp)
> +		xfs_trans_brelse(cur->bc_tp, cur->bc_levels[lev].bp);
> +	cur->bc_levels[lev].bp = bp;
> +	cur->bc_levels[lev].ra = 0;
>  
>  	b = XFS_BUF_TO_BLOCK(bp);
>  	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
>  		if (b->bb_u.l.bb_leftsib == cpu_to_be64(NULLFSBLOCK))
> -			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
> +			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
>  		if (b->bb_u.l.bb_rightsib == cpu_to_be64(NULLFSBLOCK))
> -			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
> +			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
>  	} else {
>  		if (b->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK))
> -			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
> +			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
>  		if (b->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
> -			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
> +			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
>  	}
>  }
>  
> @@ -1548,7 +1548,7 @@ xfs_btree_increment(
>  #endif
>  
>  	/* We're done if we remain in the block after the increment. */
> -	if (++cur->bc_ptrs[level] <= xfs_btree_get_numrecs(block))
> +	if (++cur->bc_levels[level].ptr <= xfs_btree_get_numrecs(block))
>  		goto out1;
>  
>  	/* Fail if we just went off the right edge of the tree. */
> @@ -1571,7 +1571,7 @@ xfs_btree_increment(
>  			goto error0;
>  #endif
>  
> -		if (++cur->bc_ptrs[lev] <= xfs_btree_get_numrecs(block))
> +		if (++cur->bc_levels[lev].ptr <= xfs_btree_get_numrecs(block))
>  			break;
>  
>  		/* Read-ahead the right block for the next loop. */
> @@ -1598,14 +1598,14 @@ xfs_btree_increment(
>  	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
>  		union xfs_btree_ptr	*ptrp;
>  
> -		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
> +		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
>  		--lev;
>  		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
>  		if (error)
>  			goto error0;
>  
>  		xfs_btree_setbuf(cur, lev, bp);
> -		cur->bc_ptrs[lev] = 1;
> +		cur->bc_levels[lev].ptr = 1;
>  	}
>  out1:
>  	*stat = 1;
> @@ -1641,7 +1641,7 @@ xfs_btree_decrement(
>  	xfs_btree_readahead(cur, level, XFS_BTCUR_LEFTRA);
>  
>  	/* We're done if we remain in the block after the decrement. */
> -	if (--cur->bc_ptrs[level] > 0)
> +	if (--cur->bc_levels[level].ptr > 0)
>  		goto out1;
>  
>  	/* Get a pointer to the btree block. */
> @@ -1665,7 +1665,7 @@ xfs_btree_decrement(
>  	 * Stop when we don't go off the left edge of a block.
>  	 */
>  	for (lev = level + 1; lev < cur->bc_nlevels; lev++) {
> -		if (--cur->bc_ptrs[lev] > 0)
> +		if (--cur->bc_levels[lev].ptr > 0)
>  			break;
>  		/* Read-ahead the left block for the next loop. */
>  		xfs_btree_readahead(cur, lev, XFS_BTCUR_LEFTRA);
> @@ -1691,13 +1691,13 @@ xfs_btree_decrement(
>  	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
>  		union xfs_btree_ptr	*ptrp;
>  
> -		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
> +		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
>  		--lev;
>  		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
>  		if (error)
>  			goto error0;
>  		xfs_btree_setbuf(cur, lev, bp);
> -		cur->bc_ptrs[lev] = xfs_btree_get_numrecs(block);
> +		cur->bc_levels[lev].ptr = xfs_btree_get_numrecs(block);
>  	}
>  out1:
>  	*stat = 1;
> @@ -1735,7 +1735,7 @@ xfs_btree_lookup_get_block(
>  	 *
>  	 * Otherwise throw it away and get a new one.
>  	 */
> -	bp = cur->bc_bufs[level];
> +	bp = cur->bc_levels[level].bp;
>  	error = xfs_btree_ptr_to_daddr(cur, pp, &daddr);
>  	if (error)
>  		return error;
> @@ -1864,7 +1864,7 @@ xfs_btree_lookup(
>  					return -EFSCORRUPTED;
>  				}
>  
> -				cur->bc_ptrs[0] = dir != XFS_LOOKUP_LE;
> +				cur->bc_levels[0].ptr = dir != XFS_LOOKUP_LE;
>  				*stat = 0;
>  				return 0;
>  			}
> @@ -1916,7 +1916,7 @@ xfs_btree_lookup(
>  			if (error)
>  				goto error0;
>  
> -			cur->bc_ptrs[level] = keyno;
> +			cur->bc_levels[level].ptr = keyno;
>  		}
>  	}
>  
> @@ -1933,7 +1933,7 @@ xfs_btree_lookup(
>  		    !xfs_btree_ptr_is_null(cur, &ptr)) {
>  			int	i;
>  
> -			cur->bc_ptrs[0] = keyno;
> +			cur->bc_levels[0].ptr = keyno;
>  			error = xfs_btree_increment(cur, 0, &i);
>  			if (error)
>  				goto error0;
> @@ -1944,7 +1944,7 @@ xfs_btree_lookup(
>  		}
>  	} else if (dir == XFS_LOOKUP_LE && diff > 0)
>  		keyno--;
> -	cur->bc_ptrs[0] = keyno;
> +	cur->bc_levels[0].ptr = keyno;
>  
>  	/* Return if we succeeded or not. */
>  	if (keyno == 0 || keyno > xfs_btree_get_numrecs(block))
> @@ -2104,7 +2104,7 @@ __xfs_btree_updkeys(
>  		if (error)
>  			return error;
>  #endif
> -		ptr = cur->bc_ptrs[level];
> +		ptr = cur->bc_levels[level].ptr;
>  		nlkey = xfs_btree_key_addr(cur, ptr, block);
>  		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
>  		if (!force_all &&
> @@ -2171,7 +2171,7 @@ xfs_btree_update_keys(
>  		if (error)
>  			return error;
>  #endif
> -		ptr = cur->bc_ptrs[level];
> +		ptr = cur->bc_levels[level].ptr;
>  		kp = xfs_btree_key_addr(cur, ptr, block);
>  		xfs_btree_copy_keys(cur, kp, &key, 1);
>  		xfs_btree_log_keys(cur, bp, ptr, ptr);
> @@ -2205,7 +2205,7 @@ xfs_btree_update(
>  		goto error0;
>  #endif
>  	/* Get the address of the rec to be updated. */
> -	ptr = cur->bc_ptrs[0];
> +	ptr = cur->bc_levels[0].ptr;
>  	rp = xfs_btree_rec_addr(cur, ptr, block);
>  
>  	/* Fill in the new contents and log them. */
> @@ -2280,7 +2280,7 @@ xfs_btree_lshift(
>  	 * If the cursor entry is the one that would be moved, don't
>  	 * do it... it's too complicated.
>  	 */
> -	if (cur->bc_ptrs[level] <= 1)
> +	if (cur->bc_levels[level].ptr <= 1)
>  		goto out0;
>  
>  	/* Set up the left neighbor as "left". */
> @@ -2414,7 +2414,7 @@ xfs_btree_lshift(
>  		goto error0;
>  
>  	/* Slide the cursor value left one. */
> -	cur->bc_ptrs[level]--;
> +	cur->bc_levels[level].ptr--;
>  
>  	*stat = 1;
>  	return 0;
> @@ -2476,7 +2476,7 @@ xfs_btree_rshift(
>  	 * do it... it's too complicated.
>  	 */
>  	lrecs = xfs_btree_get_numrecs(left);
> -	if (cur->bc_ptrs[level] >= lrecs)
> +	if (cur->bc_levels[level].ptr >= lrecs)
>  		goto out0;
>  
>  	/* Set up the right neighbor as "right". */
> @@ -2664,7 +2664,7 @@ __xfs_btree_split(
>  	 */
>  	lrecs = xfs_btree_get_numrecs(left);
>  	rrecs = lrecs / 2;
> -	if ((lrecs & 1) && cur->bc_ptrs[level] <= rrecs + 1)
> +	if ((lrecs & 1) && cur->bc_levels[level].ptr <= rrecs + 1)
>  		rrecs++;
>  	src_index = (lrecs - rrecs + 1);
>  
> @@ -2760,9 +2760,9 @@ __xfs_btree_split(
>  	 * If it's just pointing past the last entry in left, then we'll
>  	 * insert there, so don't change anything in that case.
>  	 */
> -	if (cur->bc_ptrs[level] > lrecs + 1) {
> +	if (cur->bc_levels[level].ptr > lrecs + 1) {
>  		xfs_btree_setbuf(cur, level, rbp);
> -		cur->bc_ptrs[level] -= lrecs;
> +		cur->bc_levels[level].ptr -= lrecs;
>  	}
>  	/*
>  	 * If there are more levels, we'll need another cursor which refers
> @@ -2772,7 +2772,7 @@ __xfs_btree_split(
>  		error = xfs_btree_dup_cursor(cur, curp);
>  		if (error)
>  			goto error0;
> -		(*curp)->bc_ptrs[level + 1]++;
> +		(*curp)->bc_levels[level + 1].ptr++;
>  	}
>  	*ptrp = rptr;
>  	*stat = 1;
> @@ -2934,7 +2934,7 @@ xfs_btree_new_iroot(
>  	xfs_btree_set_numrecs(block, 1);
>  	cur->bc_nlevels++;
>  	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
> -	cur->bc_ptrs[level + 1] = 1;
> +	cur->bc_levels[level + 1].ptr = 1;
>  
>  	kp = xfs_btree_key_addr(cur, 1, block);
>  	ckp = xfs_btree_key_addr(cur, 1, cblock);
> @@ -3095,7 +3095,7 @@ xfs_btree_new_root(
>  
>  	/* Fix up the cursor. */
>  	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
> -	cur->bc_ptrs[cur->bc_nlevels] = nptr;
> +	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
>  	cur->bc_nlevels++;
>  	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
>  	*stat = 1;
> @@ -3154,7 +3154,7 @@ xfs_btree_make_block_unfull(
>  		return error;
>  
>  	if (*stat) {
> -		*oindex = *index = cur->bc_ptrs[level];
> +		*oindex = *index = cur->bc_levels[level].ptr;
>  		return 0;
>  	}
>  
> @@ -3169,7 +3169,7 @@ xfs_btree_make_block_unfull(
>  		return error;
>  
>  
> -	*index = cur->bc_ptrs[level];
> +	*index = cur->bc_levels[level].ptr;
>  	return 0;
>  }
>  
> @@ -3216,7 +3216,7 @@ xfs_btree_insrec(
>  	}
>  
>  	/* If we're off the left edge, return failure. */
> -	ptr = cur->bc_ptrs[level];
> +	ptr = cur->bc_levels[level].ptr;
>  	if (ptr == 0) {
>  		*stat = 0;
>  		return 0;
> @@ -3559,7 +3559,7 @@ xfs_btree_kill_iroot(
>  	if (error)
>  		return error;
>  
> -	cur->bc_bufs[level - 1] = NULL;
> +	cur->bc_levels[level - 1].bp = NULL;
>  	be16_add_cpu(&block->bb_level, -1);
>  	xfs_trans_log_inode(cur->bc_tp, ip,
>  		XFS_ILOG_CORE | xfs_ilog_fbroot(cur->bc_ino.whichfork));
> @@ -3592,8 +3592,8 @@ xfs_btree_kill_root(
>  	if (error)
>  		return error;
>  
> -	cur->bc_bufs[level] = NULL;
> -	cur->bc_ra[level] = 0;
> +	cur->bc_levels[level].bp = NULL;
> +	cur->bc_levels[level].ra = 0;
>  	cur->bc_nlevels--;
>  
>  	return 0;
> @@ -3652,7 +3652,7 @@ xfs_btree_delrec(
>  	tcur = NULL;
>  
>  	/* Get the index of the entry being deleted, check for nothing there. */
> -	ptr = cur->bc_ptrs[level];
> +	ptr = cur->bc_levels[level].ptr;
>  	if (ptr == 0) {
>  		*stat = 0;
>  		return 0;
> @@ -3962,7 +3962,7 @@ xfs_btree_delrec(
>  				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
>  				tcur = NULL;
>  				if (level == 0)
> -					cur->bc_ptrs[0]++;
> +					cur->bc_levels[0].ptr++;
>  
>  				*stat = 1;
>  				return 0;
> @@ -4099,9 +4099,9 @@ xfs_btree_delrec(
>  	 * cursor to the left block, and fix up the index.
>  	 */
>  	if (bp != lbp) {
> -		cur->bc_bufs[level] = lbp;
> -		cur->bc_ptrs[level] += lrecs;
> -		cur->bc_ra[level] = 0;
> +		cur->bc_levels[level].bp = lbp;
> +		cur->bc_levels[level].ptr += lrecs;
> +		cur->bc_levels[level].ra = 0;
>  	}
>  	/*
>  	 * If we joined with the right neighbor and there's a level above
> @@ -4121,11 +4121,11 @@ xfs_btree_delrec(
>  	 * We can't use decrement because it would change the next level up.
>  	 */
>  	if (level > 0)
> -		cur->bc_ptrs[level]--;
> +		cur->bc_levels[level].ptr--;
>  
>  	/*
>  	 * We combined blocks, so we have to update the parent keys if the
> -	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
> +	 * btree supports overlapped intervals.  However, bc_levels[level + 1].ptr
>  	 * points to the old block so that the caller knows which record to
>  	 * delete.  Therefore, the caller must be savvy enough to call updkeys
>  	 * for us if we return stat == 2.  The other exit points from this
> @@ -4184,7 +4184,7 @@ xfs_btree_delete(
>  
>  	if (i == 0) {
>  		for (level = 1; level < cur->bc_nlevels; level++) {
> -			if (cur->bc_ptrs[level] == 0) {
> +			if (cur->bc_levels[level].ptr == 0) {
>  				error = xfs_btree_decrement(cur, level, &i);
>  				if (error)
>  					goto error0;
> @@ -4215,7 +4215,7 @@ xfs_btree_get_rec(
>  	int			error;	/* error return value */
>  #endif
>  
> -	ptr = cur->bc_ptrs[0];
> +	ptr = cur->bc_levels[0].ptr;
>  	block = xfs_btree_get_block(cur, 0, &bp);
>  
>  #ifdef DEBUG
> @@ -4663,23 +4663,23 @@ xfs_btree_overlapped_query_range(
>  	if (error)
>  		goto out;
>  #endif
> -	cur->bc_ptrs[level] = 1;
> +	cur->bc_levels[level].ptr = 1;
>  
>  	while (level < cur->bc_nlevels) {
>  		block = xfs_btree_get_block(cur, level, &bp);
>  
>  		/* End of node, pop back towards the root. */
> -		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> +		if (cur->bc_levels[level].ptr > be16_to_cpu(block->bb_numrecs)) {
>  pop_up:
>  			if (level < cur->bc_nlevels - 1)
> -				cur->bc_ptrs[level + 1]++;
> +				cur->bc_levels[level + 1].ptr++;
>  			level++;
>  			continue;
>  		}
>  
>  		if (level == 0) {
>  			/* Handle a leaf node. */
> -			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
>  
>  			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
>  			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
> @@ -4702,14 +4702,14 @@ xfs_btree_overlapped_query_range(
>  				/* Record is larger than high key; pop. */
>  				goto pop_up;
>  			}
> -			cur->bc_ptrs[level]++;
> +			cur->bc_levels[level].ptr++;
>  			continue;
>  		}
>  
>  		/* Handle an internal node. */
> -		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> -		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> -		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> +		lkp = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
> +		hkp = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
> +		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
>  
>  		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
>  		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
> @@ -4732,13 +4732,13 @@ xfs_btree_overlapped_query_range(
>  			if (error)
>  				goto out;
>  #endif
> -			cur->bc_ptrs[level] = 1;
> +			cur->bc_levels[level].ptr = 1;
>  			continue;
>  		} else if (hdiff < 0) {
>  			/* The low key is larger than the upper range; pop. */
>  			goto pop_up;
>  		}
> -		cur->bc_ptrs[level]++;
> +		cur->bc_levels[level].ptr++;
>  	}
>  
>  out:
> @@ -4749,13 +4749,13 @@ xfs_btree_overlapped_query_range(
>  	 * with a zero-results range query, so release the buffers if we
>  	 * failed to return any results.
>  	 */
> -	if (cur->bc_bufs[0] == NULL) {
> +	if (cur->bc_levels[0].bp == NULL) {
>  		for (i = 0; i < cur->bc_nlevels; i++) {
> -			if (cur->bc_bufs[i]) {
> -				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
> -				cur->bc_bufs[i] = NULL;
> -				cur->bc_ptrs[i] = 0;
> -				cur->bc_ra[i] = 0;
> +			if (cur->bc_levels[i].bp) {
> +				xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
> +				cur->bc_levels[i].bp = NULL;
> +				cur->bc_levels[i].ptr = 0;
> +				cur->bc_levels[i].ra = 0;
>  			}
>  		}
>  	}
> @@ -4917,7 +4917,7 @@ xfs_btree_has_more_records(
>  	block = xfs_btree_get_block(cur, 0, &bp);
>  
>  	/* There are still records in this block. */
> -	if (cur->bc_ptrs[0] < xfs_btree_get_numrecs(block))
> +	if (cur->bc_levels[0].ptr < xfs_btree_get_numrecs(block))
>  		return true;
>  
>  	/* There are more record blocks. */
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 513ade4a89f8..827c44bf24dc 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
>  #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
>  };
>  
> +struct xfs_btree_level {
> +	/* buffer pointer */
> +	struct xfs_buf	*bp;
> +
> +	/* key/record number */
> +	unsigned int	ptr;
> +
> +	/* readahead info */
> +#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
> +#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
> +	uint8_t		ra;
> +};
> +
>  /*
>   * Btree cursor structure.
>   * This collects all information needed by the btree code in one place.
> @@ -223,11 +236,6 @@ struct xfs_btree_cur
>  	const struct xfs_btree_ops *bc_ops;
>  	uint			bc_flags; /* btree features - below */
>  	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> -	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
> -	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
> -	uint8_t		bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
> -#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
> -#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
>  	uint8_t		bc_nlevels;	/* number of levels in the tree */
>  	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
>  	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
> @@ -243,8 +251,17 @@ struct xfs_btree_cur
>  		struct xfs_btree_cur_ag	bc_ag;
>  		struct xfs_btree_cur_ino bc_ino;
>  	};
> +
> +	/* Must be at the end of the struct! */
> +	struct xfs_btree_level	bc_levels[];
>  };
>  
> +static inline size_t xfs_btree_cur_sizeof(unsigned int nlevels)
> +{
> +	return sizeof(struct xfs_btree_cur) +
> +	       sizeof(struct xfs_btree_level) * (nlevels);
> +}
> +
>  /* cursor flags */
>  #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
>  #define XFS_BTREE_ROOT_IN_INODE		(1<<1)	/* root may be variable size */
> @@ -258,7 +275,6 @@ struct xfs_btree_cur
>   */
>  #define XFS_BTREE_STAGING		(1<<5)
>  
> -
>  #define	XFS_BTREE_NOERROR	0
>  #define	XFS_BTREE_ERROR		1
>  
> diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
> index d6d24c866bc4..b8b8e871e3b7 100644
> --- a/fs/xfs/scrub/bitmap.c
> +++ b/fs/xfs/scrub/bitmap.c
> @@ -222,20 +222,20 @@ xbitmap_disunion(
>   * 1  2  3
>   *
>   * Pretend for this example that each leaf block has 100 btree records.  For
> - * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
> - * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
> + * the first btree record, we'll observe that bc_levels[0].ptr == 1, so we record
> + * that we saw block 1.  Then we observe that bc_levels[1].ptr == 1, so we record
>   * block 4.  The list is [1, 4].
>   *
> - * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
> + * For the second btree record, we see that bc_levels[0].ptr == 2, so we exit the
>   * loop.  The list remains [1, 4].
>   *
>   * For the 101st btree record, we've moved onto leaf block 2.  Now
> - * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
> - * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
> + * bc_levels[0].ptr == 1 again, so we record that we saw block 2.  We see that
> + * bc_levels[1].ptr == 2, so we exit the loop.  The list is now [1, 4, 2].
>   *
> - * For the 102nd record, bc_ptrs[0] == 2, so we continue.
> + * For the 102nd record, bc_levels[0].ptr == 2, so we continue.
>   *
> - * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
> + * For the 201st record, we've moved on to leaf block 3.  bc_levels[0].ptr == 1, so
>   * we add 3 to the list.  Now it is [1, 4, 2, 3].
>   *
>   * For the 300th record we just exit, with the list being [1, 4, 2, 3].
> @@ -256,7 +256,7 @@ xbitmap_set_btcur_path(
>  	int			i;
>  	int			error;
>  
> -	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
> +	for (i = 0; i < cur->bc_nlevels && cur->bc_levels[i].ptr == 1; i++) {
>  		xfs_btree_get_block(cur, i, &bp);
>  		if (!bp)
>  			continue;
> diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
> index 017da9ceaee9..a4cbbc346f60 100644
> --- a/fs/xfs/scrub/bmap.c
> +++ b/fs/xfs/scrub/bmap.c
> @@ -402,7 +402,7 @@ xchk_bmapbt_rec(
>  	 * the root since the verifiers don't do that.
>  	 */
>  	if (xfs_has_crc(bs->cur->bc_mp) &&
> -	    bs->cur->bc_ptrs[0] == 1) {
> +	    bs->cur->bc_levels[0].ptr == 1) {
>  		for (i = 0; i < bs->cur->bc_nlevels - 1; i++) {
>  			block = xfs_btree_get_block(bs->cur, i, &bp);
>  			owner = be64_to_cpu(block->bb_u.l.bb_owner);
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index 7b7762ae22e5..5a453ce151ed 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -136,7 +136,7 @@ xchk_btree_rec(
>  	struct xfs_buf		*bp;
>  
>  	block = xfs_btree_get_block(cur, 0, &bp);
> -	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +	rec = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
>  
>  	trace_xchk_btree_rec(bs->sc, cur, 0);
>  
> @@ -153,7 +153,7 @@ xchk_btree_rec(
>  	/* Is this at least as large as the parent low key? */
>  	cur->bc_ops->init_key_from_rec(&key, rec);
>  	keyblock = xfs_btree_get_block(cur, 1, &bp);
> -	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
> +	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
>  	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
>  		xchk_btree_set_corrupt(bs->sc, cur, 1);
>  
> @@ -162,7 +162,7 @@ xchk_btree_rec(
>  
>  	/* Is this no larger than the parent high key? */
>  	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
> -	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
> +	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
>  	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
>  		xchk_btree_set_corrupt(bs->sc, cur, 1);
>  }
> @@ -184,7 +184,7 @@ xchk_btree_key(
>  	struct xfs_buf		*bp;
>  
>  	block = xfs_btree_get_block(cur, level, &bp);
> -	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
> +	key = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
>  
>  	trace_xchk_btree_key(bs->sc, cur, level);
>  
> @@ -200,7 +200,7 @@ xchk_btree_key(
>  
>  	/* Is this at least as large as the parent low key? */
>  	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
> -	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
> +	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
>  	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
>  		xchk_btree_set_corrupt(bs->sc, cur, level);
>  
> @@ -208,8 +208,8 @@ xchk_btree_key(
>  		return;
>  
>  	/* Is this no larger than the parent high key? */
> -	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
> -	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
> +	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
> +	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
>  	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
>  		xchk_btree_set_corrupt(bs->sc, cur, level);
>  }
> @@ -292,7 +292,7 @@ xchk_btree_block_check_sibling(
>  
>  	/* Compare upper level pointer to sibling pointer. */
>  	pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
> -	pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
> +	pp = xfs_btree_ptr_addr(ncur, ncur->bc_levels[level + 1].ptr, pblock);
>  	if (!xchk_btree_ptr_ok(bs, level + 1, pp))
>  		goto out;
>  	if (pbp)
> @@ -597,7 +597,7 @@ xchk_btree_block_keys(
>  
>  	/* Obtain the parent's copy of the keys for this block. */
>  	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
> -	parent_keys = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1],
> +	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
>  			parent_block);
>  
>  	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
> @@ -608,7 +608,7 @@ xchk_btree_block_keys(
>  
>  	/* Get high keys */
>  	high_bk = xfs_btree_high_key_from_key(cur, &block_keys);
> -	high_pk = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1],
> +	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
>  			parent_block);
>  
>  	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
> @@ -672,18 +672,18 @@ xchk_btree(
>  	if (error || !block)
>  		goto out;
>  
> -	cur->bc_ptrs[level] = 1;
> +	cur->bc_levels[level].ptr = 1;
>  
>  	while (level < cur->bc_nlevels) {
>  		block = xfs_btree_get_block(cur, level, &bp);
>  
>  		if (level == 0) {
>  			/* End of leaf, pop back towards the root. */
> -			if (cur->bc_ptrs[level] >
> +			if (cur->bc_levels[level].ptr >
>  			    be16_to_cpu(block->bb_numrecs)) {
>  				xchk_btree_block_keys(bs, level, block);
>  				if (level < cur->bc_nlevels - 1)
> -					cur->bc_ptrs[level + 1]++;
> +					cur->bc_levels[level + 1].ptr++;
>  				level++;
>  				continue;
>  			}
> @@ -692,7 +692,7 @@ xchk_btree(
>  			xchk_btree_rec(bs);
>  
>  			/* Call out to the record checker. */
> -			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> +			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
>  			error = bs->scrub_rec(bs, recp);
>  			if (error)
>  				break;
> @@ -700,15 +700,15 @@ xchk_btree(
>  			    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
>  				break;
>  
> -			cur->bc_ptrs[level]++;
> +			cur->bc_levels[level].ptr++;
>  			continue;
>  		}
>  
>  		/* End of node, pop back towards the root. */
> -		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> +		if (cur->bc_levels[level].ptr > be16_to_cpu(block->bb_numrecs)) {
>  			xchk_btree_block_keys(bs, level, block);
>  			if (level < cur->bc_nlevels - 1)
> -				cur->bc_ptrs[level + 1]++;
> +				cur->bc_levels[level + 1].ptr++;
>  			level++;
>  			continue;
>  		}
> @@ -717,9 +717,9 @@ xchk_btree(
>  		xchk_btree_key(bs, level);
>  
>  		/* Drill another level deeper. */
> -		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> +		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
>  		if (!xchk_btree_ptr_ok(bs, level, pp)) {
> -			cur->bc_ptrs[level]++;
> +			cur->bc_levels[level].ptr++;
>  			continue;
>  		}
>  		level--;
> @@ -727,7 +727,7 @@ xchk_btree(
>  		if (error || !block)
>  			goto out;
>  
> -		cur->bc_ptrs[level] = 1;
> +		cur->bc_levels[level].ptr = 1;
>  	}
>  
>  out:
> diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> index c0ef53fe6611..816dfc8e5a80 100644
> --- a/fs/xfs/scrub/trace.c
> +++ b/fs/xfs/scrub/trace.c
> @@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
>  	struct xfs_btree_cur	*cur,
>  	int			level)
>  {
> -	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> +	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
>  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> -				xfs_buf_daddr(cur->bc_bufs[level]));
> -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> +				xfs_buf_daddr(cur->bc_levels[level].bp));
> +	else if (level == cur->bc_nlevels - 1 &&
> +		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
>  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
>  		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
> index a7bbb84f91a7..93ece6df02e3 100644
> --- a/fs/xfs/scrub/trace.h
> +++ b/fs/xfs/scrub/trace.h
> @@ -348,7 +348,7 @@ TRACE_EVENT(xchk_btree_op_error,
>  		__entry->level = level;
>  		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
>  		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  		__entry->error = error;
>  		__entry->ret_ip = ret_ip;
>  	),
> @@ -389,7 +389,7 @@ TRACE_EVENT(xchk_ifork_btree_op_error,
>  		__entry->type = sc->sm->sm_type;
>  		__entry->btnum = cur->bc_btnum;
>  		__entry->level = level;
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
>  		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
>  		__entry->error = error;
> @@ -431,7 +431,7 @@ TRACE_EVENT(xchk_btree_error,
>  		__entry->level = level;
>  		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
>  		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  		__entry->ret_ip = ret_ip;
>  	),
>  	TP_printk("dev %d:%d type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
> @@ -471,7 +471,7 @@ TRACE_EVENT(xchk_ifork_btree_error,
>  		__entry->level = level;
>  		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
>  		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  		__entry->ret_ip = ret_ip;
>  	),
>  	TP_printk("dev %d:%d ino 0x%llx fork %s type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
> @@ -511,7 +511,7 @@ DECLARE_EVENT_CLASS(xchk_sbtree_class,
>  		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
>  		__entry->level = level;
>  		__entry->nlevels = cur->bc_nlevels;
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  	),
>  	TP_printk("dev %d:%d type %s btree %s agno 0x%x agbno 0x%x level %d nlevels %d ptr %d",
>  		  MAJOR(__entry->dev), MINOR(__entry->dev),
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index c4e0cd1c1c8c..30bae0657343 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1966,7 +1966,7 @@ xfs_init_zones(void)
>  		goto out_destroy_log_ticket_zone;
>  
>  	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
> -					       sizeof(struct xfs_btree_cur),
> +				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
>  					       0, 0, NULL);
>  	if (!xfs_btree_cur_zone)
>  		goto out_destroy_bmap_free_item_zone;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 1033a95fbf8e..4a8076ef8cb4 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -2476,7 +2476,7 @@ DECLARE_EVENT_CLASS(xfs_btree_cur_class,
>  		__entry->btnum = cur->bc_btnum;
>  		__entry->level = level;
>  		__entry->nlevels = cur->bc_nlevels;
> -		__entry->ptr = cur->bc_ptrs[level];
> +		__entry->ptr = cur->bc_levels[level].ptr;
>  		__entry->daddr = bp ? xfs_buf_daddr(bp) : -1;
>  	),
>  	TP_printk("dev %d:%d btree %s level %d/%d ptr %d daddr 0x%llx",


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/14] xfs: refactor btree cursor allocation function
  2021-09-18  1:29 ` [PATCH 08/14] xfs: refactor btree cursor allocation function Darrick J. Wong
@ 2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:53   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Refactor btree allocation to a common helper.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |    7 +------
>  fs/xfs/libxfs/xfs_bmap_btree.c     |    7 +------
>  fs/xfs/libxfs/xfs_btree.c          |   18 ++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h          |    2 ++
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |    7 +------
>  fs/xfs/libxfs/xfs_refcount_btree.c |    6 +-----
>  fs/xfs/libxfs/xfs_rmap_btree.c     |    6 +-----
>  7 files changed, 25 insertions(+), 28 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 6746fd735550..c644b11132f6 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -477,12 +477,7 @@ xfs_allocbt_init_common(
>  
>  	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> -
> -	cur->bc_tp = tp;
> -	cur->bc_mp = mp;
> -	cur->bc_btnum = btnum;
> -	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
>  	cur->bc_ag.abt.active = false;
>  
>  	if (btnum == XFS_BTNUM_CNT) {
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index 72444b8b38a6..a06987e36db5 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -552,13 +552,8 @@ xfs_bmbt_init_cursor(
>  	struct xfs_btree_cur	*cur;
>  	ASSERT(whichfork != XFS_COW_FORK);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> -
> -	cur->bc_tp = tp;
> -	cur->bc_mp = mp;
> +	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
>  	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
> -	cur->bc_btnum = XFS_BTNUM_BMAP;
> -	cur->bc_blocklog = mp->m_sb.sb_blocklog;
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
>  
>  	cur->bc_ops = &xfs_bmbt_ops;
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 93fb50516bc2..70785004414e 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4926,3 +4926,21 @@ xfs_btree_has_more_records(
>  	else
>  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
>  }
> +
> +/* Allocate a new btree cursor of the appropriate size. */
> +struct xfs_btree_cur *
> +xfs_btree_alloc_cursor(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans	*tp,
> +	xfs_btnum_t		btnum)
> +{
> +	struct xfs_btree_cur	*cur;
> +
> +	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur->bc_tp = tp;
> +	cur->bc_mp = mp;
> +	cur->bc_btnum = btnum;
> +	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +
> +	return cur;
> +}
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 827c44bf24dc..6540c4957c36 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -573,5 +573,7 @@ void xfs_btree_copy_ptrs(struct xfs_btree_cur *cur,
>  void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
>  		union xfs_btree_key *dst_key,
>  		const union xfs_btree_key *src_key, int numkeys);
> +struct xfs_btree_cur *xfs_btree_alloc_cursor(struct xfs_mount *mp,
> +		struct xfs_trans *tp, xfs_btnum_t btnum);
>  
>  #endif	/* __XFS_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 27190840c5d8..c8fea6a464d5 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -432,10 +432,7 @@ xfs_inobt_init_common(
>  {
>  	struct xfs_btree_cur	*cur;
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> -	cur->bc_tp = tp;
> -	cur->bc_mp = mp;
> -	cur->bc_btnum = btnum;
> +	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
>  	if (btnum == XFS_BTNUM_INO) {
>  		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
>  		cur->bc_ops = &xfs_inobt_ops;
> @@ -444,8 +441,6 @@ xfs_inobt_init_common(
>  		cur->bc_ops = &xfs_finobt_ops;
>  	}
>  
> -	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> -
>  	if (xfs_has_crc(mp))
>  		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
>  
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 1ef9b99962ab..48c45e31d897 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -322,11 +322,7 @@ xfs_refcountbt_init_common(
>  
>  	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> -	cur->bc_tp = tp;
> -	cur->bc_mp = mp;
> -	cur->bc_btnum = XFS_BTNUM_REFC;
> -	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
>  
>  	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index b7dbbfb3aeed..f3c4d0965cc9 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -451,13 +451,9 @@ xfs_rmapbt_init_common(
>  {
>  	struct xfs_btree_cur	*cur;
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> -	cur->bc_tp = tp;
> -	cur->bc_mp = mp;
>  	/* Overlapping btree; 2 keys per pointer. */
> -	cur->bc_btnum = XFS_BTNUM_RMAP;
> +	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
>  	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
> -	cur->bc_blocklog = mp->m_sb.sb_blocklog;
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
>  	cur->bc_ops = &xfs_rmapbt_ops;
>  


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code
  2021-09-18  1:29 ` [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code Darrick J. Wong
@ 2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:56   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> The btree geometry computation function has an off-by-one error in that
> it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
> This can result in repairs failing unnecessarily on very fragmented
> filesystems.  Subsequent patches to remove MAXLEVELS usage in favor of
> the per-btree type computations will make this a much more likely
> occurrence.

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree_staging.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
> index 26143297bb7b..cc56efc2b90a 100644
> --- a/fs/xfs/libxfs/xfs_btree_staging.c
> +++ b/fs/xfs/libxfs/xfs_btree_staging.c
> @@ -662,7 +662,7 @@ xfs_btree_bload_compute_geometry(
>  	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
>  
>  	bbl->nr_records = nr_this_level = nr_records;
> -	for (cur->bc_nlevels = 1; cur->bc_nlevels < XFS_BTREE_MAXLEVELS;) {
> +	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
>  		uint64_t	level_blocks;
>  		uint64_t	dontcare64;
>  		unsigned int	level = cur->bc_nlevels - 1;
> @@ -726,7 +726,7 @@ xfs_btree_bload_compute_geometry(
>  		nr_this_level = level_blocks;
>  	}
>  
> -	if (cur->bc_nlevels == XFS_BTREE_MAXLEVELS)
> +	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
>  		return -EOVERFLOW;
>  
>  	bbl->btree_height = cur->bc_nlevels;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/14] xfs: encode the max btree height in the cursor
  2021-09-18  1:30 ` [PATCH 10/14] xfs: encode the max btree height in the cursor Darrick J. Wong
@ 2021-09-20  9:55   ` Chandan Babu R
  2021-09-21  8:57   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 07:00, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Encode the maximum btree height in the cursor, since we're soon going to
> allow smaller cursors for AG btrees and larger cursors for file btrees.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c          |    2 +-
>  fs/xfs/libxfs/xfs_btree.c         |    5 +++--
>  fs/xfs/libxfs/xfs_btree.h         |    3 ++-
>  fs/xfs/libxfs/xfs_btree_staging.c |   10 +++++-----
>  4 files changed, 11 insertions(+), 9 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 644b956301b6..2ae5bf9a74e7 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -239,7 +239,7 @@ xfs_bmap_get_bp(
>  	if (!cur)
>  		return NULL;
>  
> -	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
> +	for (i = 0; i < cur->bc_maxlevels; i++) {
>  		if (!cur->bc_levels[i].bp)
>  			break;
>  		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 70785004414e..2486ba22c01d 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2933,7 +2933,7 @@ xfs_btree_new_iroot(
>  	be16_add_cpu(&block->bb_level, 1);
>  	xfs_btree_set_numrecs(block, 1);
>  	cur->bc_nlevels++;
> -	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
> +	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
>  	cur->bc_levels[level + 1].ptr = 1;
>  
>  	kp = xfs_btree_key_addr(cur, 1, block);
> @@ -3097,7 +3097,7 @@ xfs_btree_new_root(
>  	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
>  	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
>  	cur->bc_nlevels++;
> -	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
> +	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
>  	*stat = 1;
>  	return 0;
>  error0:
> @@ -4941,6 +4941,7 @@ xfs_btree_alloc_cursor(
>  	cur->bc_mp = mp;
>  	cur->bc_btnum = btnum;
>  	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> +	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
>  
>  	return cur;
>  }
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 6540c4957c36..6075918efa0c 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -235,9 +235,10 @@ struct xfs_btree_cur
>  	struct xfs_mount	*bc_mp;	/* file system mount struct */
>  	const struct xfs_btree_ops *bc_ops;
>  	uint			bc_flags; /* btree features - below */
> -	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> +	uint8_t		bc_maxlevels;	/* maximum levels for this btree type */
>  	uint8_t		bc_nlevels;	/* number of levels in the tree */
>  	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
> +	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
>  	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
>  	int		bc_statoff;	/* offset of btre stats array */
>  
> diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
> index cc56efc2b90a..dd75e208b543 100644
> --- a/fs/xfs/libxfs/xfs_btree_staging.c
> +++ b/fs/xfs/libxfs/xfs_btree_staging.c
> @@ -657,12 +657,12 @@ xfs_btree_bload_compute_geometry(
>  	 * checking levels 0 and 1 here, so set bc_nlevels such that the btree
>  	 * code doesn't interpret either as the root level.
>  	 */
> -	cur->bc_nlevels = XFS_BTREE_MAXLEVELS - 1;
> +	cur->bc_nlevels = cur->bc_maxlevels - 1;
>  	xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0);
>  	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
>  
>  	bbl->nr_records = nr_this_level = nr_records;
> -	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
> +	for (cur->bc_nlevels = 1; cur->bc_nlevels <= cur->bc_maxlevels;) {
>  		uint64_t	level_blocks;
>  		uint64_t	dontcare64;
>  		unsigned int	level = cur->bc_nlevels - 1;
> @@ -703,7 +703,7 @@ xfs_btree_bload_compute_geometry(
>  			 * block-based btree level.
>  			 */
>  			cur->bc_nlevels++;
> -			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
> +			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
>  			xfs_btree_bload_level_geometry(cur, bbl, level,
>  					nr_this_level, &avg_per_block,
>  					&level_blocks, &dontcare64);
> @@ -719,14 +719,14 @@ xfs_btree_bload_compute_geometry(
>  
>  			/* Otherwise, we need another level of btree. */
>  			cur->bc_nlevels++;
> -			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
> +			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
>  		}
>  
>  		nr_blocks += level_blocks;
>  		nr_this_level = level_blocks;
>  	}
>  
> -	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
> +	if (cur->bc_nlevels > cur->bc_maxlevels)
>  		return -EOVERFLOW;
>  
>  	bbl->btree_height = cur->bc_nlevels;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-18  1:30 ` [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
@ 2021-09-20  9:56   ` Chandan Babu R
  2021-09-20 23:06   ` Dave Chinner
  1 sibling, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 07:00, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Replace the statically-sized btree cursor zone with dynamically sized
> allocations so that we can reduce the memory overhead for per-AG bt
> cursors while handling very tall btrees for rt metadata.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c |   40 ++++++++++++++++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_btree.h |    2 --
>  fs/xfs/xfs_super.c        |   11 +----------
>  3 files changed, 33 insertions(+), 20 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 2486ba22c01d..f9516828a847 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -23,11 +23,6 @@
>  #include "xfs_btree_staging.h"
>  #include "xfs_ag.h"
>  
> -/*
> - * Cursor allocation zone.
> - */
> -kmem_zone_t	*xfs_btree_cur_zone;
> -
>  /*
>   * Btree magic numbers.
>   */
> @@ -379,7 +374,7 @@ xfs_btree_del_cursor(
>  		kmem_free(cur->bc_ops);
>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
>  		xfs_perag_put(cur->bc_ag.pag);
> -	kmem_cache_free(xfs_btree_cur_zone, cur);
> +	kmem_free(cur);
>  }
>  
>  /*
> @@ -4927,6 +4922,32 @@ xfs_btree_has_more_records(
>  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
>  }
>  
> +/* Compute the maximum allowed height for a given btree type. */
> +static unsigned int
> +xfs_btree_maxlevels(
> +	struct xfs_mount	*mp,
> +	xfs_btnum_t		btnum)
> +{
> +	switch (btnum) {
> +	case XFS_BTNUM_BNO:
> +	case XFS_BTNUM_CNT:
> +		return mp->m_ag_maxlevels;
> +	case XFS_BTNUM_BMAP:
> +		return max(mp->m_bm_maxlevels[XFS_DATA_FORK],
> +			   mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> +	case XFS_BTNUM_INO:
> +	case XFS_BTNUM_FINO:
> +		return M_IGEO(mp)->inobt_maxlevels;
> +	case XFS_BTNUM_RMAP:
> +		return mp->m_rmap_maxlevels;
> +	case XFS_BTNUM_REFC:
> +		return mp->m_refc_maxlevels;
> +	default:
> +		ASSERT(0);
> +		return XFS_BTREE_MAXLEVELS;
> +	}
> +}
> +
>  /* Allocate a new btree cursor of the appropriate size. */
>  struct xfs_btree_cur *
>  xfs_btree_alloc_cursor(
> @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
>  	xfs_btnum_t		btnum)
>  {
>  	struct xfs_btree_cur	*cur;
> +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> +
> +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
>  	cur->bc_btnum = btnum;
>  	cur->bc_blocklog = mp->m_sb.sb_blocklog;
> -	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
> +	cur->bc_maxlevels = maxlevels;
>  
>  	return cur;
>  }
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 6075918efa0c..ae83fbf58c18 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -13,8 +13,6 @@ struct xfs_trans;
>  struct xfs_ifork;
>  struct xfs_perag;
>  
> -extern kmem_zone_t	*xfs_btree_cur_zone;
> -
>  /*
>   * Generic key, ptr and record wrapper structures.
>   *
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 30bae0657343..25a548bbb0b2 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1965,17 +1965,11 @@ xfs_init_zones(void)
>  	if (!xfs_bmap_free_item_zone)
>  		goto out_destroy_log_ticket_zone;
>  
> -	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
> -				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
> -					       0, 0, NULL);
> -	if (!xfs_btree_cur_zone)
> -		goto out_destroy_bmap_free_item_zone;
> -
>  	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
>  					      sizeof(struct xfs_da_state),
>  					      0, 0, NULL);
>  	if (!xfs_da_state_zone)
> -		goto out_destroy_btree_cur_zone;
> +		goto out_destroy_bmap_free_item_zone;
>  
>  	xfs_ifork_zone = kmem_cache_create("xfs_ifork",
>  					   sizeof(struct xfs_ifork),
> @@ -2105,8 +2099,6 @@ xfs_init_zones(void)
>  	kmem_cache_destroy(xfs_ifork_zone);
>   out_destroy_da_state_zone:
>  	kmem_cache_destroy(xfs_da_state_zone);
> - out_destroy_btree_cur_zone:
> -	kmem_cache_destroy(xfs_btree_cur_zone);
>   out_destroy_bmap_free_item_zone:
>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>   out_destroy_log_ticket_zone:
> @@ -2138,7 +2130,6 @@ xfs_destroy_zones(void)
>  	kmem_cache_destroy(xfs_trans_zone);
>  	kmem_cache_destroy(xfs_ifork_zone);
>  	kmem_cache_destroy(xfs_da_state_zone);
> -	kmem_cache_destroy(xfs_btree_cur_zone);
>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>  	kmem_cache_destroy(xfs_log_ticket_zone);
>  }


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation
  2021-09-18  1:30 ` [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
@ 2021-09-20  9:56   ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 07:00, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Compute the actual maximum btree height when deciding if per-AG block
> reservation is critically low.  This only affects the sanity check
> condition, since we /generally/ will trigger on the 10% threshold.
> This is a long-winded way of saying that we're removing one more
> usage of XFS_BTREE_MAXLEVELS.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c |    4 +++-
>  fs/xfs/libxfs/xfs_btree.c   |   19 +++++++++++++++----
>  fs/xfs/libxfs/xfs_btree.h   |    1 +
>  3 files changed, 19 insertions(+), 5 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 2aa2b3484c28..931481fbdd72 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -72,6 +72,7 @@ xfs_ag_resv_critical(
>  {
>  	xfs_extlen_t			avail;
>  	xfs_extlen_t			orig;
> +	xfs_extlen_t			btree_maxlevels;
>  
>  	switch (type) {
>  	case XFS_AG_RESV_METADATA:
> @@ -91,7 +92,8 @@ xfs_ag_resv_critical(
>  	trace_xfs_ag_resv_critical(pag, type, avail);
>  
>  	/* Critically low if less than 10% or max btree height remains. */
> -	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
> +	btree_maxlevels = xfs_btree_maxlevels(pag->pag_mount, XFS_BTNUM_MAX);
> +	return XFS_TEST_ERROR(avail < orig / 10 || avail < btree_maxlevels,
>  			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
>  }
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index f9516828a847..6cf49f7e1299 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4922,12 +4922,17 @@ xfs_btree_has_more_records(
>  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
>  }
>  
> -/* Compute the maximum allowed height for a given btree type. */
> -static unsigned int
> +/*
> + * Compute the maximum allowed height for a given btree type.  If XFS_BTNUM_MAX
> + * is passed in, the maximum allowed height for all btree types is returned.
> + */
> +unsigned int
>  xfs_btree_maxlevels(
>  	struct xfs_mount	*mp,
>  	xfs_btnum_t		btnum)
>  {
> +	unsigned int		ret;
> +
>  	switch (btnum) {
>  	case XFS_BTNUM_BNO:
>  	case XFS_BTNUM_CNT:
> @@ -4943,9 +4948,15 @@ xfs_btree_maxlevels(
>  	case XFS_BTNUM_REFC:
>  		return mp->m_refc_maxlevels;
>  	default:
> -		ASSERT(0);
> -		return XFS_BTREE_MAXLEVELS;
> +		break;
>  	}
> +
> +	ret = mp->m_ag_maxlevels;
> +	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
> +	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> +	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
> +	ret = max(ret, mp->m_rmap_maxlevels);
> +	return max(ret, mp->m_refc_maxlevels);
>  }
>  
>  /* Allocate a new btree cursor of the appropriate size. */
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index ae83fbf58c18..106760c540c7 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -574,5 +574,6 @@ void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
>  		const union xfs_btree_key *src_key, int numkeys);
>  struct xfs_btree_cur *xfs_btree_alloc_cursor(struct xfs_mount *mp,
>  		struct xfs_trans *tp, xfs_btnum_t btnum);
> +unsigned int xfs_btree_maxlevels(struct xfs_mount *mp, xfs_btnum_t btnum);
>  
>  #endif	/* __XFS_BTREE_H__ */


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-09-18  1:30 ` [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-09-20  9:56   ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 07:00, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
> enough to handle the maximally tall rmap btree when all blocks are in
> use and maximally shared, let's compute the maximum height assuming the
> rmapbt consumes as many blocks as possible.

Maximum rmap btree height calculations look good to me.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c       |   34 +++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h       |    2 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c  |   40 ++++++++++++++++++++-------------------
>  fs/xfs/libxfs/xfs_rmap_btree.h  |    2 +-
>  fs/xfs/libxfs/xfs_trans_resv.c  |   12 ++++++++++++
>  fs/xfs/libxfs/xfs_trans_space.h |    7 +++++++
>  fs/xfs/xfs_mount.c              |    2 +-
>  7 files changed, 78 insertions(+), 21 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 6cf49f7e1299..005bc42cf0bd 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4526,6 +4526,40 @@ xfs_btree_compute_maxlevels(
>  	return level;
>  }
>  
> +/*
> + * Compute the maximum height of a btree that is allowed to consume up to the
> + * given number of blocks.
> + */
> +unsigned int
> +xfs_btree_compute_maxlevels_size(
> +	unsigned long long	max_btblocks,
> +	unsigned int		leaf_mnr)
> +{
> +	unsigned long long	leaf_blocks = leaf_mnr;
> +	unsigned long long	blocks_left;
> +	unsigned int		maxlevels;
> +
> +	if (max_btblocks < 1)
> +		return 0;
> +
> +	/*
> +	 * The loop increments maxlevels as long as there would be enough
> +	 * blocks left in the reservation to handle each node block at the
> +	 * current level pointing to the minimum possible number of leaf blocks
> +	 * at the next level down.  We start the loop assuming a single-level
> +	 * btree consuming one block.
> +	 */
> +	maxlevels = 1;
> +	blocks_left = max_btblocks - 1;
> +	while (leaf_blocks < blocks_left) {
> +		maxlevels++;
> +		blocks_left -= leaf_blocks;
> +		leaf_blocks *= leaf_mnr;
> +	}
> +
> +	return maxlevels;
> +}
> +
>  /*
>   * Query a regular btree for all records overlapping a given interval.
>   * Start with a LE lookup of the key of low_rec and return all records
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 106760c540c7..d256d869f0af 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -476,6 +476,8 @@ xfs_failaddr_t xfs_btree_lblock_verify(struct xfs_buf *bp,
>  		unsigned int max_recs);
>  
>  uint xfs_btree_compute_maxlevels(uint *limits, unsigned long len);
> +unsigned int xfs_btree_compute_maxlevels_size(unsigned long long max_btblocks,
> +		unsigned int leaf_mnr);
>  unsigned long long xfs_btree_calc_size(uint *limits, unsigned long long len);
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index f3c4d0965cc9..85caeb14e4db 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -535,30 +535,32 @@ xfs_rmapbt_maxrecs(
>  }
>  
>  /* Compute the maximum height of an rmap btree. */
> -void
> +unsigned int
>  xfs_rmapbt_compute_maxlevels(
> -	struct xfs_mount		*mp)
> +	struct xfs_mount	*mp)
>  {
> +	if (!xfs_has_reflink(mp)) {
> +		/*
> +		 * If there's no block sharing, compute the maximum rmapbt
> +		 * height assuming one rmap record per AG block.
> +		 */
> +		return xfs_btree_compute_maxlevels(mp->m_rmap_mnr,
> +				mp->m_sb.sb_agblocks);
> +	}
> +
>  	/*
> -	 * On a non-reflink filesystem, the maximum number of rmap
> -	 * records is the number of blocks in the AG, hence the max
> -	 * rmapbt height is log_$maxrecs($agblocks).  However, with
> -	 * reflink each AG block can have up to 2^32 (per the refcount
> -	 * record format) owners, which means that theoretically we
> -	 * could face up to 2^64 rmap records.
> +	 * Compute the asymptotic maxlevels for an rmapbt on a reflink fs.
>  	 *
> -	 * That effectively means that the max rmapbt height must be
> -	 * XFS_BTREE_MAXLEVELS.  "Fortunately" we'll run out of AG
> -	 * blocks to feed the rmapbt long before the rmapbt reaches
> -	 * maximum height.  The reflink code uses ag_resv_critical to
> -	 * disallow reflinking when less than 10% of the per-AG metadata
> -	 * block reservation since the fallback is a regular file copy.
> +	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
> +	 * refcount record format) owners, which means that theoretically we
> +	 * could face up to 2^64 rmap records.  However, we're likely to run
> +	 * out of blocks in the AG long before that happens, which means that
> +	 * we must compute the max height based on what the btree will look
> +	 * like if it consumes almost all the blocks in the AG due to maximal
> +	 * sharing factor.
>  	 */
> -	if (xfs_has_reflink(mp))
> -		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
> -	else
> -		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(
> -				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
> +	return xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks,
> +			mp->m_rmap_mnr[1]);
>  }
>  
>  /* Calculate the refcount btree size for some records. */
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index f2eee6572af4..5aaecf755abd 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -49,7 +49,7 @@ struct xfs_btree_cur *xfs_rmapbt_stage_cursor(struct xfs_mount *mp,
>  void xfs_rmapbt_commit_staged_btree(struct xfs_btree_cur *cur,
>  		struct xfs_trans *tp, struct xfs_buf *agbp);
>  int xfs_rmapbt_maxrecs(int blocklen, int leaf);
> -extern void xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
> +unsigned int xfs_rmapbt_compute_maxlevels(struct xfs_mount *mp);
>  
>  extern xfs_extlen_t xfs_rmapbt_calc_size(struct xfs_mount *mp,
>  		unsigned long long len);
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 5e300daa2559..679f10e08f31 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -814,6 +814,15 @@ xfs_trans_resv_calc(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans_resv	*resp)
>  {
> +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> +
> +	/*
> +	 * In the early days of rmap+reflink, we hardcoded the rmap maxlevels
> +	 * to 9 even if the AG size was smaller.
> +	 */
> +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> +
>  	/*
>  	 * The following transactions are logged in physical format and
>  	 * require a permanent reservation on space.
> @@ -916,4 +925,7 @@ xfs_trans_resv_calc(
>  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
>  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> +
> +	/* Put everything back the way it was.  This goes at the end. */
> +	mp->m_rmap_maxlevels = rmap_maxlevels;
>  }
> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index 50332be34388..440c9c390b86 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -17,6 +17,13 @@
>  /* Adding one rmap could split every level up to the top of the tree. */
>  #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
>  
> +/*
> + * Note that we historically set m_rmap_maxlevels to 9 when reflink was
> + * enabled, so we must preserve this behavior to avoid changing the transaction
> + * space reservations.
> + */
> +#define XFS_OLD_REFLINK_RMAP_MAXLEVELS	(9)
> +
>  /* Blocks we might need to add "b" rmaps to a tree. */
>  #define XFS_NRMAPADD_SPACE_RES(mp, b)\
>  	(((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 06dac09eddbd..e600a0b781c8 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -635,7 +635,7 @@ xfs_mountfs(
>  	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
>  	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
>  	xfs_mount_setup_inode_geom(mp);
> -	xfs_rmapbt_compute_maxlevels(mp);
> +	mp->m_rmap_maxlevels = xfs_rmapbt_compute_maxlevels(mp);
>  	xfs_refcountbt_compute_maxlevels(mp);
>  
>  	/*


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS
  2021-09-18  1:30 ` [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
@ 2021-09-20  9:57   ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-20  9:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandanrlinux, linux-xfs

On 18 Sep 2021 at 07:00, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Nobody uses this symbol anymore, so kill it.
>

Looks good.

Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c |    2 --
>  fs/xfs/libxfs/xfs_btree.h |    2 --
>  2 files changed, 4 deletions(-)
>
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 005bc42cf0bd..a7c866332911 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -5003,8 +5003,6 @@ xfs_btree_alloc_cursor(
>  	struct xfs_btree_cur	*cur;
>  	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
>  
> -	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> -
>  	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index d256d869f0af..91154dd63472 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -90,8 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
>  #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
>  	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
>  
> -#define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
> -
>  struct xfs_btree_ops {
>  	/* size of the key and record structures */
>  	size_t	key_len;


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/14] xfs: don't allocate scrub contexts on the stack
  2021-09-20  9:53   ` Chandan Babu R
@ 2021-09-20 17:39     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-20 17:39 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: chandanrlinux, linux-xfs

On Mon, Sep 20, 2021 at 03:23:34PM +0530, Chandan Babu R wrote:
> On 18 Sep 2021 at 06:59, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Convert the on-stack scrub context, btree scrub context, and da btree
> > scrub context into a heap allocation so that we reduce stack usage and
> > gain the ability to handle tall btrees without issue.
> >
> > Specifically, this saves us ~208 bytes for the dabtree scrub, ~464 bytes
> > for the btree scrub, and ~200 bytes for the main scrub context.
> >
> 
> Apart from the nits pointed below, the remaining changes look good to me.
> 
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> 
> 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/btree.c   |   54 ++++++++++++++++++++++++------------------
> >  fs/xfs/scrub/btree.h   |    1 +
> >  fs/xfs/scrub/dabtree.c |   62 ++++++++++++++++++++++++++----------------------
> >  fs/xfs/scrub/scrub.c   |   60 ++++++++++++++++++++++++++--------------------
> >  4 files changed, 98 insertions(+), 79 deletions(-)
> >
> >
> > diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> > index eccb855dc904..26dcb4691e31 100644
> > --- a/fs/xfs/scrub/btree.c
> > +++ b/fs/xfs/scrub/btree.c
> > @@ -627,15 +627,8 @@ xchk_btree(
> >  	const struct xfs_owner_info	*oinfo,
> >  	void				*private)
> >  {
> > -	struct xchk_btree		bs = {
> > -		.cur			= cur,
> > -		.scrub_rec		= scrub_fn,
> > -		.oinfo			= oinfo,
> > -		.firstrec		= true,
> > -		.private		= private,
> > -		.sc			= sc,
> > -	};
> >  	union xfs_btree_ptr		ptr;
> > +	struct xchk_btree		*bs;
> >  	union xfs_btree_ptr		*pp;
> >  	union xfs_btree_rec		*recp;
> >  	struct xfs_btree_block		*block;
> > @@ -646,10 +639,24 @@ xchk_btree(
> >  	int				i;
> >  	int				error = 0;
> >  
> > +	/*
> > +	 * Allocate the btree scrub context from the heap, because this
> > +	 * structure can get rather large.
> > +	 */
> > +	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
> > +	if (!bs)
> > +		return -ENOMEM;
> > +	bs->cur = cur;
> > +	bs->scrub_rec = scrub_fn;
> > +	bs->oinfo = oinfo;
> > +	bs->firstrec = true;
> > +	bs->private = private;
> > +	bs->sc = sc;
> > +
> >  	/* Initialize scrub state */
> >  	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
> > -		bs.firstkey[i] = true;
> > -	INIT_LIST_HEAD(&bs.to_check);
> > +		bs->firstkey[i] = true;
> > +	INIT_LIST_HEAD(&bs->to_check);
> >  
> >  	/* Don't try to check a tree with a height we can't handle. */
> >  	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> > @@ -663,9 +670,9 @@ xchk_btree(
> >  	 */
> >  	level = cur->bc_nlevels - 1;
> >  	cur->bc_ops->init_ptr_from_cur(cur, &ptr);
> > -	if (!xchk_btree_ptr_ok(&bs, cur->bc_nlevels, &ptr))
> > +	if (!xchk_btree_ptr_ok(bs, cur->bc_nlevels, &ptr))
> >  		goto out;
> > -	error = xchk_btree_get_block(&bs, level, &ptr, &block, &bp);
> > +	error = xchk_btree_get_block(bs, level, &ptr, &block, &bp);
> >  	if (error || !block)
> >  		goto out;
> >  
> > @@ -678,7 +685,7 @@ xchk_btree(
> >  			/* End of leaf, pop back towards the root. */
> >  			if (cur->bc_ptrs[level] >
> >  			    be16_to_cpu(block->bb_numrecs)) {
> > -				xchk_btree_block_keys(&bs, level, block);
> > +				xchk_btree_block_keys(bs, level, block);
> >  				if (level < cur->bc_nlevels - 1)
> >  					cur->bc_ptrs[level + 1]++;
> >  				level++;
> > @@ -686,11 +693,11 @@ xchk_btree(
> >  			}
> >  
> >  			/* Records in order for scrub? */
> > -			xchk_btree_rec(&bs);
> > +			xchk_btree_rec(bs);
> >  
> >  			/* Call out to the record checker. */
> >  			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> > -			error = bs.scrub_rec(&bs, recp);
> > +			error = bs->scrub_rec(bs, recp);
> >  			if (error)
> >  				break;
> >  			if (xchk_should_terminate(sc, &error) ||
> > @@ -703,7 +710,7 @@ xchk_btree(
> >  
> >  		/* End of node, pop back towards the root. */
> >  		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
> > -			xchk_btree_block_keys(&bs, level, block);
> > +			xchk_btree_block_keys(bs, level, block);
> >  			if (level < cur->bc_nlevels - 1)
> >  				cur->bc_ptrs[level + 1]++;
> >  			level++;
> > @@ -711,16 +718,16 @@ xchk_btree(
> >  		}
> >  
> >  		/* Keys in order for scrub? */
> > -		xchk_btree_key(&bs, level);
> > +		xchk_btree_key(bs, level);
> >  
> >  		/* Drill another level deeper. */
> >  		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
> > -		if (!xchk_btree_ptr_ok(&bs, level, pp)) {
> > +		if (!xchk_btree_ptr_ok(bs, level, pp)) {
> >  			cur->bc_ptrs[level]++;
> >  			continue;
> >  		}
> >  		level--;
> > -		error = xchk_btree_get_block(&bs, level, pp, &block, &bp);
> > +		error = xchk_btree_get_block(bs, level, pp, &block, &bp);
> >  		if (error || !block)
> >  			goto out;
> >  
> > @@ -729,13 +736,14 @@ xchk_btree(
> >  
> >  out:
> >  	/* Process deferred owner checks on btree blocks. */
> > -	list_for_each_entry_safe(co, n, &bs.to_check, list) {
> > -		if (!error && bs.cur)
> > -			error = xchk_btree_check_block_owner(&bs,
> > -					co->level, co->daddr);
> > +	list_for_each_entry_safe(co, n, &bs->to_check, list) {
> > +		if (!error && bs->cur)
> > +			error = xchk_btree_check_block_owner(bs, co->level,
> > +					co->daddr);
> >  		list_del(&co->list);
> >  		kmem_free(co);
> >  	}
> > +	kmem_free(bs);
> >  
> >  	return error;
> >  }
> > diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
> > index b7d2fc01fbf9..d5c0b0cbc505 100644
> > --- a/fs/xfs/scrub/btree.h
> > +++ b/fs/xfs/scrub/btree.h
> > @@ -44,6 +44,7 @@ struct xchk_btree {
> >  	bool				firstkey[XFS_BTREE_MAXLEVELS];
> >  	struct list_head		to_check;
> >  };
> > +
> >  int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
> >  		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
> >  		void *private);
> > diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c
> > index 8a52514bc1ff..b962cfbbd92b 100644
> > --- a/fs/xfs/scrub/dabtree.c
> > +++ b/fs/xfs/scrub/dabtree.c
> > @@ -473,7 +473,7 @@ xchk_da_btree(
> >  	xchk_da_btree_rec_fn		scrub_fn,
> >  	void				*private)
> >  {
> > -	struct xchk_da_btree		ds = {};
> > +	struct xchk_da_btree		*ds;
> >  	struct xfs_mount		*mp = sc->mp;
> >  	struct xfs_da_state_blk		*blks;
> >  	struct xfs_da_node_entry	*key;
> > @@ -486,32 +486,35 @@ xchk_da_btree(
> >  		return 0;
> >  
> >  	/* Set up initial da state. */
> > -	ds.dargs.dp = sc->ip;
> > -	ds.dargs.whichfork = whichfork;
> > -	ds.dargs.trans = sc->tp;
> > -	ds.dargs.op_flags = XFS_DA_OP_OKNOENT;
> > -	ds.state = xfs_da_state_alloc(&ds.dargs);
> > -	ds.sc = sc;
> > -	ds.private = private;
> > +	ds = kmem_zalloc(sizeof(struct xchk_da_btree), KM_NOFS | KM_MAYFAIL);
> > +	if (!ds)
> > +		return -ENOMEM;
> > +	ds->dargs.dp = sc->ip;
> > +	ds->dargs.whichfork = whichfork;
> > +	ds->dargs.trans = sc->tp;
> > +	ds->dargs.op_flags = XFS_DA_OP_OKNOENT;
> > +	ds->state = xfs_da_state_alloc(&ds->dargs);
> > +	ds->sc = sc;
> > +	ds->private = private;
> >  	if (whichfork == XFS_ATTR_FORK) {
> > -		ds.dargs.geo = mp->m_attr_geo;
> > -		ds.lowest = 0;
> > -		ds.highest = 0;
> > +		ds->dargs.geo = mp->m_attr_geo;
> > +		ds->lowest = 0;
> > +		ds->highest = 0;
> >  	} else {
> > -		ds.dargs.geo = mp->m_dir_geo;
> > -		ds.lowest = ds.dargs.geo->leafblk;
> > -		ds.highest = ds.dargs.geo->freeblk;
> > +		ds->dargs.geo = mp->m_dir_geo;
> > +		ds->lowest = ds->dargs.geo->leafblk;
> > +		ds->highest = ds->dargs.geo->freeblk;
> >  	}
> > -	blkno = ds.lowest;
> > +	blkno = ds->lowest;
> >  	level = 0;
> >  
> >  	/* Find the root of the da tree, if present. */
> > -	blks = ds.state->path.blk;
> > -	error = xchk_da_btree_block(&ds, level, blkno);
> > +	blks = ds->state->path.blk;
> > +	error = xchk_da_btree_block(ds, level, blkno);
> >  	if (error)
> >  		goto out_state;
> >  	/*
> > -	 * We didn't find a block at ds.lowest, which means that there's
> > +	 * We didn't find a block at ds->lowest, which means that there's
> >  	 * no LEAF1/LEAFN tree (at least not where it's supposed to be),
> >  	 * so jump out now.
> >  	 */
> > @@ -523,16 +526,16 @@ xchk_da_btree(
> >  		/* Handle leaf block. */
> >  		if (blks[level].magic != XFS_DA_NODE_MAGIC) {
> >  			/* End of leaf, pop back towards the root. */
> > -			if (blks[level].index >= ds.maxrecs[level]) {
> > +			if (blks[level].index >= ds->maxrecs[level]) {
> >  				if (level > 0)
> >  					blks[level - 1].index++;
> > -				ds.tree_level++;
> > +				ds->tree_level++;
> >  				level--;
> >  				continue;
> >  			}
> >  
> >  			/* Dispatch record scrubbing. */
> > -			error = scrub_fn(&ds, level);
> > +			error = scrub_fn(ds, level);
> >  			if (error)
> >  				break;
> >  			if (xchk_should_terminate(sc, &error) ||
> > @@ -545,17 +548,17 @@ xchk_da_btree(
> >  
> >  
> >  		/* End of node, pop back towards the root. */
> > -		if (blks[level].index >= ds.maxrecs[level]) {
> > +		if (blks[level].index >= ds->maxrecs[level]) {
> >  			if (level > 0)
> >  				blks[level - 1].index++;
> > -			ds.tree_level++;
> > +			ds->tree_level++;
> >  			level--;
> >  			continue;
> >  		}
> >  
> >  		/* Hashes in order for scrub? */
> > -		key = xchk_da_btree_node_entry(&ds, level);
> > -		error = xchk_da_btree_hash(&ds, level, &key->hashval);
> > +		key = xchk_da_btree_node_entry(ds, level);
> > +		error = xchk_da_btree_hash(ds, level, &key->hashval);
> >  		if (error)
> >  			goto out;
> >  
> > @@ -564,11 +567,11 @@ xchk_da_btree(
> >  		level++;
> >  		if (level >= XFS_DA_NODE_MAXDEPTH) {
> >  			/* Too deep! */
> > -			xchk_da_set_corrupt(&ds, level - 1);
> > +			xchk_da_set_corrupt(ds, level - 1);
> >  			break;
> >  		}
> > -		ds.tree_level--;
> > -		error = xchk_da_btree_block(&ds, level, blkno);
> > +		ds->tree_level--;
> > +		error = xchk_da_btree_block(ds, level, blkno);
> >  		if (error)
> >  			goto out;
> >  		if (blks[level].bp == NULL)
> > @@ -587,6 +590,7 @@ xchk_da_btree(
> >  	}
> >  
> >  out_state:
> > -	xfs_da_state_free(ds.state);
> > +	xfs_da_state_free(ds->state);
> > +	kmem_free(ds);
> >  	return error;
> >  }
> > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > index 51e4c61916d2..0569b15526ea 100644
> > --- a/fs/xfs/scrub/scrub.c
> > +++ b/fs/xfs/scrub/scrub.c
> > @@ -461,15 +461,10 @@ xfs_scrub_metadata(
> >  	struct file			*file,
> >  	struct xfs_scrub_metadata	*sm)
> >  {
> > -	struct xfs_scrub		sc = {
> > -		.file			= file,
> > -		.sm			= sm,
> > -	};
> > +	struct xfs_scrub		*sc;
> >  	struct xfs_mount		*mp = XFS_I(file_inode(file))->i_mount;
> >  	int				error = 0;
> >  
> > -	sc.mp = mp;
> > -
> >  	BUILD_BUG_ON(sizeof(meta_scrub_ops) !=
> >  		(sizeof(struct xchk_meta_ops) * XFS_SCRUB_TYPE_NR));
> >  
> > @@ -489,59 +484,68 @@ xfs_scrub_metadata(
> >  
> >  	xchk_experimental_warning(mp);
> >  
> > -	sc.ops = &meta_scrub_ops[sm->sm_type];
> > -	sc.sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
> > +	sc = kmem_zalloc(sizeof(struct xfs_scrub), KM_NOFS | KM_MAYFAIL);
> > +	if (!sc) {
> > +		error = -ENOMEM;
> > +		goto out;
> > +	}
> > +
> > +	sc->mp = mp;
> > +	sc->file = file;
> > +	sc->sm = sm;
> > +	sc->ops = &meta_scrub_ops[sm->sm_type];
> > +	sc->sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
> >  retry_op:
> >  	/*
> >  	 * When repairs are allowed, prevent freezing or readonly remount while
> >  	 * scrub is running with a real transaction.
> >  	 */
> >  	if (sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) {
> > -		error = mnt_want_write_file(sc.file);
> > +		error = mnt_want_write_file(sc->file);
> >  		if (error)
> >  			goto out;
> 
> The above should be "goto out_sc" ...
> 
> >  	}
> >  
> >  	/* Set up for the operation. */
> > -	error = sc.ops->setup(&sc);
> > +	error = sc->ops->setup(sc);
> >  	if (error)
> >  		goto out_teardown;
> >  
> >  	/* Scrub for errors. */
> > -	error = sc.ops->scrub(&sc);
> > -	if (!(sc.flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
> > +	error = sc->ops->scrub(sc);
> > +	if (!(sc->flags & XCHK_TRY_HARDER) && error == -EDEADLOCK) {
> >  		/*
> >  		 * Scrubbers return -EDEADLOCK to mean 'try harder'.
> >  		 * Tear down everything we hold, then set up again with
> >  		 * preparation for worst-case scenarios.
> >  		 */
> > -		error = xchk_teardown(&sc, 0);
> > +		error = xchk_teardown(sc, 0);
> >  		if (error)
> >  			goto out;
> 
> ... also, the one above.

Ugh, that must have been a porting error.  Fixed.

--D

> > -		sc.flags |= XCHK_TRY_HARDER;
> > +		sc->flags |= XCHK_TRY_HARDER;
> >  		goto retry_op;
> >  	} else if (error || (sm->sm_flags & XFS_SCRUB_OFLAG_INCOMPLETE))
> >  		goto out_teardown;
> >  
> > -	xchk_update_health(&sc);
> > +	xchk_update_health(sc);
> >  
> > -	if ((sc.sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
> > -	    !(sc.flags & XREP_ALREADY_FIXED)) {
> > +	if ((sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) &&
> > +	    !(sc->flags & XREP_ALREADY_FIXED)) {
> >  		bool needs_fix;
> >  
> >  		/* Let debug users force us into the repair routines. */
> >  		if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_FORCE_SCRUB_REPAIR))
> > -			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> > +			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_CORRUPT;
> >  
> > -		needs_fix = (sc.sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> > -						XFS_SCRUB_OFLAG_XCORRUPT |
> > -						XFS_SCRUB_OFLAG_PREEN));
> > +		needs_fix = (sc->sm->sm_flags & (XFS_SCRUB_OFLAG_CORRUPT |
> > +						 XFS_SCRUB_OFLAG_XCORRUPT |
> > +						 XFS_SCRUB_OFLAG_PREEN));
> >  		/*
> >  		 * If userspace asked for a repair but it wasn't necessary,
> >  		 * report that back to userspace.
> >  		 */
> >  		if (!needs_fix) {
> > -			sc.sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
> > +			sc->sm->sm_flags |= XFS_SCRUB_OFLAG_NO_REPAIR_NEEDED;
> >  			goto out_nofix;
> >  		}
> >  
> > @@ -549,26 +553,28 @@ xfs_scrub_metadata(
> >  		 * If it's broken, userspace wants us to fix it, and we haven't
> >  		 * already tried to fix it, then attempt a repair.
> >  		 */
> > -		error = xrep_attempt(&sc);
> > +		error = xrep_attempt(sc);
> >  		if (error == -EAGAIN) {
> >  			/*
> >  			 * Either the repair function succeeded or it couldn't
> >  			 * get all the resources it needs; either way, we go
> >  			 * back to the beginning and call the scrub function.
> >  			 */
> > -			error = xchk_teardown(&sc, 0);
> > +			error = xchk_teardown(sc, 0);
> >  			if (error) {
> >  				xrep_failure(mp);
> > -				goto out;
> > +				goto out_sc;
> >  			}
> >  			goto retry_op;
> >  		}
> >  	}
> >  
> >  out_nofix:
> > -	xchk_postmortem(&sc);
> > +	xchk_postmortem(sc);
> >  out_teardown:
> > -	error = xchk_teardown(&sc, error);
> > +	error = xchk_teardown(sc, error);
> > +out_sc:
> > +	kmem_free(sc);
> >  out:
> >  	trace_xchk_done(XFS_I(file_inode(file)), sm, error);
> >  	if (error == -EFSCORRUPTED || error == -EFSBADCRC) {
> 
> 
> -- 
> chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-18  1:30 ` [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
  2021-09-20  9:56   ` Chandan Babu R
@ 2021-09-20 23:06   ` Dave Chinner
  2021-09-20 23:36     ` Dave Chinner
                       ` (2 more replies)
  1 sibling, 3 replies; 48+ messages in thread
From: Dave Chinner @ 2021-09-20 23:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Replace the statically-sized btree cursor zone with dynamically sized
> allocations so that we can reduce the memory overhead for per-AG bt
> cursors while handling very tall btrees for rt metadata.

Hmmmmm. We do a *lot* of btree cursor allocation and freeing under
load. Keeping that in a single slab rather than using heap memory is
a good idea for stuff like this for many reasons...

I mean, if we are creating a million inodes a second, a rouch
back-of-the-envelope calculation says we are doing 3-4 million btree
cursor instantiations a second. That's a lot of short term churn on
the heap that we don't really need to subject it to. And even a few
extra instructions in a path called millions of times a second adds
up to a lot of extra runtime overhead.

So....

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c |   40 ++++++++++++++++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_btree.h |    2 --
>  fs/xfs/xfs_super.c        |   11 +----------
>  3 files changed, 33 insertions(+), 20 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 2486ba22c01d..f9516828a847 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -23,11 +23,6 @@
>  #include "xfs_btree_staging.h"
>  #include "xfs_ag.h"
>  
> -/*
> - * Cursor allocation zone.
> - */
> -kmem_zone_t	*xfs_btree_cur_zone;
> -
>  /*
>   * Btree magic numbers.
>   */
> @@ -379,7 +374,7 @@ xfs_btree_del_cursor(
>  		kmem_free(cur->bc_ops);
>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
>  		xfs_perag_put(cur->bc_ag.pag);
> -	kmem_cache_free(xfs_btree_cur_zone, cur);
> +	kmem_free(cur);
>  }
>  
>  /*
> @@ -4927,6 +4922,32 @@ xfs_btree_has_more_records(
>  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
>  }
>  
> +/* Compute the maximum allowed height for a given btree type. */
> +static unsigned int
> +xfs_btree_maxlevels(
> +	struct xfs_mount	*mp,
> +	xfs_btnum_t		btnum)
> +{
> +	switch (btnum) {
> +	case XFS_BTNUM_BNO:
> +	case XFS_BTNUM_CNT:
> +		return mp->m_ag_maxlevels;
> +	case XFS_BTNUM_BMAP:
> +		return max(mp->m_bm_maxlevels[XFS_DATA_FORK],
> +			   mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> +	case XFS_BTNUM_INO:
> +	case XFS_BTNUM_FINO:
> +		return M_IGEO(mp)->inobt_maxlevels;
> +	case XFS_BTNUM_RMAP:
> +		return mp->m_rmap_maxlevels;
> +	case XFS_BTNUM_REFC:
> +		return mp->m_refc_maxlevels;
> +	default:
> +		ASSERT(0);
> +		return XFS_BTREE_MAXLEVELS;
> +	}
> +}
> +
>  /* Allocate a new btree cursor of the appropriate size. */
>  struct xfs_btree_cur *
>  xfs_btree_alloc_cursor(
> @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
>  	xfs_btnum_t		btnum)
>  {
>  	struct xfs_btree_cur	*cur;
> +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> +
> +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);

Instead of multiple dynamic runtime calculations to determine the
size to allocate from the heap, which then has to select a slab
based on size, why don't we just pre-calculate the max size of
the cursor at XFS module init and use that for the btree cursor slab
size?

The memory overhead of the cursor isn't an issue because we've been
maximally sizing it since forever, and the whole point of a slab
cache is to minimise allocation overhead of frequently allocated
objects. It seems to me that we really want to retain these
properties of the cursor allocator, not give them up just as we're
in the process of making other modifications that will hit the path
more frequently than it's ever been hit before...

I like all the dynamic sized guards that this series places in the
cursor, but I don't think we want to change the way we allocate the
cursors just to support that.

FWIW, an example of avoidable runtime calculation overhead of
constants is xlog_calc_unit_res(). These values are actually
constant for a given transaction reservation, but at 1.6 million
transactions a second it shows up at #20 on the flat profile of
functions using the most CPU:

0.71%  [kernel]  [k] xlog_calc_unit_res

0.71% of 32 CPUs for 1.6 million calculations a second of the same
constants is a non-trivial amount of CPU time to spend doing
unnecessary repeated calculations.

Even though the btree cursor constant calculations are simpler than
the log res calculations, they are more frequent. Hence on general
principles of efficiency, I don't think we want to be replacing high
frequency, low overhead slab/zone based allocations with heap
allocations that require repeated constant calculations and
size->slab redirection....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-20 23:06   ` Dave Chinner
@ 2021-09-20 23:36     ` Dave Chinner
  2021-09-21  9:03     ` Christoph Hellwig
  2021-09-22 17:38     ` Darrick J. Wong
  2 siblings, 0 replies; 48+ messages in thread
From: Dave Chinner @ 2021-09-20 23:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> FWIW, an example of avoidable runtime calculation overhead of
> constants is xlog_calc_unit_res(). These values are actually
> constant for a given transaction reservation, but at 1.6 million
> transactions a second it shows up at #20 on the flat profile of
> functions using the most CPU:
> 
> 0.71%  [kernel]  [k] xlog_calc_unit_res
> 
> 0.71% of 32 CPUs for 1.6 million calculations a second of the same
> constants is a non-trivial amount of CPU time to spend doing
> unnecessary repeated calculations.
> 
> Even though the btree cursor constant calculations are simpler than
> the log res calculations, they are more frequent. Hence on general
> principles of efficiency, I don't think we want to be replacing high
> frequency, low overhead slab/zone based allocations with heap
> allocations that require repeated constant calculations and
> size->slab redirection....

FWIW, I have another example that I don't have profiles for right now
because I didn't record them in the patch series that ends up
pre-calculating the AIL push target: xlog_grant_push_threshold().

This threshold is largely a fixed value ahead of the current log
tail (push at >75% of the physical log spacei consumed). We
do that calculation more often than we call xlog_calc_unit_res().
Because xlog_grant_push_threshold() accesses contended atomic
variables, it ends up consume 1-2% of total CPU time when
transactions rates reach the million/s ballpark.

I've currently replaced it with a fixed push threshold calculated at
mount time and let the AIL calculate the LSN of the push target
itself when it needs it.  The result is a substantial reduction in
the CPU usage of the hot xfs_log_reserve() path, which also happens
to be the same hot path xlog_calc_unit_res() is called from...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef
  2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
@ 2021-09-21  8:36   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/14] xfs: don't allocate scrub contexts on the stack
  2021-09-18  1:29 ` [PATCH 02/14] xfs: don't allocate scrub contexts on the stack Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
@ 2021-09-21  8:39   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

With the goto label fixes:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/14] xfs: dynamically allocate btree scrub context structure
  2021-09-18  1:29 ` [PATCH 03/14] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
  2021-09-20  9:53   ` Chandan Babu R
@ 2021-09-21  8:43   ` Christoph Hellwig
  2021-09-22 16:17     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:43 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Fri, Sep 17, 2021 at 06:29:26PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Reorganize struct xchk_btree so that we can dynamically size the context
> structure to fit the type of btree cursor that we have.  This will
> enable us to use memory more efficiently once we start adding very tall
> btree types.

So bs->levels[0].has_lastkey replaces bs->firstkey?  Can you explain
a bit more how this works for someone not too familiar with the scrub
code.

> +static inline size_t
> +xchk_btree_sizeof(unsigned int levels)
> +{
> +	return sizeof(struct xchk_btree) +
> +				(levels * sizeof(struct xchk_btree_levels));

This should probably use struct_size().

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/14] xfs: check that bc_nlevels never overflows
  2021-09-18  1:29 ` [PATCH 06/14] xfs: check that bc_nlevels never overflows Darrick J. Wong
  2021-09-20  9:54   ` Chandan Babu R
@ 2021-09-21  8:44   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/14] xfs: support dynamic btree cursor heights
  2021-09-18  1:29 ` [PATCH 07/14] xfs: support dynamic btree cursor heights Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
@ 2021-09-21  8:49   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Fri, Sep 17, 2021 at 06:29:48PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split out the btree level information into a separate struct and put it
> at the end of the cursor structure as a VLA.  The realtime rmap btree
> (which is rooted in an inode) will require the ability to support many
> more levels than a per-AG btree cursor, which means that we're going to
> create two btree cursor caches to conserve memory for the more common
> case.

This adds a whole bunch of > 80 char lines, and xfs_btree_cur_sizeof
should use struct_size().

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 08/14] xfs: refactor btree cursor allocation function
  2021-09-18  1:29 ` [PATCH 08/14] xfs: refactor btree cursor allocation function Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
@ 2021-09-21  8:53   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code
  2021-09-18  1:29 ` [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
@ 2021-09-21  8:56   ` Christoph Hellwig
  2021-09-22 15:59     ` Darrick J. Wong
  1 sibling, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Fri, Sep 17, 2021 at 06:29:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The btree geometry computation function has an off-by-one error in that
> it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
> This can result in repairs failing unnecessarily on very fragmented
> filesystems.  Subsequent patches to remove MAXLEVELS usage in favor of
> the per-btree type computations will make this a much more likely
> occurrence.

Shouldn't this go in first as a fix?

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/14] xfs: encode the max btree height in the cursor
  2021-09-18  1:30 ` [PATCH 10/14] xfs: encode the max btree height in the cursor Darrick J. Wong
  2021-09-20  9:55   ` Chandan Babu R
@ 2021-09-21  8:57   ` Christoph Hellwig
  1 sibling, 0 replies; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  8:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-20 23:06   ` Dave Chinner
  2021-09-20 23:36     ` Dave Chinner
@ 2021-09-21  9:03     ` Christoph Hellwig
  2021-09-22 18:55       ` Darrick J. Wong
  2021-09-22 17:38     ` Darrick J. Wong
  2 siblings, 1 reply; 48+ messages in thread
From: Christoph Hellwig @ 2021-09-21  9:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Replace the statically-sized btree cursor zone with dynamically sized
> > allocations so that we can reduce the memory overhead for per-AG bt
> > cursors while handling very tall btrees for rt metadata.
> 
> Hmmmmm. We do a *lot* of btree cursor allocation and freeing under
> load. Keeping that in a single slab rather than using heap memory is
> a good idea for stuff like this for many reasons...

Or rather a few slabs for the different kind of cursors.  But otherwise
agreed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code
  2021-09-21  8:56   ` Christoph Hellwig
@ 2021-09-22 15:59     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-22 15:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 09:56:02AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 06:29:59PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The btree geometry computation function has an off-by-one error in that
> > it does not allow maximally tall btrees (nlevels == XFS_BTREE_MAXLEVELS).
> > This can result in repairs failing unnecessarily on very fragmented
> > filesystems.  Subsequent patches to remove MAXLEVELS usage in favor of
> > the per-btree type computations will make this a much more likely
> > occurrence.
> 
> Shouldn't this go in first as a fix?

It probably should, though I haven't seen any bug reports about this
fault.  I'll move it to the front of the patchset.

--D

> Looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/14] xfs: dynamically allocate btree scrub context structure
  2021-09-21  8:43   ` Christoph Hellwig
@ 2021-09-22 16:17     ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-22 16:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 09:43:18AM +0100, Christoph Hellwig wrote:
> On Fri, Sep 17, 2021 at 06:29:26PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Reorganize struct xchk_btree so that we can dynamically size the context
> > structure to fit the type of btree cursor that we have.  This will
> > enable us to use memory more efficiently once we start adding very tall
> > btree types.
> 
> So bs->levels[0].has_lastkey replaces bs->firstkey?  Can you explain
> a bit more how this works for someone not too familiar with the scrub
> code.

For each record and key that the btree scrubber encounters, it needs to
know if it should call ->{recs,keys}_inorder to check the ordering of
each item in the btree block.

Hmm.  Come to think of it, we could use "cur->bc_ptrs[level] > 0"
instead of tracking it separately.  Ok, that'll become a separate
cleanup patch to reduce memory further.  Good question!

> > +static inline size_t
> > +xchk_btree_sizeof(unsigned int levels)
> > +{
> > +	return sizeof(struct xchk_btree) +
> > +				(levels * sizeof(struct xchk_btree_levels));
> 
> This should probably use struct_size().

Assuming it's ok with sending a typed null pointer into a macro:

	return struct_size((struct xchk_btree *)NULL, levels, nr_levels);

Then ok.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-20 23:06   ` Dave Chinner
  2021-09-20 23:36     ` Dave Chinner
  2021-09-21  9:03     ` Christoph Hellwig
@ 2021-09-22 17:38     ` Darrick J. Wong
  2021-09-22 23:10       ` Dave Chinner
  2 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-22 17:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Replace the statically-sized btree cursor zone with dynamically sized
> > allocations so that we can reduce the memory overhead for per-AG bt
> > cursors while handling very tall btrees for rt metadata.
> 
> Hmmmmm. We do a *lot* of btree cursor allocation and freeing under
> load. Keeping that in a single slab rather than using heap memory is
> a good idea for stuff like this for many reasons...
> 
> I mean, if we are creating a million inodes a second, a rouch
> back-of-the-envelope calculation says we are doing 3-4 million btree
> cursor instantiations a second. That's a lot of short term churn on
> the heap that we don't really need to subject it to. And even a few
> extra instructions in a path called millions of times a second adds
> up to a lot of extra runtime overhead.
> 
> So....
> 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_btree.c |   40 ++++++++++++++++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_btree.h |    2 --
> >  fs/xfs/xfs_super.c        |   11 +----------
> >  3 files changed, 33 insertions(+), 20 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 2486ba22c01d..f9516828a847 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -23,11 +23,6 @@
> >  #include "xfs_btree_staging.h"
> >  #include "xfs_ag.h"
> >  
> > -/*
> > - * Cursor allocation zone.
> > - */
> > -kmem_zone_t	*xfs_btree_cur_zone;
> > -
> >  /*
> >   * Btree magic numbers.
> >   */
> > @@ -379,7 +374,7 @@ xfs_btree_del_cursor(
> >  		kmem_free(cur->bc_ops);
> >  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
> >  		xfs_perag_put(cur->bc_ag.pag);
> > -	kmem_cache_free(xfs_btree_cur_zone, cur);
> > +	kmem_free(cur);
> >  }
> >  
> >  /*
> > @@ -4927,6 +4922,32 @@ xfs_btree_has_more_records(
> >  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
> >  }
> >  
> > +/* Compute the maximum allowed height for a given btree type. */
> > +static unsigned int
> > +xfs_btree_maxlevels(
> > +	struct xfs_mount	*mp,
> > +	xfs_btnum_t		btnum)
> > +{
> > +	switch (btnum) {
> > +	case XFS_BTNUM_BNO:
> > +	case XFS_BTNUM_CNT:
> > +		return mp->m_ag_maxlevels;
> > +	case XFS_BTNUM_BMAP:
> > +		return max(mp->m_bm_maxlevels[XFS_DATA_FORK],
> > +			   mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> > +	case XFS_BTNUM_INO:
> > +	case XFS_BTNUM_FINO:
> > +		return M_IGEO(mp)->inobt_maxlevels;
> > +	case XFS_BTNUM_RMAP:
> > +		return mp->m_rmap_maxlevels;
> > +	case XFS_BTNUM_REFC:
> > +		return mp->m_refc_maxlevels;
> > +	default:
> > +		ASSERT(0);
> > +		return XFS_BTREE_MAXLEVELS;
> > +	}
> > +}
> > +
> >  /* Allocate a new btree cursor of the appropriate size. */
> >  struct xfs_btree_cur *
> >  xfs_btree_alloc_cursor(
> > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
> >  	xfs_btnum_t		btnum)
> >  {
> >  	struct xfs_btree_cur	*cur;
> > +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
> >  
> > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> > +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> > +
> > +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
> 
> Instead of multiple dynamic runtime calculations to determine the
> size to allocate from the heap, which then has to select a slab
> based on size, why don't we just pre-calculate the max size of
> the cursor at XFS module init and use that for the btree cursor slab
> size?

As part of developing the realtime rmapbt and reflink btrees, I computed
the maximum theoretical btree height for a maximally sized realtime
volume.  For a realtime volume with 2^52 blocks and a 1k block size, I
estimate that you'd need a 11-level rtrefcount btree cursor.  The rtrmap
btree cursor would have to be 28 levels high.  Using 4k blocks instead
of 1k blocks, it's not so bad -- 8 for rtrefcount and 17 for rtrmap.

I don't recall exactly what Chandan said the maximum bmbt height would
need to be to support really large data fork mapping structures, but
based on my worst case estimate of 2^54 single-block mappings and a 1k
blocksize, you'd need a 12-level bmbt cursor.  For 4k blocks, you'd need
only 8 levels.

The current XFS_BTREE_MAXLEVELS is 9, which just so happens to fit in
248 bytes.  I will rework this patch to make xfs_btree_cur_zone supply
256-byte cursors, and the btree code will continue using the zone if 256
bytes is enough space for the cursor.

If we decide later on that we need a zone for larger cursors, I think
the next logical size up (512 bytes) will fit 25 levels, but let's wait
to get there first.

--D

> The memory overhead of the cursor isn't an issue because we've been
> maximally sizing it since forever, and the whole point of a slab
> cache is to minimise allocation overhead of frequently allocated
> objects. It seems to me that we really want to retain these
> properties of the cursor allocator, not give them up just as we're
> in the process of making other modifications that will hit the path
> more frequently than it's ever been hit before...
> 
> I like all the dynamic sized guards that this series places in the
> cursor, but I don't think we want to change the way we allocate the
> cursors just to support that.
> 
> FWIW, an example of avoidable runtime calculation overhead of
> constants is xlog_calc_unit_res(). These values are actually
> constant for a given transaction reservation, but at 1.6 million
> transactions a second it shows up at #20 on the flat profile of
> functions using the most CPU:
> 
> 0.71%  [kernel]  [k] xlog_calc_unit_res
> 
> 0.71% of 32 CPUs for 1.6 million calculations a second of the same
> constants is a non-trivial amount of CPU time to spend doing
> unnecessary repeated calculations.
> 
> Even though the btree cursor constant calculations are simpler than
> the log res calculations, they are more frequent. Hence on general
> principles of efficiency, I don't think we want to be replacing high
> frequency, low overhead slab/zone based allocations with heap
> allocations that require repeated constant calculations and
> size->slab redirection....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-21  9:03     ` Christoph Hellwig
@ 2021-09-22 18:55       ` Darrick J. Wong
  0 siblings, 0 replies; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-22 18:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Chinner, chandan.babu, chandanrlinux, linux-xfs

On Tue, Sep 21, 2021 at 10:03:49AM +0100, Christoph Hellwig wrote:
> On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> > On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Replace the statically-sized btree cursor zone with dynamically sized
> > > allocations so that we can reduce the memory overhead for per-AG bt
> > > cursors while handling very tall btrees for rt metadata.
> > 
> > Hmmmmm. We do a *lot* of btree cursor allocation and freeing under
> > load. Keeping that in a single slab rather than using heap memory is
> > a good idea for stuff like this for many reasons...
> 
> Or rather a few slabs for the different kind of cursors.  But otherwise
> agreed.

I think I prefer to let Chandan decide if there are going to be enough
heavily fragmented files to warrant a second slab for maxlevels>9 files.
We should probably be selective about which cursor maxheight we want to
use depending on whether or not the file really needs it.

--D

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-22 17:38     ` Darrick J. Wong
@ 2021-09-22 23:10       ` Dave Chinner
  2021-09-23  1:58         ` Darrick J. Wong
  0 siblings, 1 reply; 48+ messages in thread
From: Dave Chinner @ 2021-09-22 23:10 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Wed, Sep 22, 2021 at 10:38:21AM -0700, Darrick J. Wong wrote:
> On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> > On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > >  /* Allocate a new btree cursor of the appropriate size. */
> > >  struct xfs_btree_cur *
> > >  xfs_btree_alloc_cursor(
> > > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
> > >  	xfs_btnum_t		btnum)
> > >  {
> > >  	struct xfs_btree_cur	*cur;
> > > +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
> > >  
> > > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> > > +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> > > +
> > > +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
> > 
> > Instead of multiple dynamic runtime calculations to determine the
> > size to allocate from the heap, which then has to select a slab
> > based on size, why don't we just pre-calculate the max size of
> > the cursor at XFS module init and use that for the btree cursor slab
> > size?
> 
> As part of developing the realtime rmapbt and reflink btrees, I computed
> the maximum theoretical btree height for a maximally sized realtime
> volume.  For a realtime volume with 2^52 blocks and a 1k block size, I
> estimate that you'd need a 11-level rtrefcount btree cursor.  The rtrmap
> btree cursor would have to be 28 levels high.  Using 4k blocks instead
> of 1k blocks, it's not so bad -- 8 for rtrefcount and 17 for rtrmap.

I'm going to state straight out that 1k block sizes for the rt
device are insane. That's not what that device was intended to
support, ever. It was intended for workloads with -large-,
consistent extent sizes in large contiguous runs, not tiny, small
random allocations of individual blocks.

So if we are going to be talking about the overhead RT block
management for new functionality, we need to start by putting
reasonable limits on the block sizes that the RT device will support
such features for. Because while a btree might scale to 2^52 x 1kB
blocks, the RT allocation bitmap sure as hell doesn't. It probably
doesn't even scale at all well above a few million blocks for
general usage.

Hence I don't think it's worth optimising for these cases when we
think about maximum btree sizes for the cursors - those btrees can
provide their own cursor slab to allocate from if it comes to it.

Really, if we want to scale RT devices to insane sizes, we need to
move to an AG based structure for it which breaks up the bitmaps and
summary files into regions to keep the overhead and max sizes under
control.

> I don't recall exactly what Chandan said the maximum bmbt height would
> need to be to support really large data fork mapping structures, but
> based on my worst case estimate of 2^54 single-block mappings and a 1k
> blocksize, you'd need a 12-level bmbt cursor.  For 4k blocks, you'd need
> only 8 levels.

Yup, it's not significantly different to what we have now.

> The current XFS_BTREE_MAXLEVELS is 9, which just so happens to fit in
> 248 bytes.  I will rework this patch to make xfs_btree_cur_zone supply
> 256-byte cursors, and the btree code will continue using the zone if 256
> bytes is enough space for the cursor.
>
> If we decide later on that we need a zone for larger cursors, I think
> the next logical size up (512 bytes) will fit 25 levels, but let's wait
> to get there first.

I suspect you may misunderstand how SLUB caches work. SLUB packs
non-power of two sized slabs tightly to natural alignment (8 bytes).
e.g.:

$ sudo grep xfs_btree_cur /proc/slabinfo
xfs_btree_cur       1152   1152    224   36    2 : tunables    0 0    0 : slabdata     32     32      0

SLUB is using an order-1 base page (2 pages), with 36 cursor objects
in it. 36 * 224 = 8064 bytes, which means it is packed as tightly as
possible. It is not using 256 byte objects for these btree cursors.

If we allocate these 224 byte objects _from the heap_, however, then
the 256 byte heap slab will be selected, which means the object is
then padded to 256 bytes -by the heap-. The SLUB allocator does not
pad the objects, it's the heap granularity that adds padding to the
objects.

This implicit padding of heap objects is another reason we don't
want to use the heap for anything we frequently allocate or allocate
in large amounts. It can result in substantial amounts of wasted
memory.

IOWs, we don't actually care about object size granularity for slab
cache allocated objects.

However, if we really want to look at memory usage of struct
xfs_btree_cur, pahole tells me:

	/* size: 224, cachelines: 4, members: 13 */

Where are the extra 24 bytes coming from on your kernel?

It also tells me that a bunch of space that can be taken out of it:

- 4 byte hole that bc_btnum can be moved into.
- bc_blocklog is set but not used, so it can go, too.
- bc_ag.refc.nr_ops doesn't need to be an unsigned long
- optimising bc_ra state. That just tracks if
  the current cursor has already done sibling readahead - it's two
  bits per level , held in a int8_t per level. Could be a pair of
  int16_t bitmasks if maxlevel is 12, that would save another 8
  bytes. If maxlevel == 28 as per the rt case above, then a pair of
  int32_t bitmasks saves 4 bytes for 12 levels and 20 bytes bytes
  for 28 levels...

Hence if we're concerned about space usage of the btree cursor,
these seem like low hanging fruit.

Maybe the best thing here, as Christoph mentioned, is to have a set
of btree cursor zones for the different size limits. All the per-ag
btrees have the same (small) size limits, while the BMBT is bigger.
And the RT btrees when they arrive will be bigger again. Given that
we already allocate the cursors based on the type of btree they are
going to walk, this seems like it would be pretty easy to do,
something like the patch below, perhaps?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: per-btree cursor slab caches
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |  3 ++-
 fs/xfs/libxfs/xfs_bmap_btree.c     |  4 +++-
 fs/xfs/libxfs/xfs_btree.c          | 28 +++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_btree.h          |  6 +++++-
 fs/xfs/libxfs/xfs_ialloc_btree.c   |  4 +++-
 fs/xfs/libxfs/xfs_refcount_btree.c |  4 +++-
 fs/xfs/libxfs/xfs_rmap_btree.c     |  4 +++-
 fs/xfs/xfs_super.c                 | 30 ++++++++++++++++++++++++++----
 8 files changed, 68 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6746fd735550..53ead7b98238 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_allocbt_cur_zone;
 
 STATIC struct xfs_btree_cur *
 xfs_allocbt_dup_cursor(
@@ -477,7 +478,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_allocbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 72444b8b38a6..e3f7107ce2e2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_trace.h"
 #include "xfs_rmap.h"
 
+struct kmem_cache	*xfs_bmbt_cur_zone;
+
 /*
  * Convert on-disk form of btree root to in-memory form.
  */
@@ -552,7 +554,7 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_bmbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 298395481713..7ef19f365e33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -23,10 +23,6 @@
 #include "xfs_btree_staging.h"
 #include "xfs_ag.h"
 
-/*
- * Cursor allocation zone.
- */
-kmem_zone_t	*xfs_btree_cur_zone;
 
 /*
  * Btree magic numbers.
@@ -379,7 +375,29 @@ xfs_btree_del_cursor(
 		kmem_free(cur->bc_ops);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
-	kmem_cache_free(xfs_btree_cur_zone, cur);
+
+	switch (cur->bc_btnum) {
+	case XFS_BTNUM_BMAP:
+		kmem_cache_free(xfs_bmbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_BNO:
+	case XFS_BTNUM_CNT:
+		kmem_cache_free(xfs_allocbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_INOBT:
+	case XFS_BTNUM_FINOBT:
+		kmem_cache_free(xfs_inobt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_RMAP:
+		kmem_cache_free(xfs_rmapbt_cur_zone, cur);
+		break;
+	case XFS_BTNUM_REFCNT:
+		kmem_cache_free(xfs_refcntbt_cur_zone, cur);
+		break;
+	default:
+		ASSERT(0);
+		break;
+	}
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 4eaf8517f850..acdf087c853a 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -13,7 +13,11 @@ struct xfs_trans;
 struct xfs_ifork;
 struct xfs_perag;
 
-extern kmem_zone_t	*xfs_btree_cur_zone;
+extern struct kmem_cache	*xfs_allocbt_cur_zone;
+extern struct kmem_cache	*xfs_inobt_cur_zone;
+extern struct kmem_cache	*xfs_bmbt_cur_zone;
+extern struct kmem_cache	*xfs_rmapbt_cur_zone;
+extern struct kmem_cache	*xfs_refcntbt_cur_zone;
 
 /*
  * Generic key, ptr and record wrapper structures.
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 27190840c5d8..5258696f153e 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_inobt_cur_zone;
+
 STATIC int
 xfs_inobt_get_minrecs(
 	struct xfs_btree_cur	*cur,
@@ -432,7 +434,7 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_inobt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1ef9b99962ab..20667f173040 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -21,6 +21,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+struct kmem_cache	*xfs_refcntbt_cur_zone;
+
 static struct xfs_btree_cur *
 xfs_refcountbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -322,7 +324,7 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_refcntbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = XFS_BTNUM_REFC;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b7dbbfb3aeed..cb6e64f6d8f9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 
+struct kmem_cache	*xfs_rmapbt_cur_zone;
+
 /*
  * Reverse map btree.
  *
@@ -451,7 +453,7 @@ xfs_rmapbt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(xfs_rmapbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	/* Overlapping btree; 2 keys per pointer. */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 90716b9d6e5f..3f97dc1b41e0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1965,10 +1965,24 @@ xfs_init_zones(void)
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 
-	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
+	xfs_allocbt_cur_zone = kmem_cache_create("xfs_allocbt_cur",
 					       sizeof(struct xfs_btree_cur),
 					       0, 0, NULL);
-	if (!xfs_btree_cur_zone)
+	xfs_inobt_cur_zone = kmem_cache_create("xfs_inobt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_bmbt_cur_zone = kmem_cache_create("xfs_bmbt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_rmapbt_cur_zone = kmem_cache_create("xfs_rmapbt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	xfs_refcntbt_cur_zone = kmem_cache_create("xfs_refcnt_cur",
+					       sizeof(struct xfs_btree_cur),
+					       0, 0, NULL);
+	if (!xfs_allocbt_cur_zone || !xfs_inobt_cur_zone ||
+	    !xfs_bmbt_cur_zone || !xfs_rmapbt_cur_zone ||
+	    !xfs_refcntbt_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
 
 	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
@@ -2106,7 +2120,11 @@ xfs_init_zones(void)
  out_destroy_da_state_zone:
 	kmem_cache_destroy(xfs_da_state_zone);
  out_destroy_btree_cur_zone:
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	kmem_cache_destroy(xfs_allocbt_cur_zone);
+	kmem_cache_destroy(xfs_inobt_cur_zone);
+	kmem_cache_destroy(xfs_bmbt_cur_zone);
+	kmem_cache_destroy(xfs_rmapbt_cur_zone);
+	kmem_cache_destroy(xfs_refcntbt_cur_zone);
  out_destroy_bmap_free_item_zone:
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
  out_destroy_log_ticket_zone:
@@ -2138,7 +2156,11 @@ xfs_destroy_zones(void)
 	kmem_cache_destroy(xfs_trans_zone);
 	kmem_cache_destroy(xfs_ifork_zone);
 	kmem_cache_destroy(xfs_da_state_zone);
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	kmem_cache_destroy(xfs_allocbt_cur_zone);
+	kmem_cache_destroy(xfs_inobt_cur_zone);
+	kmem_cache_destroy(xfs_bmbt_cur_zone);
+	kmem_cache_destroy(xfs_rmapbt_cur_zone);
+	kmem_cache_destroy(xfs_refcntbt_cur_zone);
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
 	kmem_cache_destroy(xfs_log_ticket_zone);
 }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-22 23:10       ` Dave Chinner
@ 2021-09-23  1:58         ` Darrick J. Wong
  2021-09-23  5:56           ` Chandan Babu R
  0 siblings, 1 reply; 48+ messages in thread
From: Darrick J. Wong @ 2021-09-23  1:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: chandan.babu, chandanrlinux, linux-xfs

On Thu, Sep 23, 2021 at 09:10:15AM +1000, Dave Chinner wrote:
> On Wed, Sep 22, 2021 at 10:38:21AM -0700, Darrick J. Wong wrote:
> > On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
> > > On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
> > > >  /* Allocate a new btree cursor of the appropriate size. */
> > > >  struct xfs_btree_cur *
> > > >  xfs_btree_alloc_cursor(
> > > > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
> > > >  	xfs_btnum_t		btnum)
> > > >  {
> > > >  	struct xfs_btree_cur	*cur;
> > > > +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
> > > >  
> > > > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> > > > +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
> > > > +
> > > > +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
> > > 
> > > Instead of multiple dynamic runtime calculations to determine the
> > > size to allocate from the heap, which then has to select a slab
> > > based on size, why don't we just pre-calculate the max size of
> > > the cursor at XFS module init and use that for the btree cursor slab
> > > size?
> > 
> > As part of developing the realtime rmapbt and reflink btrees, I computed
> > the maximum theoretical btree height for a maximally sized realtime
> > volume.  For a realtime volume with 2^52 blocks and a 1k block size, I
> > estimate that you'd need a 11-level rtrefcount btree cursor.  The rtrmap
> > btree cursor would have to be 28 levels high.  Using 4k blocks instead
> > of 1k blocks, it's not so bad -- 8 for rtrefcount and 17 for rtrmap.
> 
> I'm going to state straight out that 1k block sizes for the rt
> device are insane. That's not what that device was intended to
> support, ever. It was intended for workloads with -large-,
> consistent extent sizes in large contiguous runs, not tiny, small
> random allocations of individual blocks.
> 
> So if we are going to be talking about the overhead RT block
> management for new functionality, we need to start by putting
> reasonable limits on the block sizes that the RT device will support
> such features for. Because while a btree might scale to 2^52 x 1kB
> blocks, the RT allocation bitmap sure as hell doesn't. It probably
> doesn't even scale at all well above a few million blocks for
> general usage.
> 
> Hence I don't think it's worth optimising for these cases when we
> think about maximum btree sizes for the cursors - those btrees can
> provide their own cursor slab to allocate from if it comes to it.
> 
> Really, if we want to scale RT devices to insane sizes, we need to
> move to an AG based structure for it which breaks up the bitmaps and
> summary files into regions to keep the overhead and max sizes under
> control.

Heh.  That just sounds like more work that I get to do...

> > I don't recall exactly what Chandan said the maximum bmbt height would
> > need to be to support really large data fork mapping structures, but
> > based on my worst case estimate of 2^54 single-block mappings and a 1k
> > blocksize, you'd need a 12-level bmbt cursor.  For 4k blocks, you'd need
> > only 8 levels.
> 
> Yup, it's not significantly different to what we have now.
> 
> > The current XFS_BTREE_MAXLEVELS is 9, which just so happens to fit in
> > 248 bytes.  I will rework this patch to make xfs_btree_cur_zone supply
> > 256-byte cursors, and the btree code will continue using the zone if 256
> > bytes is enough space for the cursor.
> >
> > If we decide later on that we need a zone for larger cursors, I think
> > the next logical size up (512 bytes) will fit 25 levels, but let's wait
> > to get there first.
> 
> I suspect you may misunderstand how SLUB caches work. SLUB packs
> non-power of two sized slabs tightly to natural alignment (8 bytes).
> e.g.:
> 
> $ sudo grep xfs_btree_cur /proc/slabinfo
> xfs_btree_cur       1152   1152    224   36    2 : tunables    0 0    0 : slabdata     32     32      0
> 
> SLUB is using an order-1 base page (2 pages), with 36 cursor objects
> in it. 36 * 224 = 8064 bytes, which means it is packed as tightly as
> possible. It is not using 256 byte objects for these btree cursors.

Ahah, I didn't realize that.  Yes, taking that into mind, the 256-byte
thing is unnecessary.

> If we allocate these 224 byte objects _from the heap_, however, then
> the 256 byte heap slab will be selected, which means the object is
> then padded to 256 bytes -by the heap-. The SLUB allocator does not
> pad the objects, it's the heap granularity that adds padding to the
> objects.
> 
> This implicit padding of heap objects is another reason we don't
> want to use the heap for anything we frequently allocate or allocate
> in large amounts. It can result in substantial amounts of wasted
> memory.
> 
> IOWs, we don't actually care about object size granularity for slab
> cache allocated objects.
> 
> However, if we really want to look at memory usage of struct
> xfs_btree_cur, pahole tells me:
> 
> 	/* size: 224, cachelines: 4, members: 13 */
> 
> Where are the extra 24 bytes coming from on your kernel?

Not sure.  Can you post your pahole output?

> It also tells me that a bunch of space that can be taken out of it:
> 
> - 4 byte hole that bc_btnum can be moved into.
> - bc_blocklog is set but not used, so it can go, too.
> - bc_ag.refc.nr_ops doesn't need to be an unsigned long

I'll look into those tomorrow.

> - optimising bc_ra state. That just tracks if
>   the current cursor has already done sibling readahead - it's two
>   bits per level , held in a int8_t per level. Could be a pair of
>   int16_t bitmasks if maxlevel is 12, that would save another 8
>   bytes. If maxlevel == 28 as per the rt case above, then a pair of
>   int32_t bitmasks saves 4 bytes for 12 levels and 20 bytes bytes
>   for 28 levels...

I don't think that optimizing bc_ra buys us much.  struct
xfs_btree_level will be 16 bytes anyway due to alignment of the xfs_buf
pointer, so we might as well use the extra bytes.

> Hence if we're concerned about space usage of the btree cursor,
> these seem like low hanging fruit.
> 
> Maybe the best thing here, as Christoph mentioned, is to have a set
> of btree cursor zones for the different size limits. All the per-ag
> btrees have the same (small) size limits, while the BMBT is bigger.
> And the RT btrees when they arrive will be bigger again. Given that
> we already allocate the cursors based on the type of btree they are
> going to walk, this seems like it would be pretty easy to do,
> something like the patch below, perhaps?

Um... the bmbt cache looks like it has the same size as the rest?

It's not so hard to make there be separate zones though.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: per-btree cursor slab caches
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |  3 ++-
>  fs/xfs/libxfs/xfs_bmap_btree.c     |  4 +++-
>  fs/xfs/libxfs/xfs_btree.c          | 28 +++++++++++++++++++++++-----
>  fs/xfs/libxfs/xfs_btree.h          |  6 +++++-
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |  4 +++-
>  fs/xfs/libxfs/xfs_refcount_btree.c |  4 +++-
>  fs/xfs/libxfs/xfs_rmap_btree.c     |  4 +++-
>  fs/xfs/xfs_super.c                 | 30 ++++++++++++++++++++++++++----
>  8 files changed, 68 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 6746fd735550..53ead7b98238 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -20,6 +20,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_ag.h"
>  
> +struct kmem_cache	*xfs_allocbt_cur_zone;
>  
>  STATIC struct xfs_btree_cur *
>  xfs_allocbt_dup_cursor(
> @@ -477,7 +478,7 @@ xfs_allocbt_init_common(
>  
>  	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(xfs_allocbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>  
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index 72444b8b38a6..e3f7107ce2e2 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_trace.h"
>  #include "xfs_rmap.h"
>  
> +struct kmem_cache	*xfs_bmbt_cur_zone;
> +
>  /*
>   * Convert on-disk form of btree root to in-memory form.
>   */
> @@ -552,7 +554,7 @@ xfs_bmbt_init_cursor(
>  	struct xfs_btree_cur	*cur;
>  	ASSERT(whichfork != XFS_COW_FORK);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(xfs_bmbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>  
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 298395481713..7ef19f365e33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -23,10 +23,6 @@
>  #include "xfs_btree_staging.h"
>  #include "xfs_ag.h"
>  
> -/*
> - * Cursor allocation zone.
> - */
> -kmem_zone_t	*xfs_btree_cur_zone;
>  
>  /*
>   * Btree magic numbers.
> @@ -379,7 +375,29 @@ xfs_btree_del_cursor(
>  		kmem_free(cur->bc_ops);
>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
>  		xfs_perag_put(cur->bc_ag.pag);
> -	kmem_cache_free(xfs_btree_cur_zone, cur);
> +
> +	switch (cur->bc_btnum) {
> +	case XFS_BTNUM_BMAP:
> +		kmem_cache_free(xfs_bmbt_cur_zone, cur);
> +		break;
> +	case XFS_BTNUM_BNO:
> +	case XFS_BTNUM_CNT:
> +		kmem_cache_free(xfs_allocbt_cur_zone, cur);
> +		break;
> +	case XFS_BTNUM_INOBT:
> +	case XFS_BTNUM_FINOBT:
> +		kmem_cache_free(xfs_inobt_cur_zone, cur);
> +		break;
> +	case XFS_BTNUM_RMAP:
> +		kmem_cache_free(xfs_rmapbt_cur_zone, cur);
> +		break;
> +	case XFS_BTNUM_REFCNT:
> +		kmem_cache_free(xfs_refcntbt_cur_zone, cur);
> +		break;
> +	default:
> +		ASSERT(0);
> +		break;
> +	}
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 4eaf8517f850..acdf087c853a 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -13,7 +13,11 @@ struct xfs_trans;
>  struct xfs_ifork;
>  struct xfs_perag;
>  
> -extern kmem_zone_t	*xfs_btree_cur_zone;
> +extern struct kmem_cache	*xfs_allocbt_cur_zone;
> +extern struct kmem_cache	*xfs_inobt_cur_zone;
> +extern struct kmem_cache	*xfs_bmbt_cur_zone;
> +extern struct kmem_cache	*xfs_rmapbt_cur_zone;
> +extern struct kmem_cache	*xfs_refcntbt_cur_zone;
>  
>  /*
>   * Generic key, ptr and record wrapper structures.
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 27190840c5d8..5258696f153e 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag.h"
>  
> +struct kmem_cache	*xfs_inobt_cur_zone;
> +
>  STATIC int
>  xfs_inobt_get_minrecs(
>  	struct xfs_btree_cur	*cur,
> @@ -432,7 +434,7 @@ xfs_inobt_init_common(
>  {
>  	struct xfs_btree_cur	*cur;
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(xfs_inobt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
>  	cur->bc_btnum = btnum;
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index 1ef9b99962ab..20667f173040 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -21,6 +21,8 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag.h"
>  
> +struct kmem_cache	*xfs_refcntbt_cur_zone;
> +
>  static struct xfs_btree_cur *
>  xfs_refcountbt_dup_cursor(
>  	struct xfs_btree_cur	*cur)
> @@ -322,7 +324,7 @@ xfs_refcountbt_init_common(
>  
>  	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(xfs_refcntbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
>  	cur->bc_btnum = XFS_BTNUM_REFC;
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index b7dbbfb3aeed..cb6e64f6d8f9 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_ag.h"
>  #include "xfs_ag_resv.h"
>  
> +struct kmem_cache	*xfs_rmapbt_cur_zone;
> +
>  /*
>   * Reverse map btree.
>   *
> @@ -451,7 +453,7 @@ xfs_rmapbt_init_common(
>  {
>  	struct xfs_btree_cur	*cur;
>  
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(xfs_rmapbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
>  	/* Overlapping btree; 2 keys per pointer. */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 90716b9d6e5f..3f97dc1b41e0 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1965,10 +1965,24 @@ xfs_init_zones(void)
>  	if (!xfs_bmap_free_item_zone)
>  		goto out_destroy_log_ticket_zone;
>  
> -	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
> +	xfs_allocbt_cur_zone = kmem_cache_create("xfs_allocbt_cur",
>  					       sizeof(struct xfs_btree_cur),
>  					       0, 0, NULL);
> -	if (!xfs_btree_cur_zone)
> +	xfs_inobt_cur_zone = kmem_cache_create("xfs_inobt_cur",
> +					       sizeof(struct xfs_btree_cur),
> +					       0, 0, NULL);
> +	xfs_bmbt_cur_zone = kmem_cache_create("xfs_bmbt_cur",
> +					       sizeof(struct xfs_btree_cur),
> +					       0, 0, NULL);
> +	xfs_rmapbt_cur_zone = kmem_cache_create("xfs_rmapbt_cur",
> +					       sizeof(struct xfs_btree_cur),
> +					       0, 0, NULL);
> +	xfs_refcntbt_cur_zone = kmem_cache_create("xfs_refcnt_cur",
> +					       sizeof(struct xfs_btree_cur),
> +					       0, 0, NULL);
> +	if (!xfs_allocbt_cur_zone || !xfs_inobt_cur_zone ||
> +	    !xfs_bmbt_cur_zone || !xfs_rmapbt_cur_zone ||
> +	    !xfs_refcntbt_cur_zone)
>  		goto out_destroy_bmap_free_item_zone;
>  
>  	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
> @@ -2106,7 +2120,11 @@ xfs_init_zones(void)
>   out_destroy_da_state_zone:
>  	kmem_cache_destroy(xfs_da_state_zone);
>   out_destroy_btree_cur_zone:
> -	kmem_cache_destroy(xfs_btree_cur_zone);
> +	kmem_cache_destroy(xfs_allocbt_cur_zone);
> +	kmem_cache_destroy(xfs_inobt_cur_zone);
> +	kmem_cache_destroy(xfs_bmbt_cur_zone);
> +	kmem_cache_destroy(xfs_rmapbt_cur_zone);
> +	kmem_cache_destroy(xfs_refcntbt_cur_zone);
>   out_destroy_bmap_free_item_zone:
>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>   out_destroy_log_ticket_zone:
> @@ -2138,7 +2156,11 @@ xfs_destroy_zones(void)
>  	kmem_cache_destroy(xfs_trans_zone);
>  	kmem_cache_destroy(xfs_ifork_zone);
>  	kmem_cache_destroy(xfs_da_state_zone);
> -	kmem_cache_destroy(xfs_btree_cur_zone);
> +	kmem_cache_destroy(xfs_allocbt_cur_zone);
> +	kmem_cache_destroy(xfs_inobt_cur_zone);
> +	kmem_cache_destroy(xfs_bmbt_cur_zone);
> +	kmem_cache_destroy(xfs_rmapbt_cur_zone);
> +	kmem_cache_destroy(xfs_refcntbt_cur_zone);
>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>  	kmem_cache_destroy(xfs_log_ticket_zone);
>  }

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels
  2021-09-23  1:58         ` Darrick J. Wong
@ 2021-09-23  5:56           ` Chandan Babu R
  0 siblings, 0 replies; 48+ messages in thread
From: Chandan Babu R @ 2021-09-23  5:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, chandanrlinux, linux-xfs


On 23 Sep 2021 at 07:28, Darrick J. Wong wrote:
> On Thu, Sep 23, 2021 at 09:10:15AM +1000, Dave Chinner wrote:
>> On Wed, Sep 22, 2021 at 10:38:21AM -0700, Darrick J. Wong wrote:
>> > On Tue, Sep 21, 2021 at 09:06:35AM +1000, Dave Chinner wrote:
>> > > On Fri, Sep 17, 2021 at 06:30:10PM -0700, Darrick J. Wong wrote:
>> > > >  /* Allocate a new btree cursor of the appropriate size. */
>> > > >  struct xfs_btree_cur *
>> > > >  xfs_btree_alloc_cursor(
>> > > > @@ -4935,13 +4956,16 @@ xfs_btree_alloc_cursor(
>> > > >  	xfs_btnum_t		btnum)
>> > > >  {
>> > > >  	struct xfs_btree_cur	*cur;
>> > > > +	unsigned int		maxlevels = xfs_btree_maxlevels(mp, btnum);
>> > > >  
>> > > > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> > > > +	ASSERT(maxlevels <= XFS_BTREE_MAXLEVELS);
>> > > > +
>> > > > +	cur = kmem_zalloc(xfs_btree_cur_sizeof(maxlevels), KM_NOFS);
>> > > 
>> > > Instead of multiple dynamic runtime calculations to determine the
>> > > size to allocate from the heap, which then has to select a slab
>> > > based on size, why don't we just pre-calculate the max size of
>> > > the cursor at XFS module init and use that for the btree cursor slab
>> > > size?
>> > 
>> > As part of developing the realtime rmapbt and reflink btrees, I computed
>> > the maximum theoretical btree height for a maximally sized realtime
>> > volume.  For a realtime volume with 2^52 blocks and a 1k block size, I
>> > estimate that you'd need a 11-level rtrefcount btree cursor.  The rtrmap
>> > btree cursor would have to be 28 levels high.  Using 4k blocks instead
>> > of 1k blocks, it's not so bad -- 8 for rtrefcount and 17 for rtrmap.
>> 
>> I'm going to state straight out that 1k block sizes for the rt
>> device are insane. That's not what that device was intended to
>> support, ever. It was intended for workloads with -large-,
>> consistent extent sizes in large contiguous runs, not tiny, small
>> random allocations of individual blocks.
>> 
>> So if we are going to be talking about the overhead RT block
>> management for new functionality, we need to start by putting
>> reasonable limits on the block sizes that the RT device will support
>> such features for. Because while a btree might scale to 2^52 x 1kB
>> blocks, the RT allocation bitmap sure as hell doesn't. It probably
>> doesn't even scale at all well above a few million blocks for
>> general usage.
>> 
>> Hence I don't think it's worth optimising for these cases when we
>> think about maximum btree sizes for the cursors - those btrees can
>> provide their own cursor slab to allocate from if it comes to it.
>> 
>> Really, if we want to scale RT devices to insane sizes, we need to
>> move to an AG based structure for it which breaks up the bitmaps and
>> summary files into regions to keep the overhead and max sizes under
>> control.
>
> Heh.  That just sounds like more work that I get to do...
>
>> > I don't recall exactly what Chandan said the maximum bmbt height would
>> > need to be to support really large data fork mapping structures, but
>> > based on my worst case estimate of 2^54 single-block mappings and a 1k
>> > blocksize, you'd need a 12-level bmbt cursor.  For 4k blocks, you'd need
>> > only 8 levels.

With 2^48 = 280e12 as the maximum extent count,
- 1k block size
  - Minimum number of records in leaf = 29
  - Minimum number of records in node = 29
  - Maximum height of BMBT = 10 (i.e. 1 more than the current value of
    XFS_BTREE_MAXLEVELS)
    |-------+------------+-----------------------------|
    | Level | nr_records | Nr blocks = nr_records / 29 |
    |-------+------------+-----------------------------|
    |     1 |     280e12 |                      9.7e12 |
    |     2 |     9.7e12 |                       330e9 |
    |     3 |      330e9 |                        11e9 |
    |     4 |       11e9 |                       380e6 |
    |     5 |      380e6 |                        13e6 |
    |     6 |       13e6 |                       450e3 |
    |     7 |      450e3 |                        16e3 |
    |     8 |       16e3 |                       550e0 |
    |     9 |      550e0 |                        19e0 |
    |    10 |       19e0 |                           1 |
    |-------+------------+-----------------------------|
- 4k block size
  - Minimum number of records in leaf = 125
  - Minimum number of records in node = 125
  - Maximum height of BMBT = 7
    |-------+------------+------------------------------|
    | Level | nr_records | Nr blocks = nr_records / 125 |
    |-------+------------+------------------------------|
    |     1 |     280e12 |                       2.2e12 |
    |     2 |     2.2e12 |                         18e9 |
    |     3 |       18e9 |                        140e6 |
    |     4 |      140e6 |                        1.1e6 |
    |     5 |      1.1e6 |                        8.8e3 |
    |     6 |      8.8e3 |                         70e0 |
    |     7 |       70e0 |                            1 |
    |-------+------------+------------------------------|

Hence if we are creating different btree cursor zones, then size of a BMBT
cursor object should be calculated based on the tree having a maximum height
of 10.

>> 
>> Yup, it's not significantly different to what we have now.
>> 
>> > The current XFS_BTREE_MAXLEVELS is 9, which just so happens to fit in
>> > 248 bytes.  I will rework this patch to make xfs_btree_cur_zone supply
>> > 256-byte cursors, and the btree code will continue using the zone if 256
>> > bytes is enough space for the cursor.
>> >
>> > If we decide later on that we need a zone for larger cursors, I think
>> > the next logical size up (512 bytes) will fit 25 levels, but let's wait
>> > to get there first.
>> 
>> I suspect you may misunderstand how SLUB caches work. SLUB packs
>> non-power of two sized slabs tightly to natural alignment (8 bytes).
>> e.g.:
>> 
>> $ sudo grep xfs_btree_cur /proc/slabinfo
>> xfs_btree_cur       1152   1152    224   36    2 : tunables    0 0    0 : slabdata     32     32      0
>> 
>> SLUB is using an order-1 base page (2 pages), with 36 cursor objects
>> in it. 36 * 224 = 8064 bytes, which means it is packed as tightly as
>> possible. It is not using 256 byte objects for these btree cursors.
>
> Ahah, I didn't realize that.  Yes, taking that into mind, the 256-byte
> thing is unnecessary.
>
>> If we allocate these 224 byte objects _from the heap_, however, then
>> the 256 byte heap slab will be selected, which means the object is
>> then padded to 256 bytes -by the heap-. The SLUB allocator does not
>> pad the objects, it's the heap granularity that adds padding to the
>> objects.
>> 
>> This implicit padding of heap objects is another reason we don't
>> want to use the heap for anything we frequently allocate or allocate
>> in large amounts. It can result in substantial amounts of wasted
>> memory.
>> 
>> IOWs, we don't actually care about object size granularity for slab
>> cache allocated objects.
>> 
>> However, if we really want to look at memory usage of struct
>> xfs_btree_cur, pahole tells me:
>> 
>> 	/* size: 224, cachelines: 4, members: 13 */
>> 
>> Where are the extra 24 bytes coming from on your kernel?
>
> Not sure.  Can you post your pahole output?
>
>> It also tells me that a bunch of space that can be taken out of it:
>> 
>> - 4 byte hole that bc_btnum can be moved into.
>> - bc_blocklog is set but not used, so it can go, too.
>> - bc_ag.refc.nr_ops doesn't need to be an unsigned long
>
> I'll look into those tomorrow.
>
>> - optimising bc_ra state. That just tracks if
>>   the current cursor has already done sibling readahead - it's two
>>   bits per level , held in a int8_t per level. Could be a pair of
>>   int16_t bitmasks if maxlevel is 12, that would save another 8
>>   bytes. If maxlevel == 28 as per the rt case above, then a pair of
>>   int32_t bitmasks saves 4 bytes for 12 levels and 20 bytes bytes
>>   for 28 levels...
>
> I don't think that optimizing bc_ra buys us much.  struct
> xfs_btree_level will be 16 bytes anyway due to alignment of the xfs_buf
> pointer, so we might as well use the extra bytes.
>
>> Hence if we're concerned about space usage of the btree cursor,
>> these seem like low hanging fruit.
>> 
>> Maybe the best thing here, as Christoph mentioned, is to have a set
>> of btree cursor zones for the different size limits. All the per-ag
>> btrees have the same (small) size limits, while the BMBT is bigger.
>> And the RT btrees when they arrive will be bigger again. Given that
>> we already allocate the cursors based on the type of btree they are
>> going to walk, this seems like it would be pretty easy to do,
>> something like the patch below, perhaps?
>
> Um... the bmbt cache looks like it has the same size as the rest?
>
> It's not so hard to make there be separate zones though.
>
> --D
>
>> Cheers,
>> 
>> Dave.
>> -- 
>> Dave Chinner
>> david@fromorbit.com
>> 
>> xfs: per-btree cursor slab caches
>> ---
>>  fs/xfs/libxfs/xfs_alloc_btree.c    |  3 ++-
>>  fs/xfs/libxfs/xfs_bmap_btree.c     |  4 +++-
>>  fs/xfs/libxfs/xfs_btree.c          | 28 +++++++++++++++++++++++-----
>>  fs/xfs/libxfs/xfs_btree.h          |  6 +++++-
>>  fs/xfs/libxfs/xfs_ialloc_btree.c   |  4 +++-
>>  fs/xfs/libxfs/xfs_refcount_btree.c |  4 +++-
>>  fs/xfs/libxfs/xfs_rmap_btree.c     |  4 +++-
>>  fs/xfs/xfs_super.c                 | 30 ++++++++++++++++++++++++++----
>>  8 files changed, 68 insertions(+), 15 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>> index 6746fd735550..53ead7b98238 100644
>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>> @@ -20,6 +20,7 @@
>>  #include "xfs_trans.h"
>>  #include "xfs_ag.h"
>>  
>> +struct kmem_cache	*xfs_allocbt_cur_zone;
>>  
>>  STATIC struct xfs_btree_cur *
>>  xfs_allocbt_dup_cursor(
>> @@ -477,7 +478,7 @@ xfs_allocbt_init_common(
>>  
>>  	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
>>  
>> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> +	cur = kmem_cache_zalloc(xfs_allocbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>>  
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
>> index 72444b8b38a6..e3f7107ce2e2 100644
>> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
>> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
>> @@ -22,6 +22,8 @@
>>  #include "xfs_trace.h"
>>  #include "xfs_rmap.h"
>>  
>> +struct kmem_cache	*xfs_bmbt_cur_zone;
>> +
>>  /*
>>   * Convert on-disk form of btree root to in-memory form.
>>   */
>> @@ -552,7 +554,7 @@ xfs_bmbt_init_cursor(
>>  	struct xfs_btree_cur	*cur;
>>  	ASSERT(whichfork != XFS_COW_FORK);
>>  
>> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> +	cur = kmem_cache_zalloc(xfs_bmbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>>  
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
>> index 298395481713..7ef19f365e33 100644
>> --- a/fs/xfs/libxfs/xfs_btree.c
>> +++ b/fs/xfs/libxfs/xfs_btree.c
>> @@ -23,10 +23,6 @@
>>  #include "xfs_btree_staging.h"
>>  #include "xfs_ag.h"
>>  
>> -/*
>> - * Cursor allocation zone.
>> - */
>> -kmem_zone_t	*xfs_btree_cur_zone;
>>  
>>  /*
>>   * Btree magic numbers.
>> @@ -379,7 +375,29 @@ xfs_btree_del_cursor(
>>  		kmem_free(cur->bc_ops);
>>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
>>  		xfs_perag_put(cur->bc_ag.pag);
>> -	kmem_cache_free(xfs_btree_cur_zone, cur);
>> +
>> +	switch (cur->bc_btnum) {
>> +	case XFS_BTNUM_BMAP:
>> +		kmem_cache_free(xfs_bmbt_cur_zone, cur);
>> +		break;
>> +	case XFS_BTNUM_BNO:
>> +	case XFS_BTNUM_CNT:
>> +		kmem_cache_free(xfs_allocbt_cur_zone, cur);
>> +		break;
>> +	case XFS_BTNUM_INOBT:
>> +	case XFS_BTNUM_FINOBT:
>> +		kmem_cache_free(xfs_inobt_cur_zone, cur);
>> +		break;
>> +	case XFS_BTNUM_RMAP:
>> +		kmem_cache_free(xfs_rmapbt_cur_zone, cur);
>> +		break;
>> +	case XFS_BTNUM_REFCNT:
>> +		kmem_cache_free(xfs_refcntbt_cur_zone, cur);
>> +		break;
>> +	default:
>> +		ASSERT(0);
>> +		break;
>> +	}
>>  }
>>  
>>  /*
>> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
>> index 4eaf8517f850..acdf087c853a 100644
>> --- a/fs/xfs/libxfs/xfs_btree.h
>> +++ b/fs/xfs/libxfs/xfs_btree.h
>> @@ -13,7 +13,11 @@ struct xfs_trans;
>>  struct xfs_ifork;
>>  struct xfs_perag;
>>  
>> -extern kmem_zone_t	*xfs_btree_cur_zone;
>> +extern struct kmem_cache	*xfs_allocbt_cur_zone;
>> +extern struct kmem_cache	*xfs_inobt_cur_zone;
>> +extern struct kmem_cache	*xfs_bmbt_cur_zone;
>> +extern struct kmem_cache	*xfs_rmapbt_cur_zone;
>> +extern struct kmem_cache	*xfs_refcntbt_cur_zone;
>>  
>>  /*
>>   * Generic key, ptr and record wrapper structures.
>> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
>> index 27190840c5d8..5258696f153e 100644
>> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
>> @@ -22,6 +22,8 @@
>>  #include "xfs_rmap.h"
>>  #include "xfs_ag.h"
>>  
>> +struct kmem_cache	*xfs_inobt_cur_zone;
>> +
>>  STATIC int
>>  xfs_inobt_get_minrecs(
>>  	struct xfs_btree_cur	*cur,
>> @@ -432,7 +434,7 @@ xfs_inobt_init_common(
>>  {
>>  	struct xfs_btree_cur	*cur;
>>  
>> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> +	cur = kmem_cache_zalloc(xfs_inobt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>>  	cur->bc_btnum = btnum;
>> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
>> index 1ef9b99962ab..20667f173040 100644
>> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
>> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
>> @@ -21,6 +21,8 @@
>>  #include "xfs_rmap.h"
>>  #include "xfs_ag.h"
>>  
>> +struct kmem_cache	*xfs_refcntbt_cur_zone;
>> +
>>  static struct xfs_btree_cur *
>>  xfs_refcountbt_dup_cursor(
>>  	struct xfs_btree_cur	*cur)
>> @@ -322,7 +324,7 @@ xfs_refcountbt_init_common(
>>  
>>  	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
>>  
>> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> +	cur = kmem_cache_zalloc(xfs_refcntbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>>  	cur->bc_btnum = XFS_BTNUM_REFC;
>> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
>> index b7dbbfb3aeed..cb6e64f6d8f9 100644
>> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
>> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
>> @@ -22,6 +22,8 @@
>>  #include "xfs_ag.h"
>>  #include "xfs_ag_resv.h"
>>  
>> +struct kmem_cache	*xfs_rmapbt_cur_zone;
>> +
>>  /*
>>   * Reverse map btree.
>>   *
>> @@ -451,7 +453,7 @@ xfs_rmapbt_init_common(
>>  {
>>  	struct xfs_btree_cur	*cur;
>>  
>> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>> +	cur = kmem_cache_zalloc(xfs_rmapbt_cur_zone, GFP_NOFS | __GFP_NOFAIL);
>>  	cur->bc_tp = tp;
>>  	cur->bc_mp = mp;
>>  	/* Overlapping btree; 2 keys per pointer. */
>> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
>> index 90716b9d6e5f..3f97dc1b41e0 100644
>> --- a/fs/xfs/xfs_super.c
>> +++ b/fs/xfs/xfs_super.c
>> @@ -1965,10 +1965,24 @@ xfs_init_zones(void)
>>  	if (!xfs_bmap_free_item_zone)
>>  		goto out_destroy_log_ticket_zone;
>>  
>> -	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
>> +	xfs_allocbt_cur_zone = kmem_cache_create("xfs_allocbt_cur",
>>  					       sizeof(struct xfs_btree_cur),
>>  					       0, 0, NULL);
>> -	if (!xfs_btree_cur_zone)
>> +	xfs_inobt_cur_zone = kmem_cache_create("xfs_inobt_cur",
>> +					       sizeof(struct xfs_btree_cur),
>> +					       0, 0, NULL);
>> +	xfs_bmbt_cur_zone = kmem_cache_create("xfs_bmbt_cur",
>> +					       sizeof(struct xfs_btree_cur),
>> +					       0, 0, NULL);
>> +	xfs_rmapbt_cur_zone = kmem_cache_create("xfs_rmapbt_cur",
>> +					       sizeof(struct xfs_btree_cur),
>> +					       0, 0, NULL);
>> +	xfs_refcntbt_cur_zone = kmem_cache_create("xfs_refcnt_cur",
>> +					       sizeof(struct xfs_btree_cur),
>> +					       0, 0, NULL);
>> +	if (!xfs_allocbt_cur_zone || !xfs_inobt_cur_zone ||
>> +	    !xfs_bmbt_cur_zone || !xfs_rmapbt_cur_zone ||
>> +	    !xfs_refcntbt_cur_zone)
>>  		goto out_destroy_bmap_free_item_zone;
>>  
>>  	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
>> @@ -2106,7 +2120,11 @@ xfs_init_zones(void)
>>   out_destroy_da_state_zone:
>>  	kmem_cache_destroy(xfs_da_state_zone);
>>   out_destroy_btree_cur_zone:
>> -	kmem_cache_destroy(xfs_btree_cur_zone);
>> +	kmem_cache_destroy(xfs_allocbt_cur_zone);
>> +	kmem_cache_destroy(xfs_inobt_cur_zone);
>> +	kmem_cache_destroy(xfs_bmbt_cur_zone);
>> +	kmem_cache_destroy(xfs_rmapbt_cur_zone);
>> +	kmem_cache_destroy(xfs_refcntbt_cur_zone);
>>   out_destroy_bmap_free_item_zone:
>>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>>   out_destroy_log_ticket_zone:
>> @@ -2138,7 +2156,11 @@ xfs_destroy_zones(void)
>>  	kmem_cache_destroy(xfs_trans_zone);
>>  	kmem_cache_destroy(xfs_ifork_zone);
>>  	kmem_cache_destroy(xfs_da_state_zone);
>> -	kmem_cache_destroy(xfs_btree_cur_zone);
>> +	kmem_cache_destroy(xfs_allocbt_cur_zone);
>> +	kmem_cache_destroy(xfs_inobt_cur_zone);
>> +	kmem_cache_destroy(xfs_bmbt_cur_zone);
>> +	kmem_cache_destroy(xfs_rmapbt_cur_zone);
>> +	kmem_cache_destroy(xfs_refcntbt_cur_zone);
>>  	kmem_cache_destroy(xfs_bmap_free_item_zone);
>>  	kmem_cache_destroy(xfs_log_ticket_zone);
>>  }


-- 
chandan

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-09-23  5:56 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-18  1:29 [PATCHSET RFC chandan 00/14] xfs: support dynamic btree cursor height Darrick J. Wong
2021-09-18  1:29 ` [PATCH 01/14] xfs: remove xfs_btree_cur_t typedef Darrick J. Wong
2021-09-20  9:53   ` Chandan Babu R
2021-09-21  8:36   ` Christoph Hellwig
2021-09-18  1:29 ` [PATCH 02/14] xfs: don't allocate scrub contexts on the stack Darrick J. Wong
2021-09-20  9:53   ` Chandan Babu R
2021-09-20 17:39     ` Darrick J. Wong
2021-09-21  8:39   ` Christoph Hellwig
2021-09-18  1:29 ` [PATCH 03/14] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
2021-09-20  9:53   ` Chandan Babu R
2021-09-21  8:43   ` Christoph Hellwig
2021-09-22 16:17     ` Darrick J. Wong
2021-09-18  1:29 ` [PATCH 04/14] xfs: stricter btree height checking when looking for errors Darrick J. Wong
2021-09-20  9:54   ` Chandan Babu R
2021-09-18  1:29 ` [PATCH 05/14] xfs: stricter btree height checking when scanning for btree roots Darrick J. Wong
2021-09-20  9:54   ` Chandan Babu R
2021-09-18  1:29 ` [PATCH 06/14] xfs: check that bc_nlevels never overflows Darrick J. Wong
2021-09-20  9:54   ` Chandan Babu R
2021-09-21  8:44   ` Christoph Hellwig
2021-09-18  1:29 ` [PATCH 07/14] xfs: support dynamic btree cursor heights Darrick J. Wong
2021-09-20  9:55   ` Chandan Babu R
2021-09-21  8:49   ` Christoph Hellwig
2021-09-18  1:29 ` [PATCH 08/14] xfs: refactor btree cursor allocation function Darrick J. Wong
2021-09-20  9:55   ` Chandan Babu R
2021-09-21  8:53   ` Christoph Hellwig
2021-09-18  1:29 ` [PATCH 09/14] xfs: fix maxlevels comparisons in the btree staging code Darrick J. Wong
2021-09-20  9:55   ` Chandan Babu R
2021-09-21  8:56   ` Christoph Hellwig
2021-09-22 15:59     ` Darrick J. Wong
2021-09-18  1:30 ` [PATCH 10/14] xfs: encode the max btree height in the cursor Darrick J. Wong
2021-09-20  9:55   ` Chandan Babu R
2021-09-21  8:57   ` Christoph Hellwig
2021-09-18  1:30 ` [PATCH 11/14] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
2021-09-20  9:56   ` Chandan Babu R
2021-09-20 23:06   ` Dave Chinner
2021-09-20 23:36     ` Dave Chinner
2021-09-21  9:03     ` Christoph Hellwig
2021-09-22 18:55       ` Darrick J. Wong
2021-09-22 17:38     ` Darrick J. Wong
2021-09-22 23:10       ` Dave Chinner
2021-09-23  1:58         ` Darrick J. Wong
2021-09-23  5:56           ` Chandan Babu R
2021-09-18  1:30 ` [PATCH 12/14] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
2021-09-20  9:56   ` Chandan Babu R
2021-09-18  1:30 ` [PATCH 13/14] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
2021-09-20  9:56   ` Chandan Babu R
2021-09-18  1:30 ` [PATCH 14/14] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
2021-09-20  9:57   ` Chandan Babu R

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).