All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET v4 00/17] xfs: support dynamic btree cursor height
@ 2021-10-14 20:16 Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
                   ` (16 more replies)
  0 siblings, 17 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:16 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, Dave Chinner, linux-xfs,
	chandan.babu, hch

Hi all,

In what's left of this series, we rearrange the incore btree cursor so
that we can support btrees of any height.  This will become necessary
for realtime rmap and reflink since we'd like to handle tall trees
without bloating the AG btree cursors.

Chandan Babu pointed out that his large extent counters series depends
on the ability to have btree cursors of arbitrary heights, so I've
ported this to 5.15-rc4 so his patchsets won't have to depend on
djwong-dev for submission.

Following the review discussions about the dynamic btree cursor height
patches, I've throw together another series to reduce the size of the
btree cursor, compute the absolute maximum possible btree heights for
each btree type, and now each btree cursor has its own slab cache:

$ grep xfs.*cur /proc/slabinfo
xfs_refcbt_cur 0 0 200 20 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_rmapbt_cur 0 0 248 16 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_bmbt_cur   0 0 248 16 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_inobt_cur  0 0 216 18 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_bnobt_cur  0 0 216 18 1 : tunables 0 0 0 : slabdata 4 4 0

I've also rigged up the debugger to make it easier to extract the actual
height information:

$ xfs_db /dev/sda -c 'btheight -w absmax all'
bnobt: 7
cntbt: 7
inobt: 7
finobt: 7
bmapbt: 9
refcountbt: 6
rmapbt: 9

As you can see from the slabinfo output, this no longer means that we're
allocating 224-byte cursors for all five btree types.  Even with the
extra overhead of supporting dynamic cursor sizes and per-btree caches,
we still come out ahead in terms of cursor size for three of the five
btree types.

This series now also includes a couple of patches to reduce holes and
unnecessary fields in the btree cursor.

Patches 1, 5, 11, 12, 13, 16, and 17 still need review.

v2: reduce scrub btree checker memory footprint even more, put the one
    fixpatch first, use struct_size, fix 80col problems, move all the
    btree cache work to a separate series
v3: rebase to 5.15-rc4, fold in the per-btree cursor cache patches,
    remove all the references to "zones" since they're called "caches"
    in Linux
v4: improve documentation, clean up some geometry functions, simplify
    the max possible height computation code

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-dynamic-depth-5.16
---
 fs/xfs/libxfs/xfs_ag_resv.c        |    3 
 fs/xfs/libxfs/xfs_alloc.c          |   26 ++-
 fs/xfs/libxfs/xfs_alloc.h          |    2 
 fs/xfs/libxfs/xfs_alloc_btree.c    |   63 ++++++-
 fs/xfs/libxfs/xfs_alloc_btree.h    |    5 +
 fs/xfs/libxfs/xfs_bmap.c           |   13 +
 fs/xfs/libxfs/xfs_bmap_btree.c     |   60 ++++++-
 fs/xfs/libxfs/xfs_bmap_btree.h     |    5 +
 fs/xfs/libxfs/xfs_btree.c          |  320 ++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_btree.h          |   87 +++++++---
 fs/xfs/libxfs/xfs_btree_staging.c  |   10 +
 fs/xfs/libxfs/xfs_fs.h             |    2 
 fs/xfs/libxfs/xfs_ialloc.c         |    1 
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   90 +++++++++-
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    5 +
 fs/xfs/libxfs/xfs_refcount_btree.c |   73 +++++++-
 fs/xfs/libxfs/xfs_refcount_btree.h |    5 +
 fs/xfs/libxfs/xfs_rmap_btree.c     |  116 ++++++++++---
 fs/xfs/libxfs/xfs_rmap_btree.h     |    5 +
 fs/xfs/libxfs/xfs_trans_resv.c     |   18 ++
 fs/xfs/libxfs/xfs_trans_space.h    |    9 +
 fs/xfs/scrub/agheader.c            |    4 
 fs/xfs/scrub/agheader_repair.c     |    4 
 fs/xfs/scrub/bitmap.c              |   22 +-
 fs/xfs/scrub/bmap.c                |    2 
 fs/xfs/scrub/btree.c               |   77 ++++-----
 fs/xfs/scrub/btree.h               |   17 ++
 fs/xfs/scrub/trace.c               |   11 +
 fs/xfs/scrub/trace.h               |   10 +
 fs/xfs/xfs_mount.c                 |   14 ++
 fs/xfs/xfs_mount.h                 |    5 -
 fs/xfs/xfs_super.c                 |   13 +
 fs/xfs/xfs_trace.h                 |    2 
 33 files changed, 783 insertions(+), 316 deletions(-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 22:48   ` Dave Chinner
  2021-10-15  0:42   ` [PATCH v4.1 " Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 02/17] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/trace.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index c0ef53fe6611..93c13763c15e 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
 	if (level < cur->bc_nlevels && cur->bc_bufs[level])
 		return XFS_DADDR_TO_FSB(cur->bc_mp,
 				xfs_buf_daddr(cur->bc_bufs[level]));
-	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
+
+	if (level == cur->bc_nlevels - 1 &&
+	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
 		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
-	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
-		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
+
 	return NULLFSBLOCK;
 }
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 02/17] xfs: remove xfs_btree_cur.bc_blocklog
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 03/17] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

This field isn't used by anyone, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    1 -
 fs/xfs/libxfs/xfs_bmap_btree.c     |    1 -
 fs/xfs/libxfs/xfs_btree.h          |    1 -
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    2 --
 fs/xfs/libxfs/xfs_refcount_btree.c |    1 -
 fs/xfs/libxfs/xfs_rmap_btree.c     |    1 -
 6 files changed, 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6746fd735550..152ed2a202f4 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -482,7 +482,6 @@ xfs_allocbt_init_common(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 72444b8b38a6..a43dea8d6a65 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -558,7 +558,6 @@ xfs_bmbt_init_cursor(
 	cur->bc_mp = mp;
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_btnum = XFS_BTNUM_BMAP;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 513ade4a89f8..49ecc496238f 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -229,7 +229,6 @@ struct xfs_btree_cur
 #define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
 #define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
-	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
 	int		bc_statoff;	/* offset of btre stats array */
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 27190840c5d8..10736b89b679 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -444,8 +444,6 @@ xfs_inobt_init_common(
 		cur->bc_ops = &xfs_finobt_ops;
 	}
 
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1ef9b99962ab..3ea589f15b14 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -326,7 +326,6 @@ xfs_refcountbt_init_common(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = XFS_BTNUM_REFC;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b7dbbfb3aeed..d65bf3c6f25e 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -457,7 +457,6 @@ xfs_rmapbt_init_common(
 	/* Overlapping btree; 2 keys per pointer. */
 	cur->bc_btnum = XFS_BTNUM_RMAP;
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 03/17] xfs: reduce the size of nr_ops for refcount btree cursors
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 02/17] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 04/17] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size.  Fix whitespace alignment too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 49ecc496238f..1018bcc43d66 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -181,18 +181,18 @@ union xfs_btree_irec {
 
 /* Per-AG btree information. */
 struct xfs_btree_cur_ag {
-	struct xfs_perag	*pag;
+	struct xfs_perag		*pag;
 	union {
 		struct xfs_buf		*agbp;
 		struct xbtree_afakeroot	*afake;	/* for staging cursor */
 	};
 	union {
 		struct {
-			unsigned long nr_ops;	/* # record updates */
-			int	shape_changes;	/* # of extent splits */
+			unsigned int	nr_ops;	/* # record updates */
+			unsigned int	shape_changes;	/* # of extent splits */
 		} refc;
 		struct {
-			bool	active;		/* allocation cursor state */
+			bool		active;	/* allocation cursor state */
 		} abt;
 	};
 };


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 04/17] xfs: don't track firstrec/firstkey separately in xchk_btree
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 03/17] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 05/17] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function.  This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.

Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero.  Use
that, and save ourselves some memory space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/scrub/btree.c |   11 +++--------
 fs/xfs/scrub/btree.h |    2 --
 2 files changed, 3 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 26dcb4691e31..d5e1ca521fc4 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -141,9 +141,9 @@ xchk_btree_rec(
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
 	/* If this isn't the first record, are they in order? */
-	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
+	if (cur->bc_ptrs[0] > 1 &&
+	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
-	bs->firstrec = false;
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
 
 	if (cur->bc_nlevels == 1)
@@ -188,10 +188,9 @@ xchk_btree_key(
 	trace_xchk_btree_key(bs->sc, cur, level);
 
 	/* If this isn't the first key, are they in order? */
-	if (!bs->firstkey[level] &&
+	if (cur->bc_ptrs[level] > 1 &&
 	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	bs->firstkey[level] = false;
 	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
 
 	if (level + 1 >= cur->bc_nlevels)
@@ -636,7 +635,6 @@ xchk_btree(
 	struct xfs_buf			*bp;
 	struct check_owner		*co;
 	struct check_owner		*n;
-	int				i;
 	int				error = 0;
 
 	/*
@@ -649,13 +647,10 @@ xchk_btree(
 	bs->cur = cur;
 	bs->scrub_rec = scrub_fn;
 	bs->oinfo = oinfo;
-	bs->firstrec = true;
 	bs->private = private;
 	bs->sc = sc;
 
 	/* Initialize scrub state */
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
-		bs->firstkey[i] = true;
 	INIT_LIST_HEAD(&bs->to_check);
 
 	/* Don't try to check a tree with a height we can't handle. */
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index b7d2fc01fbf9..7671108f9f85 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -39,9 +39,7 @@ struct xchk_btree {
 
 	/* internal scrub state */
 	union xfs_btree_rec		lastrec;
-	bool				firstrec;
 	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
-	bool				firstkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
 };
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 05/17] xfs: dynamically allocate btree scrub context structure
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 04/17] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 22:48   ` Dave Chinner
  2021-10-14 20:17 ` [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights Darrick J. Wong
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have.  This will
enable us to use memory more efficiently once we start adding very tall
btree types.  Right-size the lastkey array to match the number of *node*
levels in the tree so that we stop wasting space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   23 ++++++++++++-----------
 fs/xfs/scrub/btree.h |   15 ++++++++++++++-
 2 files changed, 26 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index d5e1ca521fc4..6d4eba85ef77 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -189,9 +189,9 @@ xchk_btree_key(
 
 	/* If this isn't the first key, are they in order? */
 	if (cur->bc_ptrs[level] > 1 &&
-	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
+	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
+	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
 
 	if (level + 1 >= cur->bc_nlevels)
 		return;
@@ -631,17 +631,24 @@ xchk_btree(
 	union xfs_btree_ptr		*pp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
-	int				level;
 	struct xfs_buf			*bp;
 	struct check_owner		*co;
 	struct check_owner		*n;
+	size_t				cur_sz;
+	int				level;
 	int				error = 0;
 
 	/*
 	 * Allocate the btree scrub context from the heap, because this
-	 * structure can get rather large.
+	 * structure can get rather large.  Don't let a caller feed us a
+	 * totally absurd size.
 	 */
-	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
+	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
+	if (cur_sz > PAGE_SIZE) {
+		xchk_btree_set_corrupt(sc, cur, 0);
+		return 0;
+	}
+	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
 	if (!bs)
 		return -ENOMEM;
 	bs->cur = cur;
@@ -653,12 +660,6 @@ xchk_btree(
 	/* Initialize scrub state */
 	INIT_LIST_HEAD(&bs->to_check);
 
-	/* Don't try to check a tree with a height we can't handle. */
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
-		xchk_btree_set_corrupt(sc, cur, 0);
-		goto out;
-	}
-
 	/*
 	 * Load the root of the btree.  The helper function absorbs
 	 * error codes for us.
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 7671108f9f85..da61a53a0b61 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -39,9 +39,22 @@ struct xchk_btree {
 
 	/* internal scrub state */
 	union xfs_btree_rec		lastrec;
-	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
+
+	/* this element must come last! */
+	union xfs_btree_key		lastkey[];
 };
+
+/*
+ * Calculate the size of a xchk_btree structure.  There are nlevels-1 slots for
+ * keys because we track leaf records separately in lastrec.
+ */
+static inline size_t
+xchk_btree_sizeof(unsigned int nlevels)
+{
+	return struct_size((struct xchk_btree *)NULL, lastkey, nlevels - 1);
+}
+
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
 		void *private);


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 05/17] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 22:49   ` Dave Chinner
  2021-10-14 20:17 ` [PATCH 07/17] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  Files with huge data forks
(and in the future, the realtime rmap btree) will require the ability to
support many more levels than a per-AG btree cursor, which means that
we're going to create per-btree type cursor caches to conserve memory
for the more common case.

Note that a subsequent patch actually introduces dynamic cursor heights.
This one merely rearranges the structure to prepare for that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c |    6 +-
 fs/xfs/libxfs/xfs_bmap.c  |   10 +--
 fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
 fs/xfs/libxfs/xfs_btree.h |   33 +++++++--
 fs/xfs/scrub/bitmap.c     |   22 +++---
 fs/xfs/scrub/bmap.c       |    2 -
 fs/xfs/scrub/btree.c      |   47 +++++++------
 fs/xfs/scrub/trace.c      |    4 +
 fs/xfs/scrub/trace.h      |   10 +--
 fs/xfs/xfs_super.c        |    2 -
 fs/xfs/xfs_trace.h        |    2 -
 11 files changed, 167 insertions(+), 139 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 35fb1dd3be95..55c5adc9b54e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -488,8 +488,8 @@ xfs_alloc_fixup_trees(
 		struct xfs_btree_block	*bnoblock;
 		struct xfs_btree_block	*cntblock;
 
-		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_bufs[0]);
-		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_bufs[0]);
+		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_levels[0].bp);
+		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_levels[0].bp);
 
 		if (XFS_IS_CORRUPT(mp,
 				   bnoblock->bb_numrecs !=
@@ -1512,7 +1512,7 @@ xfs_alloc_ag_vextent_lastblock(
 	 * than minlen.
 	 */
 	if (*len || args->alignment > 1) {
-		acur->cnt->bc_ptrs[0] = 1;
+		acur->cnt->bc_levels[0].ptr = 1;
 		do {
 			error = xfs_alloc_get_rec(acur->cnt, bno, len, &i);
 			if (error)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 499c977cbf56..644b956301b6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -240,10 +240,10 @@ xfs_bmap_get_bp(
 		return NULL;
 
 	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
-		if (!cur->bc_bufs[i])
+		if (!cur->bc_levels[i].bp)
 			break;
-		if (xfs_buf_daddr(cur->bc_bufs[i]) == bno)
-			return cur->bc_bufs[i];
+		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
+			return cur->bc_levels[i].bp;
 	}
 
 	/* Chase down all the log items to see if the bp is there */
@@ -629,8 +629,8 @@ xfs_bmap_btree_to_extents(
 	ip->i_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
-	if (cur->bc_bufs[0] == cbp)
-		cur->bc_bufs[0] = NULL;
+	if (cur->bc_levels[0].bp == cbp)
+		cur->bc_levels[0].bp = NULL;
 	xfs_iroot_realloc(ip, -1, whichfork);
 	ASSERT(ifp->if_broot == NULL);
 	ifp->if_format = XFS_DINODE_FMT_EXTENTS;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index bc4e49f0456a..25dfab81025f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -367,8 +367,8 @@ xfs_btree_del_cursor(
 	 * way we won't have initialized all the entries down to 0.
 	 */
 	for (i = 0; i < cur->bc_nlevels; i++) {
-		if (cur->bc_bufs[i])
-			xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
+		if (cur->bc_levels[i].bp)
+			xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
 		else if (!error)
 			break;
 	}
@@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
 	 * For each level current, re-get the buffer and copy the ptr value.
 	 */
 	for (i = 0; i < new->bc_nlevels; i++) {
-		new->bc_ptrs[i] = cur->bc_ptrs[i];
-		new->bc_ra[i] = cur->bc_ra[i];
-		bp = cur->bc_bufs[i];
+		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
+		new->bc_levels[i].ra = cur->bc_levels[i].ra;
+		bp = cur->bc_levels[i].bp;
 		if (bp) {
 			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 						   xfs_buf_daddr(bp), mp->m_bsize,
@@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
 				return error;
 			}
 		}
-		new->bc_bufs[i] = bp;
+		new->bc_levels[i].bp = bp;
 	}
 	*ncur = new;
 	return 0;
@@ -681,7 +681,7 @@ xfs_btree_get_block(
 		return xfs_btree_get_iroot(cur);
 	}
 
-	*bpp = cur->bc_bufs[level];
+	*bpp = cur->bc_levels[level].bp;
 	return XFS_BUF_TO_BLOCK(*bpp);
 }
 
@@ -711,7 +711,7 @@ xfs_btree_firstrec(
 	/*
 	 * Set the ptr value to 1, that's the first record/key.
 	 */
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 	return 1;
 }
 
@@ -741,7 +741,7 @@ xfs_btree_lastrec(
 	/*
 	 * Set the ptr value to numrecs, that's the last record/key.
 	 */
-	cur->bc_ptrs[level] = be16_to_cpu(block->bb_numrecs);
+	cur->bc_levels[level].ptr = be16_to_cpu(block->bb_numrecs);
 	return 1;
 }
 
@@ -922,11 +922,11 @@ xfs_btree_readahead(
 	    (lev == cur->bc_nlevels - 1))
 		return 0;
 
-	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
+	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
 		return 0;
 
-	cur->bc_ra[lev] |= lr;
-	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[lev]);
+	cur->bc_levels[lev].ra |= lr;
+	block = XFS_BUF_TO_BLOCK(cur->bc_levels[lev].bp);
 
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		return xfs_btree_readahead_lblock(cur, lr, block);
@@ -991,22 +991,22 @@ xfs_btree_setbuf(
 {
 	struct xfs_btree_block	*b;	/* btree block */
 
-	if (cur->bc_bufs[lev])
-		xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[lev]);
-	cur->bc_bufs[lev] = bp;
-	cur->bc_ra[lev] = 0;
+	if (cur->bc_levels[lev].bp)
+		xfs_trans_brelse(cur->bc_tp, cur->bc_levels[lev].bp);
+	cur->bc_levels[lev].bp = bp;
+	cur->bc_levels[lev].ra = 0;
 
 	b = XFS_BUF_TO_BLOCK(bp);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (b->bb_u.l.bb_leftsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.l.bb_rightsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	} else {
 		if (b->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	}
 }
 
@@ -1548,7 +1548,7 @@ xfs_btree_increment(
 #endif
 
 	/* We're done if we remain in the block after the increment. */
-	if (++cur->bc_ptrs[level] <= xfs_btree_get_numrecs(block))
+	if (++cur->bc_levels[level].ptr <= xfs_btree_get_numrecs(block))
 		goto out1;
 
 	/* Fail if we just went off the right edge of the tree. */
@@ -1571,7 +1571,7 @@ xfs_btree_increment(
 			goto error0;
 #endif
 
-		if (++cur->bc_ptrs[lev] <= xfs_btree_get_numrecs(block))
+		if (++cur->bc_levels[lev].ptr <= xfs_btree_get_numrecs(block))
 			break;
 
 		/* Read-ahead the right block for the next loop. */
@@ -1598,14 +1598,14 @@ xfs_btree_increment(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = 1;
+		cur->bc_levels[lev].ptr = 1;
 	}
 out1:
 	*stat = 1;
@@ -1641,7 +1641,7 @@ xfs_btree_decrement(
 	xfs_btree_readahead(cur, level, XFS_BTCUR_LEFTRA);
 
 	/* We're done if we remain in the block after the decrement. */
-	if (--cur->bc_ptrs[level] > 0)
+	if (--cur->bc_levels[level].ptr > 0)
 		goto out1;
 
 	/* Get a pointer to the btree block. */
@@ -1665,7 +1665,7 @@ xfs_btree_decrement(
 	 * Stop when we don't go off the left edge of a block.
 	 */
 	for (lev = level + 1; lev < cur->bc_nlevels; lev++) {
-		if (--cur->bc_ptrs[lev] > 0)
+		if (--cur->bc_levels[lev].ptr > 0)
 			break;
 		/* Read-ahead the left block for the next loop. */
 		xfs_btree_readahead(cur, lev, XFS_BTCUR_LEFTRA);
@@ -1691,13 +1691,13 @@ xfs_btree_decrement(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = xfs_btree_get_numrecs(block);
+		cur->bc_levels[lev].ptr = xfs_btree_get_numrecs(block);
 	}
 out1:
 	*stat = 1;
@@ -1735,7 +1735,7 @@ xfs_btree_lookup_get_block(
 	 *
 	 * Otherwise throw it away and get a new one.
 	 */
-	bp = cur->bc_bufs[level];
+	bp = cur->bc_levels[level].bp;
 	error = xfs_btree_ptr_to_daddr(cur, pp, &daddr);
 	if (error)
 		return error;
@@ -1864,7 +1864,7 @@ xfs_btree_lookup(
 					return -EFSCORRUPTED;
 				}
 
-				cur->bc_ptrs[0] = dir != XFS_LOOKUP_LE;
+				cur->bc_levels[0].ptr = dir != XFS_LOOKUP_LE;
 				*stat = 0;
 				return 0;
 			}
@@ -1916,7 +1916,7 @@ xfs_btree_lookup(
 			if (error)
 				goto error0;
 
-			cur->bc_ptrs[level] = keyno;
+			cur->bc_levels[level].ptr = keyno;
 		}
 	}
 
@@ -1933,7 +1933,7 @@ xfs_btree_lookup(
 		    !xfs_btree_ptr_is_null(cur, &ptr)) {
 			int	i;
 
-			cur->bc_ptrs[0] = keyno;
+			cur->bc_levels[0].ptr = keyno;
 			error = xfs_btree_increment(cur, 0, &i);
 			if (error)
 				goto error0;
@@ -1944,7 +1944,7 @@ xfs_btree_lookup(
 		}
 	} else if (dir == XFS_LOOKUP_LE && diff > 0)
 		keyno--;
-	cur->bc_ptrs[0] = keyno;
+	cur->bc_levels[0].ptr = keyno;
 
 	/* Return if we succeeded or not. */
 	if (keyno == 0 || keyno > xfs_btree_get_numrecs(block))
@@ -2104,7 +2104,7 @@ __xfs_btree_updkeys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		nlkey = xfs_btree_key_addr(cur, ptr, block);
 		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
 		if (!force_all &&
@@ -2171,7 +2171,7 @@ xfs_btree_update_keys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		kp = xfs_btree_key_addr(cur, ptr, block);
 		xfs_btree_copy_keys(cur, kp, &key, 1);
 		xfs_btree_log_keys(cur, bp, ptr, ptr);
@@ -2205,7 +2205,7 @@ xfs_btree_update(
 		goto error0;
 #endif
 	/* Get the address of the rec to be updated. */
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	rp = xfs_btree_rec_addr(cur, ptr, block);
 
 	/* Fill in the new contents and log them. */
@@ -2280,7 +2280,7 @@ xfs_btree_lshift(
 	 * If the cursor entry is the one that would be moved, don't
 	 * do it... it's too complicated.
 	 */
-	if (cur->bc_ptrs[level] <= 1)
+	if (cur->bc_levels[level].ptr <= 1)
 		goto out0;
 
 	/* Set up the left neighbor as "left". */
@@ -2414,7 +2414,7 @@ xfs_btree_lshift(
 		goto error0;
 
 	/* Slide the cursor value left one. */
-	cur->bc_ptrs[level]--;
+	cur->bc_levels[level].ptr--;
 
 	*stat = 1;
 	return 0;
@@ -2476,7 +2476,7 @@ xfs_btree_rshift(
 	 * do it... it's too complicated.
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
-	if (cur->bc_ptrs[level] >= lrecs)
+	if (cur->bc_levels[level].ptr >= lrecs)
 		goto out0;
 
 	/* Set up the right neighbor as "right". */
@@ -2664,7 +2664,7 @@ __xfs_btree_split(
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
 	rrecs = lrecs / 2;
-	if ((lrecs & 1) && cur->bc_ptrs[level] <= rrecs + 1)
+	if ((lrecs & 1) && cur->bc_levels[level].ptr <= rrecs + 1)
 		rrecs++;
 	src_index = (lrecs - rrecs + 1);
 
@@ -2760,9 +2760,9 @@ __xfs_btree_split(
 	 * If it's just pointing past the last entry in left, then we'll
 	 * insert there, so don't change anything in that case.
 	 */
-	if (cur->bc_ptrs[level] > lrecs + 1) {
+	if (cur->bc_levels[level].ptr > lrecs + 1) {
 		xfs_btree_setbuf(cur, level, rbp);
-		cur->bc_ptrs[level] -= lrecs;
+		cur->bc_levels[level].ptr -= lrecs;
 	}
 	/*
 	 * If there are more levels, we'll need another cursor which refers
@@ -2772,7 +2772,7 @@ __xfs_btree_split(
 		error = xfs_btree_dup_cursor(cur, curp);
 		if (error)
 			goto error0;
-		(*curp)->bc_ptrs[level + 1]++;
+		(*curp)->bc_levels[level + 1].ptr++;
 	}
 	*ptrp = rptr;
 	*stat = 1;
@@ -2934,7 +2934,7 @@ xfs_btree_new_iroot(
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
-	cur->bc_ptrs[level + 1] = 1;
+	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
 	ckp = xfs_btree_key_addr(cur, 1, cblock);
@@ -3095,7 +3095,7 @@ xfs_btree_new_root(
 
 	/* Fix up the cursor. */
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
-	cur->bc_ptrs[cur->bc_nlevels] = nptr;
+	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 	*stat = 1;
@@ -3154,7 +3154,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 	if (*stat) {
-		*oindex = *index = cur->bc_ptrs[level];
+		*oindex = *index = cur->bc_levels[level].ptr;
 		return 0;
 	}
 
@@ -3169,7 +3169,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 
-	*index = cur->bc_ptrs[level];
+	*index = cur->bc_levels[level].ptr;
 	return 0;
 }
 
@@ -3216,7 +3216,7 @@ xfs_btree_insrec(
 	}
 
 	/* If we're off the left edge, return failure. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3559,7 +3559,7 @@ xfs_btree_kill_iroot(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level - 1] = NULL;
+	cur->bc_levels[level - 1].bp = NULL;
 	be16_add_cpu(&block->bb_level, -1);
 	xfs_trans_log_inode(cur->bc_tp, ip,
 		XFS_ILOG_CORE | xfs_ilog_fbroot(cur->bc_ino.whichfork));
@@ -3592,8 +3592,8 @@ xfs_btree_kill_root(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level] = NULL;
-	cur->bc_ra[level] = 0;
+	cur->bc_levels[level].bp = NULL;
+	cur->bc_levels[level].ra = 0;
 	cur->bc_nlevels--;
 
 	return 0;
@@ -3652,7 +3652,7 @@ xfs_btree_delrec(
 	tcur = NULL;
 
 	/* Get the index of the entry being deleted, check for nothing there. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3962,7 +3962,7 @@ xfs_btree_delrec(
 				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
 				tcur = NULL;
 				if (level == 0)
-					cur->bc_ptrs[0]++;
+					cur->bc_levels[0].ptr++;
 
 				*stat = 1;
 				return 0;
@@ -4099,9 +4099,9 @@ xfs_btree_delrec(
 	 * cursor to the left block, and fix up the index.
 	 */
 	if (bp != lbp) {
-		cur->bc_bufs[level] = lbp;
-		cur->bc_ptrs[level] += lrecs;
-		cur->bc_ra[level] = 0;
+		cur->bc_levels[level].bp = lbp;
+		cur->bc_levels[level].ptr += lrecs;
+		cur->bc_levels[level].ra = 0;
 	}
 	/*
 	 * If we joined with the right neighbor and there's a level above
@@ -4121,16 +4121,16 @@ xfs_btree_delrec(
 	 * We can't use decrement because it would change the next level up.
 	 */
 	if (level > 0)
-		cur->bc_ptrs[level]--;
+		cur->bc_levels[level].ptr--;
 
 	/*
 	 * We combined blocks, so we have to update the parent keys if the
-	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
-	 * points to the old block so that the caller knows which record to
-	 * delete.  Therefore, the caller must be savvy enough to call updkeys
-	 * for us if we return stat == 2.  The other exit points from this
-	 * function don't require deletions further up the tree, so they can
-	 * call updkeys directly.
+	 * btree supports overlapped intervals.  However,
+	 * bc_levels[level + 1].ptr points to the old block so that the caller
+	 * knows which record to delete.  Therefore, the caller must be savvy
+	 * enough to call updkeys for us if we return stat == 2.  The other
+	 * exit points from this function don't require deletions further up
+	 * the tree, so they can call updkeys directly.
 	 */
 
 	/* Return value means the next level up has something to do. */
@@ -4184,7 +4184,7 @@ xfs_btree_delete(
 
 	if (i == 0) {
 		for (level = 1; level < cur->bc_nlevels; level++) {
-			if (cur->bc_ptrs[level] == 0) {
+			if (cur->bc_levels[level].ptr == 0) {
 				error = xfs_btree_decrement(cur, level, &i);
 				if (error)
 					goto error0;
@@ -4215,7 +4215,7 @@ xfs_btree_get_rec(
 	int			error;	/* error return value */
 #endif
 
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 #ifdef DEBUG
@@ -4663,23 +4663,25 @@ xfs_btree_overlapped_query_range(
 	if (error)
 		goto out;
 #endif
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr >
+					be16_to_cpu(block->bb_numrecs)) {
 pop_up:
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
 
 		if (level == 0) {
 			/* Handle a leaf node. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr,
+					block);
 
 			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
 			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
@@ -4702,14 +4704,15 @@ xfs_btree_overlapped_query_range(
 				/* Record is larger than high key; pop. */
 				goto pop_up;
 			}
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* Handle an internal node. */
-		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
-		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		lkp = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
+		hkp = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr,
+				block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 
 		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
 		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
@@ -4732,13 +4735,13 @@ xfs_btree_overlapped_query_range(
 			if (error)
 				goto out;
 #endif
-			cur->bc_ptrs[level] = 1;
+			cur->bc_levels[level].ptr = 1;
 			continue;
 		} else if (hdiff < 0) {
 			/* The low key is larger than the upper range; pop. */
 			goto pop_up;
 		}
-		cur->bc_ptrs[level]++;
+		cur->bc_levels[level].ptr++;
 	}
 
 out:
@@ -4749,13 +4752,14 @@ xfs_btree_overlapped_query_range(
 	 * with a zero-results range query, so release the buffers if we
 	 * failed to return any results.
 	 */
-	if (cur->bc_bufs[0] == NULL) {
+	if (cur->bc_levels[0].bp == NULL) {
 		for (i = 0; i < cur->bc_nlevels; i++) {
-			if (cur->bc_bufs[i]) {
-				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
-				cur->bc_bufs[i] = NULL;
-				cur->bc_ptrs[i] = 0;
-				cur->bc_ra[i] = 0;
+			if (cur->bc_levels[i].bp) {
+				xfs_trans_brelse(cur->bc_tp,
+						cur->bc_levels[i].bp);
+				cur->bc_levels[i].bp = NULL;
+				cur->bc_levels[i].ptr = 0;
+				cur->bc_levels[i].ra = 0;
 			}
 		}
 	}
@@ -4917,7 +4921,7 @@ xfs_btree_has_more_records(
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 	/* There are still records in this block. */
-	if (cur->bc_ptrs[0] < xfs_btree_get_numrecs(block))
+	if (cur->bc_levels[0].ptr < xfs_btree_get_numrecs(block))
 		return true;
 
 	/* There are more record blocks. */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 1018bcc43d66..0181fc98bc12 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
 #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
 };
 
+struct xfs_btree_level {
+	/* buffer pointer */
+	struct xfs_buf		*bp;
+
+	/* key/record number */
+	uint16_t		ptr;
+
+	/* readahead info */
+#define XFS_BTCUR_LEFTRA	(1 << 0) /* left sibling has been read-ahead */
+#define XFS_BTCUR_RIGHTRA	(1 << 1) /* right sibling has been read-ahead */
+	uint16_t		ra;
+};
+
 /*
  * Btree cursor structure.
  * This collects all information needed by the btree code in one place.
@@ -223,11 +236,6 @@ struct xfs_btree_cur
 	const struct xfs_btree_ops *bc_ops;
 	uint			bc_flags; /* btree features - below */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
-	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
-	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
-	uint8_t		bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
-#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
-#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
 	int		bc_statoff;	/* offset of btre stats array */
@@ -242,8 +250,22 @@ struct xfs_btree_cur
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
 	};
+
+	/* Must be at the end of the struct! */
+	struct xfs_btree_level	bc_levels[];
 };
 
+/*
+ * Compute the size of a btree cursor that can handle a btree of a given
+ * height.  The bc_levels array handles node and leaf blocks, so its size
+ * is exactly nlevels.
+ */
+static inline size_t
+xfs_btree_cur_sizeof(unsigned int nlevels)
+{
+	return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
+}
+
 /* cursor flags */
 #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
 #define XFS_BTREE_ROOT_IN_INODE		(1<<1)	/* root may be variable size */
@@ -257,7 +279,6 @@ struct xfs_btree_cur
  */
 #define XFS_BTREE_STAGING		(1<<5)
 
-
 #define	XFS_BTREE_NOERROR	0
 #define	XFS_BTREE_ERROR		1
 
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index d6d24c866bc4..b89bf9de9b1c 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -222,21 +222,21 @@ xbitmap_disunion(
  * 1  2  3
  *
  * Pretend for this example that each leaf block has 100 btree records.  For
- * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
- * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
- * block 4.  The list is [1, 4].
+ * the first btree record, we'll observe that bc_levels[0].ptr == 1, so we
+ * record that we saw block 1.  Then we observe that bc_levels[1].ptr == 1, so
+ * we record block 4.  The list is [1, 4].
  *
- * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
- * loop.  The list remains [1, 4].
+ * For the second btree record, we see that bc_levels[0].ptr == 2, so we exit
+ * the loop.  The list remains [1, 4].
  *
  * For the 101st btree record, we've moved onto leaf block 2.  Now
- * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
- * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
+ * bc_levels[0].ptr == 1 again, so we record that we saw block 2.  We see that
+ * bc_levels[1].ptr == 2, so we exit the loop.  The list is now [1, 4, 2].
  *
- * For the 102nd record, bc_ptrs[0] == 2, so we continue.
+ * For the 102nd record, bc_levels[0].ptr == 2, so we continue.
  *
- * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
- * we add 3 to the list.  Now it is [1, 4, 2, 3].
+ * For the 201st record, we've moved on to leaf block 3.
+ * bc_levels[0].ptr == 1, so we add 3 to the list.  Now it is [1, 4, 2, 3].
  *
  * For the 300th record we just exit, with the list being [1, 4, 2, 3].
  */
@@ -256,7 +256,7 @@ xbitmap_set_btcur_path(
 	int			i;
 	int			error;
 
-	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
+	for (i = 0; i < cur->bc_nlevels && cur->bc_levels[i].ptr == 1; i++) {
 		xfs_btree_get_block(cur, i, &bp);
 		if (!bp)
 			continue;
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 017da9ceaee9..a4cbbc346f60 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -402,7 +402,7 @@ xchk_bmapbt_rec(
 	 * the root since the verifiers don't do that.
 	 */
 	if (xfs_has_crc(bs->cur->bc_mp) &&
-	    bs->cur->bc_ptrs[0] == 1) {
+	    bs->cur->bc_levels[0].ptr == 1) {
 		for (i = 0; i < bs->cur->bc_nlevels - 1; i++) {
 			block = xfs_btree_get_block(bs->cur, i, &bp);
 			owner = be64_to_cpu(block->bb_u.l.bb_owner);
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 6d4eba85ef77..39dd46f038fe 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -136,12 +136,12 @@ xchk_btree_rec(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, 0, &bp);
-	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+	rec = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
 
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
 	/* If this isn't the first record, are they in order? */
-	if (cur->bc_ptrs[0] > 1 &&
+	if (cur->bc_levels[0].ptr > 1 &&
 	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
@@ -152,7 +152,7 @@ xchk_btree_rec(
 	/* Is this at least as large as the parent low key? */
 	cur->bc_ops->init_key_from_rec(&key, rec);
 	keyblock = xfs_btree_get_block(cur, 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
@@ -161,7 +161,7 @@ xchk_btree_rec(
 
 	/* Is this no larger than the parent high key? */
 	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
@@ -183,12 +183,12 @@ xchk_btree_key(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, level, &bp);
-	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+	key = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
 
 	trace_xchk_btree_key(bs->sc, cur, level);
 
 	/* If this isn't the first key, are they in order? */
-	if (cur->bc_ptrs[level] > 1 &&
+	if (cur->bc_levels[level].ptr > 1 &&
 	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
@@ -198,7 +198,7 @@ xchk_btree_key(
 
 	/* Is this at least as large as the parent low key? */
 	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 
@@ -206,8 +206,9 @@ xchk_btree_key(
 		return;
 
 	/* Is this no larger than the parent high key? */
-	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
+			keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
@@ -290,7 +291,7 @@ xchk_btree_block_check_sibling(
 
 	/* Compare upper level pointer to sibling pointer. */
 	pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
-	pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
+	pp = xfs_btree_ptr_addr(ncur, ncur->bc_levels[level + 1].ptr, pblock);
 	if (!xchk_btree_ptr_ok(bs, level + 1, pp))
 		goto out;
 	if (pbp)
@@ -595,7 +596,7 @@ xchk_btree_block_keys(
 
 	/* Obtain the parent's copy of the keys for this block. */
 	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
-	parent_keys = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1],
+	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
@@ -606,7 +607,7 @@ xchk_btree_block_keys(
 
 	/* Get high keys */
 	high_bk = xfs_btree_high_key_from_key(cur, &block_keys);
-	high_pk = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1],
+	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
@@ -672,18 +673,18 @@ xchk_btree(
 	if (error || !block)
 		goto out;
 
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		if (level == 0) {
 			/* End of leaf, pop back towards the root. */
-			if (cur->bc_ptrs[level] >
+			if (cur->bc_levels[level].ptr >
 			    be16_to_cpu(block->bb_numrecs)) {
 				xchk_btree_block_keys(bs, level, block);
 				if (level < cur->bc_nlevels - 1)
-					cur->bc_ptrs[level + 1]++;
+					cur->bc_levels[level + 1].ptr++;
 				level++;
 				continue;
 			}
@@ -692,7 +693,8 @@ xchk_btree(
 			xchk_btree_rec(bs);
 
 			/* Call out to the record checker. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr,
+					block);
 			error = bs->scrub_rec(bs, recp);
 			if (error)
 				break;
@@ -700,15 +702,16 @@ xchk_btree(
 			    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 				break;
 
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr >
+					be16_to_cpu(block->bb_numrecs)) {
 			xchk_btree_block_keys(bs, level, block);
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
@@ -717,9 +720,9 @@ xchk_btree(
 		xchk_btree_key(bs, level);
 
 		/* Drill another level deeper. */
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 		if (!xchk_btree_ptr_ok(bs, level, pp)) {
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 		level--;
@@ -727,7 +730,7 @@ xchk_btree(
 		if (error || !block)
 			goto out;
 
-		cur->bc_ptrs[level] = 1;
+		cur->bc_levels[level].ptr = 1;
 	}
 
 out:
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index 93c13763c15e..b5f94676c37c 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -21,9 +21,9 @@ xchk_btree_cur_fsbno(
 	struct xfs_btree_cur	*cur,
 	int			level)
 {
-	if (level < cur->bc_nlevels && cur->bc_bufs[level])
+	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
 		return XFS_DADDR_TO_FSB(cur->bc_mp,
-				xfs_buf_daddr(cur->bc_bufs[level]));
+				xfs_buf_daddr(cur->bc_levels[level].bp));
 
 	if (level == cur->bc_nlevels - 1 &&
 	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a7bbb84f91a7..93ece6df02e3 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -348,7 +348,7 @@ TRACE_EVENT(xchk_btree_op_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
@@ -389,7 +389,7 @@ TRACE_EVENT(xchk_ifork_btree_op_error,
 		__entry->type = sc->sm->sm_type;
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->error = error;
@@ -431,7 +431,7 @@ TRACE_EVENT(xchk_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -471,7 +471,7 @@ TRACE_EVENT(xchk_ifork_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d ino 0x%llx fork %s type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -511,7 +511,7 @@ DECLARE_EVENT_CLASS(xchk_sbtree_class,
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 	),
 	TP_printk("dev %d:%d type %s btree %s agno 0x%x agbno 0x%x level %d nlevels %d ptr %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c4e0cd1c1c8c..30bae0657343 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1966,7 +1966,7 @@ xfs_init_zones(void)
 		goto out_destroy_log_ticket_zone;
 
 	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-					       sizeof(struct xfs_btree_cur),
+				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
 					       0, 0, NULL);
 	if (!xfs_btree_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1033a95fbf8e..4a8076ef8cb4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2476,7 +2476,7 @@ DECLARE_EVENT_CLASS(xfs_btree_cur_class,
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->daddr = bp ? xfs_buf_daddr(bp) : -1;
 	),
 	TP_printk("dev %d:%d btree %s level %d/%d ptr %d daddr 0x%llx",


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 07/17] xfs: rearrange xfs_btree_cur fields for better packing
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 08/17] xfs: refactor btree cursor allocation function Darrick J. Wong
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space.  While we're at it, fix the ragged
indentation and a spelling error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 0181fc98bc12..eaffd8223ce6 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -234,11 +234,11 @@ struct xfs_btree_cur
 	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
-	uint			bc_flags; /* btree features - below */
+	unsigned int		bc_flags; /* btree features - below */
+	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
-	uint8_t		bc_nlevels;	/* number of levels in the tree */
-	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
-	int		bc_statoff;	/* offset of btre stats array */
+	uint8_t			bc_nlevels; /* number of levels in the tree */
+	int			bc_statoff; /* offset of btree stats array */
 
 	/*
 	 * Short btree pointers need an agno to be able to turn the pointers


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 08/17] xfs: refactor btree cursor allocation function
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 07/17] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 09/17] xfs: encode the max btree height in the cursor Darrick J. Wong
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, Dave Chinner, linux-xfs,
	chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Refactor btree allocation to a common helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    6 +-----
 fs/xfs/libxfs/xfs_bmap_btree.c     |    6 +-----
 fs/xfs/libxfs/xfs_btree.h          |   16 ++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    5 +----
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 +----
 fs/xfs/libxfs/xfs_rmap_btree.c     |    5 +----
 6 files changed, 21 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 152ed2a202f4..c644b11132f6 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -477,11 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index a43dea8d6a65..a06987e36db5 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -552,12 +552,8 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
-	cur->bc_btnum = XFS_BTNUM_BMAP;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index eaffd8223ce6..8e78ede87b16 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -578,4 +578,20 @@ void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
 		union xfs_btree_key *dst_key,
 		const union xfs_btree_key *src_key, int numkeys);
 
+static inline struct xfs_btree_cur *
+xfs_btree_alloc_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+
+	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = btnum;
+
+	return cur;
+}
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 10736b89b679..c8fea6a464d5 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -432,10 +432,7 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 3ea589f15b14..48c45e31d897 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -322,10 +322,7 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = XFS_BTNUM_REFC;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index d65bf3c6f25e..f3c4d0965cc9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -451,11 +451,8 @@ xfs_rmapbt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
 	/* Overlapping btree; 2 keys per pointer. */
-	cur->bc_btnum = XFS_BTNUM_RMAP;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 09/17] xfs: encode the max btree height in the cursor
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 08/17] xfs: refactor btree cursor allocation function Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 10/17] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c          |    2 +-
 fs/xfs/libxfs/xfs_btree.c         |    4 ++--
 fs/xfs/libxfs/xfs_btree.h         |    2 ++
 fs/xfs/libxfs/xfs_btree_staging.c |   10 +++++-----
 4 files changed, 10 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 644b956301b6..2ae5bf9a74e7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -239,7 +239,7 @@ xfs_bmap_get_bp(
 	if (!cur)
 		return NULL;
 
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
+	for (i = 0; i < cur->bc_maxlevels; i++) {
 		if (!cur->bc_levels[i].bp)
 			break;
 		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 25dfab81025f..6ced8f028d47 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2933,7 +2933,7 @@ xfs_btree_new_iroot(
 	be16_add_cpu(&block->bb_level, 1);
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
@@ -3097,7 +3097,7 @@ xfs_btree_new_root(
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
 	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	*stat = 1;
 	return 0;
 error0:
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 8e78ede87b16..ed0b7d5ab3a3 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -238,6 +238,7 @@ struct xfs_btree_cur
 	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
 	uint8_t			bc_nlevels; /* number of levels in the tree */
+	uint8_t			bc_maxlevels; /* maximum levels for this btree type */
 	int			bc_statoff; /* offset of btree stats array */
 
 	/*
@@ -590,6 +591,7 @@ xfs_btree_alloc_cursor(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
+	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index cc56efc2b90a..dd75e208b543 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -657,12 +657,12 @@ xfs_btree_bload_compute_geometry(
 	 * checking levels 0 and 1 here, so set bc_nlevels such that the btree
 	 * code doesn't interpret either as the root level.
 	 */
-	cur->bc_nlevels = XFS_BTREE_MAXLEVELS - 1;
+	cur->bc_nlevels = cur->bc_maxlevels - 1;
 	xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0);
 	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
 
 	bbl->nr_records = nr_this_level = nr_records;
-	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
+	for (cur->bc_nlevels = 1; cur->bc_nlevels <= cur->bc_maxlevels;) {
 		uint64_t	level_blocks;
 		uint64_t	dontcare64;
 		unsigned int	level = cur->bc_nlevels - 1;
@@ -703,7 +703,7 @@ xfs_btree_bload_compute_geometry(
 			 * block-based btree level.
 			 */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 			xfs_btree_bload_level_geometry(cur, bbl, level,
 					nr_this_level, &avg_per_block,
 					&level_blocks, &dontcare64);
@@ -719,14 +719,14 @@ xfs_btree_bload_compute_geometry(
 
 			/* Otherwise, we need another level of btree. */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 		}
 
 		nr_blocks += level_blocks;
 		nr_this_level = level_blocks;
 	}
 
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
+	if (cur->bc_nlevels > cur->bc_maxlevels)
 		return -EOVERFLOW;
 
 	bbl->btree_height = cur->bc_nlevels;


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 10/17] xfs: dynamically allocate cursors based on maxlevels
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 09/17] xfs: encode the max btree height in the cursor Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 20:17 ` [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels Darrick J. Wong
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees.  Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height.  For now, we retain the btree cursor cache
that can handle up to 9-level btrees, though a subsequent patch
introduces separate caches for each btree type, where each cache's
objects will be exactly tall enough to handle the specific btree type.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    2 +-
 fs/xfs/libxfs/xfs_bmap_btree.c     |    3 ++-
 fs/xfs/libxfs/xfs_btree.h          |   13 +++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    3 ++-
 fs/xfs/libxfs/xfs_refcount_btree.c |    3 ++-
 fs/xfs/libxfs/xfs_rmap_btree.c     |    3 ++-
 fs/xfs/xfs_super.c                 |    4 ++--
 7 files changed, 22 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index c644b11132f6..f14bad21503f 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -477,7 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index a06987e36db5..b90122de0df0 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -552,7 +552,8 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
+			mp->m_bm_maxlevels[whichfork]);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index ed0b7d5ab3a3..b46cd98309fa 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -94,6 +94,12 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 
 #define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
 
+/*
+ * The btree cursor zone hands out cursors that can handle up to this many
+ * levels.  This is the known maximum for all btree types.
+ */
+#define XFS_BTREE_CUR_CACHE_MAXLEVELS	(9)
+
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -583,15 +589,18 @@ static inline struct xfs_btree_cur *
 xfs_btree_alloc_cursor(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
-	xfs_btnum_t		btnum)
+	xfs_btnum_t		btnum,
+	uint8_t			maxlevels)
 {
 	struct xfs_btree_cur	*cur;
 
+	ASSERT(maxlevels <= XFS_BTREE_CUR_CACHE_MAXLEVELS);
+
 	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
-	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
+	cur->bc_maxlevels = maxlevels;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index c8fea6a464d5..3a5a24648b87 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -432,7 +432,8 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
+			M_IGEO(mp)->inobt_maxlevels);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 48c45e31d897..995b0d86ddc0 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -322,7 +322,8 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
+			mp->m_refc_maxlevels);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index f3c4d0965cc9..1b48b7b3ee30 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -452,7 +452,8 @@ xfs_rmapbt_init_common(
 	struct xfs_btree_cur	*cur;
 
 	/* Overlapping btree; 2 keys per pointer. */
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
+			mp->m_rmap_maxlevels);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 30bae0657343..2a535a8bc3c0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1966,8 +1966,8 @@ xfs_init_zones(void)
 		goto out_destroy_log_ticket_zone;
 
 	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
-					       0, 0, NULL);
+			xfs_btree_cur_sizeof(XFS_BTREE_CUR_CACHE_MAXLEVELS),
+			0, 0, NULL);
 	if (!xfs_btree_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
 


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 10/17] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
@ 2021-10-14 20:17 ` Darrick J. Wong
  2021-10-14 22:52   ` Dave Chinner
  2021-10-14 20:18 ` [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation Darrick J. Wong
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:17 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Years ago when XFS was thought to be much more simple, we introduced
m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
a given filesystem mount.  Then we observed that inode btrees don't
actually have the same height and split that off; and now we have rmap
and refcount btrees with much different geometries and separate
maxlevels variables.

The 'ag' part of the name doesn't make much sense anymore, so rename
this to m_allocbt_maxlevels to reinforce that this is the maximum height
of the *free space* btrees.  This sets us up for the next patch, which
will add a variable to track the maximum height of all AG btrees.

(Also take the opportunity to improve adjacent comments and fix minor
style problems.)

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c       |   19 +++++++++++--------
 fs/xfs/libxfs/xfs_alloc.h       |    2 +-
 fs/xfs/libxfs/xfs_alloc_btree.c |    4 ++--
 fs/xfs/libxfs/xfs_trans_resv.c  |    2 +-
 fs/xfs/libxfs/xfs_trans_space.h |    2 +-
 fs/xfs/scrub/agheader.c         |    4 ++--
 fs/xfs/scrub/agheader_repair.c  |    4 ++--
 fs/xfs/xfs_mount.h              |    4 ++--
 8 files changed, 22 insertions(+), 19 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 55c5adc9b54e..2efb4f4539f1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2190,13 +2190,13 @@ xfs_free_ag_extent(
  */
 
 /*
- * Compute and fill in value of m_ag_maxlevels.
+ * Compute and fill in value of m_alloc_maxlevels.
  */
 void
 xfs_alloc_compute_maxlevels(
 	xfs_mount_t	*mp)	/* file system mount structure */
 {
-	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
+	mp->m_alloc_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
 			(mp->m_sb.sb_agblocks + 1) / 2);
 }
 
@@ -2255,14 +2255,14 @@ xfs_alloc_min_freelist(
 	const uint8_t		*levels = pag ? pag->pagf_levels : fake_levels;
 	unsigned int		min_free;
 
-	ASSERT(mp->m_ag_maxlevels > 0);
+	ASSERT(mp->m_alloc_maxlevels > 0);
 
 	/* space needed by-bno freespace btree */
 	min_free = min_t(unsigned int, levels[XFS_BTNUM_BNOi] + 1,
-				       mp->m_ag_maxlevels);
+				       mp->m_alloc_maxlevels);
 	/* space needed by-size freespace btree */
 	min_free += min_t(unsigned int, levels[XFS_BTNUM_CNTi] + 1,
-				       mp->m_ag_maxlevels);
+				       mp->m_alloc_maxlevels);
 	/* space needed reverse mapping used space btree */
 	if (xfs_has_rmapbt(mp))
 		min_free += min_t(unsigned int, levels[XFS_BTNUM_RMAPi] + 1,
@@ -2903,13 +2903,16 @@ xfs_agf_verify(
 
 	if (be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) < 1 ||
 	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) < 1 ||
-	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) > mp->m_ag_maxlevels ||
-	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) > mp->m_ag_maxlevels)
+	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]) >
+						mp->m_alloc_maxlevels ||
+	    be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]) >
+						mp->m_alloc_maxlevels)
 		return __this_address;
 
 	if (xfs_has_rmapbt(mp) &&
 	    (be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) < 1 ||
-	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) > mp->m_rmap_maxlevels))
+	     be32_to_cpu(agf->agf_levels[XFS_BTNUM_RMAP]) >
+						mp->m_rmap_maxlevels))
 		return __this_address;
 
 	if (xfs_has_rmapbt(mp) &&
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index df4aefaf0046..2f3f8c2e0860 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -98,7 +98,7 @@ unsigned int xfs_alloc_min_freelist(struct xfs_mount *mp,
 		struct xfs_perag *pag);
 
 /*
- * Compute and fill in value of m_ag_maxlevels.
+ * Compute and fill in value of m_alloc_maxlevels.
  */
 void
 xfs_alloc_compute_maxlevels(
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index f14bad21503f..c3e262290f6f 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -316,7 +316,7 @@ xfs_allocbt_verify(
 	if (pag && pag->pagf_init) {
 		if (level >= pag->pagf_levels[btnum])
 			return __this_address;
-	} else if (level >= mp->m_ag_maxlevels)
+	} else if (level >= mp->m_alloc_maxlevels)
 		return __this_address;
 
 	return xfs_btree_sblock_verify(bp, mp->m_alloc_mxr[level != 0]);
@@ -477,7 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_alloc_maxlevels);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 5e300daa2559..c879e7754ee6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -70,7 +70,7 @@ xfs_allocfree_log_count(
 {
 	uint		blocks;
 
-	blocks = num_ops * 2 * (2 * mp->m_ag_maxlevels - 1);
+	blocks = num_ops * 2 * (2 * mp->m_alloc_maxlevels - 1);
 	if (xfs_has_rmapbt(mp))
 		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
 	if (xfs_has_reflink(mp))
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 50332be34388..bd04cb836419 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -74,7 +74,7 @@
 #define	XFS_DIOSTRAT_SPACE_RES(mp, v)	\
 	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
 #define	XFS_GROWFS_SPACE_RES(mp)	\
-	(2 * (mp)->m_ag_maxlevels)
+	(2 * (mp)->m_alloc_maxlevels)
 #define	XFS_GROWFSRT_SPACE_RES(mp,b)	\
 	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK))
 #define	XFS_LINK_SPACE_RES(mp,nl)	\
diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c
index a2c3af77b6c2..bed798792226 100644
--- a/fs/xfs/scrub/agheader.c
+++ b/fs/xfs/scrub/agheader.c
@@ -555,11 +555,11 @@ xchk_agf(
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_BNO]);
-	if (level <= 0 || level > mp->m_ag_maxlevels)
+	if (level <= 0 || level > mp->m_alloc_maxlevels)
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	level = be32_to_cpu(agf->agf_levels[XFS_BTNUM_CNT]);
-	if (level <= 0 || level > mp->m_ag_maxlevels)
+	if (level <= 0 || level > mp->m_alloc_maxlevels)
 		xchk_block_set_corrupt(sc, sc->sa.agf_bp);
 
 	if (xfs_has_rmapbt(mp)) {
diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c
index 05c27149b65d..d7bfed52f4cd 100644
--- a/fs/xfs/scrub/agheader_repair.c
+++ b/fs/xfs/scrub/agheader_repair.c
@@ -339,12 +339,12 @@ xrep_agf(
 		[XREP_AGF_BNOBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
 			.buf_ops = &xfs_bnobt_buf_ops,
-			.maxlevels = sc->mp->m_ag_maxlevels,
+			.maxlevels = sc->mp->m_alloc_maxlevels,
 		},
 		[XREP_AGF_CNTBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
 			.buf_ops = &xfs_cntbt_buf_ops,
-			.maxlevels = sc->mp->m_ag_maxlevels,
+			.maxlevels = sc->mp->m_alloc_maxlevels,
 		},
 		[XREP_AGF_RMAPBT] = {
 			.rmap_owner = XFS_RMAP_OWN_AG,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e091f3b3fa15..e4b7a8eb0d06 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -128,8 +128,8 @@ typedef struct xfs_mount {
 	uint			m_rmap_mnr[2];	/* min rmap btree records */
 	uint			m_refc_mxr[2];	/* max refc btree records */
 	uint			m_refc_mnr[2];	/* min refc btree records */
-	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
-	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
+	uint			m_alloc_maxlevels; /* max alloc btree levels */
+	uint			m_bm_maxlevels[2]; /* max bmap btree levels */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	uint			m_refc_maxlevels; /* max refcount btree level */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-10-14 20:17 ` [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 22:54   ` Dave Chinner
  2021-10-14 20:18 ` [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels} Darrick J. Wong
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Compute the actual maximum AG btree height for deciding if a per-AG
block reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.  This
is a long-winded way of saying that we're removing one more usage of
XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag_resv.c |    3 ++-
 fs/xfs/xfs_mount.c          |   14 ++++++++++++++
 fs/xfs/xfs_mount.h          |    1 +
 3 files changed, 17 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 2aa2b3484c28..fe94058d4e9e 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -91,7 +91,8 @@ xfs_ag_resv_critical(
 	trace_xfs_ag_resv_critical(pag, type, avail);
 
 	/* Critically low if less than 10% or max btree height remains. */
-	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
+	return XFS_TEST_ERROR(avail < orig / 10 ||
+			      avail < pag->pag_mount->m_agbtree_maxlevels,
 			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 06dac09eddbd..5be1dd63fac5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -567,6 +567,18 @@ xfs_mount_setup_inode_geom(
 	xfs_ialloc_setup_geometry(mp);
 }
 
+/* Compute maximum possible height for per-AG btree types for this fs. */
+static inline void
+xfs_agbtree_compute_maxlevels(
+	struct xfs_mount	*mp)
+{
+	unsigned int		ret;
+
+	ret = max(mp->m_alloc_maxlevels, M_IGEO(mp)->inobt_maxlevels);
+	ret = max(ret, mp->m_rmap_maxlevels);
+	mp->m_agbtree_maxlevels = max(ret, mp->m_refc_maxlevels);
+}
+
 /*
  * This function does the following on an initial mount of a file system:
  *	- reads the superblock from disk and init the mount struct
@@ -638,6 +650,8 @@ xfs_mountfs(
 	xfs_rmapbt_compute_maxlevels(mp);
 	xfs_refcountbt_compute_maxlevels(mp);
 
+	xfs_agbtree_compute_maxlevels(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e4b7a8eb0d06..00720a02e761 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -132,6 +132,7 @@ typedef struct xfs_mount {
 	uint			m_bm_maxlevels[2]; /* max bmap btree levels */
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	uint			m_refc_maxlevels; /* max refcount btree level */
+	unsigned int		m_agbtree_maxlevels; /* max level of all AG btrees */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
 	uint			m_alloc_set_aside; /* space we can't use */
 	uint			m_ag_max_usable; /* max space per AG */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-10-14 20:18 ` [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 22:58   ` Dave Chinner
  2021-10-14 20:18 ` [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

During review of the next patch, Dave remarked that he found these two
btree geometry calculation functions lacking in documentation and that
they performed more work than was really necessary.

These functions take the same parameters and have nearly the same logic;
the only real difference is in the return values.  Reword the function
comment to make it clearer what each function does, and move them to be
adjacent to reinforce their relation.

Clean up both of them to stop opencoding the howmany functions, stop
using the uint typedefs, and make them both support computations for
more than 2^32 leaf records, since we're going to need all of the above
for files with large data forks and large rmap btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |   67 ++++++++++++++++++++++-----------------------
 fs/xfs/libxfs/xfs_btree.h |    6 +++-
 2 files changed, 37 insertions(+), 36 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 6ced8f028d47..4d95a3bb05cd 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4514,21 +4514,43 @@ xfs_btree_sblock_verify(
 }
 
 /*
- * Calculate the number of btree levels needed to store a given number of
- * records in a short-format btree.
+ * For the given limits on leaf and keyptr records per block, calculate the
+ * height of the tree needed to index the number of leaf records.
  */
-uint
+unsigned int
 xfs_btree_compute_maxlevels(
-	uint			*limits,
-	unsigned long		len)
+	const unsigned int	*limits,
+	unsigned long long	records)
 {
-	uint			level;
-	unsigned long		maxblocks;
+	unsigned long long	level_blocks = howmany_64(records, limits[0]);
+	unsigned int		height = 1;
 
-	maxblocks = (len + limits[0] - 1) / limits[0];
-	for (level = 1; maxblocks > 1; level++)
-		maxblocks = (maxblocks + limits[1] - 1) / limits[1];
-	return level;
+	while (level_blocks > 1) {
+		level_blocks = howmany_64(level_blocks, limits[1]);
+		height++;
+	}
+
+	return height;
+}
+
+/*
+ * For the given limits on leaf and keyptr records per block, calculate the
+ * number of blocks needed to index the given number of leaf records.
+ */
+unsigned long long
+xfs_btree_calc_size(
+	const unsigned int	*limits,
+	unsigned long long	records)
+{
+	unsigned long long	level_blocks = howmany_64(records, limits[0]);
+	unsigned long long	blocks = level_blocks;
+
+	while (level_blocks > 1) {
+		level_blocks = howmany_64(level_blocks, limits[1]);
+		blocks += level_blocks;
+	}
+
+	return blocks;
 }
 
 /*
@@ -4822,29 +4844,6 @@ xfs_btree_query_all(
 	return xfs_btree_simple_query_range(cur, &low_key, &high_key, fn, priv);
 }
 
-/*
- * Calculate the number of blocks needed to store a given number of records
- * in a short-format (per-AG metadata) btree.
- */
-unsigned long long
-xfs_btree_calc_size(
-	uint			*limits,
-	unsigned long long	len)
-{
-	int			level;
-	int			maxrecs;
-	unsigned long long	rval;
-
-	maxrecs = limits[0];
-	for (level = 0, rval = 0; len > 1; level++) {
-		len += maxrecs - 1;
-		do_div(len, maxrecs);
-		maxrecs = limits[1];
-		rval += len;
-	}
-	return rval;
-}
-
 static int
 xfs_btree_count_blocks_helper(
 	struct xfs_btree_cur	*cur,
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b46cd98309fa..3bd69fe425a7 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -487,8 +487,10 @@ xfs_failaddr_t xfs_btree_lblock_v5hdr_verify(struct xfs_buf *bp,
 xfs_failaddr_t xfs_btree_lblock_verify(struct xfs_buf *bp,
 		unsigned int max_recs);
 
-uint xfs_btree_compute_maxlevels(uint *limits, unsigned long len);
-unsigned long long xfs_btree_calc_size(uint *limits, unsigned long long len);
+unsigned int xfs_btree_compute_maxlevels(const unsigned int *limits,
+		unsigned long long records);
+unsigned long long xfs_btree_calc_size(const unsigned int *limits,
+		unsigned long long records);
 
 /*
  * Return codes for the query range iterator function are 0 to continue


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-10-14 20:18 ` [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels} Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 23:03   ` Dave Chinner
  2021-10-14 20:18 ` [PATCH 15/17] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: Chandan Babu R, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c       |   33 +++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h       |    2 ++
 fs/xfs/libxfs/xfs_rmap_btree.c  |   45 +++++++++++++++++++++++----------------
 fs/xfs/libxfs/xfs_trans_resv.c  |   16 ++++++++++++++
 fs/xfs/libxfs/xfs_trans_space.h |    7 ++++++
 5 files changed, 85 insertions(+), 18 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 4d95a3bb05cd..43e646f3956c 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4553,6 +4553,39 @@ xfs_btree_calc_size(
 	return blocks;
 }
 
+/*
+ * Given a number of available blocks for the btree to consume with records and
+ * pointers, calculate the height of the tree needed to index all the records
+ * that space can hold based on the number of pointers each interior node
+ * holds.
+ *
+ * We start by assuming a single level tree consumes a single block, then track
+ * the number of blocks each node level consumes until we no longer have space
+ * to store the next node level. At this point, we are indexing all the leaf
+ * blocks in the space, and there's no more free space to split the tree any
+ * further. That's our maximum btree height.
+ */
+unsigned int
+xfs_btree_space_to_height(
+	const unsigned int	*limits,
+	unsigned long long	leaf_blocks)
+{
+	unsigned long long	node_blocks = limits[1];
+	unsigned long long	blocks_left = leaf_blocks - 1;
+	unsigned int		height = 1;
+
+	if (leaf_blocks < 1)
+		return 0;
+
+	while (node_blocks < blocks_left) {
+		blocks_left -= node_blocks;
+		node_blocks *= limits[1];
+		height++;
+	}
+
+	return height;
+}
+
 /*
  * Query a regular btree for all records overlapping a given interval.
  * Start with a LE lookup of the key of low_rec and return all records
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 3bd69fe425a7..e488bfcc1fc0 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -491,6 +491,8 @@ unsigned int xfs_btree_compute_maxlevels(const unsigned int *limits,
 		unsigned long long records);
 unsigned long long xfs_btree_calc_size(const unsigned int *limits,
 		unsigned long long records);
+unsigned int xfs_btree_space_to_height(const unsigned int *limits,
+		unsigned long long blocks);
 
 /*
  * Return codes for the query range iterator function are 0 to continue
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 1b48b7b3ee30..7388201e48d2 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -540,26 +540,35 @@ void
 xfs_rmapbt_compute_maxlevels(
 	struct xfs_mount		*mp)
 {
-	/*
-	 * On a non-reflink filesystem, the maximum number of rmap
-	 * records is the number of blocks in the AG, hence the max
-	 * rmapbt height is log_$maxrecs($agblocks).  However, with
-	 * reflink each AG block can have up to 2^32 (per the refcount
-	 * record format) owners, which means that theoretically we
-	 * could face up to 2^64 rmap records.
-	 *
-	 * That effectively means that the max rmapbt height must be
-	 * XFS_BTREE_MAXLEVELS.  "Fortunately" we'll run out of AG
-	 * blocks to feed the rmapbt long before the rmapbt reaches
-	 * maximum height.  The reflink code uses ag_resv_critical to
-	 * disallow reflinking when less than 10% of the per-AG metadata
-	 * block reservation since the fallback is a regular file copy.
-	 */
-	if (xfs_has_reflink(mp))
-		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
-	else
+	if (!xfs_has_rmapbt(mp)) {
+		mp->m_rmap_maxlevels = 0;
+		return;
+	}
+
+	if (xfs_has_reflink(mp)) {
+		/*
+		 * Compute the asymptotic maxlevels for an rmap btree on a
+		 * filesystem that supports reflink.
+		 *
+		 * On a reflink filesystem, each AG block can have up to 2^32
+		 * (per the refcount record format) owners, which means that
+		 * theoretically we could face up to 2^64 rmap records.
+		 * However, we're likely to run out of blocks in the AG long
+		 * before that happens, which means that we must compute the
+		 * max height based on what the btree will look like if it
+		 * consumes almost all the blocks in the AG due to maximal
+		 * sharing factor.
+		 */
+		mp->m_rmap_maxlevels = xfs_btree_space_to_height(mp->m_rmap_mnr,
+				mp->m_sb.sb_agblocks);
+	} else {
+		/*
+		 * If there's no block sharing, compute the maximum rmapbt
+		 * height assuming one rmap record per AG block.
+		 */
 		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(
 				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+	}
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index c879e7754ee6..6f83d9b306ee 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -814,6 +814,19 @@ xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
 	struct xfs_trans_resv	*resp)
 {
+	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
+
+	/*
+	 * In the early days of rmap+reflink, we always set the rmap maxlevels
+	 * to 9 even if the AG was small enough that it would never grow to
+	 * that height.  Transaction reservation sizes influence the minimum
+	 * log size calculation, which influences the size of the log that mkfs
+	 * creates.  Use the old value here to ensure that newly formatted
+	 * small filesystems will mount on older kernels.
+	 */
+	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
+		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
+
 	/*
 	 * The following transactions are logged in physical format and
 	 * require a permanent reservation on space.
@@ -916,4 +929,7 @@ xfs_trans_resv_calc(
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
+
+	/* Put everything back the way it was.  This goes at the end. */
+	mp->m_rmap_maxlevels = rmap_maxlevels;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index bd04cb836419..87b31c69a773 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -17,6 +17,13 @@
 /* Adding one rmap could split every level up to the top of the tree. */
 #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
 
+/*
+ * Note that we historically set m_rmap_maxlevels to 9 when reflink is enabled,
+ * so we must preserve this behavior to avoid changing the transaction space
+ * reservations and minimum log size calculations for existing filesystems.
+ */
+#define XFS_OLD_REFLINK_RMAP_MAXLEVELS		9
+
 /* Blocks we might need to add "b" rmaps to a tree. */
 #define XFS_NRMAPADD_SPACE_RES(mp, b)\
 	(((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 15/17] xfs: kill XFS_BTREE_MAXLEVELS
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (13 preceding siblings ...)
  2021-10-14 20:18 ` [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 20:18 ` [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
  2021-10-14 20:18 ` [PATCH 17/17] xfs: use separate btree cursor cache " Darrick J. Wong
  16 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: Chandan Babu R, Dave Chinner, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_btree.h |    2 --
 1 file changed, 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index e488bfcc1fc0..fdf7090c74f4 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -92,8 +92,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
-#define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
-
 /*
  * The btree cursor zone hands out cursors that can handle up to this many
  * levels.  This is the known maximum for all btree types.


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (14 preceding siblings ...)
  2021-10-14 20:18 ` [PATCH 15/17] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 23:08   ` Dave Chinner
  2021-10-14 20:18 ` [PATCH 17/17] xfs: use separate btree cursor cache " Darrick J. Wong
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type.  This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c          |    1 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   33 ++++++++++++++++++-
 fs/xfs/libxfs/xfs_alloc_btree.h    |    2 +
 fs/xfs/libxfs/xfs_bmap.c           |    1 +
 fs/xfs/libxfs/xfs_bmap_btree.c     |   31 +++++++++++++++++-
 fs/xfs/libxfs/xfs_bmap_btree.h     |    2 +
 fs/xfs/libxfs/xfs_fs.h             |    2 +
 fs/xfs/libxfs/xfs_ialloc.c         |    1 +
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   61 ++++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    2 +
 fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++-----
 fs/xfs/libxfs/xfs_refcount_btree.h |    2 +
 fs/xfs/libxfs/xfs_rmap_btree.c     |   43 +++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_rmap_btree.h     |    2 +
 14 files changed, 207 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 2efb4f4539f1..1a5684af8430 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2198,6 +2198,7 @@ xfs_alloc_compute_maxlevels(
 {
 	mp->m_alloc_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
 			(mp->m_sb.sb_agblocks + 1) / 2);
+	ASSERT(mp->m_alloc_maxlevels <= xfs_allocbt_maxlevels_ondisk());
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index c3e262290f6f..d0a7aa4b52a8 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -566,6 +566,17 @@ xfs_allocbt_commit_staged_btree(
 	}
 }
 
+/* Calculate number of records in an alloc btree block. */
+static inline unsigned int
+xfs_allocbt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(xfs_alloc_rec_t);
+	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
+}
+
 /*
  * Calculate number of records in an alloc btree block.
  */
@@ -576,10 +587,26 @@ xfs_allocbt_maxrecs(
 	int			leaf)
 {
 	blocklen -= XFS_ALLOC_BLOCK_LEN(mp);
+	return xfs_allocbt_block_maxrecs(blocklen, leaf);
+}
 
-	if (leaf)
-		return blocklen / sizeof(xfs_alloc_rec_t);
-	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
+/* Free space btrees are at their largest when every other block is free. */
+#define XFS_MAX_FREESP_RECORDS	((XFS_MAX_AG_BLOCKS + 1) / 2)
+
+/* Compute the max possible height for free space btrees. */
+unsigned int
+xfs_allocbt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = min(XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN,
+		       XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN);
+
+	minrecs[0] = xfs_allocbt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_allocbt_block_maxrecs(blocklen, false) / 2;
+
+	return xfs_btree_compute_maxlevels(minrecs, XFS_MAX_FREESP_RECORDS);
 }
 
 /* Calculate the freespace btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
index 2f6b816aaf9f..c715bee5ae90 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.h
+++ b/fs/xfs/libxfs/xfs_alloc_btree.h
@@ -60,4 +60,6 @@ extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
 void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_allocbt_maxlevels_ondisk(void);
+
 #endif	/* __XFS_ALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2ae5bf9a74e7..321617e837ef 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -93,6 +93,7 @@ xfs_bmap_compute_maxlevels(
 			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
 	}
 	mp->m_bm_maxlevels[whichfork] = level;
+	ASSERT(mp->m_bm_maxlevels[whichfork] <= xfs_bmbt_maxlevels_ondisk());
 }
 
 unsigned int
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index b90122de0df0..59d146696a62 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -571,6 +571,17 @@ xfs_bmbt_init_cursor(
 	return cur;
 }
 
+/* Calculate number of records in a block mapping btree block. */
+static inline unsigned int
+xfs_bmbt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(xfs_bmbt_rec_t);
+	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
+}
+
 /*
  * Calculate number of records in a bmap btree block.
  */
@@ -581,10 +592,24 @@ xfs_bmbt_maxrecs(
 	int			leaf)
 {
 	blocklen -= XFS_BMBT_BLOCK_LEN(mp);
+	return xfs_bmbt_block_maxrecs(blocklen, leaf);
+}
 
-	if (leaf)
-		return blocklen / sizeof(xfs_bmbt_rec_t);
-	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
+/* Compute the max possible height for block mapping btrees. */
+unsigned int
+xfs_bmbt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = min(XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN,
+		       XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN);
+
+	minrecs[0] = xfs_bmbt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_bmbt_block_maxrecs(blocklen, false) / 2;
+
+	/* One extra level for the inode root. */
+	return xfs_btree_compute_maxlevels(minrecs, MAXEXTNUM) + 1;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index 729e3bc569be..2a1c9e607b52 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -110,4 +110,6 @@ extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
 
+unsigned int xfs_bmbt_maxlevels_ondisk(void);
+
 #endif	/* __XFS_BMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index bde2b4c64dbe..c43877c8a279 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -268,6 +268,8 @@ typedef struct xfs_fsop_resblks {
  */
 #define XFS_MIN_AG_BYTES	(1ULL << 24)	/* 16 MB */
 #define XFS_MAX_AG_BYTES	(1ULL << 40)	/* 1 TB */
+#define XFS_MAX_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_BLOCKSIZE)
+#define XFS_MAX_CRC_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_CRC_BLOCKSIZE)
 
 /* keep the maximum size under 2^31 by a small amount */
 #define XFS_MAX_LOG_BYTES \
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 994ad783d407..f78a600ca73f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2793,6 +2793,7 @@ xfs_ialloc_setup_geometry(
 	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
 	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
 			inodes);
+	ASSERT(igeo->inobt_maxlevels <= xfs_iallocbt_maxlevels_ondisk());
 
 	/*
 	 * Set the maximum inode count for this filesystem, being careful not
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 3a5a24648b87..74681e881164 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -526,6 +526,17 @@ xfs_inobt_commit_staged_btree(
 	}
 }
 
+/* Calculate number of records in an inode btree block. */
+static inline unsigned int
+xfs_inobt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(xfs_inobt_rec_t);
+	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
+}
+
 /*
  * Calculate number of records in an inobt btree block.
  */
@@ -536,10 +547,54 @@ xfs_inobt_maxrecs(
 	int			leaf)
 {
 	blocklen -= XFS_INOBT_BLOCK_LEN(mp);
+	return xfs_inobt_block_maxrecs(blocklen, leaf);
+}
 
-	if (leaf)
-		return blocklen / sizeof(xfs_inobt_rec_t);
-	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
+/*
+ * Maximum number of inode btree records per AG.  Pretend that we can fill an
+ * entire AG completely full of inodes except for the AG headers.
+ */
+#define XFS_MAX_INODE_RECORDS \
+	((XFS_MAX_AG_BYTES - (4 * BBSIZE)) / XFS_DINODE_MIN_SIZE) / \
+			XFS_INODES_PER_CHUNK
+
+/* Compute the max possible height for the inode btree. */
+static inline unsigned int
+xfs_inobt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = min(XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN,
+		       XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN);
+
+	minrecs[0] = xfs_inobt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_inobt_block_maxrecs(blocklen, false) / 2;
+
+	return xfs_btree_compute_maxlevels(minrecs, XFS_MAX_INODE_RECORDS);
+}
+
+/* Compute the max possible height for the free inode btree. */
+static inline unsigned int
+xfs_finobt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
+
+	minrecs[0] = xfs_inobt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_inobt_block_maxrecs(blocklen, false) / 2;
+
+	return xfs_btree_compute_maxlevels(minrecs, XFS_MAX_INODE_RECORDS);
+}
+
+/* Compute the max possible height for either inode btree. */
+unsigned int
+xfs_iallocbt_maxlevels_ondisk(void)
+{
+	return max(xfs_inobt_maxlevels_ondisk(),
+		   xfs_finobt_maxlevels_ondisk());
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index 8a322d402e61..6d3e4a3316d7 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -75,4 +75,6 @@ int xfs_inobt_cur(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_iallocbt_maxlevels_ondisk(void);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 995b0d86ddc0..3bf802fc33bb 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -393,29 +393,58 @@ xfs_refcountbt_commit_staged_btree(
 	xfs_btree_commit_afakeroot(cur, tp, agbp, &xfs_refcountbt_ops);
 }
 
-/*
- * Calculate the number of records in a refcount btree block.
- */
-int
-xfs_refcountbt_maxrecs(
-	int			blocklen,
+/* Calculate number of records in a refcount btree block. */
+static inline unsigned int
+xfs_refcountbt_block_maxrecs(
+	unsigned int		blocklen,
 	bool			leaf)
 {
-	blocklen -= XFS_REFCOUNT_BLOCK_LEN;
-
 	if (leaf)
 		return blocklen / sizeof(struct xfs_refcount_rec);
 	return blocklen / (sizeof(struct xfs_refcount_key) +
 			   sizeof(xfs_refcount_ptr_t));
 }
 
+/*
+ * Calculate the number of records in a refcount btree block.
+ */
+int
+xfs_refcountbt_maxrecs(
+	int			blocklen,
+	bool			leaf)
+{
+	blocklen -= XFS_REFCOUNT_BLOCK_LEN;
+	return xfs_refcountbt_block_maxrecs(blocklen, leaf);
+}
+
+/* Compute the max possible height of the maximally sized refcount btree. */
+unsigned int
+xfs_refcountbt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
+
+	minrecs[0] = xfs_refcountbt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_refcountbt_block_maxrecs(blocklen, false) / 2;
+
+	return xfs_btree_compute_maxlevels(minrecs, XFS_MAX_CRC_AG_BLOCKS);
+}
+
 /* Compute the maximum height of a refcount btree. */
 void
 xfs_refcountbt_compute_maxlevels(
 	struct xfs_mount		*mp)
 {
+	if (!xfs_has_reflink(mp)) {
+		mp->m_refc_maxlevels = 0;
+		return;
+	}
+
 	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(
 			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
+	ASSERT(mp->m_refc_maxlevels <= xfs_refcountbt_maxlevels_ondisk());
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index bd9ed9e1e41f..d7f7c89cbf35 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -65,4 +65,6 @@ extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
 void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_refcountbt_maxlevels_ondisk(void);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 7388201e48d2..0c96e26daca9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -519,6 +519,18 @@ xfs_rmapbt_commit_staged_btree(
 	xfs_btree_commit_afakeroot(cur, tp, agbp, &xfs_rmapbt_ops);
 }
 
+/* Calculate number of records in a reverse mapping btree block. */
+static inline unsigned int
+xfs_rmapbt_block_maxrecs(
+	unsigned int		blocklen,
+	bool			leaf)
+{
+	if (leaf)
+		return blocklen / sizeof(struct xfs_rmap_rec);
+	return blocklen /
+		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
+}
+
 /*
  * Calculate number of records in an rmap btree block.
  */
@@ -528,11 +540,33 @@ xfs_rmapbt_maxrecs(
 	int			leaf)
 {
 	blocklen -= XFS_RMAP_BLOCK_LEN;
+	return xfs_rmapbt_block_maxrecs(blocklen, leaf);
+}
 
-	if (leaf)
-		return blocklen / sizeof(struct xfs_rmap_rec);
-	return blocklen /
-		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
+/* Compute the max possible height for reverse mapping btrees. */
+unsigned int
+xfs_rmapbt_maxlevels_ondisk(void)
+{
+	unsigned int		minrecs[2];
+	unsigned int		blocklen;
+
+	blocklen = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
+
+	minrecs[0] = xfs_rmapbt_block_maxrecs(blocklen, true) / 2;
+	minrecs[1] = xfs_rmapbt_block_maxrecs(blocklen, false) / 2;
+
+	/*
+	 * Compute the asymptotic maxlevels for an rmapbt on any reflink fs.
+	 *
+	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
+	 * refcount record format) owners, which means that theoretically we
+	 * could face up to 2^64 rmap records.  However, we're likely to run
+	 * out of blocks in the AG long before that happens, which means that
+	 * we must compute the max height based on what the btree will look
+	 * like if it consumes almost all the blocks in the AG due to maximal
+	 * sharing factor.
+	 */
+	return xfs_btree_space_to_height(minrecs, XFS_MAX_CRC_AG_BLOCKS);
 }
 
 /* Compute the maximum height of an rmap btree. */
@@ -569,6 +603,7 @@ xfs_rmapbt_compute_maxlevels(
 		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(
 				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
 	}
+	ASSERT(mp->m_rmap_maxlevels <= xfs_rmapbt_maxlevels_ondisk());
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index f2eee6572af4..e9778b62ad55 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -59,4 +59,6 @@ extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp,
 extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
 		struct xfs_perag *pag, xfs_extlen_t *ask, xfs_extlen_t *used);
 
+unsigned int xfs_rmapbt_maxlevels_ondisk(void);
+
 #endif /* __XFS_RMAP_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 17/17] xfs: use separate btree cursor cache for each btree type
  2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (15 preceding siblings ...)
  2021-10-14 20:18 ` [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
@ 2021-10-14 20:18 ` Darrick J. Wong
  2021-10-14 23:11   ` Dave Chinner
  16 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 20:18 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree.  For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |   23 +++++++++++++++++
 fs/xfs/libxfs/xfs_alloc_btree.h    |    3 ++
 fs/xfs/libxfs/xfs_bmap_btree.c     |   23 +++++++++++++++++
 fs/xfs/libxfs/xfs_bmap_btree.h     |    3 ++
 fs/xfs/libxfs/xfs_btree.c          |   48 ++++++++++++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_btree.h          |   20 ++++++---------
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   23 +++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    3 ++
 fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
 fs/xfs/libxfs/xfs_rmap_btree.c     |   23 +++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h     |    3 ++
 fs/xfs/xfs_super.c                 |   13 +++++-----
 13 files changed, 182 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index d0a7aa4b52a8..609d349e7bd4 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_allocbt_cur_cache;
 
 STATIC struct xfs_btree_cur *
 xfs_allocbt_dup_cursor(
@@ -477,7 +478,8 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_alloc_maxlevels);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_alloc_maxlevels,
+			xfs_allocbt_cur_cache);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
@@ -617,3 +619,22 @@ xfs_allocbt_calc_size(
 {
 	return xfs_btree_calc_size(mp->m_alloc_mnr, len);
 }
+
+int __init
+xfs_allocbt_init_cur_cache(void)
+{
+	xfs_allocbt_cur_cache = kmem_cache_create("xfs_bnobt_cur",
+			xfs_btree_cur_sizeof(xfs_allocbt_maxlevels_ondisk()),
+			0, 0, NULL);
+
+	if (!xfs_allocbt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+xfs_allocbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_allocbt_cur_cache);
+	xfs_allocbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
index c715bee5ae90..45df893ef6bb 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.h
+++ b/fs/xfs/libxfs/xfs_alloc_btree.h
@@ -62,4 +62,7 @@ void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_allocbt_maxlevels_ondisk(void);
 
+int __init xfs_allocbt_init_cur_cache(void);
+void xfs_allocbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_ALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 59d146696a62..107ac1d127bf 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_trace.h"
 #include "xfs_rmap.h"
 
+static kmem_zone_t	*xfs_bmbt_cur_cache;
+
 /*
  * Convert on-disk form of btree root to in-memory form.
  */
@@ -553,7 +555,7 @@ xfs_bmbt_init_cursor(
 	ASSERT(whichfork != XFS_COW_FORK);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
-			mp->m_bm_maxlevels[whichfork]);
+			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
@@ -675,3 +677,22 @@ xfs_bmbt_calc_size(
 {
 	return xfs_btree_calc_size(mp->m_bmap_dmnr, len);
 }
+
+int __init
+xfs_bmbt_init_cur_cache(void)
+{
+	xfs_bmbt_cur_cache = kmem_cache_create("xfs_bmbt_cur",
+			xfs_btree_cur_sizeof(xfs_bmbt_maxlevels_ondisk()),
+			0, 0, NULL);
+
+	if (!xfs_bmbt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+xfs_bmbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_bmbt_cur_cache);
+	xfs_bmbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index 2a1c9e607b52..3e7a40a83835 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -112,4 +112,7 @@ extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
 
 unsigned int xfs_bmbt_maxlevels_ondisk(void);
 
+int __init xfs_bmbt_init_cur_cache(void);
+void xfs_bmbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_BMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 43e646f3956c..80723dac519f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -22,11 +22,11 @@
 #include "xfs_log.h"
 #include "xfs_btree_staging.h"
 #include "xfs_ag.h"
-
-/*
- * Cursor allocation zone.
- */
-kmem_zone_t	*xfs_btree_cur_zone;
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount_btree.h"
 
 /*
  * Btree magic numbers.
@@ -379,7 +379,7 @@ xfs_btree_del_cursor(
 		kmem_free(cur->bc_ops);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
-	kmem_cache_free(xfs_btree_cur_zone, cur);
+	kmem_cache_free(cur->bc_cache, cur);
 }
 
 /*
@@ -4962,3 +4962,39 @@ xfs_btree_has_more_records(
 	else
 		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
 }
+
+/* Set up all the btree cursor caches. */
+int __init
+xfs_btree_init_cur_caches(void)
+{
+	int		error;
+
+	error = xfs_allocbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_inobt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_bmbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_rmapbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_refcountbt_init_cur_cache();
+	if (error)
+		return error;
+
+	return 0;
+}
+
+/* Destroy all the btree cursor caches, if they've been allocated. */
+void
+xfs_btree_destroy_cur_caches(void)
+{
+	xfs_allocbt_destroy_cur_cache();
+	xfs_inobt_destroy_cur_cache();
+	xfs_bmbt_destroy_cur_cache();
+	xfs_rmapbt_destroy_cur_cache();
+	xfs_refcountbt_destroy_cur_cache();
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index fdf7090c74f4..7bc5a3796052 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -13,8 +13,6 @@ struct xfs_trans;
 struct xfs_ifork;
 struct xfs_perag;
 
-extern kmem_zone_t	*xfs_btree_cur_zone;
-
 /*
  * Generic key, ptr and record wrapper structures.
  *
@@ -92,12 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
-/*
- * The btree cursor zone hands out cursors that can handle up to this many
- * levels.  This is the known maximum for all btree types.
- */
-#define XFS_BTREE_CUR_CACHE_MAXLEVELS	(9)
-
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -238,6 +230,7 @@ struct xfs_btree_cur
 	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
+	kmem_zone_t		*bc_cache; /* cursor cache */
 	unsigned int		bc_flags; /* btree features - below */
 	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
@@ -592,19 +585,22 @@ xfs_btree_alloc_cursor(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	xfs_btnum_t		btnum,
-	uint8_t			maxlevels)
+	uint8_t			maxlevels,
+	kmem_zone_t		*cache)
 {
 	struct xfs_btree_cur	*cur;
 
-	ASSERT(maxlevels <= XFS_BTREE_CUR_CACHE_MAXLEVELS);
-
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(cache, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_maxlevels = maxlevels;
+	cur->bc_cache = cache;
 
 	return cur;
 }
 
+int __init xfs_btree_init_cur_caches(void);
+void xfs_btree_destroy_cur_caches(void);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 74681e881164..4a11024408e0 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_inobt_cur_cache;
+
 STATIC int
 xfs_inobt_get_minrecs(
 	struct xfs_btree_cur	*cur,
@@ -433,7 +435,7 @@ xfs_inobt_init_common(
 	struct xfs_btree_cur	*cur;
 
 	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
-			M_IGEO(mp)->inobt_maxlevels);
+			M_IGEO(mp)->inobt_maxlevels, xfs_inobt_cur_cache);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
@@ -812,3 +814,22 @@ xfs_iallocbt_calc_size(
 {
 	return xfs_btree_calc_size(M_IGEO(mp)->inobt_mnr, len);
 }
+
+int __init
+xfs_inobt_init_cur_cache(void)
+{
+	xfs_inobt_cur_cache = kmem_cache_create("xfs_inobt_cur",
+			xfs_btree_cur_sizeof(xfs_inobt_maxlevels_ondisk()),
+			0, 0, NULL);
+
+	if (!xfs_inobt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+xfs_inobt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_inobt_cur_cache);
+	xfs_inobt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index 6d3e4a3316d7..26451cb76b98 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -77,4 +77,7 @@ void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_iallocbt_maxlevels_ondisk(void);
 
+int __init xfs_inobt_init_cur_cache(void);
+void xfs_inobt_destroy_cur_cache(void);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 3bf802fc33bb..6c4deb436c07 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -21,6 +21,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_refcountbt_cur_cache;
+
 static struct xfs_btree_cur *
 xfs_refcountbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -323,7 +325,7 @@ xfs_refcountbt_init_common(
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
-			mp->m_refc_maxlevels);
+			mp->m_refc_maxlevels, xfs_refcountbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
@@ -514,3 +516,22 @@ xfs_refcountbt_calc_reserves(
 
 	return error;
 }
+
+int __init
+xfs_refcountbt_init_cur_cache(void)
+{
+	xfs_refcountbt_cur_cache = kmem_cache_create("xfs_refcbt_cur",
+			xfs_btree_cur_sizeof(xfs_refcountbt_maxlevels_ondisk()),
+			0, 0, NULL);
+
+	if (!xfs_refcountbt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+xfs_refcountbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_refcountbt_cur_cache);
+	xfs_refcountbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index d7f7c89cbf35..d66b37259bed 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -67,4 +67,7 @@ void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_refcountbt_maxlevels_ondisk(void);
 
+int __init xfs_refcountbt_init_cur_cache(void);
+void xfs_refcountbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 0c96e26daca9..3d4134eab8cf 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 
+static kmem_zone_t	*xfs_rmapbt_cur_cache;
+
 /*
  * Reverse map btree.
  *
@@ -453,7 +455,7 @@ xfs_rmapbt_init_common(
 
 	/* Overlapping btree; 2 keys per pointer. */
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
-			mp->m_rmap_maxlevels);
+			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
@@ -674,3 +676,22 @@ xfs_rmapbt_calc_reserves(
 
 	return error;
 }
+
+int __init
+xfs_rmapbt_init_cur_cache(void)
+{
+	xfs_rmapbt_cur_cache = kmem_cache_create("xfs_rmapbt_cur",
+			xfs_btree_cur_sizeof(xfs_rmapbt_maxlevels_ondisk()),
+			0, 0, NULL);
+
+	if (!xfs_rmapbt_cur_cache)
+		return -ENOMEM;
+	return 0;
+}
+
+void
+xfs_rmapbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_rmapbt_cur_cache);
+	xfs_rmapbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index e9778b62ad55..3244715dd111 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -61,4 +61,7 @@ extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
 
 unsigned int xfs_rmapbt_maxlevels_ondisk(void);
 
+int __init xfs_rmapbt_init_cur_cache(void);
+void xfs_rmapbt_destroy_cur_cache(void);
+
 #endif /* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 2a535a8bc3c0..6fcafc43b823 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -37,6 +37,7 @@
 #include "xfs_reflink.h"
 #include "xfs_pwork.h"
 #include "xfs_ag.h"
+#include "xfs_btree.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1953,6 +1954,8 @@ MODULE_ALIAS_FS("xfs");
 STATIC int __init
 xfs_init_zones(void)
 {
+	int		error;
+
 	xfs_log_ticket_zone = kmem_cache_create("xfs_log_ticket",
 						sizeof(struct xlog_ticket),
 						0, 0, NULL);
@@ -1965,10 +1968,8 @@ xfs_init_zones(void)
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 
-	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-			xfs_btree_cur_sizeof(XFS_BTREE_CUR_CACHE_MAXLEVELS),
-			0, 0, NULL);
-	if (!xfs_btree_cur_zone)
+	error = xfs_btree_init_cur_caches();
+	if (error)
 		goto out_destroy_bmap_free_item_zone;
 
 	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
@@ -2106,7 +2107,7 @@ xfs_init_zones(void)
  out_destroy_da_state_zone:
 	kmem_cache_destroy(xfs_da_state_zone);
  out_destroy_btree_cur_zone:
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	xfs_btree_destroy_cur_caches();
  out_destroy_bmap_free_item_zone:
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
  out_destroy_log_ticket_zone:
@@ -2138,7 +2139,7 @@ xfs_destroy_zones(void)
 	kmem_cache_destroy(xfs_trans_zone);
 	kmem_cache_destroy(xfs_ifork_zone);
 	kmem_cache_destroy(xfs_da_state_zone);
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	xfs_btree_destroy_cur_caches();
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
 	kmem_cache_destroy(xfs_log_ticket_zone);
 }


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
@ 2021-10-14 22:48   ` Dave Chinner
  2021-10-14 23:05     ` Darrick J. Wong
  2021-10-15  0:42   ` [PATCH v4.1 " Darrick J. Wong
  1 sibling, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:17:00PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> During review of subsequent patches, Dave and I noticed that this
> function doesn't work quite right -- accessing cur->bc_ino depends on
> the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
> isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
> because block 0 of an AG is never part of a btree.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/trace.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> index c0ef53fe6611..93c13763c15e 100644
> --- a/fs/xfs/scrub/trace.c
> +++ b/fs/xfs/scrub/trace.c
> @@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
>  	if (level < cur->bc_nlevels && cur->bc_bufs[level])
>  		return XFS_DADDR_TO_FSB(cur->bc_mp,
>  				xfs_buf_daddr(cur->bc_bufs[level]));
> -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> +
> +	if (level == cur->bc_nlevels - 1 &&
> +	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
>  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);

Ok.

> -	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
> -		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> +

But doesn't this break the tracing code on short pointers as the
tracing code does:

	TP_fast_assign(
		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
		...
		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);


i.e. the tracing will no longer give the correct agno for per-ag
cursors that don't have any buffers attached to them at the current
level?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 05/17] xfs: dynamically allocate btree scrub context structure
  2021-10-14 20:17 ` [PATCH 05/17] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-10-14 22:48   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:17:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Reorganize struct xchk_btree so that we can dynamically size the context
> structure to fit the type of btree cursor that we have.  This will
> enable us to use memory more efficiently once we start adding very tall
> btree types.  Right-size the lastkey array to match the number of *node*
> levels in the tree so that we stop wasting space.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/btree.c |   23 ++++++++++++-----------
>  fs/xfs/scrub/btree.h |   15 ++++++++++++++-
>  2 files changed, 26 insertions(+), 12 deletions(-)

Looks good now.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights
  2021-10-14 20:17 ` [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights Darrick J. Wong
@ 2021-10-14 22:49   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, Christoph Hellwig, linux-xfs

On Thu, Oct 14, 2021 at 01:17:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split out the btree level information into a separate struct and put it
> at the end of the cursor structure as a VLA.  Files with huge data forks
> (and in the future, the realtime rmap btree) will require the ability to
> support many more levels than a per-AG btree cursor, which means that
> we're going to create per-btree type cursor caches to conserve memory
> for the more common case.
> 
> Note that a subsequent patch actually introduces dynamic cursor heights.
> This one merely rearranges the structure to prepare for that.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    6 +-
>  fs/xfs/libxfs/xfs_bmap.c  |   10 +--
>  fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
>  fs/xfs/libxfs/xfs_btree.h |   33 +++++++--
>  fs/xfs/scrub/bitmap.c     |   22 +++---
>  fs/xfs/scrub/bmap.c       |    2 -
>  fs/xfs/scrub/btree.c      |   47 +++++++------
>  fs/xfs/scrub/trace.c      |    4 +
>  fs/xfs/scrub/trace.h      |   10 +--
>  fs/xfs/xfs_super.c        |    2 -
>  fs/xfs/xfs_trace.h        |    2 -
>  11 files changed, 167 insertions(+), 139 deletions(-)

LGTM.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
  2021-10-14 20:17 ` [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels Darrick J. Wong
@ 2021-10-14 22:52   ` Dave Chinner
  2021-10-15  0:44     ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:52 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:17:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Years ago when XFS was thought to be much more simple, we introduced
> m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
> a given filesystem mount.  Then we observed that inode btrees don't
> actually have the same height and split that off; and now we have rmap
> and refcount btrees with much different geometries and separate
> maxlevels variables.
> 
> The 'ag' part of the name doesn't make much sense anymore, so rename
> this to m_allocbt_maxlevels to reinforce that this is the maximum height
                 ^^
You named it m_alloc_maxlevels....

> of the *free space* btrees.  This sets us up for the next patch, which
> will add a variable to track the maximum height of all AG btrees.
> 
> (Also take the opportunity to improve adjacent comments and fix minor
> style problems.)
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c       |   19 +++++++++++--------
>  fs/xfs/libxfs/xfs_alloc.h       |    2 +-
>  fs/xfs/libxfs/xfs_alloc_btree.c |    4 ++--
>  fs/xfs/libxfs/xfs_trans_resv.c  |    2 +-
>  fs/xfs/libxfs/xfs_trans_space.h |    2 +-
>  fs/xfs/scrub/agheader.c         |    4 ++--
>  fs/xfs/scrub/agheader_repair.c  |    4 ++--
>  fs/xfs/xfs_mount.h              |    4 ++--
>  8 files changed, 22 insertions(+), 19 deletions(-)

Other than that minor nit, the change looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation
  2021-10-14 20:18 ` [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation Darrick J. Wong
@ 2021-10-14 22:54   ` Dave Chinner
  2021-10-15  0:44     ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:18:00PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Compute the actual maximum AG btree height for deciding if a per-AG
> block reservation is critically low.  This only affects the sanity check
> condition, since we /generally/ will trigger on the 10% threshold.  This
> is a long-winded way of saying that we're removing one more usage of
> XFS_BTREE_MAXLEVELS.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c |    3 ++-
>  fs/xfs/xfs_mount.c          |   14 ++++++++++++++
>  fs/xfs/xfs_mount.h          |    1 +
>  3 files changed, 17 insertions(+), 1 deletion(-)

One minor nit below, otherwise it looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> 
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 2aa2b3484c28..fe94058d4e9e 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -91,7 +91,8 @@ xfs_ag_resv_critical(
>  	trace_xfs_ag_resv_critical(pag, type, avail);
>  
>  	/* Critically low if less than 10% or max btree height remains. */
> -	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
> +	return XFS_TEST_ERROR(avail < orig / 10 ||
> +			      avail < pag->pag_mount->m_agbtree_maxlevels,
>  			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 06dac09eddbd..5be1dd63fac5 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -567,6 +567,18 @@ xfs_mount_setup_inode_geom(
>  	xfs_ialloc_setup_geometry(mp);
>  }
>  
> +/* Compute maximum possible height for per-AG btree types for this fs. */
> +static inline void
> +xfs_agbtree_compute_maxlevels(
> +	struct xfs_mount	*mp)
> +{
> +	unsigned int		ret;
> +
> +	ret = max(mp->m_alloc_maxlevels, M_IGEO(mp)->inobt_maxlevels);
> +	ret = max(ret, mp->m_rmap_maxlevels);
> +	mp->m_agbtree_maxlevels = max(ret, mp->m_refc_maxlevels);
> +}

"ret" should really be named "levels" here because it's not a return
value anymore...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels}
  2021-10-14 20:18 ` [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels} Darrick J. Wong
@ 2021-10-14 22:58   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 22:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:18:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> During review of the next patch, Dave remarked that he found these two
> btree geometry calculation functions lacking in documentation and that
> they performed more work than was really necessary.
> 
> These functions take the same parameters and have nearly the same logic;
> the only real difference is in the return values.  Reword the function
> comment to make it clearer what each function does, and move them to be
> adjacent to reinforce their relation.
> 
> Clean up both of them to stop opencoding the howmany functions, stop
> using the uint typedefs, and make them both support computations for
> more than 2^32 leaf records, since we're going to need all of the above
> for files with large data forks and large rmap btrees.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c |   67 ++++++++++++++++++++++-----------------------
>  fs/xfs/libxfs/xfs_btree.h |    6 +++-
>  2 files changed, 37 insertions(+), 36 deletions(-)

Nice! that's soooo much easier for my simple brain to understand. :)

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-14 20:18 ` [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-10-14 23:03   ` Dave Chinner
  2021-10-15  0:48     ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 23:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs, hch

On Thu, Oct 14, 2021 at 01:18:11PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
> enough to handle the maximally tall rmap btree when all blocks are in
> use and maximally shared, let's compute the maximum height assuming the
> rmapbt consumes as many blocks as possible.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_btree.c       |   33 +++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h       |    2 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c  |   45 +++++++++++++++++++++++----------------
>  fs/xfs/libxfs/xfs_trans_resv.c  |   16 ++++++++++++++
>  fs/xfs/libxfs/xfs_trans_space.h |    7 ++++++
>  5 files changed, 85 insertions(+), 18 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

>  /* Calculate the refcount btree size for some records. */
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index c879e7754ee6..6f83d9b306ee 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -814,6 +814,19 @@ xfs_trans_resv_calc(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans_resv	*resp)
>  {
> +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> +
> +	/*
> +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> +	 * to 9 even if the AG was small enough that it would never grow to
> +	 * that height.  Transaction reservation sizes influence the minimum
> +	 * log size calculation, which influences the size of the log that mkfs
> +	 * creates.  Use the old value here to ensure that newly formatted
> +	 * small filesystems will mount on older kernels.
> +	 */
> +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> +

As an aside, what are your plans to get your "legacy minimum log
size reservations" calculation patch moved upstream so we can stop
having to care about this in future?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 22:48   ` Dave Chinner
@ 2021-10-14 23:05     ` Darrick J. Wong
  2021-10-14 23:17       ` Dave Chinner
  0 siblings, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-14 23:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Fri, Oct 15, 2021 at 09:48:13AM +1100, Dave Chinner wrote:
> On Thu, Oct 14, 2021 at 01:17:00PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > During review of subsequent patches, Dave and I noticed that this
> > function doesn't work quite right -- accessing cur->bc_ino depends on
> > the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
> > isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
> > because block 0 of an AG is never part of a btree.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/trace.c |    7 ++++---
> >  1 file changed, 4 insertions(+), 3 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> > index c0ef53fe6611..93c13763c15e 100644
> > --- a/fs/xfs/scrub/trace.c
> > +++ b/fs/xfs/scrub/trace.c
> > @@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
> >  	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> >  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> >  				xfs_buf_daddr(cur->bc_bufs[level]));
> > -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > +
> > +	if (level == cur->bc_nlevels - 1 &&
> > +	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
> >  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
> 
> Ok.
> 
> > -	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
> > -		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> > +
> 
> But doesn't this break the tracing code on short pointers as the
> tracing code does:
> 
> 	TP_fast_assign(
> 		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
> 		...
> 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
> 
> 
> i.e. the tracing will no longer give the correct agno for per-ag
> cursors that don't have any buffers attached to them at the current
> level?

Hmmm.  By that logic, maybe we should get rid of the (level ==
cur->bc_nlevels - 1) check entirely so that any cursor without a buffer
attached at that level will always provide *some* kind of breadcrumb to
the tracepoints?

I almost did that instead, except for the consideration that if you're
tracing the online fsck code, you should /probably/ have at least one of
xchk_stop/xrep_attempt/xchk_done included in the filter list so you can
be certain of what the kernel is checking.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type
  2021-10-14 20:18 ` [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
@ 2021-10-14 23:08   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 23:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:18:22PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add code for all five btree types so that we can compute the absolute
> maximum possible btree height for each btree type.  This is a setup for
> the next patch, which makes every btree type have its own cursor cache.
> 
> The functions are exported so that we can have xfs_db report the
> absolute maximum btree heights for each btree type, rather than making
> everyone run their own ad-hoc computations.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c          |    1 +
>  fs/xfs/libxfs/xfs_alloc_btree.c    |   33 ++++++++++++++++++-
>  fs/xfs/libxfs/xfs_alloc_btree.h    |    2 +
>  fs/xfs/libxfs/xfs_bmap.c           |    1 +
>  fs/xfs/libxfs/xfs_bmap_btree.c     |   31 +++++++++++++++++-
>  fs/xfs/libxfs/xfs_bmap_btree.h     |    2 +
>  fs/xfs/libxfs/xfs_fs.h             |    2 +
>  fs/xfs/libxfs/xfs_ialloc.c         |    1 +
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |   61 ++++++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.h   |    2 +
>  fs/xfs/libxfs/xfs_refcount_btree.c |   45 ++++++++++++++++++++++-----
>  fs/xfs/libxfs/xfs_refcount_btree.h |    2 +
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   43 +++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    2 +
>  14 files changed, 207 insertions(+), 21 deletions(-)

It's a little more verbose than the previous version, but the clear
explanations of how we are calculating the maximum levels is a big
improvement. Nice!

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 17/17] xfs: use separate btree cursor cache for each btree type
  2021-10-14 20:18 ` [PATCH 17/17] xfs: use separate btree cursor cache " Darrick J. Wong
@ 2021-10-14 23:11   ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 23:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 01:18:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have the infrastructure to track the max possible height of
> each btree type, we can create a separate slab cache for cursors of each
> type of btree.  For smaller indices like the free space btrees, this
> means that we can pack more cursors into a slab page, improving slab
> utilization.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |   23 +++++++++++++++++
>  fs/xfs/libxfs/xfs_alloc_btree.h    |    3 ++
>  fs/xfs/libxfs/xfs_bmap_btree.c     |   23 +++++++++++++++++
>  fs/xfs/libxfs/xfs_bmap_btree.h     |    3 ++
>  fs/xfs/libxfs/xfs_btree.c          |   48 ++++++++++++++++++++++++++++++++----
>  fs/xfs/libxfs/xfs_btree.h          |   20 ++++++---------
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |   23 +++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc_btree.h   |    3 ++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   23 +++++++++++++++++
>  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   23 +++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    3 ++
>  fs/xfs/xfs_super.c                 |   13 +++++-----
>  13 files changed, 182 insertions(+), 29 deletions(-)

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 23:05     ` Darrick J. Wong
@ 2021-10-14 23:17       ` Dave Chinner
  2021-10-15  0:37         ` Darrick J. Wong
  0 siblings, 1 reply; 35+ messages in thread
From: Dave Chinner @ 2021-10-14 23:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 04:05:49PM -0700, Darrick J. Wong wrote:
> On Fri, Oct 15, 2021 at 09:48:13AM +1100, Dave Chinner wrote:
> > On Thu, Oct 14, 2021 at 01:17:00PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > During review of subsequent patches, Dave and I noticed that this
> > > function doesn't work quite right -- accessing cur->bc_ino depends on
> > > the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
> > > isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
> > > because block 0 of an AG is never part of a btree.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/scrub/trace.c |    7 ++++---
> > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> > > index c0ef53fe6611..93c13763c15e 100644
> > > --- a/fs/xfs/scrub/trace.c
> > > +++ b/fs/xfs/scrub/trace.c
> > > @@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
> > >  	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> > >  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> > >  				xfs_buf_daddr(cur->bc_bufs[level]));
> > > -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > > +
> > > +	if (level == cur->bc_nlevels - 1 &&
> > > +	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
> > >  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
> > 
> > Ok.
> > 
> > > -	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
> > > -		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> > > +
> > 
> > But doesn't this break the tracing code on short pointers as the
> > tracing code does:
> > 
> > 	TP_fast_assign(
> > 		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
> > 		...
> > 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
> > 
> > 
> > i.e. the tracing will no longer give the correct agno for per-ag
> > cursors that don't have any buffers attached to them at the current
> > level?
> 
> Hmmm.  By that logic, maybe we should get rid of the (level ==
> cur->bc_nlevels - 1) check entirely so that any cursor without a buffer
> attached at that level will always provide *some* kind of breadcrumb to
> the tracepoints?

Perhaps so.

> I almost did that instead, except for the consideration that if you're
> tracing the online fsck code, you should /probably/ have at least one of
> xchk_stop/xrep_attempt/xchk_done included in the filter list so you can
> be certain of what the kernel is checking.

It doesn't worry me how we resolve this - I just pointed it out
because it's not mentioned in the commit message and I wanted to
make sure it wasn't an oversight. If there are better ways to get
this information from tracepoints and XFS_FSB_TO_AGNO() does
something safe and obvious with NULLFSBLOCK (e.g. ends up as -1)
then as long as there's mention of this change in the commit message
I'm fine with it as it is.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 23:17       ` Dave Chinner
@ 2021-10-15  0:37         ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-15  0:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Fri, Oct 15, 2021 at 10:17:02AM +1100, Dave Chinner wrote:
> On Thu, Oct 14, 2021 at 04:05:49PM -0700, Darrick J. Wong wrote:
> > On Fri, Oct 15, 2021 at 09:48:13AM +1100, Dave Chinner wrote:
> > > On Thu, Oct 14, 2021 at 01:17:00PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > During review of subsequent patches, Dave and I noticed that this
> > > > function doesn't work quite right -- accessing cur->bc_ino depends on
> > > > the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
> > > > isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
> > > > because block 0 of an AG is never part of a btree.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/scrub/trace.c |    7 ++++---
> > > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > > 
> > > > 
> > > > diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> > > > index c0ef53fe6611..93c13763c15e 100644
> > > > --- a/fs/xfs/scrub/trace.c
> > > > +++ b/fs/xfs/scrub/trace.c
> > > > @@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
> > > >  	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> > > >  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> > > >  				xfs_buf_daddr(cur->bc_bufs[level]));
> > > > -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > > > +
> > > > +	if (level == cur->bc_nlevels - 1 &&
> > > > +	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
> > > >  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
> > > 
> > > Ok.
> > > 
> > > > -	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
> > > > -		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> > > > +
> > > 
> > > But doesn't this break the tracing code on short pointers as the
> > > tracing code does:
> > > 
> > > 	TP_fast_assign(
> > > 		xfs_fsblock_t fsbno = xchk_btree_cur_fsbno(cur, level);
> > > 		...
> > > 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
> > > 
> > > 
> > > i.e. the tracing will no longer give the correct agno for per-ag
> > > cursors that don't have any buffers attached to them at the current
> > > level?
> > 
> > Hmmm.  By that logic, maybe we should get rid of the (level ==
> > cur->bc_nlevels - 1) check entirely so that any cursor without a buffer
> > attached at that level will always provide *some* kind of breadcrumb to
> > the tracepoints?
> 
> Perhaps so.
> 
> > I almost did that instead, except for the consideration that if you're
> > tracing the online fsck code, you should /probably/ have at least one of
> > xchk_stop/xrep_attempt/xchk_done included in the filter list so you can
> > be certain of what the kernel is checking.
> 
> It doesn't worry me how we resolve this - I just pointed it out
> because it's not mentioned in the commit message and I wanted to
> make sure it wasn't an oversight. If there are better ways to get
> this information from tracepoints and XFS_FSB_TO_AGNO() does
> something safe and obvious with NULLFSBLOCK (e.g. ends up as -1)
> then as long as there's mention of this change in the commit message
> I'm fine with it as it is.

Ok.  I'll add the following to the commit message:

"Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed."

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v4.1 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
  2021-10-14 22:48   ` Dave Chinner
@ 2021-10-15  0:42   ` Darrick J. Wong
  2021-10-18 22:26     ` Dave Chinner
  1 sibling, 1 reply; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-15  0:42 UTC (permalink / raw)
  To: david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

During review of subsequent patches, Dave and I noticed that this
function doesn't work quite right -- accessing cur->bc_ino depends on
the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
because block 0 of an AG is never part of a btree.

Note: This changes the btree scrubber tracepoints behavior -- if the
cursor has no buffer for a certain level, it will always report
NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
will also be tracing xchk_start/xchk_done or otherwise be aware of what
exactly is being scrubbed.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
v4.1: note the new behavior in the commit message
---
 fs/xfs/scrub/trace.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index c0ef53fe6611..93c13763c15e 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
 	if (level < cur->bc_nlevels && cur->bc_bufs[level])
 		return XFS_DADDR_TO_FSB(cur->bc_mp,
 				xfs_buf_daddr(cur->bc_bufs[level]));
-	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
+
+	if (level == cur->bc_nlevels - 1 &&
+	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
 		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
-	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
-		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
+
 	return NULLFSBLOCK;
 }
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels
  2021-10-14 22:52   ` Dave Chinner
@ 2021-10-15  0:44     ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-15  0:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Fri, Oct 15, 2021 at 09:52:10AM +1100, Dave Chinner wrote:
> On Thu, Oct 14, 2021 at 01:17:55PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Years ago when XFS was thought to be much more simple, we introduced
> > m_ag_maxlevels to specify the maximum btree height of per-AG btrees for
> > a given filesystem mount.  Then we observed that inode btrees don't
> > actually have the same height and split that off; and now we have rmap
> > and refcount btrees with much different geometries and separate
> > maxlevels variables.
> > 
> > The 'ag' part of the name doesn't make much sense anymore, so rename
> > this to m_allocbt_maxlevels to reinforce that this is the maximum height
>                  ^^
> You named it m_alloc_maxlevels....

Oops, fixed.

--D

> > of the *free space* btrees.  This sets us up for the next patch, which
> > will add a variable to track the maximum height of all AG btrees.
> > 
> > (Also take the opportunity to improve adjacent comments and fix minor
> > style problems.)
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c       |   19 +++++++++++--------
> >  fs/xfs/libxfs/xfs_alloc.h       |    2 +-
> >  fs/xfs/libxfs/xfs_alloc_btree.c |    4 ++--
> >  fs/xfs/libxfs/xfs_trans_resv.c  |    2 +-
> >  fs/xfs/libxfs/xfs_trans_space.h |    2 +-
> >  fs/xfs/scrub/agheader.c         |    4 ++--
> >  fs/xfs/scrub/agheader_repair.c  |    4 ++--
> >  fs/xfs/xfs_mount.h              |    4 ++--
> >  8 files changed, 22 insertions(+), 19 deletions(-)
> 
> Other than that minor nit, the change looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation
  2021-10-14 22:54   ` Dave Chinner
@ 2021-10-15  0:44     ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-15  0:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Fri, Oct 15, 2021 at 09:54:47AM +1100, Dave Chinner wrote:
> On Thu, Oct 14, 2021 at 01:18:00PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Compute the actual maximum AG btree height for deciding if a per-AG
> > block reservation is critically low.  This only affects the sanity check
> > condition, since we /generally/ will trigger on the 10% threshold.  This
> > is a long-winded way of saying that we're removing one more usage of
> > XFS_BTREE_MAXLEVELS.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c |    3 ++-
> >  fs/xfs/xfs_mount.c          |   14 ++++++++++++++
> >  fs/xfs/xfs_mount.h          |    1 +
> >  3 files changed, 17 insertions(+), 1 deletion(-)
> 
> One minor nit below, otherwise it looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index 2aa2b3484c28..fe94058d4e9e 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -91,7 +91,8 @@ xfs_ag_resv_critical(
> >  	trace_xfs_ag_resv_critical(pag, type, avail);
> >  
> >  	/* Critically low if less than 10% or max btree height remains. */
> > -	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
> > +	return XFS_TEST_ERROR(avail < orig / 10 ||
> > +			      avail < pag->pag_mount->m_agbtree_maxlevels,
> >  			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
> >  }
> >  
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index 06dac09eddbd..5be1dd63fac5 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -567,6 +567,18 @@ xfs_mount_setup_inode_geom(
> >  	xfs_ialloc_setup_geometry(mp);
> >  }
> >  
> > +/* Compute maximum possible height for per-AG btree types for this fs. */
> > +static inline void
> > +xfs_agbtree_compute_maxlevels(
> > +	struct xfs_mount	*mp)
> > +{
> > +	unsigned int		ret;
> > +
> > +	ret = max(mp->m_alloc_maxlevels, M_IGEO(mp)->inobt_maxlevels);
> > +	ret = max(ret, mp->m_rmap_maxlevels);
> > +	mp->m_agbtree_maxlevels = max(ret, mp->m_refc_maxlevels);
> > +}
> 
> "ret" should really be named "levels" here because it's not a return
> value anymore...

Fixed.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-14 23:03   ` Dave Chinner
@ 2021-10-15  0:48     ` Darrick J. Wong
  0 siblings, 0 replies; 35+ messages in thread
From: Darrick J. Wong @ 2021-10-15  0:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Babu R, linux-xfs, hch

On Fri, Oct 15, 2021 at 10:03:33AM +1100, Dave Chinner wrote:
> On Thu, Oct 14, 2021 at 01:18:11PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
> > enough to handle the maximally tall rmap btree when all blocks are in
> > use and maximally shared, let's compute the maximum height assuming the
> > rmapbt consumes as many blocks as possible.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_btree.c       |   33 +++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h       |    2 ++
> >  fs/xfs/libxfs/xfs_rmap_btree.c  |   45 +++++++++++++++++++++++----------------
> >  fs/xfs/libxfs/xfs_trans_resv.c  |   16 ++++++++++++++
> >  fs/xfs/libxfs/xfs_trans_space.h |    7 ++++++
> >  5 files changed, 85 insertions(+), 18 deletions(-)
> 
> Looks good.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> >  /* Calculate the refcount btree size for some records. */
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index c879e7754ee6..6f83d9b306ee 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -814,6 +814,19 @@ xfs_trans_resv_calc(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_trans_resv	*resp)
> >  {
> > +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> > +
> > +	/*
> > +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> > +	 * to 9 even if the AG was small enough that it would never grow to
> > +	 * that height.  Transaction reservation sizes influence the minimum
> > +	 * log size calculation, which influences the size of the log that mkfs
> > +	 * creates.  Use the old value here to ensure that newly formatted
> > +	 * small filesystems will mount on older kernels.
> > +	 */
> > +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> > +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> > +
> 
> As an aside, what are your plans to get your "legacy minimum log
> size reservations" calculation patch moved upstream so we can stop
> having to care about this in future?

Hmm.  I think I'm just going to bail on those until 5.17.

At this point I have the kmem_zone renaming series (already on the list)
and another patchset to shrink the defer items and convert them to use
slab caches that I might push after the kmem_zone thing if we get that
far.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v4.1 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno
  2021-10-15  0:42   ` [PATCH v4.1 " Darrick J. Wong
@ 2021-10-18 22:26     ` Dave Chinner
  0 siblings, 0 replies; 35+ messages in thread
From: Dave Chinner @ 2021-10-18 22:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Thu, Oct 14, 2021 at 05:42:03PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> During review of subsequent patches, Dave and I noticed that this
> function doesn't work quite right -- accessing cur->bc_ino depends on
> the ROOT_IN_INODE flag, not LONG_PTRS.  Fix that and the parentheses
> isssue.  While we're at it, remove the piece that accesses cur->bc_ag,
> because block 0 of an AG is never part of a btree.
> 
> Note: This changes the btree scrubber tracepoints behavior -- if the
> cursor has no buffer for a certain level, it will always report
> NULLFSBLOCK.  It is assumed that anyone tracing the online fsck code
> will also be tracing xchk_start/xchk_done or otherwise be aware of what
> exactly is being scrubbed.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> v4.1: note the new behavior in the commit message
> ---
>  fs/xfs/scrub/trace.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> index c0ef53fe6611..93c13763c15e 100644
> --- a/fs/xfs/scrub/trace.c
> +++ b/fs/xfs/scrub/trace.c
> @@ -24,10 +24,11 @@ xchk_btree_cur_fsbno(
>  	if (level < cur->bc_nlevels && cur->bc_bufs[level])
>  		return XFS_DADDR_TO_FSB(cur->bc_mp,
>  				xfs_buf_daddr(cur->bc_bufs[level]));
> -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> +
> +	if (level == cur->bc_nlevels - 1 &&
> +	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
>  		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
> -	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
> -		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
> +
>  	return NULLFSBLOCK;
>  }

Looks good with the update to the commit message.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2021-10-18 22:26 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-14 20:16 [PATCHSET v4 00/17] xfs: support dynamic btree cursor height Darrick J. Wong
2021-10-14 20:17 ` [PATCH 01/17] xfs: fix incorrect decoding in xchk_btree_cur_fsbno Darrick J. Wong
2021-10-14 22:48   ` Dave Chinner
2021-10-14 23:05     ` Darrick J. Wong
2021-10-14 23:17       ` Dave Chinner
2021-10-15  0:37         ` Darrick J. Wong
2021-10-15  0:42   ` [PATCH v4.1 " Darrick J. Wong
2021-10-18 22:26     ` Dave Chinner
2021-10-14 20:17 ` [PATCH 02/17] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
2021-10-14 20:17 ` [PATCH 03/17] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
2021-10-14 20:17 ` [PATCH 04/17] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
2021-10-14 20:17 ` [PATCH 05/17] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
2021-10-14 22:48   ` Dave Chinner
2021-10-14 20:17 ` [PATCH 06/17] xfs: prepare xfs_btree_cur for dynamic cursor heights Darrick J. Wong
2021-10-14 22:49   ` Dave Chinner
2021-10-14 20:17 ` [PATCH 07/17] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
2021-10-14 20:17 ` [PATCH 08/17] xfs: refactor btree cursor allocation function Darrick J. Wong
2021-10-14 20:17 ` [PATCH 09/17] xfs: encode the max btree height in the cursor Darrick J. Wong
2021-10-14 20:17 ` [PATCH 10/17] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
2021-10-14 20:17 ` [PATCH 11/17] xfs: rename m_ag_maxlevels to m_allocbt_maxlevels Darrick J. Wong
2021-10-14 22:52   ` Dave Chinner
2021-10-15  0:44     ` Darrick J. Wong
2021-10-14 20:18 ` [PATCH 12/17] xfs: compute maximum AG btree height for critical reservation calculation Darrick J. Wong
2021-10-14 22:54   ` Dave Chinner
2021-10-15  0:44     ` Darrick J. Wong
2021-10-14 20:18 ` [PATCH 13/17] xfs: clean up xfs_btree_{calc_size,compute_maxlevels} Darrick J. Wong
2021-10-14 22:58   ` Dave Chinner
2021-10-14 20:18 ` [PATCH 14/17] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
2021-10-14 23:03   ` Dave Chinner
2021-10-15  0:48     ` Darrick J. Wong
2021-10-14 20:18 ` [PATCH 15/17] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
2021-10-14 20:18 ` [PATCH 16/17] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
2021-10-14 23:08   ` Dave Chinner
2021-10-14 20:18 ` [PATCH 17/17] xfs: use separate btree cursor cache " Darrick J. Wong
2021-10-14 23:11   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.