linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v3 00/15] xfs: support dynamic btree cursor height
@ 2021-10-12 23:32 Darrick J. Wong
  2021-10-12 23:32 ` [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
                   ` (14 more replies)
  0 siblings, 15 replies; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:32 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, linux-xfs, chandan.babu, hch

Hi all,

In what's left of this series, we rearrange the incore btree cursor so
that we can support btrees of any height.  This will become necessary
for realtime rmap and reflink since we'd like to handle tall trees
without bloating the AG btree cursors.

Chandan Babu pointed out that his large extent counters series depends
on the ability to have btree cursors of arbitrary heights, so I've
ported this to 5.15-rc4 so his patchsets won't have to depend on
djwong-dev for submission.

Following the review discussions about the dynamic btree cursor height
patches, I've throw together another series to reduce the size of the
btree cursor, compute the absolute maximum possible btree heights for
each btree type, and now each btree cursor has its own slab cache:

$ grep xfs.*cur /proc/slabinfo
xfs_refcbt_cur 0 0 200 20 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_rmapbt_cur 0 0 248 16 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_bmbt_cur   0 0 248 16 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_inobt_cur  0 0 216 18 1 : tunables 0 0 0 : slabdata 4 4 0
xfs_bnobt_cur  0 0 216 18 1 : tunables 0 0 0 : slabdata 4 4 0

I've also rigged up the debugger to make it easier to extract the actual
height information:

$ xfs_db /dev/sda -c 'btheight -w absmax all'
bnobt: 7
cntbt: 7
inobt: 7
finobt: 7
bmapbt: 9
refcountbt: 6
rmapbt: 9

As you can see from the slabinfo output, this no longer means that we're
allocating 224-byte cursors for all five btree types.  Even with the
extra overhead of supporting dynamic cursor sizes and per-btree caches,
we still come out ahead in terms of cursor size for three of the five
btree types.

This series now also includes a couple of patches to reduce holes and
unnecessary fields in the btree cursor.

v2: reduce scrub btree checker memory footprint even more, put the one
    fixpatch first, use struct_size, fix 80col problems, move all the
    btree cache work to a separate series
v3: rebase to 5.15-rc4, fold in the per-btree cursor cache patches,
    remove all the references to "zones" since they're called "caches"
    in Linux

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=btree-dynamic-depth-5.16
---
 fs/xfs/libxfs/xfs_ag_resv.c        |   18 ++
 fs/xfs/libxfs/xfs_alloc.c          |    7 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   39 ++++-
 fs/xfs/libxfs/xfs_alloc_btree.h    |    5 +
 fs/xfs/libxfs/xfs_bmap.c           |   13 +-
 fs/xfs/libxfs/xfs_bmap_btree.c     |   41 +++++
 fs/xfs/libxfs/xfs_bmap_btree.h     |    5 +
 fs/xfs/libxfs/xfs_btree.c          |  270 +++++++++++++++++++++++-------------
 fs/xfs/libxfs/xfs_btree.h          |   79 ++++++++---
 fs/xfs/libxfs/xfs_btree_staging.c  |   10 +
 fs/xfs/libxfs/xfs_fs.h             |    2 
 fs/xfs/libxfs/xfs_ialloc.c         |    1 
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   46 +++++-
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    5 +
 fs/xfs/libxfs/xfs_refcount_btree.c |   46 +++++-
 fs/xfs/libxfs/xfs_refcount_btree.h |    5 +
 fs/xfs/libxfs/xfs_rmap_btree.c     |  108 +++++++++++---
 fs/xfs/libxfs/xfs_rmap_btree.h     |    5 +
 fs/xfs/libxfs/xfs_trans_resv.c     |   13 ++
 fs/xfs/libxfs/xfs_trans_space.h    |    7 +
 fs/xfs/scrub/bitmap.c              |   22 +--
 fs/xfs/scrub/bmap.c                |    2 
 fs/xfs/scrub/btree.c               |   77 +++++-----
 fs/xfs/scrub/btree.h               |   13 +-
 fs/xfs/scrub/trace.c               |    7 +
 fs/xfs/scrub/trace.h               |   10 +
 fs/xfs/xfs_super.c                 |   53 ++++++-
 fs/xfs/xfs_trace.h                 |    2 
 28 files changed, 660 insertions(+), 251 deletions(-)


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
@ 2021-10-12 23:32 ` Darrick J. Wong
  2021-10-13  0:56   ` Dave Chinner
  2021-10-12 23:32 ` [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:32 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

This field isn't used by anyone, so get rid of it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    1 -
 fs/xfs/libxfs/xfs_bmap_btree.c     |    1 -
 fs/xfs/libxfs/xfs_btree.h          |    1 -
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    2 --
 fs/xfs/libxfs/xfs_refcount_btree.c |    1 -
 fs/xfs/libxfs/xfs_rmap_btree.c     |    1 -
 6 files changed, 7 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6746fd735550..152ed2a202f4 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -482,7 +482,6 @@ xfs_allocbt_init_common(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 72444b8b38a6..a43dea8d6a65 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -558,7 +558,6 @@ xfs_bmbt_init_cursor(
 	cur->bc_mp = mp;
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_btnum = XFS_BTNUM_BMAP;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 513ade4a89f8..49ecc496238f 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -229,7 +229,6 @@ struct xfs_btree_cur
 #define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
 #define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
-	uint8_t		bc_blocklog;	/* log2(blocksize) of btree blocks */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
 	int		bc_statoff;	/* offset of btre stats array */
 
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 27190840c5d8..10736b89b679 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -444,8 +444,6 @@ xfs_inobt_init_common(
 		cur->bc_ops = &xfs_finobt_ops;
 	}
 
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
-
 	if (xfs_has_crc(mp))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
 
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 1ef9b99962ab..3ea589f15b14 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -326,7 +326,6 @@ xfs_refcountbt_init_common(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = XFS_BTNUM_REFC;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b7dbbfb3aeed..d65bf3c6f25e 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -457,7 +457,6 @@ xfs_rmapbt_init_common(
 	/* Overlapping btree; 2 keys per pointer. */
 	cur->bc_btnum = XFS_BTNUM_RMAP;
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
-	cur->bc_blocklog = mp->m_sb.sb_blocklog;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-10-12 23:32 ` [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
@ 2021-10-12 23:32 ` Darrick J. Wong
  2021-10-13  0:57   ` Dave Chinner
  2021-10-12 23:32 ` [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:32 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

We're never going to run more than 4 billion btree operations on a
refcount cursor, so shrink the field to an unsigned int to reduce the
structure size.  Fix whitespace alignment too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 49ecc496238f..1018bcc43d66 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -181,18 +181,18 @@ union xfs_btree_irec {
 
 /* Per-AG btree information. */
 struct xfs_btree_cur_ag {
-	struct xfs_perag	*pag;
+	struct xfs_perag		*pag;
 	union {
 		struct xfs_buf		*agbp;
 		struct xbtree_afakeroot	*afake;	/* for staging cursor */
 	};
 	union {
 		struct {
-			unsigned long nr_ops;	/* # record updates */
-			int	shape_changes;	/* # of extent splits */
+			unsigned int	nr_ops;	/* # record updates */
+			unsigned int	shape_changes;	/* # of extent splits */
 		} refc;
 		struct {
-			bool	active;		/* allocation cursor state */
+			bool		active;	/* allocation cursor state */
 		} abt;
 	};
 };


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
  2021-10-12 23:32 ` [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
  2021-10-12 23:32 ` [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
@ 2021-10-12 23:32 ` Darrick J. Wong
  2021-10-13  1:02   ` Dave Chinner
  2021-10-12 23:32 ` [PATCH 04/15] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:32 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

The btree scrubbing code checks that the records (or keys) that it finds
in a btree block are all in order by calling the btree cursor's
->recs_inorder function.  This of course makes no sense for the first
item in the block, so we switch that off with a separate variable in
struct xchk_btree.

Christoph helped me figure out that the variable is unnecessary, since
we just accessed bc_ptrs[level] and can compare that against zero.  Use
that, and save ourselves some memory space.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   11 +++--------
 fs/xfs/scrub/btree.h |    2 --
 2 files changed, 3 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 26dcb4691e31..d5e1ca521fc4 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -141,9 +141,9 @@ xchk_btree_rec(
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
 	/* If this isn't the first record, are they in order? */
-	if (!bs->firstrec && !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
+	if (cur->bc_ptrs[0] > 1 &&
+	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
-	bs->firstrec = false;
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
 
 	if (cur->bc_nlevels == 1)
@@ -188,10 +188,9 @@ xchk_btree_key(
 	trace_xchk_btree_key(bs->sc, cur, level);
 
 	/* If this isn't the first key, are they in order? */
-	if (!bs->firstkey[level] &&
+	if (cur->bc_ptrs[level] > 1 &&
 	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	bs->firstkey[level] = false;
 	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
 
 	if (level + 1 >= cur->bc_nlevels)
@@ -636,7 +635,6 @@ xchk_btree(
 	struct xfs_buf			*bp;
 	struct check_owner		*co;
 	struct check_owner		*n;
-	int				i;
 	int				error = 0;
 
 	/*
@@ -649,13 +647,10 @@ xchk_btree(
 	bs->cur = cur;
 	bs->scrub_rec = scrub_fn;
 	bs->oinfo = oinfo;
-	bs->firstrec = true;
 	bs->private = private;
 	bs->sc = sc;
 
 	/* Initialize scrub state */
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++)
-		bs->firstkey[i] = true;
 	INIT_LIST_HEAD(&bs->to_check);
 
 	/* Don't try to check a tree with a height we can't handle. */
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index b7d2fc01fbf9..7671108f9f85 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -39,9 +39,7 @@ struct xchk_btree {
 
 	/* internal scrub state */
 	union xfs_btree_rec		lastrec;
-	bool				firstrec;
 	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
-	bool				firstkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
 };
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 04/15] xfs: dynamically allocate btree scrub context structure
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-10-12 23:32 ` [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
@ 2021-10-12 23:32 ` Darrick J. Wong
  2021-10-13  4:57   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 05/15] xfs: support dynamic btree cursor heights Darrick J. Wong
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:32 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Reorganize struct xchk_btree so that we can dynamically size the context
structure to fit the type of btree cursor that we have.  This will
enable us to use memory more efficiently once we start adding very tall
btree types.  Right-size the lastkey array so that we stop wasting the
first array element.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/scrub/btree.c |   23 ++++++++++++-----------
 fs/xfs/scrub/btree.h |   11 ++++++++++-
 2 files changed, 22 insertions(+), 12 deletions(-)


diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index d5e1ca521fc4..6d4eba85ef77 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -189,9 +189,9 @@ xchk_btree_key(
 
 	/* If this isn't the first key, are they in order? */
 	if (cur->bc_ptrs[level] > 1 &&
-	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
+	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
-	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
+	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
 
 	if (level + 1 >= cur->bc_nlevels)
 		return;
@@ -631,17 +631,24 @@ xchk_btree(
 	union xfs_btree_ptr		*pp;
 	union xfs_btree_rec		*recp;
 	struct xfs_btree_block		*block;
-	int				level;
 	struct xfs_buf			*bp;
 	struct check_owner		*co;
 	struct check_owner		*n;
+	size_t				cur_sz;
+	int				level;
 	int				error = 0;
 
 	/*
 	 * Allocate the btree scrub context from the heap, because this
-	 * structure can get rather large.
+	 * structure can get rather large.  Don't let a caller feed us a
+	 * totally absurd size.
 	 */
-	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
+	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
+	if (cur_sz > PAGE_SIZE) {
+		xchk_btree_set_corrupt(sc, cur, 0);
+		return 0;
+	}
+	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
 	if (!bs)
 		return -ENOMEM;
 	bs->cur = cur;
@@ -653,12 +660,6 @@ xchk_btree(
 	/* Initialize scrub state */
 	INIT_LIST_HEAD(&bs->to_check);
 
-	/* Don't try to check a tree with a height we can't handle. */
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
-		xchk_btree_set_corrupt(sc, cur, 0);
-		goto out;
-	}
-
 	/*
 	 * Load the root of the btree.  The helper function absorbs
 	 * error codes for us.
diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
index 7671108f9f85..62c3091ef20f 100644
--- a/fs/xfs/scrub/btree.h
+++ b/fs/xfs/scrub/btree.h
@@ -39,9 +39,18 @@ struct xchk_btree {
 
 	/* internal scrub state */
 	union xfs_btree_rec		lastrec;
-	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
 	struct list_head		to_check;
+
+	/* this element must come last! */
+	union xfs_btree_key		lastkey[];
 };
+
+static inline size_t
+xchk_btree_sizeof(unsigned int nlevels)
+{
+	return struct_size((struct xchk_btree *)NULL, lastkey, nlevels - 1);
+}
+
 int xchk_btree(struct xfs_scrub *sc, struct xfs_btree_cur *cur,
 		xchk_btree_rec_fn scrub_fn, const struct xfs_owner_info *oinfo,
 		void *private);


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 05/15] xfs: support dynamic btree cursor heights
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-10-12 23:32 ` [PATCH 04/15] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:31   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Split out the btree level information into a separate struct and put it
at the end of the cursor structure as a VLA.  The realtime rmap btree
(which is rooted in an inode) will require the ability to support many
more levels than a per-AG btree cursor, which means that we're going to
create two btree cursor caches to conserve memory for the more common
case.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc.c |    6 +-
 fs/xfs/libxfs/xfs_bmap.c  |   10 +--
 fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
 fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
 fs/xfs/scrub/bitmap.c     |   22 +++---
 fs/xfs/scrub/bmap.c       |    2 -
 fs/xfs/scrub/btree.c      |   47 +++++++------
 fs/xfs/scrub/trace.c      |    7 +-
 fs/xfs/scrub/trace.h      |   10 +--
 fs/xfs/xfs_super.c        |    2 -
 fs/xfs/xfs_trace.h        |    2 -
 11 files changed, 164 insertions(+), 140 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 35fb1dd3be95..55c5adc9b54e 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -488,8 +488,8 @@ xfs_alloc_fixup_trees(
 		struct xfs_btree_block	*bnoblock;
 		struct xfs_btree_block	*cntblock;
 
-		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_bufs[0]);
-		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_bufs[0]);
+		bnoblock = XFS_BUF_TO_BLOCK(bno_cur->bc_levels[0].bp);
+		cntblock = XFS_BUF_TO_BLOCK(cnt_cur->bc_levels[0].bp);
 
 		if (XFS_IS_CORRUPT(mp,
 				   bnoblock->bb_numrecs !=
@@ -1512,7 +1512,7 @@ xfs_alloc_ag_vextent_lastblock(
 	 * than minlen.
 	 */
 	if (*len || args->alignment > 1) {
-		acur->cnt->bc_ptrs[0] = 1;
+		acur->cnt->bc_levels[0].ptr = 1;
 		do {
 			error = xfs_alloc_get_rec(acur->cnt, bno, len, &i);
 			if (error)
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 499c977cbf56..644b956301b6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -240,10 +240,10 @@ xfs_bmap_get_bp(
 		return NULL;
 
 	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
-		if (!cur->bc_bufs[i])
+		if (!cur->bc_levels[i].bp)
 			break;
-		if (xfs_buf_daddr(cur->bc_bufs[i]) == bno)
-			return cur->bc_bufs[i];
+		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
+			return cur->bc_levels[i].bp;
 	}
 
 	/* Chase down all the log items to see if the bp is there */
@@ -629,8 +629,8 @@ xfs_bmap_btree_to_extents(
 	ip->i_nblocks--;
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT, -1L);
 	xfs_trans_binval(tp, cbp);
-	if (cur->bc_bufs[0] == cbp)
-		cur->bc_bufs[0] = NULL;
+	if (cur->bc_levels[0].bp == cbp)
+		cur->bc_levels[0].bp = NULL;
 	xfs_iroot_realloc(ip, -1, whichfork);
 	ASSERT(ifp->if_broot == NULL);
 	ifp->if_format = XFS_DINODE_FMT_EXTENTS;
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index bc4e49f0456a..25dfab81025f 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -367,8 +367,8 @@ xfs_btree_del_cursor(
 	 * way we won't have initialized all the entries down to 0.
 	 */
 	for (i = 0; i < cur->bc_nlevels; i++) {
-		if (cur->bc_bufs[i])
-			xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
+		if (cur->bc_levels[i].bp)
+			xfs_trans_brelse(cur->bc_tp, cur->bc_levels[i].bp);
 		else if (!error)
 			break;
 	}
@@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
 	 * For each level current, re-get the buffer and copy the ptr value.
 	 */
 	for (i = 0; i < new->bc_nlevels; i++) {
-		new->bc_ptrs[i] = cur->bc_ptrs[i];
-		new->bc_ra[i] = cur->bc_ra[i];
-		bp = cur->bc_bufs[i];
+		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
+		new->bc_levels[i].ra = cur->bc_levels[i].ra;
+		bp = cur->bc_levels[i].bp;
 		if (bp) {
 			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
 						   xfs_buf_daddr(bp), mp->m_bsize,
@@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
 				return error;
 			}
 		}
-		new->bc_bufs[i] = bp;
+		new->bc_levels[i].bp = bp;
 	}
 	*ncur = new;
 	return 0;
@@ -681,7 +681,7 @@ xfs_btree_get_block(
 		return xfs_btree_get_iroot(cur);
 	}
 
-	*bpp = cur->bc_bufs[level];
+	*bpp = cur->bc_levels[level].bp;
 	return XFS_BUF_TO_BLOCK(*bpp);
 }
 
@@ -711,7 +711,7 @@ xfs_btree_firstrec(
 	/*
 	 * Set the ptr value to 1, that's the first record/key.
 	 */
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 	return 1;
 }
 
@@ -741,7 +741,7 @@ xfs_btree_lastrec(
 	/*
 	 * Set the ptr value to numrecs, that's the last record/key.
 	 */
-	cur->bc_ptrs[level] = be16_to_cpu(block->bb_numrecs);
+	cur->bc_levels[level].ptr = be16_to_cpu(block->bb_numrecs);
 	return 1;
 }
 
@@ -922,11 +922,11 @@ xfs_btree_readahead(
 	    (lev == cur->bc_nlevels - 1))
 		return 0;
 
-	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
+	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
 		return 0;
 
-	cur->bc_ra[lev] |= lr;
-	block = XFS_BUF_TO_BLOCK(cur->bc_bufs[lev]);
+	cur->bc_levels[lev].ra |= lr;
+	block = XFS_BUF_TO_BLOCK(cur->bc_levels[lev].bp);
 
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		return xfs_btree_readahead_lblock(cur, lr, block);
@@ -991,22 +991,22 @@ xfs_btree_setbuf(
 {
 	struct xfs_btree_block	*b;	/* btree block */
 
-	if (cur->bc_bufs[lev])
-		xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[lev]);
-	cur->bc_bufs[lev] = bp;
-	cur->bc_ra[lev] = 0;
+	if (cur->bc_levels[lev].bp)
+		xfs_trans_brelse(cur->bc_tp, cur->bc_levels[lev].bp);
+	cur->bc_levels[lev].bp = bp;
+	cur->bc_levels[lev].ra = 0;
 
 	b = XFS_BUF_TO_BLOCK(bp);
 	if (cur->bc_flags & XFS_BTREE_LONG_PTRS) {
 		if (b->bb_u.l.bb_leftsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.l.bb_rightsib == cpu_to_be64(NULLFSBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	} else {
 		if (b->bb_u.s.bb_leftsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_LEFTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_LEFTRA;
 		if (b->bb_u.s.bb_rightsib == cpu_to_be32(NULLAGBLOCK))
-			cur->bc_ra[lev] |= XFS_BTCUR_RIGHTRA;
+			cur->bc_levels[lev].ra |= XFS_BTCUR_RIGHTRA;
 	}
 }
 
@@ -1548,7 +1548,7 @@ xfs_btree_increment(
 #endif
 
 	/* We're done if we remain in the block after the increment. */
-	if (++cur->bc_ptrs[level] <= xfs_btree_get_numrecs(block))
+	if (++cur->bc_levels[level].ptr <= xfs_btree_get_numrecs(block))
 		goto out1;
 
 	/* Fail if we just went off the right edge of the tree. */
@@ -1571,7 +1571,7 @@ xfs_btree_increment(
 			goto error0;
 #endif
 
-		if (++cur->bc_ptrs[lev] <= xfs_btree_get_numrecs(block))
+		if (++cur->bc_levels[lev].ptr <= xfs_btree_get_numrecs(block))
 			break;
 
 		/* Read-ahead the right block for the next loop. */
@@ -1598,14 +1598,14 @@ xfs_btree_increment(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = 1;
+		cur->bc_levels[lev].ptr = 1;
 	}
 out1:
 	*stat = 1;
@@ -1641,7 +1641,7 @@ xfs_btree_decrement(
 	xfs_btree_readahead(cur, level, XFS_BTCUR_LEFTRA);
 
 	/* We're done if we remain in the block after the decrement. */
-	if (--cur->bc_ptrs[level] > 0)
+	if (--cur->bc_levels[level].ptr > 0)
 		goto out1;
 
 	/* Get a pointer to the btree block. */
@@ -1665,7 +1665,7 @@ xfs_btree_decrement(
 	 * Stop when we don't go off the left edge of a block.
 	 */
 	for (lev = level + 1; lev < cur->bc_nlevels; lev++) {
-		if (--cur->bc_ptrs[lev] > 0)
+		if (--cur->bc_levels[lev].ptr > 0)
 			break;
 		/* Read-ahead the left block for the next loop. */
 		xfs_btree_readahead(cur, lev, XFS_BTCUR_LEFTRA);
@@ -1691,13 +1691,13 @@ xfs_btree_decrement(
 	for (block = xfs_btree_get_block(cur, lev, &bp); lev > level; ) {
 		union xfs_btree_ptr	*ptrp;
 
-		ptrp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[lev], block);
+		ptrp = xfs_btree_ptr_addr(cur, cur->bc_levels[lev].ptr, block);
 		--lev;
 		error = xfs_btree_read_buf_block(cur, ptrp, 0, &block, &bp);
 		if (error)
 			goto error0;
 		xfs_btree_setbuf(cur, lev, bp);
-		cur->bc_ptrs[lev] = xfs_btree_get_numrecs(block);
+		cur->bc_levels[lev].ptr = xfs_btree_get_numrecs(block);
 	}
 out1:
 	*stat = 1;
@@ -1735,7 +1735,7 @@ xfs_btree_lookup_get_block(
 	 *
 	 * Otherwise throw it away and get a new one.
 	 */
-	bp = cur->bc_bufs[level];
+	bp = cur->bc_levels[level].bp;
 	error = xfs_btree_ptr_to_daddr(cur, pp, &daddr);
 	if (error)
 		return error;
@@ -1864,7 +1864,7 @@ xfs_btree_lookup(
 					return -EFSCORRUPTED;
 				}
 
-				cur->bc_ptrs[0] = dir != XFS_LOOKUP_LE;
+				cur->bc_levels[0].ptr = dir != XFS_LOOKUP_LE;
 				*stat = 0;
 				return 0;
 			}
@@ -1916,7 +1916,7 @@ xfs_btree_lookup(
 			if (error)
 				goto error0;
 
-			cur->bc_ptrs[level] = keyno;
+			cur->bc_levels[level].ptr = keyno;
 		}
 	}
 
@@ -1933,7 +1933,7 @@ xfs_btree_lookup(
 		    !xfs_btree_ptr_is_null(cur, &ptr)) {
 			int	i;
 
-			cur->bc_ptrs[0] = keyno;
+			cur->bc_levels[0].ptr = keyno;
 			error = xfs_btree_increment(cur, 0, &i);
 			if (error)
 				goto error0;
@@ -1944,7 +1944,7 @@ xfs_btree_lookup(
 		}
 	} else if (dir == XFS_LOOKUP_LE && diff > 0)
 		keyno--;
-	cur->bc_ptrs[0] = keyno;
+	cur->bc_levels[0].ptr = keyno;
 
 	/* Return if we succeeded or not. */
 	if (keyno == 0 || keyno > xfs_btree_get_numrecs(block))
@@ -2104,7 +2104,7 @@ __xfs_btree_updkeys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		nlkey = xfs_btree_key_addr(cur, ptr, block);
 		nhkey = xfs_btree_high_key_addr(cur, ptr, block);
 		if (!force_all &&
@@ -2171,7 +2171,7 @@ xfs_btree_update_keys(
 		if (error)
 			return error;
 #endif
-		ptr = cur->bc_ptrs[level];
+		ptr = cur->bc_levels[level].ptr;
 		kp = xfs_btree_key_addr(cur, ptr, block);
 		xfs_btree_copy_keys(cur, kp, &key, 1);
 		xfs_btree_log_keys(cur, bp, ptr, ptr);
@@ -2205,7 +2205,7 @@ xfs_btree_update(
 		goto error0;
 #endif
 	/* Get the address of the rec to be updated. */
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	rp = xfs_btree_rec_addr(cur, ptr, block);
 
 	/* Fill in the new contents and log them. */
@@ -2280,7 +2280,7 @@ xfs_btree_lshift(
 	 * If the cursor entry is the one that would be moved, don't
 	 * do it... it's too complicated.
 	 */
-	if (cur->bc_ptrs[level] <= 1)
+	if (cur->bc_levels[level].ptr <= 1)
 		goto out0;
 
 	/* Set up the left neighbor as "left". */
@@ -2414,7 +2414,7 @@ xfs_btree_lshift(
 		goto error0;
 
 	/* Slide the cursor value left one. */
-	cur->bc_ptrs[level]--;
+	cur->bc_levels[level].ptr--;
 
 	*stat = 1;
 	return 0;
@@ -2476,7 +2476,7 @@ xfs_btree_rshift(
 	 * do it... it's too complicated.
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
-	if (cur->bc_ptrs[level] >= lrecs)
+	if (cur->bc_levels[level].ptr >= lrecs)
 		goto out0;
 
 	/* Set up the right neighbor as "right". */
@@ -2664,7 +2664,7 @@ __xfs_btree_split(
 	 */
 	lrecs = xfs_btree_get_numrecs(left);
 	rrecs = lrecs / 2;
-	if ((lrecs & 1) && cur->bc_ptrs[level] <= rrecs + 1)
+	if ((lrecs & 1) && cur->bc_levels[level].ptr <= rrecs + 1)
 		rrecs++;
 	src_index = (lrecs - rrecs + 1);
 
@@ -2760,9 +2760,9 @@ __xfs_btree_split(
 	 * If it's just pointing past the last entry in left, then we'll
 	 * insert there, so don't change anything in that case.
 	 */
-	if (cur->bc_ptrs[level] > lrecs + 1) {
+	if (cur->bc_levels[level].ptr > lrecs + 1) {
 		xfs_btree_setbuf(cur, level, rbp);
-		cur->bc_ptrs[level] -= lrecs;
+		cur->bc_levels[level].ptr -= lrecs;
 	}
 	/*
 	 * If there are more levels, we'll need another cursor which refers
@@ -2772,7 +2772,7 @@ __xfs_btree_split(
 		error = xfs_btree_dup_cursor(cur, curp);
 		if (error)
 			goto error0;
-		(*curp)->bc_ptrs[level + 1]++;
+		(*curp)->bc_levels[level + 1].ptr++;
 	}
 	*ptrp = rptr;
 	*stat = 1;
@@ -2934,7 +2934,7 @@ xfs_btree_new_iroot(
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
-	cur->bc_ptrs[level + 1] = 1;
+	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
 	ckp = xfs_btree_key_addr(cur, 1, cblock);
@@ -3095,7 +3095,7 @@ xfs_btree_new_root(
 
 	/* Fix up the cursor. */
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
-	cur->bc_ptrs[cur->bc_nlevels] = nptr;
+	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
 	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
 	*stat = 1;
@@ -3154,7 +3154,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 	if (*stat) {
-		*oindex = *index = cur->bc_ptrs[level];
+		*oindex = *index = cur->bc_levels[level].ptr;
 		return 0;
 	}
 
@@ -3169,7 +3169,7 @@ xfs_btree_make_block_unfull(
 		return error;
 
 
-	*index = cur->bc_ptrs[level];
+	*index = cur->bc_levels[level].ptr;
 	return 0;
 }
 
@@ -3216,7 +3216,7 @@ xfs_btree_insrec(
 	}
 
 	/* If we're off the left edge, return failure. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3559,7 +3559,7 @@ xfs_btree_kill_iroot(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level - 1] = NULL;
+	cur->bc_levels[level - 1].bp = NULL;
 	be16_add_cpu(&block->bb_level, -1);
 	xfs_trans_log_inode(cur->bc_tp, ip,
 		XFS_ILOG_CORE | xfs_ilog_fbroot(cur->bc_ino.whichfork));
@@ -3592,8 +3592,8 @@ xfs_btree_kill_root(
 	if (error)
 		return error;
 
-	cur->bc_bufs[level] = NULL;
-	cur->bc_ra[level] = 0;
+	cur->bc_levels[level].bp = NULL;
+	cur->bc_levels[level].ra = 0;
 	cur->bc_nlevels--;
 
 	return 0;
@@ -3652,7 +3652,7 @@ xfs_btree_delrec(
 	tcur = NULL;
 
 	/* Get the index of the entry being deleted, check for nothing there. */
-	ptr = cur->bc_ptrs[level];
+	ptr = cur->bc_levels[level].ptr;
 	if (ptr == 0) {
 		*stat = 0;
 		return 0;
@@ -3962,7 +3962,7 @@ xfs_btree_delrec(
 				xfs_btree_del_cursor(tcur, XFS_BTREE_NOERROR);
 				tcur = NULL;
 				if (level == 0)
-					cur->bc_ptrs[0]++;
+					cur->bc_levels[0].ptr++;
 
 				*stat = 1;
 				return 0;
@@ -4099,9 +4099,9 @@ xfs_btree_delrec(
 	 * cursor to the left block, and fix up the index.
 	 */
 	if (bp != lbp) {
-		cur->bc_bufs[level] = lbp;
-		cur->bc_ptrs[level] += lrecs;
-		cur->bc_ra[level] = 0;
+		cur->bc_levels[level].bp = lbp;
+		cur->bc_levels[level].ptr += lrecs;
+		cur->bc_levels[level].ra = 0;
 	}
 	/*
 	 * If we joined with the right neighbor and there's a level above
@@ -4121,16 +4121,16 @@ xfs_btree_delrec(
 	 * We can't use decrement because it would change the next level up.
 	 */
 	if (level > 0)
-		cur->bc_ptrs[level]--;
+		cur->bc_levels[level].ptr--;
 
 	/*
 	 * We combined blocks, so we have to update the parent keys if the
-	 * btree supports overlapped intervals.  However, bc_ptrs[level + 1]
-	 * points to the old block so that the caller knows which record to
-	 * delete.  Therefore, the caller must be savvy enough to call updkeys
-	 * for us if we return stat == 2.  The other exit points from this
-	 * function don't require deletions further up the tree, so they can
-	 * call updkeys directly.
+	 * btree supports overlapped intervals.  However,
+	 * bc_levels[level + 1].ptr points to the old block so that the caller
+	 * knows which record to delete.  Therefore, the caller must be savvy
+	 * enough to call updkeys for us if we return stat == 2.  The other
+	 * exit points from this function don't require deletions further up
+	 * the tree, so they can call updkeys directly.
 	 */
 
 	/* Return value means the next level up has something to do. */
@@ -4184,7 +4184,7 @@ xfs_btree_delete(
 
 	if (i == 0) {
 		for (level = 1; level < cur->bc_nlevels; level++) {
-			if (cur->bc_ptrs[level] == 0) {
+			if (cur->bc_levels[level].ptr == 0) {
 				error = xfs_btree_decrement(cur, level, &i);
 				if (error)
 					goto error0;
@@ -4215,7 +4215,7 @@ xfs_btree_get_rec(
 	int			error;	/* error return value */
 #endif
 
-	ptr = cur->bc_ptrs[0];
+	ptr = cur->bc_levels[0].ptr;
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 #ifdef DEBUG
@@ -4663,23 +4663,25 @@ xfs_btree_overlapped_query_range(
 	if (error)
 		goto out;
 #endif
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr >
+					be16_to_cpu(block->bb_numrecs)) {
 pop_up:
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
 
 		if (level == 0) {
 			/* Handle a leaf node. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr,
+					block);
 
 			cur->bc_ops->init_high_key_from_rec(&rec_hkey, recp);
 			ldiff = cur->bc_ops->diff_two_keys(cur, &rec_hkey,
@@ -4702,14 +4704,15 @@ xfs_btree_overlapped_query_range(
 				/* Record is larger than high key; pop. */
 				goto pop_up;
 			}
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* Handle an internal node. */
-		lkp = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
-		hkp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		lkp = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
+		hkp = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr,
+				block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 
 		ldiff = cur->bc_ops->diff_two_keys(cur, hkp, low_key);
 		hdiff = cur->bc_ops->diff_two_keys(cur, high_key, lkp);
@@ -4732,13 +4735,13 @@ xfs_btree_overlapped_query_range(
 			if (error)
 				goto out;
 #endif
-			cur->bc_ptrs[level] = 1;
+			cur->bc_levels[level].ptr = 1;
 			continue;
 		} else if (hdiff < 0) {
 			/* The low key is larger than the upper range; pop. */
 			goto pop_up;
 		}
-		cur->bc_ptrs[level]++;
+		cur->bc_levels[level].ptr++;
 	}
 
 out:
@@ -4749,13 +4752,14 @@ xfs_btree_overlapped_query_range(
 	 * with a zero-results range query, so release the buffers if we
 	 * failed to return any results.
 	 */
-	if (cur->bc_bufs[0] == NULL) {
+	if (cur->bc_levels[0].bp == NULL) {
 		for (i = 0; i < cur->bc_nlevels; i++) {
-			if (cur->bc_bufs[i]) {
-				xfs_trans_brelse(cur->bc_tp, cur->bc_bufs[i]);
-				cur->bc_bufs[i] = NULL;
-				cur->bc_ptrs[i] = 0;
-				cur->bc_ra[i] = 0;
+			if (cur->bc_levels[i].bp) {
+				xfs_trans_brelse(cur->bc_tp,
+						cur->bc_levels[i].bp);
+				cur->bc_levels[i].bp = NULL;
+				cur->bc_levels[i].ptr = 0;
+				cur->bc_levels[i].ra = 0;
 			}
 		}
 	}
@@ -4917,7 +4921,7 @@ xfs_btree_has_more_records(
 	block = xfs_btree_get_block(cur, 0, &bp);
 
 	/* There are still records in this block. */
-	if (cur->bc_ptrs[0] < xfs_btree_get_numrecs(block))
+	if (cur->bc_levels[0].ptr < xfs_btree_get_numrecs(block))
 		return true;
 
 	/* There are more record blocks. */
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 1018bcc43d66..f31f057bec9d 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
 #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
 };
 
+struct xfs_btree_level {
+	/* buffer pointer */
+	struct xfs_buf		*bp;
+
+	/* key/record number */
+	uint16_t		ptr;
+
+	/* readahead info */
+#define XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
+#define XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
+	uint16_t		ra;
+};
+
 /*
  * Btree cursor structure.
  * This collects all information needed by the btree code in one place.
@@ -223,11 +236,6 @@ struct xfs_btree_cur
 	const struct xfs_btree_ops *bc_ops;
 	uint			bc_flags; /* btree features - below */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
-	struct xfs_buf	*bc_bufs[XFS_BTREE_MAXLEVELS];	/* buf ptr per level */
-	int		bc_ptrs[XFS_BTREE_MAXLEVELS];	/* key/record # */
-	uint8_t		bc_ra[XFS_BTREE_MAXLEVELS];	/* readahead bits */
-#define	XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
-#define	XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
 	uint8_t		bc_nlevels;	/* number of levels in the tree */
 	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
 	int		bc_statoff;	/* offset of btre stats array */
@@ -242,8 +250,17 @@ struct xfs_btree_cur
 		struct xfs_btree_cur_ag	bc_ag;
 		struct xfs_btree_cur_ino bc_ino;
 	};
+
+	/* Must be at the end of the struct! */
+	struct xfs_btree_level	bc_levels[];
 };
 
+static inline size_t
+xfs_btree_cur_sizeof(unsigned int nlevels)
+{
+	return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
+}
+
 /* cursor flags */
 #define XFS_BTREE_LONG_PTRS		(1<<0)	/* pointers are 64bits long */
 #define XFS_BTREE_ROOT_IN_INODE		(1<<1)	/* root may be variable size */
@@ -257,7 +274,6 @@ struct xfs_btree_cur
  */
 #define XFS_BTREE_STAGING		(1<<5)
 
-
 #define	XFS_BTREE_NOERROR	0
 #define	XFS_BTREE_ERROR		1
 
diff --git a/fs/xfs/scrub/bitmap.c b/fs/xfs/scrub/bitmap.c
index d6d24c866bc4..b89bf9de9b1c 100644
--- a/fs/xfs/scrub/bitmap.c
+++ b/fs/xfs/scrub/bitmap.c
@@ -222,21 +222,21 @@ xbitmap_disunion(
  * 1  2  3
  *
  * Pretend for this example that each leaf block has 100 btree records.  For
- * the first btree record, we'll observe that bc_ptrs[0] == 1, so we record
- * that we saw block 1.  Then we observe that bc_ptrs[1] == 1, so we record
- * block 4.  The list is [1, 4].
+ * the first btree record, we'll observe that bc_levels[0].ptr == 1, so we
+ * record that we saw block 1.  Then we observe that bc_levels[1].ptr == 1, so
+ * we record block 4.  The list is [1, 4].
  *
- * For the second btree record, we see that bc_ptrs[0] == 2, so we exit the
- * loop.  The list remains [1, 4].
+ * For the second btree record, we see that bc_levels[0].ptr == 2, so we exit
+ * the loop.  The list remains [1, 4].
  *
  * For the 101st btree record, we've moved onto leaf block 2.  Now
- * bc_ptrs[0] == 1 again, so we record that we saw block 2.  We see that
- * bc_ptrs[1] == 2, so we exit the loop.  The list is now [1, 4, 2].
+ * bc_levels[0].ptr == 1 again, so we record that we saw block 2.  We see that
+ * bc_levels[1].ptr == 2, so we exit the loop.  The list is now [1, 4, 2].
  *
- * For the 102nd record, bc_ptrs[0] == 2, so we continue.
+ * For the 102nd record, bc_levels[0].ptr == 2, so we continue.
  *
- * For the 201st record, we've moved on to leaf block 3.  bc_ptrs[0] == 1, so
- * we add 3 to the list.  Now it is [1, 4, 2, 3].
+ * For the 201st record, we've moved on to leaf block 3.
+ * bc_levels[0].ptr == 1, so we add 3 to the list.  Now it is [1, 4, 2, 3].
  *
  * For the 300th record we just exit, with the list being [1, 4, 2, 3].
  */
@@ -256,7 +256,7 @@ xbitmap_set_btcur_path(
 	int			i;
 	int			error;
 
-	for (i = 0; i < cur->bc_nlevels && cur->bc_ptrs[i] == 1; i++) {
+	for (i = 0; i < cur->bc_nlevels && cur->bc_levels[i].ptr == 1; i++) {
 		xfs_btree_get_block(cur, i, &bp);
 		if (!bp)
 			continue;
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 017da9ceaee9..a4cbbc346f60 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -402,7 +402,7 @@ xchk_bmapbt_rec(
 	 * the root since the verifiers don't do that.
 	 */
 	if (xfs_has_crc(bs->cur->bc_mp) &&
-	    bs->cur->bc_ptrs[0] == 1) {
+	    bs->cur->bc_levels[0].ptr == 1) {
 		for (i = 0; i < bs->cur->bc_nlevels - 1; i++) {
 			block = xfs_btree_get_block(bs->cur, i, &bp);
 			owner = be64_to_cpu(block->bb_u.l.bb_owner);
diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
index 6d4eba85ef77..39dd46f038fe 100644
--- a/fs/xfs/scrub/btree.c
+++ b/fs/xfs/scrub/btree.c
@@ -136,12 +136,12 @@ xchk_btree_rec(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, 0, &bp);
-	rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+	rec = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr, block);
 
 	trace_xchk_btree_rec(bs->sc, cur, 0);
 
 	/* If this isn't the first record, are they in order? */
-	if (cur->bc_ptrs[0] > 1 &&
+	if (cur->bc_levels[0].ptr > 1 &&
 	    !cur->bc_ops->recs_inorder(cur, &bs->lastrec, rec))
 		xchk_btree_set_corrupt(bs->sc, cur, 0);
 	memcpy(&bs->lastrec, rec, cur->bc_ops->rec_len);
@@ -152,7 +152,7 @@ xchk_btree_rec(
 	/* Is this at least as large as the parent low key? */
 	cur->bc_ops->init_key_from_rec(&key, rec);
 	keyblock = xfs_btree_get_block(cur, 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, &key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 
@@ -161,7 +161,7 @@ xchk_btree_rec(
 
 	/* Is this no larger than the parent high key? */
 	cur->bc_ops->init_high_key_from_rec(&hkey, rec);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[1], keyblock);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, &hkey) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, 1);
 }
@@ -183,12 +183,12 @@ xchk_btree_key(
 	struct xfs_buf		*bp;
 
 	block = xfs_btree_get_block(cur, level, &bp);
-	key = xfs_btree_key_addr(cur, cur->bc_ptrs[level], block);
+	key = xfs_btree_key_addr(cur, cur->bc_levels[level].ptr, block);
 
 	trace_xchk_btree_key(bs->sc, cur, level);
 
 	/* If this isn't the first key, are they in order? */
-	if (cur->bc_ptrs[level] > 1 &&
+	if (cur->bc_levels[level].ptr > 1 &&
 	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
@@ -198,7 +198,7 @@ xchk_btree_key(
 
 	/* Is this at least as large as the parent low key? */
 	keyblock = xfs_btree_get_block(cur, level + 1, &bp);
-	keyp = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	keyp = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr, keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, key, keyp) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 
@@ -206,8 +206,9 @@ xchk_btree_key(
 		return;
 
 	/* Is this no larger than the parent high key? */
-	key = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level], block);
-	keyp = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1], keyblock);
+	key = xfs_btree_high_key_addr(cur, cur->bc_levels[level].ptr, block);
+	keyp = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
+			keyblock);
 	if (cur->bc_ops->diff_two_keys(cur, keyp, key) < 0)
 		xchk_btree_set_corrupt(bs->sc, cur, level);
 }
@@ -290,7 +291,7 @@ xchk_btree_block_check_sibling(
 
 	/* Compare upper level pointer to sibling pointer. */
 	pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
-	pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
+	pp = xfs_btree_ptr_addr(ncur, ncur->bc_levels[level + 1].ptr, pblock);
 	if (!xchk_btree_ptr_ok(bs, level + 1, pp))
 		goto out;
 	if (pbp)
@@ -595,7 +596,7 @@ xchk_btree_block_keys(
 
 	/* Obtain the parent's copy of the keys for this block. */
 	parent_block = xfs_btree_get_block(cur, level + 1, &bp);
-	parent_keys = xfs_btree_key_addr(cur, cur->bc_ptrs[level + 1],
+	parent_keys = xfs_btree_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, &block_keys, parent_keys) != 0)
@@ -606,7 +607,7 @@ xchk_btree_block_keys(
 
 	/* Get high keys */
 	high_bk = xfs_btree_high_key_from_key(cur, &block_keys);
-	high_pk = xfs_btree_high_key_addr(cur, cur->bc_ptrs[level + 1],
+	high_pk = xfs_btree_high_key_addr(cur, cur->bc_levels[level + 1].ptr,
 			parent_block);
 
 	if (cur->bc_ops->diff_two_keys(cur, high_bk, high_pk) != 0)
@@ -672,18 +673,18 @@ xchk_btree(
 	if (error || !block)
 		goto out;
 
-	cur->bc_ptrs[level] = 1;
+	cur->bc_levels[level].ptr = 1;
 
 	while (level < cur->bc_nlevels) {
 		block = xfs_btree_get_block(cur, level, &bp);
 
 		if (level == 0) {
 			/* End of leaf, pop back towards the root. */
-			if (cur->bc_ptrs[level] >
+			if (cur->bc_levels[level].ptr >
 			    be16_to_cpu(block->bb_numrecs)) {
 				xchk_btree_block_keys(bs, level, block);
 				if (level < cur->bc_nlevels - 1)
-					cur->bc_ptrs[level + 1]++;
+					cur->bc_levels[level + 1].ptr++;
 				level++;
 				continue;
 			}
@@ -692,7 +693,8 @@ xchk_btree(
 			xchk_btree_rec(bs);
 
 			/* Call out to the record checker. */
-			recp = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
+			recp = xfs_btree_rec_addr(cur, cur->bc_levels[0].ptr,
+					block);
 			error = bs->scrub_rec(bs, recp);
 			if (error)
 				break;
@@ -700,15 +702,16 @@ xchk_btree(
 			    (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT))
 				break;
 
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 
 		/* End of node, pop back towards the root. */
-		if (cur->bc_ptrs[level] > be16_to_cpu(block->bb_numrecs)) {
+		if (cur->bc_levels[level].ptr >
+					be16_to_cpu(block->bb_numrecs)) {
 			xchk_btree_block_keys(bs, level, block);
 			if (level < cur->bc_nlevels - 1)
-				cur->bc_ptrs[level + 1]++;
+				cur->bc_levels[level + 1].ptr++;
 			level++;
 			continue;
 		}
@@ -717,9 +720,9 @@ xchk_btree(
 		xchk_btree_key(bs, level);
 
 		/* Drill another level deeper. */
-		pp = xfs_btree_ptr_addr(cur, cur->bc_ptrs[level], block);
+		pp = xfs_btree_ptr_addr(cur, cur->bc_levels[level].ptr, block);
 		if (!xchk_btree_ptr_ok(bs, level, pp)) {
-			cur->bc_ptrs[level]++;
+			cur->bc_levels[level].ptr++;
 			continue;
 		}
 		level--;
@@ -727,7 +730,7 @@ xchk_btree(
 		if (error || !block)
 			goto out;
 
-		cur->bc_ptrs[level] = 1;
+		cur->bc_levels[level].ptr = 1;
 	}
 
 out:
diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
index c0ef53fe6611..816dfc8e5a80 100644
--- a/fs/xfs/scrub/trace.c
+++ b/fs/xfs/scrub/trace.c
@@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
 	struct xfs_btree_cur	*cur,
 	int			level)
 {
-	if (level < cur->bc_nlevels && cur->bc_bufs[level])
+	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
 		return XFS_DADDR_TO_FSB(cur->bc_mp,
-				xfs_buf_daddr(cur->bc_bufs[level]));
-	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
+				xfs_buf_daddr(cur->bc_levels[level].bp));
+	else if (level == cur->bc_nlevels - 1 &&
+		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
 		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS))
 		return XFS_AGB_TO_FSB(cur->bc_mp, cur->bc_ag.pag->pag_agno, 0);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index a7bbb84f91a7..93ece6df02e3 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -348,7 +348,7 @@ TRACE_EVENT(xchk_btree_op_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->error = error;
 		__entry->ret_ip = ret_ip;
 	),
@@ -389,7 +389,7 @@ TRACE_EVENT(xchk_ifork_btree_op_error,
 		__entry->type = sc->sm->sm_type;
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->error = error;
@@ -431,7 +431,7 @@ TRACE_EVENT(xchk_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -471,7 +471,7 @@ TRACE_EVENT(xchk_ifork_btree_error,
 		__entry->level = level;
 		__entry->agno = XFS_FSB_TO_AGNO(cur->bc_mp, fsbno);
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->ret_ip = ret_ip;
 	),
 	TP_printk("dev %d:%d ino 0x%llx fork %s type %s btree %s level %d ptr %d agno 0x%x agbno 0x%x ret_ip %pS",
@@ -511,7 +511,7 @@ DECLARE_EVENT_CLASS(xchk_sbtree_class,
 		__entry->bno = XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno);
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 	),
 	TP_printk("dev %d:%d type %s btree %s agno 0x%x agbno 0x%x level %d nlevels %d ptr %d",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index c4e0cd1c1c8c..30bae0657343 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1966,7 +1966,7 @@ xfs_init_zones(void)
 		goto out_destroy_log_ticket_zone;
 
 	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-					       sizeof(struct xfs_btree_cur),
+				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
 					       0, 0, NULL);
 	if (!xfs_btree_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 1033a95fbf8e..4a8076ef8cb4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2476,7 +2476,7 @@ DECLARE_EVENT_CLASS(xfs_btree_cur_class,
 		__entry->btnum = cur->bc_btnum;
 		__entry->level = level;
 		__entry->nlevels = cur->bc_nlevels;
-		__entry->ptr = cur->bc_ptrs[level];
+		__entry->ptr = cur->bc_levels[level].ptr;
 		__entry->daddr = bp ? xfs_buf_daddr(bp) : -1;
 	),
 	TP_printk("dev %d:%d btree %s level %d/%d ptr %d daddr 0x%llx",


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 05/15] xfs: support dynamic btree cursor heights Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:34   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 07/15] xfs: refactor btree cursor allocation function Darrick J. Wong
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Reduce the size of the btree cursor structure some more by rearranging
fields to eliminate unused space.  While we're at it, fix the ragged
indentation and a spelling error.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.h |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f31f057bec9d..613f7a303cc6 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -234,11 +234,11 @@ struct xfs_btree_cur
 	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
-	uint			bc_flags; /* btree features - below */
+	unsigned int		bc_flags; /* btree features - below */
+	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
-	uint8_t		bc_nlevels;	/* number of levels in the tree */
-	xfs_btnum_t	bc_btnum;	/* identifies which btree type */
-	int		bc_statoff;	/* offset of btre stats array */
+	uint8_t			bc_nlevels; /* number of levels in the tree */
+	int			bc_statoff; /* offset of btree stats array */
 
 	/*
 	 * Short btree pointers need an agno to be able to turn the pointers


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 07/15] xfs: refactor btree cursor allocation function
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:34   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 08/15] xfs: encode the max btree height in the cursor Darrick J. Wong
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david
  Cc: Chandan Babu R, Christoph Hellwig, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Refactor btree allocation to a common helper.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    6 +-----
 fs/xfs/libxfs/xfs_bmap_btree.c     |    6 +-----
 fs/xfs/libxfs/xfs_btree.h          |   16 ++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    5 +----
 fs/xfs/libxfs/xfs_refcount_btree.c |    5 +----
 fs/xfs/libxfs/xfs_rmap_btree.c     |    5 +----
 6 files changed, 21 insertions(+), 22 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 152ed2a202f4..c644b11132f6 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -477,11 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index a43dea8d6a65..a06987e36db5 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -552,12 +552,8 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
-	cur->bc_btnum = XFS_BTNUM_BMAP;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
 	cur->bc_ops = &xfs_bmbt_ops;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 613f7a303cc6..76509e819d60 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -573,4 +573,20 @@ void xfs_btree_copy_keys(struct xfs_btree_cur *cur,
 		union xfs_btree_key *dst_key,
 		const union xfs_btree_key *src_key, int numkeys);
 
+static inline struct xfs_btree_cur *
+xfs_btree_alloc_cursor(
+	struct xfs_mount	*mp,
+	struct xfs_trans	*tp,
+	xfs_btnum_t		btnum)
+{
+	struct xfs_btree_cur	*cur;
+
+	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur->bc_tp = tp;
+	cur->bc_mp = mp;
+	cur->bc_btnum = btnum;
+
+	return cur;
+}
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 10736b89b679..c8fea6a464d5 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -432,10 +432,7 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = btnum;
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 3ea589f15b14..48c45e31d897 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -322,10 +322,7 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
-	cur->bc_btnum = XFS_BTNUM_REFC;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index d65bf3c6f25e..f3c4d0965cc9 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -451,11 +451,8 @@ xfs_rmapbt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
-	cur->bc_tp = tp;
-	cur->bc_mp = mp;
 	/* Overlapping btree; 2 keys per pointer. */
-	cur->bc_btnum = XFS_BTNUM_RMAP;
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 08/15] xfs: encode the max btree height in the cursor
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 07/15] xfs: refactor btree cursor allocation function Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:38   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Encode the maximum btree height in the cursor, since we're soon going to
allow smaller cursors for AG btrees and larger cursors for file btrees.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c          |    2 +-
 fs/xfs/libxfs/xfs_btree.c         |    4 ++--
 fs/xfs/libxfs/xfs_btree.h         |    2 ++
 fs/xfs/libxfs/xfs_btree_staging.c |   10 +++++-----
 4 files changed, 10 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 644b956301b6..2ae5bf9a74e7 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -239,7 +239,7 @@ xfs_bmap_get_bp(
 	if (!cur)
 		return NULL;
 
-	for (i = 0; i < XFS_BTREE_MAXLEVELS; i++) {
+	for (i = 0; i < cur->bc_maxlevels; i++) {
 		if (!cur->bc_levels[i].bp)
 			break;
 		if (xfs_buf_daddr(cur->bc_levels[i].bp) == bno)
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 25dfab81025f..6ced8f028d47 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2933,7 +2933,7 @@ xfs_btree_new_iroot(
 	be16_add_cpu(&block->bb_level, 1);
 	xfs_btree_set_numrecs(block, 1);
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	cur->bc_levels[level + 1].ptr = 1;
 
 	kp = xfs_btree_key_addr(cur, 1, block);
@@ -3097,7 +3097,7 @@ xfs_btree_new_root(
 	xfs_btree_setbuf(cur, cur->bc_nlevels, nbp);
 	cur->bc_levels[cur->bc_nlevels].ptr = nptr;
 	cur->bc_nlevels++;
-	ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+	ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 	*stat = 1;
 	return 0;
 error0:
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 76509e819d60..43766e5b680f 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -238,6 +238,7 @@ struct xfs_btree_cur
 	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
 	uint8_t			bc_nlevels; /* number of levels in the tree */
+	uint8_t			bc_maxlevels; /* maximum levels for this btree type */
 	int			bc_statoff; /* offset of btree stats array */
 
 	/*
@@ -585,6 +586,7 @@ xfs_btree_alloc_cursor(
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
+	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_btree_staging.c b/fs/xfs/libxfs/xfs_btree_staging.c
index cc56efc2b90a..dd75e208b543 100644
--- a/fs/xfs/libxfs/xfs_btree_staging.c
+++ b/fs/xfs/libxfs/xfs_btree_staging.c
@@ -657,12 +657,12 @@ xfs_btree_bload_compute_geometry(
 	 * checking levels 0 and 1 here, so set bc_nlevels such that the btree
 	 * code doesn't interpret either as the root level.
 	 */
-	cur->bc_nlevels = XFS_BTREE_MAXLEVELS - 1;
+	cur->bc_nlevels = cur->bc_maxlevels - 1;
 	xfs_btree_bload_ensure_slack(cur, &bbl->leaf_slack, 0);
 	xfs_btree_bload_ensure_slack(cur, &bbl->node_slack, 1);
 
 	bbl->nr_records = nr_this_level = nr_records;
-	for (cur->bc_nlevels = 1; cur->bc_nlevels <= XFS_BTREE_MAXLEVELS;) {
+	for (cur->bc_nlevels = 1; cur->bc_nlevels <= cur->bc_maxlevels;) {
 		uint64_t	level_blocks;
 		uint64_t	dontcare64;
 		unsigned int	level = cur->bc_nlevels - 1;
@@ -703,7 +703,7 @@ xfs_btree_bload_compute_geometry(
 			 * block-based btree level.
 			 */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 			xfs_btree_bload_level_geometry(cur, bbl, level,
 					nr_this_level, &avg_per_block,
 					&level_blocks, &dontcare64);
@@ -719,14 +719,14 @@ xfs_btree_bload_compute_geometry(
 
 			/* Otherwise, we need another level of btree. */
 			cur->bc_nlevels++;
-			ASSERT(cur->bc_nlevels <= XFS_BTREE_MAXLEVELS);
+			ASSERT(cur->bc_nlevels <= cur->bc_maxlevels);
 		}
 
 		nr_blocks += level_blocks;
 		nr_this_level = level_blocks;
 	}
 
-	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS)
+	if (cur->bc_nlevels > cur->bc_maxlevels)
 		return -EOVERFLOW;
 
 	bbl->btree_height = cur->bc_nlevels;


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 08/15] xfs: encode the max btree height in the cursor Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:40   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

To support future btree code, we need to be able to size btree cursors
dynamically for very large btrees.  Switch the maxlevels computation to
use the precomputed values in the superblock, and create cursors that
can handle a certain height.  For now, we retain the btree cursor zone
that can handle up to 9-level btrees, and create larger cursors (which
shouldn't happen currently) from the heap as a failsafe.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |    2 +-
 fs/xfs/libxfs/xfs_bmap_btree.c     |    3 ++-
 fs/xfs/libxfs/xfs_btree.h          |   13 +++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.c   |    3 ++-
 fs/xfs/libxfs/xfs_refcount_btree.c |    3 ++-
 fs/xfs/libxfs/xfs_rmap_btree.c     |    3 ++-
 fs/xfs/xfs_super.c                 |    4 ++--
 7 files changed, 22 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index c644b11132f6..f14bad21503f 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -477,7 +477,7 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index a06987e36db5..b90122de0df0 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -552,7 +552,8 @@ xfs_bmbt_init_cursor(
 	struct xfs_btree_cur	*cur;
 	ASSERT(whichfork != XFS_COW_FORK);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
+			mp->m_bm_maxlevels[whichfork]);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 43766e5b680f..b8761a2fc24b 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -94,6 +94,12 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 
 #define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
 
+/*
+ * The btree cursor zone hands out cursors that can handle up to this many
+ * levels.  This is the known maximum for all btree types.
+ */
+#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)
+
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -578,15 +584,18 @@ static inline struct xfs_btree_cur *
 xfs_btree_alloc_cursor(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
-	xfs_btnum_t		btnum)
+	xfs_btnum_t		btnum,
+	uint8_t			maxlevels)
 {
 	struct xfs_btree_cur	*cur;
 
+	ASSERT(maxlevels <= XFS_BTREE_CUR_ZONE_MAXLEVELS);
+
 	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
-	cur->bc_maxlevels = XFS_BTREE_MAXLEVELS;
+	cur->bc_maxlevels = maxlevels;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index c8fea6a464d5..3a5a24648b87 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -432,7 +432,8 @@ xfs_inobt_init_common(
 {
 	struct xfs_btree_cur	*cur;
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
+			M_IGEO(mp)->inobt_maxlevels);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 48c45e31d897..995b0d86ddc0 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -322,7 +322,8 @@ xfs_refcountbt_init_common(
 
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
+			mp->m_refc_maxlevels);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index f3c4d0965cc9..1b48b7b3ee30 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -452,7 +452,8 @@ xfs_rmapbt_init_common(
 	struct xfs_btree_cur	*cur;
 
 	/* Overlapping btree; 2 keys per pointer. */
-	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP);
+	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
+			mp->m_rmap_maxlevels);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 30bae0657343..90c92a6a49e0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1966,8 +1966,8 @@ xfs_init_zones(void)
 		goto out_destroy_log_ticket_zone;
 
 	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-				xfs_btree_cur_sizeof(XFS_BTREE_MAXLEVELS),
-					       0, 0, NULL);
+			xfs_btree_cur_sizeof(XFS_BTREE_CUR_ZONE_MAXLEVELS),
+			0, 0, NULL);
 	if (!xfs_btree_cur_zone)
 		goto out_destroy_bmap_free_item_zone;
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  5:49   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Compute the actual maximum btree height when deciding if per-AG block
reservation is critically low.  This only affects the sanity check
condition, since we /generally/ will trigger on the 10% threshold.
This is a long-winded way of saying that we're removing one more
usage of XFS_BTREE_MAXLEVELS.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_ag_resv.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 2aa2b3484c28..d34d4614f175 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -60,6 +60,20 @@
  * to use the reservation system should update ask/used in xfs_ag_resv_init.
  */
 
+/* Compute maximum possible height for per-AG btree types for this fs. */
+static unsigned int
+xfs_ag_btree_maxlevels(
+	struct xfs_mount	*mp)
+{
+	unsigned int		ret = mp->m_ag_maxlevels;
+
+	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
+	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
+	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
+	ret = max(ret, mp->m_rmap_maxlevels);
+	return max(ret, mp->m_refc_maxlevels);
+}
+
 /*
  * Are we critically low on blocks?  For now we'll define that as the number
  * of blocks we can get our hands on being less than 10% of what we reserved
@@ -72,6 +86,7 @@ xfs_ag_resv_critical(
 {
 	xfs_extlen_t			avail;
 	xfs_extlen_t			orig;
+	xfs_extlen_t			btree_maxlevels;
 
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
@@ -91,7 +106,8 @@ xfs_ag_resv_critical(
 	trace_xfs_ag_resv_critical(pag, type, avail);
 
 	/* Critically low if less than 10% or max btree height remains. */
-	return XFS_TEST_ERROR(avail < orig / 10 || avail < XFS_BTREE_MAXLEVELS,
+	btree_maxlevels = xfs_ag_btree_maxlevels(pag->pag_mount);
+	return XFS_TEST_ERROR(avail < orig / 10 || avail < btree_maxlevels,
 			pag->pag_mount, XFS_ERRTAG_AG_RESV_CRITICAL);
 }
 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  7:25   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: Chandan Babu R, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
enough to handle the maximally tall rmap btree when all blocks are in
use and maximally shared, let's compute the maximum height assuming the
rmapbt consumes as many blocks as possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.c       |   34 ++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h       |    2 +
 fs/xfs/libxfs/xfs_rmap_btree.c  |   55 ++++++++++++++++++++++++---------------
 fs/xfs/libxfs/xfs_trans_resv.c  |   13 +++++++++
 fs/xfs/libxfs/xfs_trans_space.h |    7 +++++
 5 files changed, 90 insertions(+), 21 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 6ced8f028d47..201b81d54622 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4531,6 +4531,40 @@ xfs_btree_compute_maxlevels(
 	return level;
 }
 
+/*
+ * Compute the maximum height of a btree that is allowed to consume up to the
+ * given number of blocks.
+ */
+unsigned int
+xfs_btree_compute_maxlevels_size(
+	unsigned long long	max_btblocks,
+	unsigned int		leaf_mnr)
+{
+	unsigned long long	leaf_blocks = leaf_mnr;
+	unsigned long long	blocks_left;
+	unsigned int		maxlevels;
+
+	if (max_btblocks < 1)
+		return 0;
+
+	/*
+	 * The loop increments maxlevels as long as there would be enough
+	 * blocks left in the reservation to handle each node block at the
+	 * current level pointing to the minimum possible number of leaf blocks
+	 * at the next level down.  We start the loop assuming a single-level
+	 * btree consuming one block.
+	 */
+	maxlevels = 1;
+	blocks_left = max_btblocks - 1;
+	while (leaf_blocks < blocks_left) {
+		maxlevels++;
+		blocks_left -= leaf_blocks;
+		leaf_blocks *= leaf_mnr;
+	}
+
+	return maxlevels;
+}
+
 /*
  * Query a regular btree for all records overlapping a given interval.
  * Start with a LE lookup of the key of low_rec and return all records
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index b8761a2fc24b..fccb374a8399 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -483,6 +483,8 @@ xfs_failaddr_t xfs_btree_lblock_verify(struct xfs_buf *bp,
 		unsigned int max_recs);
 
 uint xfs_btree_compute_maxlevels(uint *limits, unsigned long len);
+unsigned int xfs_btree_compute_maxlevels_size(unsigned long long max_btblocks,
+		unsigned int leaf_mnr);
 unsigned long long xfs_btree_calc_size(uint *limits, unsigned long long len);
 
 /*
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 1b48b7b3ee30..b1b55a6e7d25 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -538,28 +538,41 @@ xfs_rmapbt_maxrecs(
 /* Compute the maximum height of an rmap btree. */
 void
 xfs_rmapbt_compute_maxlevels(
-	struct xfs_mount		*mp)
+	struct xfs_mount	*mp)
 {
-	/*
-	 * On a non-reflink filesystem, the maximum number of rmap
-	 * records is the number of blocks in the AG, hence the max
-	 * rmapbt height is log_$maxrecs($agblocks).  However, with
-	 * reflink each AG block can have up to 2^32 (per the refcount
-	 * record format) owners, which means that theoretically we
-	 * could face up to 2^64 rmap records.
-	 *
-	 * That effectively means that the max rmapbt height must be
-	 * XFS_BTREE_MAXLEVELS.  "Fortunately" we'll run out of AG
-	 * blocks to feed the rmapbt long before the rmapbt reaches
-	 * maximum height.  The reflink code uses ag_resv_critical to
-	 * disallow reflinking when less than 10% of the per-AG metadata
-	 * block reservation since the fallback is a regular file copy.
-	 */
-	if (xfs_has_reflink(mp))
-		mp->m_rmap_maxlevels = XFS_BTREE_MAXLEVELS;
-	else
-		mp->m_rmap_maxlevels = xfs_btree_compute_maxlevels(
-				mp->m_rmap_mnr, mp->m_sb.sb_agblocks);
+	unsigned int		val;
+
+	if (!xfs_has_rmapbt(mp)) {
+		mp->m_rmap_maxlevels = 0;
+		return;
+	}
+
+	if (xfs_has_reflink(mp)) {
+		/*
+		 * Compute the asymptotic maxlevels for an rmap btree on a
+		 * filesystem that supports reflink.
+		 *
+		 * On a reflink filesystem, each AG block can have up to 2^32
+		 * (per the refcount record format) owners, which means that
+		 * theoretically we could face up to 2^64 rmap records.
+		 * However, we're likely to run out of blocks in the AG long
+		 * before that happens, which means that we must compute the
+		 * max height based on what the btree will look like if it
+		 * consumes almost all the blocks in the AG due to maximal
+		 * sharing factor.
+		 */
+		val = xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks,
+				mp->m_rmap_mnr[1]);
+	} else {
+		/*
+		 * If there's no block sharing, compute the maximum rmapbt
+		 * height assuming one rmap record per AG block.
+		 */
+		val = xfs_btree_compute_maxlevels(mp->m_rmap_mnr,
+				mp->m_sb.sb_agblocks);
+	}
+
+	mp->m_rmap_maxlevels = val;
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 5e300daa2559..97bd17d84a23 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -814,6 +814,16 @@ xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
 	struct xfs_trans_resv	*resp)
 {
+	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
+
+	/*
+	 * In the early days of rmap+reflink, we always set the rmap maxlevels
+	 * to 9 even if the AG was small enough that it would never grow to
+	 * that height.
+	 */
+	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
+		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
+
 	/*
 	 * The following transactions are logged in physical format and
 	 * require a permanent reservation on space.
@@ -916,4 +926,7 @@ xfs_trans_resv_calc(
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
+
+	/* Put everything back the way it was.  This goes at the end. */
+	mp->m_rmap_maxlevels = rmap_maxlevels;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 50332be34388..440c9c390b86 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -17,6 +17,13 @@
 /* Adding one rmap could split every level up to the top of the tree. */
 #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
 
+/*
+ * Note that we historically set m_rmap_maxlevels to 9 when reflink was
+ * enabled, so we must preserve this behavior to avoid changing the transaction
+ * space reservations.
+ */
+#define XFS_OLD_REFLINK_RMAP_MAXLEVELS	(9)
+
 /* Blocks we might need to add "b" rmaps to a tree. */
 #define XFS_NRMAPADD_SPACE_RES(mp, b)\
 	(((b + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) - 1) / \


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  7:25   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts Darrick J. Wong
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: Chandan Babu R, linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Nobody uses this symbol anymore, so kill it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_btree.h |    2 --
 1 file changed, 2 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index fccb374a8399..d5f03550cec9 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -92,8 +92,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
-#define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
-
 /*
  * The btree cursor zone hands out cursors that can handle up to this many
  * levels.  This is the known maximum for all btree types.


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  7:28   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
  2021-10-12 23:33 ` [PATCH 15/15] xfs: use separate btree cursor cache " Darrick J. Wong
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Rework xfs_btree_compute_maxlevels to handle larger record counts, since
we're about to add support for very large data forks.  Eventually the
realtime reverse mapping btree will need this too.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_btree.c |   16 ++++++++--------
 fs/xfs/libxfs/xfs_btree.h |    3 ++-
 2 files changed, 10 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 201b81d54622..b95c817ad90d 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4515,19 +4515,19 @@ xfs_btree_sblock_verify(
 
 /*
  * Calculate the number of btree levels needed to store a given number of
- * records in a short-format btree.
+ * records in btree blocks.  This does not include the inode root level.
  */
-uint
+unsigned int
 xfs_btree_compute_maxlevels(
-	uint			*limits,
-	unsigned long		len)
+	unsigned int		*limits,
+	unsigned long long	len)
 {
-	uint			level;
-	unsigned long		maxblocks;
+	unsigned int		level;
+	unsigned long long	maxblocks;
 
-	maxblocks = (len + limits[0] - 1) / limits[0];
+	maxblocks = howmany_64(len, limits[0]);
 	for (level = 1; maxblocks > 1; level++)
-		maxblocks = (maxblocks + limits[1] - 1) / limits[1];
+		maxblocks = howmany_64(maxblocks, limits[1]);
 	return level;
 }
 
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index d5f03550cec9..20a2828c11ef 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -480,7 +480,8 @@ xfs_failaddr_t xfs_btree_lblock_v5hdr_verify(struct xfs_buf *bp,
 xfs_failaddr_t xfs_btree_lblock_verify(struct xfs_buf *bp,
 		unsigned int max_recs);
 
-uint xfs_btree_compute_maxlevels(uint *limits, unsigned long len);
+unsigned int xfs_btree_compute_maxlevels(unsigned int *limits,
+		unsigned long long len);
 unsigned int xfs_btree_compute_maxlevels_size(unsigned long long max_btblocks,
 		unsigned int leaf_mnr);
 unsigned long long xfs_btree_calc_size(uint *limits, unsigned long long len);


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  7:57   ` Dave Chinner
  2021-10-12 23:33 ` [PATCH 15/15] xfs: use separate btree cursor cache " Darrick J. Wong
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Add code for all five btree types so that we can compute the absolute
maximum possible btree height for each btree type.  This is a setup for
the next patch, which makes every btree type have its own cursor cache.

The functions are exported so that we can have xfs_db report the
absolute maximum btree heights for each btree type, rather than making
everyone run their own ad-hoc computations.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c          |    1 +
 fs/xfs/libxfs/xfs_alloc_btree.c    |   13 +++++++++++
 fs/xfs/libxfs/xfs_alloc_btree.h    |    2 ++
 fs/xfs/libxfs/xfs_bmap.c           |    1 +
 fs/xfs/libxfs/xfs_bmap_btree.c     |   14 ++++++++++++
 fs/xfs/libxfs/xfs_bmap_btree.h     |    2 ++
 fs/xfs/libxfs/xfs_btree.c          |   41 ++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_btree.h          |    3 +++
 fs/xfs/libxfs/xfs_fs.h             |    2 ++
 fs/xfs/libxfs/xfs_ialloc.c         |    1 +
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   19 +++++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    2 ++
 fs/xfs/libxfs/xfs_refcount_btree.c |   20 ++++++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |    2 ++
 fs/xfs/libxfs/xfs_rmap_btree.c     |   27 ++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h     |    2 ++
 16 files changed, 152 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 55c5adc9b54e..7145416a230c 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2198,6 +2198,7 @@ xfs_alloc_compute_maxlevels(
 {
 	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
 			(mp->m_sb.sb_agblocks + 1) / 2);
+	ASSERT(mp->m_ag_maxlevels <= xfs_allocbt_absolute_maxlevels());
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index f14bad21503f..61f6d266b822 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -582,6 +582,19 @@ xfs_allocbt_maxrecs(
 	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
 }
 
+/* Compute the max possible height of the maximally sized free space btree. */
+unsigned int
+xfs_allocbt_absolute_maxlevels(void)
+{
+	unsigned int		minrecs[2];
+
+	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_alloc_rec_t),
+			sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
+
+	return xfs_btree_compute_maxlevels(minrecs,
+			(XFS_MAX_AG_BLOCKS + 1) / 2);
+}
+
 /* Calculate the freespace btree size for some records. */
 xfs_extlen_t
 xfs_allocbt_calc_size(
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
index 2f6b816aaf9f..c47d0e285435 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.h
+++ b/fs/xfs/libxfs/xfs_alloc_btree.h
@@ -60,4 +60,6 @@ extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
 void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_allocbt_absolute_maxlevels(void);
+
 #endif	/* __XFS_ALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2ae5bf9a74e7..7e70df8d1a9b 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -93,6 +93,7 @@ xfs_bmap_compute_maxlevels(
 			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
 	}
 	mp->m_bm_maxlevels[whichfork] = level;
+	ASSERT(mp->m_bm_maxlevels[whichfork] <= xfs_bmbt_absolute_maxlevels());
 }
 
 unsigned int
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index b90122de0df0..7001aff639d2 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -587,6 +587,20 @@ xfs_bmbt_maxrecs(
 	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
 }
 
+/* Compute the max possible height of the maximally sized bmap btree. */
+unsigned int
+xfs_bmbt_absolute_maxlevels(void)
+{
+	unsigned int		minrecs[2];
+
+	xfs_btree_absolute_minrecs(minrecs, XFS_BTREE_LONG_PTRS,
+			sizeof(struct xfs_bmbt_rec),
+			sizeof(struct xfs_bmbt_key) +
+				sizeof(xfs_bmbt_ptr_t));
+
+	return xfs_btree_compute_maxlevels(minrecs, MAXEXTNUM) + 1;
+}
+
 /*
  * Calculate number of records in a bmap btree inode root.
  */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index 729e3bc569be..e9218e92526b 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -110,4 +110,6 @@ extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
 extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
 		unsigned long long len);
 
+unsigned int xfs_bmbt_absolute_maxlevels(void);
+
 #endif	/* __XFS_BMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index b95c817ad90d..bea1bdf9b8b9 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -4964,3 +4964,44 @@ xfs_btree_has_more_records(
 	else
 		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
 }
+
+/*
+ * Compute absolute minrecs for leaf and node btree blocks.  Callers should set
+ * BTREE_LONG_PTRS and BTREE_OVERLAPPING as they would for regular cursors.
+ * Set BTREE_CRC_BLOCKS if the btree type is supported /only/ on V5 or newer
+ * filesystems.
+ */
+void
+xfs_btree_absolute_minrecs(
+	unsigned int		*minrecs,
+	unsigned int		bc_flags,
+	unsigned int		leaf_recbytes,
+	unsigned int		node_recbytes)
+{
+	unsigned int		min_recbytes;
+
+	/*
+	 * If this btree type is supported on V4, we use the smaller V4 min
+	 * block size along with the V4 header size.  If the btree type is only
+	 * supported on V5, use the (twice as large) V5 min block size along
+	 * with the V5 header size.
+	 */
+	if (!(bc_flags & XFS_BTREE_CRC_BLOCKS)) {
+		if (bc_flags & XFS_BTREE_LONG_PTRS)
+			min_recbytes = XFS_MIN_BLOCKSIZE -
+							XFS_BTREE_LBLOCK_LEN;
+		else
+			min_recbytes = XFS_MIN_BLOCKSIZE -
+							XFS_BTREE_SBLOCK_LEN;
+	} else if (bc_flags & XFS_BTREE_LONG_PTRS) {
+		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_LBLOCK_CRC_LEN;
+	} else {
+		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
+	}
+
+	if (bc_flags & XFS_BTREE_OVERLAPPING)
+		node_recbytes <<= 1;
+
+	minrecs[0] = min_recbytes / (2 * leaf_recbytes);
+	minrecs[1] = min_recbytes / (2 * node_recbytes);
+}
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index 20a2828c11ef..acb202839afd 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -601,4 +601,7 @@ xfs_btree_alloc_cursor(
 	return cur;
 }
 
+void xfs_btree_absolute_minrecs(unsigned int *minrecs, unsigned int bc_flags,
+		unsigned int leaf_recbytes, unsigned int node_recbytes);
+
 #endif	/* __XFS_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index bde2b4c64dbe..c43877c8a279 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -268,6 +268,8 @@ typedef struct xfs_fsop_resblks {
  */
 #define XFS_MIN_AG_BYTES	(1ULL << 24)	/* 16 MB */
 #define XFS_MAX_AG_BYTES	(1ULL << 40)	/* 1 TB */
+#define XFS_MAX_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_BLOCKSIZE)
+#define XFS_MAX_CRC_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_CRC_BLOCKSIZE)
 
 /* keep the maximum size under 2^31 by a small amount */
 #define XFS_MAX_LOG_BYTES \
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 994ad783d407..017aebdda42f 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2793,6 +2793,7 @@ xfs_ialloc_setup_geometry(
 	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
 	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
 			inodes);
+	ASSERT(igeo->inobt_maxlevels <= xfs_inobt_absolute_maxlevels());
 
 	/*
 	 * Set the maximum inode count for this filesystem, being careful not
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 3a5a24648b87..2e3dd1d798bd 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -542,6 +542,25 @@ xfs_inobt_maxrecs(
 	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
 }
 
+/* Compute the max possible height of the maximally sized inode btree. */
+unsigned int
+xfs_inobt_absolute_maxlevels(void)
+{
+	unsigned int		minrecs[2];
+	unsigned long long	max_ag_inodes;
+
+	/*
+	 * For the absolute maximum, pretend that we can fill an entire AG
+	 * completely full of inodes except for the AG headers.
+	 */
+	max_ag_inodes = (XFS_MAX_AG_BYTES - (4 * BBSIZE)) / XFS_DINODE_MIN_SIZE;
+
+	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_inobt_rec_t),
+			sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
+
+	return xfs_btree_compute_maxlevels(minrecs, max_ag_inodes);
+}
+
 /*
  * Convert the inode record holemask to an inode allocation bitmap. The inode
  * allocation bitmap is inode granularity and specifies whether an inode is
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index 8a322d402e61..1f09530bf856 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -75,4 +75,6 @@ int xfs_inobt_cur(struct xfs_mount *mp, struct xfs_trans *tp,
 void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_inobt_absolute_maxlevels(void);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index 995b0d86ddc0..bacd1b442b09 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -409,13 +409,33 @@ xfs_refcountbt_maxrecs(
 			   sizeof(xfs_refcount_ptr_t));
 }
 
+/* Compute the max possible height of the maximally sized refcount btree. */
+unsigned int
+xfs_refcountbt_absolute_maxlevels(void)
+{
+	unsigned int		minrecs[2];
+
+	xfs_btree_absolute_minrecs(minrecs, XFS_BTREE_CRC_BLOCKS,
+			sizeof(struct xfs_refcount_rec),
+			sizeof(struct xfs_refcount_key) +
+						sizeof(xfs_refcount_ptr_t));
+
+	return xfs_btree_compute_maxlevels(minrecs, XFS_MAX_CRC_AG_BLOCKS);
+}
+
 /* Compute the maximum height of a refcount btree. */
 void
 xfs_refcountbt_compute_maxlevels(
 	struct xfs_mount		*mp)
 {
+	if (!xfs_has_reflink(mp)) {
+		mp->m_refc_maxlevels = 0;
+		return;
+	}
+
 	mp->m_refc_maxlevels = xfs_btree_compute_maxlevels(
 			mp->m_refc_mnr, mp->m_sb.sb_agblocks);
+	ASSERT(mp->m_refc_maxlevels <= xfs_refcountbt_absolute_maxlevels());
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index bd9ed9e1e41f..2625b08f50a8 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -65,4 +65,6 @@ extern int xfs_refcountbt_calc_reserves(struct xfs_mount *mp,
 void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
 		struct xfs_trans *tp, struct xfs_buf *agbp);
 
+unsigned int xfs_refcountbt_absolute_maxlevels(void);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index b1b55a6e7d25..860627b5ec08 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -535,6 +535,32 @@ xfs_rmapbt_maxrecs(
 		(2 * sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
 }
 
+/* Compute the max possible height of the maximally sized rmap btree. */
+unsigned int
+xfs_rmapbt_absolute_maxlevels(void)
+{
+	unsigned int		minrecs[2];
+
+	xfs_btree_absolute_minrecs(minrecs,
+			XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING,
+			sizeof(struct xfs_rmap_rec),
+			sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
+
+	/*
+	 * Compute the asymptotic maxlevels for an rmapbt on any reflink fs.
+	 *
+	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
+	 * refcount record format) owners, which means that theoretically we
+	 * could face up to 2^64 rmap records.  However, we're likely to run
+	 * out of blocks in the AG long before that happens, which means that
+	 * we must compute the max height based on what the btree will look
+	 * like if it consumes almost all the blocks in the AG due to maximal
+	 * sharing factor.
+	 */
+	return xfs_btree_compute_maxlevels_size(XFS_MAX_CRC_AG_BLOCKS,
+			minrecs[1]);
+}
+
 /* Compute the maximum height of an rmap btree. */
 void
 xfs_rmapbt_compute_maxlevels(
@@ -573,6 +599,7 @@ xfs_rmapbt_compute_maxlevels(
 	}
 
 	mp->m_rmap_maxlevels = val;
+	ASSERT(mp->m_rmap_maxlevels <= xfs_rmapbt_absolute_maxlevels());
 }
 
 /* Calculate the refcount btree size for some records. */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index f2eee6572af4..84fe74de923f 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -59,4 +59,6 @@ extern xfs_extlen_t xfs_rmapbt_max_size(struct xfs_mount *mp,
 extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
 		struct xfs_perag *pag, xfs_extlen_t *ask, xfs_extlen_t *used);
 
+unsigned int xfs_rmapbt_absolute_maxlevels(void);
+
 #endif /* __XFS_RMAP_BTREE_H__ */


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 15/15] xfs: use separate btree cursor cache for each btree type
  2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
                   ` (13 preceding siblings ...)
  2021-10-12 23:33 ` [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
@ 2021-10-12 23:33 ` Darrick J. Wong
  2021-10-13  8:01   ` Dave Chinner
  14 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-12 23:33 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs, chandan.babu, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we have the infrastructure to track the max possible height of
each btree type, we can create a separate slab cache for cursors of each
type of btree.  For smaller indices like the free space btrees, this
means that we can pack more cursors into a slab page, improving slab
utilization.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc_btree.c    |   21 ++++++++++++++
 fs/xfs/libxfs/xfs_alloc_btree.h    |    3 ++
 fs/xfs/libxfs/xfs_bmap_btree.c     |   21 ++++++++++++++
 fs/xfs/libxfs/xfs_bmap_btree.h     |    3 ++
 fs/xfs/libxfs/xfs_btree.c          |    7 +----
 fs/xfs/libxfs/xfs_btree.h          |   17 +++---------
 fs/xfs/libxfs/xfs_ialloc_btree.c   |   21 ++++++++++++++
 fs/xfs/libxfs/xfs_ialloc_btree.h   |    3 ++
 fs/xfs/libxfs/xfs_refcount_btree.c |   21 ++++++++++++++
 fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
 fs/xfs/libxfs/xfs_rmap_btree.c     |   21 ++++++++++++++
 fs/xfs/libxfs/xfs_rmap_btree.h     |    3 ++
 fs/xfs/xfs_super.c                 |   53 ++++++++++++++++++++++++++++++++----
 13 files changed, 168 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 61f6d266b822..4c5942146b05 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -20,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_allocbt_cur_cache;
 
 STATIC struct xfs_btree_cur *
 xfs_allocbt_dup_cursor(
@@ -477,7 +478,8 @@ xfs_allocbt_init_common(
 
 	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
 
-	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
+	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels,
+			xfs_allocbt_cur_cache);
 	cur->bc_ag.abt.active = false;
 
 	if (btnum == XFS_BTNUM_CNT) {
@@ -603,3 +605,20 @@ xfs_allocbt_calc_size(
 {
 	return xfs_btree_calc_size(mp->m_alloc_mnr, len);
 }
+
+int __init
+xfs_allocbt_init_cur_cache(void)
+{
+	xfs_allocbt_cur_cache = kmem_cache_create("xfs_bnobt_cur",
+			xfs_btree_cur_sizeof(xfs_allocbt_absolute_maxlevels()),
+			0, 0, NULL);
+
+	return xfs_allocbt_cur_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_allocbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_allocbt_cur_cache);
+	xfs_allocbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
index c47d0e285435..82a9b3201f91 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.h
+++ b/fs/xfs/libxfs/xfs_alloc_btree.h
@@ -62,4 +62,7 @@ void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_allocbt_absolute_maxlevels(void);
 
+int __init xfs_allocbt_init_cur_cache(void);
+void xfs_allocbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_ALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
index 7001aff639d2..99261d51d2c3 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.c
+++ b/fs/xfs/libxfs/xfs_bmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_trace.h"
 #include "xfs_rmap.h"
 
+static kmem_zone_t	*xfs_bmbt_cur_cache;
+
 /*
  * Convert on-disk form of btree root to in-memory form.
  */
@@ -553,7 +555,7 @@ xfs_bmbt_init_cursor(
 	ASSERT(whichfork != XFS_COW_FORK);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
-			mp->m_bm_maxlevels[whichfork]);
+			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
 	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
 
@@ -664,3 +666,20 @@ xfs_bmbt_calc_size(
 {
 	return xfs_btree_calc_size(mp->m_bmap_dmnr, len);
 }
+
+int __init
+xfs_bmbt_init_cur_cache(void)
+{
+	xfs_bmbt_cur_cache = kmem_cache_create("xfs_bmbt_cur",
+			xfs_btree_cur_sizeof(xfs_bmbt_absolute_maxlevels()),
+			0, 0, NULL);
+
+	return xfs_bmbt_cur_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_bmbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_bmbt_cur_cache);
+	xfs_bmbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index e9218e92526b..4c752f7341df 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -112,4 +112,7 @@ extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
 
 unsigned int xfs_bmbt_absolute_maxlevels(void);
 
+int __init xfs_bmbt_init_cur_cache(void);
+void xfs_bmbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_BMAP_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index bea1bdf9b8b9..11ff814996a1 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -23,11 +23,6 @@
 #include "xfs_btree_staging.h"
 #include "xfs_ag.h"
 
-/*
- * Cursor allocation zone.
- */
-kmem_zone_t	*xfs_btree_cur_zone;
-
 /*
  * Btree magic numbers.
  */
@@ -379,7 +374,7 @@ xfs_btree_del_cursor(
 		kmem_free(cur->bc_ops);
 	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
 		xfs_perag_put(cur->bc_ag.pag);
-	kmem_cache_free(xfs_btree_cur_zone, cur);
+	kmem_cache_free(cur->bc_cache, cur);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index acb202839afd..6d61ce1559e2 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -13,8 +13,6 @@ struct xfs_trans;
 struct xfs_ifork;
 struct xfs_perag;
 
-extern kmem_zone_t	*xfs_btree_cur_zone;
-
 /*
  * Generic key, ptr and record wrapper structures.
  *
@@ -92,12 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
 #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
 	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
 
-/*
- * The btree cursor zone hands out cursors that can handle up to this many
- * levels.  This is the known maximum for all btree types.
- */
-#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)
-
 struct xfs_btree_ops {
 	/* size of the key and record structures */
 	size_t	key_len;
@@ -238,6 +230,7 @@ struct xfs_btree_cur
 	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
 	struct xfs_mount	*bc_mp;	/* file system mount struct */
 	const struct xfs_btree_ops *bc_ops;
+	kmem_zone_t		*bc_cache; /* cursor cache */
 	unsigned int		bc_flags; /* btree features - below */
 	xfs_btnum_t		bc_btnum; /* identifies which btree type */
 	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
@@ -586,17 +579,17 @@ xfs_btree_alloc_cursor(
 	struct xfs_mount	*mp,
 	struct xfs_trans	*tp,
 	xfs_btnum_t		btnum,
-	uint8_t			maxlevels)
+	uint8_t			maxlevels,
+	kmem_zone_t		*cache)
 {
 	struct xfs_btree_cur	*cur;
 
-	ASSERT(maxlevels <= XFS_BTREE_CUR_ZONE_MAXLEVELS);
-
-	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
+	cur = kmem_cache_zalloc(cache, GFP_NOFS | __GFP_NOFAIL);
 	cur->bc_tp = tp;
 	cur->bc_mp = mp;
 	cur->bc_btnum = btnum;
 	cur->bc_maxlevels = maxlevels;
+	cur->bc_cache = cache;
 
 	return cur;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 2e3dd1d798bd..2502085d476c 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_inobt_cur_cache;
+
 STATIC int
 xfs_inobt_get_minrecs(
 	struct xfs_btree_cur	*cur,
@@ -433,7 +435,7 @@ xfs_inobt_init_common(
 	struct xfs_btree_cur	*cur;
 
 	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
-			M_IGEO(mp)->inobt_maxlevels);
+			M_IGEO(mp)->inobt_maxlevels, xfs_inobt_cur_cache);
 	if (btnum == XFS_BTNUM_INO) {
 		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
 		cur->bc_ops = &xfs_inobt_ops;
@@ -776,3 +778,20 @@ xfs_iallocbt_calc_size(
 {
 	return xfs_btree_calc_size(M_IGEO(mp)->inobt_mnr, len);
 }
+
+int __init
+xfs_inobt_init_cur_cache(void)
+{
+	xfs_inobt_cur_cache = kmem_cache_create("xfs_inobt_cur",
+			xfs_btree_cur_sizeof(xfs_inobt_absolute_maxlevels()),
+			0, 0, NULL);
+
+	return xfs_inobt_cur_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_inobt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_inobt_cur_cache);
+	xfs_inobt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index 1f09530bf856..b384733d5e0f 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -77,4 +77,7 @@ void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_inobt_absolute_maxlevels(void);
 
+int __init xfs_inobt_init_cur_cache(void);
+void xfs_inobt_destroy_cur_cache(void);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
index bacd1b442b09..ba27a3ea2ce2 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.c
+++ b/fs/xfs/libxfs/xfs_refcount_btree.c
@@ -21,6 +21,8 @@
 #include "xfs_rmap.h"
 #include "xfs_ag.h"
 
+static kmem_zone_t	*xfs_refcountbt_cur_cache;
+
 static struct xfs_btree_cur *
 xfs_refcountbt_dup_cursor(
 	struct xfs_btree_cur	*cur)
@@ -323,7 +325,7 @@ xfs_refcountbt_init_common(
 	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
 
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
-			mp->m_refc_maxlevels);
+			mp->m_refc_maxlevels, xfs_refcountbt_cur_cache);
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
 
 	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
@@ -505,3 +507,20 @@ xfs_refcountbt_calc_reserves(
 
 	return error;
 }
+
+int __init
+xfs_refcountbt_init_cur_cache(void)
+{
+	xfs_refcountbt_cur_cache = kmem_cache_create("xfs_refcbt_cur",
+			xfs_btree_cur_sizeof(xfs_refcountbt_absolute_maxlevels()),
+			0, 0, NULL);
+
+	return xfs_refcountbt_cur_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_refcountbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_refcountbt_cur_cache);
+	xfs_refcountbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
index 2625b08f50a8..a1437d0a5717 100644
--- a/fs/xfs/libxfs/xfs_refcount_btree.h
+++ b/fs/xfs/libxfs/xfs_refcount_btree.h
@@ -67,4 +67,7 @@ void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
 
 unsigned int xfs_refcountbt_absolute_maxlevels(void);
 
+int __init xfs_refcountbt_init_cur_cache(void);
+void xfs_refcountbt_destroy_cur_cache(void);
+
 #endif	/* __XFS_REFCOUNT_BTREE_H__ */
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index 860627b5ec08..0a9bc37c01d0 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -22,6 +22,8 @@
 #include "xfs_ag.h"
 #include "xfs_ag_resv.h"
 
+static kmem_zone_t	*xfs_rmapbt_cur_cache;
+
 /*
  * Reverse map btree.
  *
@@ -453,7 +455,7 @@ xfs_rmapbt_init_common(
 
 	/* Overlapping btree; 2 keys per pointer. */
 	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
-			mp->m_rmap_maxlevels);
+			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
 	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
 	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
 	cur->bc_ops = &xfs_rmapbt_ops;
@@ -670,3 +672,20 @@ xfs_rmapbt_calc_reserves(
 
 	return error;
 }
+
+int __init
+xfs_rmapbt_init_cur_cache(void)
+{
+	xfs_rmapbt_cur_cache = kmem_cache_create("xfs_rmapbt_cur",
+			xfs_btree_cur_sizeof(xfs_rmapbt_absolute_maxlevels()),
+			0, 0, NULL);
+
+	return xfs_rmapbt_cur_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_rmapbt_destroy_cur_cache(void)
+{
+	kmem_cache_destroy(xfs_rmapbt_cur_cache);
+	xfs_rmapbt_cur_cache = NULL;
+}
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
index 84fe74de923f..dd5422850656 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.h
+++ b/fs/xfs/libxfs/xfs_rmap_btree.h
@@ -61,4 +61,7 @@ extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
 
 unsigned int xfs_rmapbt_absolute_maxlevels(void);
 
+int __init xfs_rmapbt_init_cur_cache(void);
+void xfs_rmapbt_destroy_cur_cache(void);
+
 #endif /* __XFS_RMAP_BTREE_H__ */
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 90c92a6a49e0..399d7cfc7d4b 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -37,6 +37,13 @@
 #include "xfs_reflink.h"
 #include "xfs_pwork.h"
 #include "xfs_ag.h"
+#include "xfs_btree.h"
+#include "xfs_alloc_btree.h"
+#include "xfs_ialloc_btree.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_rmap_btree.h"
+#include "xfs_refcount_btree.h"
+
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -1950,9 +1957,45 @@ static struct file_system_type xfs_fs_type = {
 };
 MODULE_ALIAS_FS("xfs");
 
+STATIC int __init
+xfs_init_btree_cur_caches(void)
+{
+	int				error;
+
+	error = xfs_allocbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_inobt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_bmbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_rmapbt_init_cur_cache();
+	if (error)
+		return error;
+	error = xfs_refcountbt_init_cur_cache();
+	if (error)
+		return error;
+
+	return 0;
+}
+
+STATIC void
+xfs_destroy_btree_cur_caches(void)
+{
+	xfs_allocbt_destroy_cur_cache();
+	xfs_inobt_destroy_cur_cache();
+	xfs_bmbt_destroy_cur_cache();
+	xfs_rmapbt_destroy_cur_cache();
+	xfs_refcountbt_destroy_cur_cache();
+}
+
 STATIC int __init
 xfs_init_zones(void)
 {
+	int		error;
+
 	xfs_log_ticket_zone = kmem_cache_create("xfs_log_ticket",
 						sizeof(struct xlog_ticket),
 						0, 0, NULL);
@@ -1965,10 +2008,8 @@ xfs_init_zones(void)
 	if (!xfs_bmap_free_item_zone)
 		goto out_destroy_log_ticket_zone;
 
-	xfs_btree_cur_zone = kmem_cache_create("xfs_btree_cur",
-			xfs_btree_cur_sizeof(XFS_BTREE_CUR_ZONE_MAXLEVELS),
-			0, 0, NULL);
-	if (!xfs_btree_cur_zone)
+	error = xfs_init_btree_cur_caches();
+	if (error)
 		goto out_destroy_bmap_free_item_zone;
 
 	xfs_da_state_zone = kmem_cache_create("xfs_da_state",
@@ -2106,7 +2147,7 @@ xfs_init_zones(void)
  out_destroy_da_state_zone:
 	kmem_cache_destroy(xfs_da_state_zone);
  out_destroy_btree_cur_zone:
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	xfs_destroy_btree_cur_caches();
  out_destroy_bmap_free_item_zone:
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
  out_destroy_log_ticket_zone:
@@ -2138,7 +2179,7 @@ xfs_destroy_zones(void)
 	kmem_cache_destroy(xfs_trans_zone);
 	kmem_cache_destroy(xfs_ifork_zone);
 	kmem_cache_destroy(xfs_da_state_zone);
-	kmem_cache_destroy(xfs_btree_cur_zone);
+	xfs_destroy_btree_cur_caches();
 	kmem_cache_destroy(xfs_bmap_free_item_zone);
 	kmem_cache_destroy(xfs_log_ticket_zone);
 }


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog
  2021-10-12 23:32 ` [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
@ 2021-10-13  0:56   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  0:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:32:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This field isn't used by anyone, so get rid of it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |    1 -
>  fs/xfs/libxfs/xfs_bmap_btree.c     |    1 -
>  fs/xfs/libxfs/xfs_btree.h          |    1 -
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |    2 --
>  fs/xfs/libxfs/xfs_refcount_btree.c |    1 -
>  fs/xfs/libxfs/xfs_rmap_btree.c     |    1 -
>  6 files changed, 7 deletions(-)

LGTM.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors
  2021-10-12 23:32 ` [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
@ 2021-10-13  0:57   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  0:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:32:44PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We're never going to run more than 4 billion btree operations on a
> refcount cursor, so shrink the field to an unsigned int to reduce the
> structure size.  Fix whitespace alignment too.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.h |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 49ecc496238f..1018bcc43d66 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -181,18 +181,18 @@ union xfs_btree_irec {
>  
>  /* Per-AG btree information. */
>  struct xfs_btree_cur_ag {
> -	struct xfs_perag	*pag;
> +	struct xfs_perag		*pag;
>  	union {
>  		struct xfs_buf		*agbp;
>  		struct xbtree_afakeroot	*afake;	/* for staging cursor */
>  	};
>  	union {
>  		struct {
> -			unsigned long nr_ops;	/* # record updates */
> -			int	shape_changes;	/* # of extent splits */
> +			unsigned int	nr_ops;	/* # record updates */
> +			unsigned int	shape_changes;	/* # of extent splits */
>  		} refc;
>  		struct {
> -			bool	active;		/* allocation cursor state */
> +			bool		active;	/* allocation cursor state */
>  		} abt;
>  	};
>  };

Much nicer.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree
  2021-10-12 23:32 ` [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
@ 2021-10-13  1:02   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  1:02 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:32:50PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The btree scrubbing code checks that the records (or keys) that it finds
> in a btree block are all in order by calling the btree cursor's
> ->recs_inorder function.  This of course makes no sense for the first
> item in the block, so we switch that off with a separate variable in
> struct xchk_btree.
> 
> Christoph helped me figure out that the variable is unnecessary, since
> we just accessed bc_ptrs[level] and can compare that against zero.  Use
> that, and save ourselves some memory space.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Yup, took a little bit of reading to work it out, but it looks
correct.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/15] xfs: dynamically allocate btree scrub context structure
  2021-10-12 23:32 ` [PATCH 04/15] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
@ 2021-10-13  4:57   ` Dave Chinner
  2021-10-13 16:29     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  4:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:32:55PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Reorganize struct xchk_btree so that we can dynamically size the context
> structure to fit the type of btree cursor that we have.  This will
> enable us to use memory more efficiently once we start adding very tall
> btree types.  Right-size the lastkey array so that we stop wasting the
> first array element.

"right size"?

I'm assuming this is the "nlevels - 1" bit?

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/scrub/btree.c |   23 ++++++++++++-----------
>  fs/xfs/scrub/btree.h |   11 ++++++++++-
>  2 files changed, 22 insertions(+), 12 deletions(-)
> 
> 
> diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> index d5e1ca521fc4..6d4eba85ef77 100644
> --- a/fs/xfs/scrub/btree.c
> +++ b/fs/xfs/scrub/btree.c
> @@ -189,9 +189,9 @@ xchk_btree_key(
>  
>  	/* If this isn't the first key, are they in order? */
>  	if (cur->bc_ptrs[level] > 1 &&
> -	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
> +	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
>  		xchk_btree_set_corrupt(bs->sc, cur, level);
> -	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
> +	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
>  
>  	if (level + 1 >= cur->bc_nlevels)
>  		return;
> @@ -631,17 +631,24 @@ xchk_btree(
>  	union xfs_btree_ptr		*pp;
>  	union xfs_btree_rec		*recp;
>  	struct xfs_btree_block		*block;
> -	int				level;
>  	struct xfs_buf			*bp;
>  	struct check_owner		*co;
>  	struct check_owner		*n;
> +	size_t				cur_sz;
> +	int				level;
>  	int				error = 0;
>  
>  	/*
>  	 * Allocate the btree scrub context from the heap, because this
> -	 * structure can get rather large.
> +	 * structure can get rather large.  Don't let a caller feed us a
> +	 * totally absurd size.
>  	 */
> -	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
> +	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
> +	if (cur_sz > PAGE_SIZE) {
> +		xchk_btree_set_corrupt(sc, cur, 0);
> +		return 0;
> +	}
> +	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
>  	if (!bs)
>  		return -ENOMEM;
>  	bs->cur = cur;
> @@ -653,12 +660,6 @@ xchk_btree(
>  	/* Initialize scrub state */
>  	INIT_LIST_HEAD(&bs->to_check);
>  
> -	/* Don't try to check a tree with a height we can't handle. */
> -	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> -		xchk_btree_set_corrupt(sc, cur, 0);
> -		goto out;
> -	}
> -
>  	/*
>  	 * Load the root of the btree.  The helper function absorbs
>  	 * error codes for us.
> diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
> index 7671108f9f85..62c3091ef20f 100644
> --- a/fs/xfs/scrub/btree.h
> +++ b/fs/xfs/scrub/btree.h
> @@ -39,9 +39,18 @@ struct xchk_btree {
>  
>  	/* internal scrub state */
>  	union xfs_btree_rec		lastrec;
> -	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
>  	struct list_head		to_check;
> +
> +	/* this element must come last! */
> +	union xfs_btree_key		lastkey[];
>  };
> +
> +static inline size_t
> +xchk_btree_sizeof(unsigned int nlevels)
> +{
> +	return struct_size((struct xchk_btree *)NULL, lastkey, nlevels - 1);
> +}

I'd like a comment here indicating that the max number of keys is
"nlevels - 1" because the last level of the tree is records and
that's held in a separate lastrec field...

That way there's a reminder of why there's a "- 1" here without
having work it out from first principles every time we look at this
code...

Otherwise it seems reasonable.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/15] xfs: support dynamic btree cursor heights
  2021-10-12 23:33 ` [PATCH 05/15] xfs: support dynamic btree cursor heights Darrick J. Wong
@ 2021-10-13  5:31   ` Dave Chinner
  2021-10-13 16:52     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, Christoph Hellwig, linux-xfs

On Tue, Oct 12, 2021 at 04:33:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Split out the btree level information into a separate struct and put it
> at the end of the cursor structure as a VLA.  The realtime rmap btree
> (which is rooted in an inode) will require the ability to support many
> more levels than a per-AG btree cursor, which means that we're going to
> create two btree cursor caches to conserve memory for the more common
> case.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    6 +-
>  fs/xfs/libxfs/xfs_bmap.c  |   10 +--
>  fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
>  fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
>  fs/xfs/scrub/bitmap.c     |   22 +++---
>  fs/xfs/scrub/bmap.c       |    2 -
>  fs/xfs/scrub/btree.c      |   47 +++++++------
>  fs/xfs/scrub/trace.c      |    7 +-
>  fs/xfs/scrub/trace.h      |   10 +--
>  fs/xfs/xfs_super.c        |    2 -
>  fs/xfs/xfs_trace.h        |    2 -
>  11 files changed, 164 insertions(+), 140 deletions(-)

Hmmm - subject of the patch doesn't really match the changes being
made - there's nothing here that makes the btree cursor heights
dynamic. It's just a structure layout change...

> @@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
>  	 * For each level current, re-get the buffer and copy the ptr value.
>  	 */
>  	for (i = 0; i < new->bc_nlevels; i++) {
> -		new->bc_ptrs[i] = cur->bc_ptrs[i];
> -		new->bc_ra[i] = cur->bc_ra[i];
> -		bp = cur->bc_bufs[i];
> +		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
> +		new->bc_levels[i].ra = cur->bc_levels[i].ra;
> +		bp = cur->bc_levels[i].bp;
>  		if (bp) {
>  			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
>  						   xfs_buf_daddr(bp), mp->m_bsize,
> @@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
>  				return error;
>  			}
>  		}
> -		new->bc_bufs[i] = bp;
> +		new->bc_levels[i].bp = bp;
>  	}
>  	*ncur = new;
>  	return 0;

ObHuh: that dup_cursor code seems like a really obtuse way of doing:

	bip = cur->bc_levels[i].bp->b_log_item;
	bip->bli_recur++;
	new->bc_levels[i] = cur->bc_levels[i];

But that's not a problem this patch needs to solve. Just something
that made me go hmmmm...

> @@ -922,11 +922,11 @@ xfs_btree_readahead(
>  	    (lev == cur->bc_nlevels - 1))
>  		return 0;
>  
> -	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
> +	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
>  		return 0;

That's whacky logic. Surely that's just:

	if (cur->bc_levels[lev].ra & lr)
		return 0;

> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 1018bcc43d66..f31f057bec9d 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
>  #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
>  };
>  
> +struct xfs_btree_level {
> +	/* buffer pointer */
> +	struct xfs_buf		*bp;
> +
> +	/* key/record number */
> +	uint16_t		ptr;
> +
> +	/* readahead info */
> +#define XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
> +#define XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
> +	uint16_t		ra;
> +};

The ra variable is a bit field. Can we define the values obviously
as bit fields with (1 << 0) and (1 << 1) instead of 1 and 2?

> @@ -242,8 +250,17 @@ struct xfs_btree_cur
>  		struct xfs_btree_cur_ag	bc_ag;
>  		struct xfs_btree_cur_ino bc_ino;
>  	};
> +
> +	/* Must be at the end of the struct! */
> +	struct xfs_btree_level	bc_levels[];
>  };
>  
> +static inline size_t
> +xfs_btree_cur_sizeof(unsigned int nlevels)
> +{
> +	return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
> +}

Ooooh, yeah, we really need comments explaining how many btree
levels these VLAs are tracking, because this one doesn't have a "-
1" in it like the previous one I commented on....

> diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> index c0ef53fe6611..816dfc8e5a80 100644
> --- a/fs/xfs/scrub/trace.c
> +++ b/fs/xfs/scrub/trace.c
> @@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
>  	struct xfs_btree_cur	*cur,
>  	int			level)
>  {
> -	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> +	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
>  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> -				xfs_buf_daddr(cur->bc_bufs[level]));
> -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> +				xfs_buf_daddr(cur->bc_levels[level].bp));
> +	else if (level == cur->bc_nlevels - 1 &&
> +		 cur->bc_flags & XFS_BTREE_LONG_PTRS)

No need for an else there as the first if () clause returns.
Also, needs more () around that "a & b" second line.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing
  2021-10-12 23:33 ` [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
@ 2021-10-13  5:34   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Reduce the size of the btree cursor structure some more by rearranging
> fields to eliminate unused space.  While we're at it, fix the ragged
> indentation and a spelling error.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.h |    8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)

looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 07/15] xfs: refactor btree cursor allocation function
  2021-10-12 23:33 ` [PATCH 07/15] xfs: refactor btree cursor allocation function Darrick J. Wong
@ 2021-10-13  5:34   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:34 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, Christoph Hellwig, linux-xfs

On Tue, Oct 12, 2021 at 04:33:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Refactor btree allocation to a common helper.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |    6 +-----
>  fs/xfs/libxfs/xfs_bmap_btree.c     |    6 +-----
>  fs/xfs/libxfs/xfs_btree.h          |   16 ++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |    5 +----
>  fs/xfs/libxfs/xfs_refcount_btree.c |    5 +----
>  fs/xfs/libxfs/xfs_rmap_btree.c     |    5 +----
>  6 files changed, 21 insertions(+), 22 deletions(-)

LGTM

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 08/15] xfs: encode the max btree height in the cursor
  2021-10-12 23:33 ` [PATCH 08/15] xfs: encode the max btree height in the cursor Darrick J. Wong
@ 2021-10-13  5:38   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Encode the maximum btree height in the cursor, since we're soon going to
> allow smaller cursors for AG btrees and larger cursors for file btrees.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c          |    2 +-
>  fs/xfs/libxfs/xfs_btree.c         |    4 ++--
>  fs/xfs/libxfs/xfs_btree.h         |    2 ++
>  fs/xfs/libxfs/xfs_btree_staging.c |   10 +++++-----
>  4 files changed, 10 insertions(+), 8 deletions(-)

Looks ok, and fills part of the hole in the btree cursor structure
so doesn't grow it again.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels
  2021-10-12 23:33 ` [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
@ 2021-10-13  5:40   ` Dave Chinner
  2021-10-13 16:55     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:40 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> To support future btree code, we need to be able to size btree cursors
> dynamically for very large btrees.  Switch the maxlevels computation to
> use the precomputed values in the superblock, and create cursors that
> can handle a certain height.  For now, we retain the btree cursor zone
> that can handle up to 9-level btrees, and create larger cursors (which
> shouldn't happen currently) from the heap as a failsafe.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |    2 +-
>  fs/xfs/libxfs/xfs_bmap_btree.c     |    3 ++-
>  fs/xfs/libxfs/xfs_btree.h          |   13 +++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |    3 ++-
>  fs/xfs/libxfs/xfs_refcount_btree.c |    3 ++-
>  fs/xfs/libxfs/xfs_rmap_btree.c     |    3 ++-
>  fs/xfs/xfs_super.c                 |    4 ++--
>  7 files changed, 22 insertions(+), 9 deletions(-)

minor nit:

> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 43766e5b680f..b8761a2fc24b 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -94,6 +94,12 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
>  
>  #define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
>  
> +/*
> + * The btree cursor zone hands out cursors that can handle up to this many
> + * levels.  This is the known maximum for all btree types.
> + */
> +#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)

XFS_BTREE_CUR_CACHE_MAXLEVELS	9

Otherwise looks OK.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation
  2021-10-12 23:33 ` [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
@ 2021-10-13  5:49   ` Dave Chinner
  2021-10-13 17:07     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  5:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:28PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Compute the actual maximum btree height when deciding if per-AG block
> reservation is critically low.  This only affects the sanity check
> condition, since we /generally/ will trigger on the 10% threshold.
> This is a long-winded way of saying that we're removing one more
> usage of XFS_BTREE_MAXLEVELS.

And replacing it with a branchy dynamic calculation that has a
static, unchanging result. :(

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c |   18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 2aa2b3484c28..d34d4614f175 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -60,6 +60,20 @@
>   * to use the reservation system should update ask/used in xfs_ag_resv_init.
>   */
>  
> +/* Compute maximum possible height for per-AG btree types for this fs. */
> +static unsigned int
> +xfs_ag_btree_maxlevels(
> +	struct xfs_mount	*mp)
> +{
> +	unsigned int		ret = mp->m_ag_maxlevels;
> +
> +	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
> +	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> +	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
> +	ret = max(ret, mp->m_rmap_maxlevels);
> +	return max(ret, mp->m_refc_maxlevels);
> +}

Hmmmm. perhaps mp->m_ag_maxlevels should be renamed to
mp->m_agbno_maxlevels and we pre-calculate mp->m_ag_maxlevels from
the above function and just use the variable in the
xfs_ag_resv_critical() check?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-12 23:33 ` [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
@ 2021-10-13  7:25   ` Dave Chinner
  2021-10-13 17:47     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  7:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs, hch

On Tue, Oct 12, 2021 at 04:33:34PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
> enough to handle the maximally tall rmap btree when all blocks are in
> use and maximally shared, let's compute the maximum height assuming the
> rmapbt consumes as many blocks as possible.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_btree.c       |   34 ++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h       |    2 +
>  fs/xfs/libxfs/xfs_rmap_btree.c  |   55 ++++++++++++++++++++++++---------------
>  fs/xfs/libxfs/xfs_trans_resv.c  |   13 +++++++++
>  fs/xfs/libxfs/xfs_trans_space.h |    7 +++++
>  5 files changed, 90 insertions(+), 21 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 6ced8f028d47..201b81d54622 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4531,6 +4531,40 @@ xfs_btree_compute_maxlevels(
>  	return level;
>  }
>  
> +/*
> + * Compute the maximum height of a btree that is allowed to consume up to the
> + * given number of blocks.
> + */
> +unsigned int
> +xfs_btree_compute_maxlevels_size(
> +	unsigned long long	max_btblocks,
> +	unsigned int		leaf_mnr)

So "leaf_mnr" is supposed to be the minimum number of records in
a leaf?

But this gets passed mp->m_rmap_mnr[1], which is the minimum number
of keys/ptrs in a node, not a leaf. I'm confused.

> +{
> +	unsigned long long	leaf_blocks = leaf_mnr;
> +	unsigned long long	blocks_left;
> +	unsigned int		maxlevels;
> +
> +	if (max_btblocks < 1)
> +		return 0;
> +
> +	/*
> +	 * The loop increments maxlevels as long as there would be enough
> +	 * blocks left in the reservation to handle each node block at the
> +	 * current level pointing to the minimum possible number of leaf blocks
> +	 * at the next level down.  We start the loop assuming a single-level
> +	 * btree consuming one block.
> +	 */
> +	maxlevels = 1;
> +	blocks_left = max_btblocks - 1;
> +	while (leaf_blocks < blocks_left) {
> +		maxlevels++;
> +		blocks_left -= leaf_blocks;
> +		leaf_blocks *= leaf_mnr;
> +	}
> +
> +	return maxlevels;

Yup, I'm definitely confused. We also have:

xfs_btree_calc_size(limits, len)
xfs_btree_compute_maxlevels(limits, len)

And they do something similar but subtly different. They aren't
clearly documented, either, so from reading the code:

xfs_btree_calc_size is calculating the btree block usage for a
discrete count of items based on the leaf and node population values
from mp->m_rmap_mnr, etc. It uses a division based algorithm

	recs = limits[0]	// min recs per block
	for (level = 0; len > 1; level++) {
		do_div(len, recs)
		recs = limits[1]	// min ptrs per node
		rval += len;
	}
	return rval

(why does this even calculate level?)

So it returns the number of blocks the btree will consume to
index a given number of discrete blocks.

xfs_btree_compute_maxlevels() is basically:

	len = len / limits[0]		// record blocks in level 0
	for (level = 1; len > 1; level++)
		len = len / limits[1]	// node blocks in level n
	return level

So it returns how many levels are required to index a specific
number of discrete blocks given a specific leaf/node population.

But what does xfs_btree_compute_maxlevels_size do? I'm really not
sure from the desription, the calculation or the parameters passed
to it. Even a table doesn't tell me:

say 10000 records, leaf_mnr = 10

loop		blocks_left	leaf_blocks	max_levels
0 (at init)		9999		10		1
1			9989	       100		2
2			9889	      1000		3
3			8889	     10000		4
Breaks out on (leaf_blocks > blocks_left)

So, after much head scratching, I *think* what this function is
trying to do is take into account the case where we have a single
block shared by reflink N times, such that the entire AG is made up
of rmap records pointing to all the owners.  We're trying to
determine the size is the height of the tree if we index enough leaf
records to consume all the free space in the AG?

Which then means we don't care what the number of records are in the
leaf nodes, we only need to know how many leaf blocks there are and
how many interior nodes we consume to index them?

IOWs, we're counting the number of leaf blocks we can index at each
level based on the _minimum number of pointers_ we can hold in a
_node_?

If so, then the naming leaves a lot to be desired here. The
variables all being named "leaf" even though they are being passed
node limits and are calculating node level indexing limits and not
leaf space consumption completely threw me in the wrong direction.
I just spent the best part of 90 minutes working all this out
from first principles because nothing is obvious about why this code
is correct. Everything screamed "wrong wrong wrong" at me until
I finally understood what was being calculated. And now I know, it
still screams "wrong wrong wrong" at me.

So:

/*
 * Given a number of available blocks for the btree to consume with
 * records and pointers, calculate the height of the tree needed to
 * index all the records that space can hold based on the number of
 * pointers each interior node holds.
 *
 * We start by assuming a single level tree consumes a single block,
 * then track the number of blocks each node level consumes until we
 * no longer have space to store the next node level. At this point,
 * we are indexing all the leaf blocks in the space, and there's no
 * more free space to split the tree any further. That's our maximum
 * btree height.
 */
unsigned int
xfs_btree_space_to_height(
	unsigned int		*limits,
	unsigned long long	leaf_blocks)
{
	unsigned long long	node_blocks = limits[1];
	unsigned long long	blocks_left = leaf_blocks - 1;
	unsigned int		height = 1;

	if (leaf_blocks < 1)
		return 0;

	while (node_blocks < blocks_left) {
		height++;
		blocks_left -= node_blocks;
		node_blocks *= limits[1];
	}

	return height;
}

Oh, yeah, I made the parameters the same as the other btree
height/size functions, too, because....

> +	unsigned int		val;
> +
> +	if (!xfs_has_rmapbt(mp)) {
> +		mp->m_rmap_maxlevels = 0;
> +		return;
> +	}
> +
> +	if (xfs_has_reflink(mp)) {
> +		/*
> +		 * Compute the asymptotic maxlevels for an rmap btree on a
> +		 * filesystem that supports reflink.
> +		 *
> +		 * On a reflink filesystem, each AG block can have up to 2^32
> +		 * (per the refcount record format) owners, which means that
> +		 * theoretically we could face up to 2^64 rmap records.
> +		 * However, we're likely to run out of blocks in the AG long
> +		 * before that happens, which means that we must compute the
> +		 * max height based on what the btree will look like if it
> +		 * consumes almost all the blocks in the AG due to maximal
> +		 * sharing factor.
> +		 */
> +		val = xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks,
> +				mp->m_rmap_mnr[1]);
> +	} else {
> +		/*
> +		 * If there's no block sharing, compute the maximum rmapbt
> +		 * height assuming one rmap record per AG block.
> +		 */
> +		val = xfs_btree_compute_maxlevels(mp->m_rmap_mnr,
> +				mp->m_sb.sb_agblocks);

This just looks weird with the same parameters in reverse order to
these two functions...

> +	}
> +
> +	mp->m_rmap_maxlevels = val;
>  }

Also, this function becomes simpler if it just returns the maxlevels
value and the caller writes it into mp->m_rmap_maxlevels.

>  
>  /* Calculate the refcount btree size for some records. */
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 5e300daa2559..97bd17d84a23 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -814,6 +814,16 @@ xfs_trans_resv_calc(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans_resv	*resp)
>  {
> +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> +
> +	/*
> +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> +	 * to 9 even if the AG was small enough that it would never grow to
> +	 * that height.
> +	 */
> +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> +
>  	/*
>  	 * The following transactions are logged in physical format and
>  	 * require a permanent reservation on space.
> @@ -916,4 +926,7 @@ xfs_trans_resv_calc(
>  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
>  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> +
> +	/* Put everything back the way it was.  This goes at the end. */
> +	mp->m_rmap_maxlevels = rmap_maxlevels;
>  }

Why play games like this? We want the reservations to go down in
size if the btrees don't reach "XFS_OLD_REFLINK_RMAP_MAXLEVELS"
size. The reason isn't mentioned in the commit message...

> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index 50332be34388..440c9c390b86 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -17,6 +17,13 @@
>  /* Adding one rmap could split every level up to the top of the tree. */
>  #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
>  
> +/*
> + * Note that we historically set m_rmap_maxlevels to 9 when reflink was
> + * enabled, so we must preserve this behavior to avoid changing the transaction
> + * space reservations.
> + */
> +#define XFS_OLD_REFLINK_RMAP_MAXLEVELS	(9)

9.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS
  2021-10-12 23:33 ` [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
@ 2021-10-13  7:25   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  7:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs, hch

On Tue, Oct 12, 2021 at 04:33:39PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Nobody uses this symbol anymore, so kill it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>

LGTM

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts
  2021-10-12 23:33 ` [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts Darrick J. Wong
@ 2021-10-13  7:28   ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  7:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:45PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Rework xfs_btree_compute_maxlevels to handle larger record counts, since
> we're about to add support for very large data forks.  Eventually the
> realtime reverse mapping btree will need this too.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_btree.c |   16 ++++++++--------
>  fs/xfs/libxfs/xfs_btree.h |    3 ++-
>  2 files changed, 10 insertions(+), 9 deletions(-)

Looks good. howmany_64() uses do_div() properly so there shouldn't
be any issues with this on 32 bit platforms.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type
  2021-10-12 23:33 ` [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
@ 2021-10-13  7:57   ` Dave Chinner
  2021-10-13 21:36     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  7:57 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:50PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Add code for all five btree types so that we can compute the absolute
> maximum possible btree height for each btree type.  This is a setup for
> the next patch, which makes every btree type have its own cursor cache.
> 
> The functions are exported so that we can have xfs_db report the
> absolute maximum btree heights for each btree type, rather than making
> everyone run their own ad-hoc computations.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c          |    1 +
>  fs/xfs/libxfs/xfs_alloc_btree.c    |   13 +++++++++++
>  fs/xfs/libxfs/xfs_alloc_btree.h    |    2 ++
>  fs/xfs/libxfs/xfs_bmap.c           |    1 +
>  fs/xfs/libxfs/xfs_bmap_btree.c     |   14 ++++++++++++
>  fs/xfs/libxfs/xfs_bmap_btree.h     |    2 ++
>  fs/xfs/libxfs/xfs_btree.c          |   41 ++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_btree.h          |    3 +++
>  fs/xfs/libxfs/xfs_fs.h             |    2 ++
>  fs/xfs/libxfs/xfs_ialloc.c         |    1 +
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |   19 +++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc_btree.h   |    2 ++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   20 ++++++++++++++++++
>  fs/xfs/libxfs/xfs_refcount_btree.h |    2 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   27 ++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    2 ++
>  16 files changed, 152 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 55c5adc9b54e..7145416a230c 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2198,6 +2198,7 @@ xfs_alloc_compute_maxlevels(
>  {
>  	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
>  			(mp->m_sb.sb_agblocks + 1) / 2);
> +	ASSERT(mp->m_ag_maxlevels <= xfs_allocbt_absolute_maxlevels());
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index f14bad21503f..61f6d266b822 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -582,6 +582,19 @@ xfs_allocbt_maxrecs(
>  	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
>  }
>  
> +/* Compute the max possible height of the maximally sized free space btree. */
> +unsigned int
> +xfs_allocbt_absolute_maxlevels(void)
> +{
> +	unsigned int		minrecs[2];
> +
> +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_alloc_rec_t),
> +			sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> +
> +	return xfs_btree_compute_maxlevels(minrecs,
> +			(XFS_MAX_AG_BLOCKS + 1) / 2);
> +}

Hmmmm. This is kinds messy. I'd prefer we share code with the
xfs_allocbt_maxrecs() function that do this. Not sure "absolute" is
the right word, either. It's more a function of the on-disk format
maximum, not an "absolute" thing.

I mean, we know that the worst case is going to be for each btree
type - we don't need to pass in XFS_BTREE_CRC_BLOCKS or
XFS_BTREE_LONG_PTRS to generic code for it to branch multiple times
to be generic. Instead:

static inline int
xfs_allocbt_block_maxrecs(
        int                     blocklen,
        int                     leaf)
{
        if (leaf)
                return blocklen / sizeof(xfs_alloc_rec_t);
        return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
}

/*
 * Calculate number of records in an alloc btree block.
 */
int
xfs_allocbt_maxrecs(
        struct xfs_mount        *mp,
        int                     blocklen,
        int                     leaf)
{
        blocklen -= XFS_ALLOC_BLOCK_LEN(mp);
	return xfs_allobt_block_maxrecs(blocklen, leaf);
}

xfs_allocbt_maxlevels_ondisk()
{
	unsigned int		minrecs[2];

	minrecs[0] = xfs_allocbt_block_maxrecs(
			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, true) / 2;
	minrecs[1] = xfs_allocbt_block_maxrecs(
			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, false) / 2;

	return xfs_btree_compute_maxlevels(minrecs,
			(XFS_MAX_AG_BLOCKS + 1) / 2);
}

All the other btree implementations factor this way, too, allowing
the minrec values to be calculated clearly and directly in the
specific btree function...

> +
>  /* Calculate the freespace btree size for some records. */
>  xfs_extlen_t
>  xfs_allocbt_calc_size(
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
> index 2f6b816aaf9f..c47d0e285435 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.h
> @@ -60,4 +60,6 @@ extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
>  void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
>  		struct xfs_trans *tp, struct xfs_buf *agbp);
>  
> +unsigned int xfs_allocbt_absolute_maxlevels(void);
> +
>  #endif	/* __XFS_ALLOC_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 2ae5bf9a74e7..7e70df8d1a9b 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -93,6 +93,7 @@ xfs_bmap_compute_maxlevels(
>  			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
>  	}
>  	mp->m_bm_maxlevels[whichfork] = level;
> +	ASSERT(mp->m_bm_maxlevels[whichfork] <= xfs_bmbt_absolute_maxlevels());
>  }
>  
>  unsigned int
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index b90122de0df0..7001aff639d2 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -587,6 +587,20 @@ xfs_bmbt_maxrecs(
>  	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
>  }
>  
> +/* Compute the max possible height of the maximally sized bmap btree. */
> +unsigned int
> +xfs_bmbt_absolute_maxlevels(void)
> +{
> +	unsigned int		minrecs[2];
> +
> +	xfs_btree_absolute_minrecs(minrecs, XFS_BTREE_LONG_PTRS,
> +			sizeof(struct xfs_bmbt_rec),
> +			sizeof(struct xfs_bmbt_key) +
> +				sizeof(xfs_bmbt_ptr_t));
> +
> +	return xfs_btree_compute_maxlevels(minrecs, MAXEXTNUM) + 1;
> +}

	minrecs[0] = xfs_bmbt_block_maxrecs(
			XFS_MIN_BLOCKSIZE - XFS_BTREE_LBLOCK_LEN, true) / 2;
	....

> +
>  /*
>   * Calculate number of records in a bmap btree inode root.
>   */
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> index 729e3bc569be..e9218e92526b 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> @@ -110,4 +110,6 @@ extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
>  extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
>  		unsigned long long len);
>  
> +unsigned int xfs_bmbt_absolute_maxlevels(void);
> +
>  #endif	/* __XFS_BMAP_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index b95c817ad90d..bea1bdf9b8b9 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -4964,3 +4964,44 @@ xfs_btree_has_more_records(
>  	else
>  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
>  }
> +
> +/*
> + * Compute absolute minrecs for leaf and node btree blocks.  Callers should set
> + * BTREE_LONG_PTRS and BTREE_OVERLAPPING as they would for regular cursors.
> + * Set BTREE_CRC_BLOCKS if the btree type is supported /only/ on V5 or newer
> + * filesystems.
> + */
> +void
> +xfs_btree_absolute_minrecs(
> +	unsigned int		*minrecs,
> +	unsigned int		bc_flags,
> +	unsigned int		leaf_recbytes,
> +	unsigned int		node_recbytes)
> +{
> +	unsigned int		min_recbytes;
> +
> +	/*
> +	 * If this btree type is supported on V4, we use the smaller V4 min
> +	 * block size along with the V4 header size.  If the btree type is only
> +	 * supported on V5, use the (twice as large) V5 min block size along
> +	 * with the V5 header size.
> +	 */
> +	if (!(bc_flags & XFS_BTREE_CRC_BLOCKS)) {
> +		if (bc_flags & XFS_BTREE_LONG_PTRS)
> +			min_recbytes = XFS_MIN_BLOCKSIZE -
> +							XFS_BTREE_LBLOCK_LEN;
> +		else
> +			min_recbytes = XFS_MIN_BLOCKSIZE -
> +							XFS_BTREE_SBLOCK_LEN;
> +	} else if (bc_flags & XFS_BTREE_LONG_PTRS) {
> +		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_LBLOCK_CRC_LEN;
> +	} else {
> +		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
> +	}
> +
> +	if (bc_flags & XFS_BTREE_OVERLAPPING)
> +		node_recbytes <<= 1;
> +
> +	minrecs[0] = min_recbytes / (2 * leaf_recbytes);
> +	minrecs[1] = min_recbytes / (2 * node_recbytes);
> +}

This can go away.

> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index 20a2828c11ef..acb202839afd 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -601,4 +601,7 @@ xfs_btree_alloc_cursor(
>  	return cur;
>  }
>  
> +void xfs_btree_absolute_minrecs(unsigned int *minrecs, unsigned int bc_flags,
> +		unsigned int leaf_recbytes, unsigned int node_recbytes);
> +
>  #endif	/* __XFS_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index bde2b4c64dbe..c43877c8a279 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -268,6 +268,8 @@ typedef struct xfs_fsop_resblks {
>   */
>  #define XFS_MIN_AG_BYTES	(1ULL << 24)	/* 16 MB */
>  #define XFS_MAX_AG_BYTES	(1ULL << 40)	/* 1 TB */
> +#define XFS_MAX_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_BLOCKSIZE)
> +#define XFS_MAX_CRC_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_CRC_BLOCKSIZE)
>  
>  /* keep the maximum size under 2^31 by a small amount */
>  #define XFS_MAX_LOG_BYTES \
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 994ad783d407..017aebdda42f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2793,6 +2793,7 @@ xfs_ialloc_setup_geometry(
>  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
>  	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
>  			inodes);
> +	ASSERT(igeo->inobt_maxlevels <= xfs_inobt_absolute_maxlevels());
>  
>  	/*
>  	 * Set the maximum inode count for this filesystem, being careful not
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 3a5a24648b87..2e3dd1d798bd 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -542,6 +542,25 @@ xfs_inobt_maxrecs(
>  	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
>  }
>  
> +/* Compute the max possible height of the maximally sized inode btree. */
> +unsigned int
> +xfs_inobt_absolute_maxlevels(void)
> +{
> +	unsigned int		minrecs[2];
> +	unsigned long long	max_ag_inodes;
> +
> +	/*
> +	 * For the absolute maximum, pretend that we can fill an entire AG
> +	 * completely full of inodes except for the AG headers.
> +	 */
> +	max_ag_inodes = (XFS_MAX_AG_BYTES - (4 * BBSIZE)) / XFS_DINODE_MIN_SIZE;
> +
> +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_inobt_rec_t),
> +			sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
> +
> +	return xfs_btree_compute_maxlevels(minrecs, max_ag_inodes);
> +}

We've got two different inobt max levels on disk. The inobt which has v4
limits, whilst the finobt that has v5 limits...

> +/* Compute the max possible height of the maximally sized rmap btree. */
> +unsigned int
> +xfs_rmapbt_absolute_maxlevels(void)
> +{
> +	unsigned int		minrecs[2];
> +
> +	xfs_btree_absolute_minrecs(minrecs,
> +			XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING,
> +			sizeof(struct xfs_rmap_rec),
> +			sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> +
> +	/*
> +	 * Compute the asymptotic maxlevels for an rmapbt on any reflink fs.
> +	 *
> +	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
> +	 * refcount record format) owners, which means that theoretically we
> +	 * could face up to 2^64 rmap records.  However, we're likely to run
> +	 * out of blocks in the AG long before that happens, which means that
> +	 * we must compute the max height based on what the btree will look
> +	 * like if it consumes almost all the blocks in the AG due to maximal
> +	 * sharing factor.
> +	 */
> +	return xfs_btree_compute_maxlevels_size(XFS_MAX_CRC_AG_BLOCKS,

Huh. I don't know where XFS_MAX_CRC_AG_BLOCKS is defined. I must
have missed it somewhere?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 15/15] xfs: use separate btree cursor cache for each btree type
  2021-10-12 23:33 ` [PATCH 15/15] xfs: use separate btree cursor cache " Darrick J. Wong
@ 2021-10-13  8:01   ` Dave Chinner
  2021-10-13 21:42     ` Darrick J. Wong
  0 siblings, 1 reply; 41+ messages in thread
From: Dave Chinner @ 2021-10-13  8:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Tue, Oct 12, 2021 at 04:33:56PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Now that we have the infrastructure to track the max possible height of
> each btree type, we can create a separate slab cache for cursors of each
> type of btree.  For smaller indices like the free space btrees, this
> means that we can pack more cursors into a slab page, improving slab
> utilization.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc_btree.c    |   21 ++++++++++++++
>  fs/xfs/libxfs/xfs_alloc_btree.h    |    3 ++
>  fs/xfs/libxfs/xfs_bmap_btree.c     |   21 ++++++++++++++
>  fs/xfs/libxfs/xfs_bmap_btree.h     |    3 ++
>  fs/xfs/libxfs/xfs_btree.c          |    7 +----
>  fs/xfs/libxfs/xfs_btree.h          |   17 +++---------
>  fs/xfs/libxfs/xfs_ialloc_btree.c   |   21 ++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc_btree.h   |    3 ++
>  fs/xfs/libxfs/xfs_refcount_btree.c |   21 ++++++++++++++
>  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
>  fs/xfs/libxfs/xfs_rmap_btree.c     |   21 ++++++++++++++
>  fs/xfs/libxfs/xfs_rmap_btree.h     |    3 ++
>  fs/xfs/xfs_super.c                 |   53 ++++++++++++++++++++++++++++++++----
>  13 files changed, 168 insertions(+), 29 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 61f6d266b822..4c5942146b05 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -20,6 +20,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_ag.h"
>  
> +static kmem_zone_t	*xfs_allocbt_cur_cache;
>  
>  STATIC struct xfs_btree_cur *
>  xfs_allocbt_dup_cursor(
> @@ -477,7 +478,8 @@ xfs_allocbt_init_common(
>  
>  	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
>  
> -	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
> +	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels,
> +			xfs_allocbt_cur_cache);
>  	cur->bc_ag.abt.active = false;
>  
>  	if (btnum == XFS_BTNUM_CNT) {
> @@ -603,3 +605,20 @@ xfs_allocbt_calc_size(
>  {
>  	return xfs_btree_calc_size(mp->m_alloc_mnr, len);
>  }
> +
> +int __init
> +xfs_allocbt_init_cur_cache(void)
> +{
> +	xfs_allocbt_cur_cache = kmem_cache_create("xfs_bnobt_cur",
> +			xfs_btree_cur_sizeof(xfs_allocbt_absolute_maxlevels()),
> +			0, 0, NULL);
> +
> +	return xfs_allocbt_cur_cache != NULL ? 0 : -ENOMEM;

	if (!xfs_allocbt_cur_cache)
		return -ENOMEM;
	return 0;

(and the others :)

> +}
> +
> +void
> +xfs_allocbt_destroy_cur_cache(void)
> +{
> +	kmem_cache_destroy(xfs_allocbt_cur_cache);
> +	xfs_allocbt_cur_cache = NULL;
> +}
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
> index c47d0e285435..82a9b3201f91 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.h
> @@ -62,4 +62,7 @@ void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
>  
>  unsigned int xfs_allocbt_absolute_maxlevels(void);
>  
> +int __init xfs_allocbt_init_cur_cache(void);
> +void xfs_allocbt_destroy_cur_cache(void);
> +
>  #endif	/* __XFS_ALLOC_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> index 7001aff639d2..99261d51d2c3 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_trace.h"
>  #include "xfs_rmap.h"
>  
> +static kmem_zone_t	*xfs_bmbt_cur_cache;
> +
>  /*
>   * Convert on-disk form of btree root to in-memory form.
>   */
> @@ -553,7 +555,7 @@ xfs_bmbt_init_cursor(
>  	ASSERT(whichfork != XFS_COW_FORK);
>  
>  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
> -			mp->m_bm_maxlevels[whichfork]);
> +			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
>  	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
>  
> @@ -664,3 +666,20 @@ xfs_bmbt_calc_size(
>  {
>  	return xfs_btree_calc_size(mp->m_bmap_dmnr, len);
>  }
> +
> +int __init
> +xfs_bmbt_init_cur_cache(void)
> +{
> +	xfs_bmbt_cur_cache = kmem_cache_create("xfs_bmbt_cur",
> +			xfs_btree_cur_sizeof(xfs_bmbt_absolute_maxlevels()),
> +			0, 0, NULL);
> +
> +	return xfs_bmbt_cur_cache != NULL ? 0 : -ENOMEM;
> +}
> +
> +void
> +xfs_bmbt_destroy_cur_cache(void)
> +{
> +	kmem_cache_destroy(xfs_bmbt_cur_cache);
> +	xfs_bmbt_cur_cache = NULL;
> +}
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> index e9218e92526b..4c752f7341df 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> @@ -112,4 +112,7 @@ extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
>  
>  unsigned int xfs_bmbt_absolute_maxlevels(void);
>  
> +int __init xfs_bmbt_init_cur_cache(void);
> +void xfs_bmbt_destroy_cur_cache(void);
> +
>  #endif	/* __XFS_BMAP_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index bea1bdf9b8b9..11ff814996a1 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -23,11 +23,6 @@
>  #include "xfs_btree_staging.h"
>  #include "xfs_ag.h"
>  
> -/*
> - * Cursor allocation zone.
> - */
> -kmem_zone_t	*xfs_btree_cur_zone;
> -
>  /*
>   * Btree magic numbers.
>   */
> @@ -379,7 +374,7 @@ xfs_btree_del_cursor(
>  		kmem_free(cur->bc_ops);
>  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
>  		xfs_perag_put(cur->bc_ag.pag);
> -	kmem_cache_free(xfs_btree_cur_zone, cur);
> +	kmem_cache_free(cur->bc_cache, cur);
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index acb202839afd..6d61ce1559e2 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -13,8 +13,6 @@ struct xfs_trans;
>  struct xfs_ifork;
>  struct xfs_perag;
>  
> -extern kmem_zone_t	*xfs_btree_cur_zone;
> -
>  /*
>   * Generic key, ptr and record wrapper structures.
>   *
> @@ -92,12 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
>  #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
>  	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
>  
> -/*
> - * The btree cursor zone hands out cursors that can handle up to this many
> - * levels.  This is the known maximum for all btree types.
> - */
> -#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)
> -
>  struct xfs_btree_ops {
>  	/* size of the key and record structures */
>  	size_t	key_len;
> @@ -238,6 +230,7 @@ struct xfs_btree_cur
>  	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
>  	struct xfs_mount	*bc_mp;	/* file system mount struct */
>  	const struct xfs_btree_ops *bc_ops;
> +	kmem_zone_t		*bc_cache; /* cursor cache */
>  	unsigned int		bc_flags; /* btree features - below */
>  	xfs_btnum_t		bc_btnum; /* identifies which btree type */
>  	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> @@ -586,17 +579,17 @@ xfs_btree_alloc_cursor(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans	*tp,
>  	xfs_btnum_t		btnum,
> -	uint8_t			maxlevels)
> +	uint8_t			maxlevels,
> +	kmem_zone_t		*cache)
>  {
>  	struct xfs_btree_cur	*cur;
>  
> -	ASSERT(maxlevels <= XFS_BTREE_CUR_ZONE_MAXLEVELS);
> -
> -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> +	cur = kmem_cache_zalloc(cache, GFP_NOFS | __GFP_NOFAIL);
>  	cur->bc_tp = tp;
>  	cur->bc_mp = mp;
>  	cur->bc_btnum = btnum;
>  	cur->bc_maxlevels = maxlevels;
> +	cur->bc_cache = cache;
>  
>  	return cur;
>  }
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 2e3dd1d798bd..2502085d476c 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag.h"
>  
> +static kmem_zone_t	*xfs_inobt_cur_cache;
> +
>  STATIC int
>  xfs_inobt_get_minrecs(
>  	struct xfs_btree_cur	*cur,
> @@ -433,7 +435,7 @@ xfs_inobt_init_common(
>  	struct xfs_btree_cur	*cur;
>  
>  	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
> -			M_IGEO(mp)->inobt_maxlevels);
> +			M_IGEO(mp)->inobt_maxlevels, xfs_inobt_cur_cache);
>  	if (btnum == XFS_BTNUM_INO) {
>  		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
>  		cur->bc_ops = &xfs_inobt_ops;
> @@ -776,3 +778,20 @@ xfs_iallocbt_calc_size(
>  {
>  	return xfs_btree_calc_size(M_IGEO(mp)->inobt_mnr, len);
>  }
> +
> +int __init
> +xfs_inobt_init_cur_cache(void)
> +{
> +	xfs_inobt_cur_cache = kmem_cache_create("xfs_inobt_cur",
> +			xfs_btree_cur_sizeof(xfs_inobt_absolute_maxlevels()),
> +			0, 0, NULL);
> +
> +	return xfs_inobt_cur_cache != NULL ? 0 : -ENOMEM;
> +}
> +
> +void
> +xfs_inobt_destroy_cur_cache(void)
> +{
> +	kmem_cache_destroy(xfs_inobt_cur_cache);
> +	xfs_inobt_cur_cache = NULL;
> +}
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
> index 1f09530bf856..b384733d5e0f 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
> @@ -77,4 +77,7 @@ void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
>  
>  unsigned int xfs_inobt_absolute_maxlevels(void);
>  
> +int __init xfs_inobt_init_cur_cache(void);
> +void xfs_inobt_destroy_cur_cache(void);
> +
>  #endif	/* __XFS_IALLOC_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> index bacd1b442b09..ba27a3ea2ce2 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> @@ -21,6 +21,8 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag.h"
>  
> +static kmem_zone_t	*xfs_refcountbt_cur_cache;
> +
>  static struct xfs_btree_cur *
>  xfs_refcountbt_dup_cursor(
>  	struct xfs_btree_cur	*cur)
> @@ -323,7 +325,7 @@ xfs_refcountbt_init_common(
>  	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
>  
>  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
> -			mp->m_refc_maxlevels);
> +			mp->m_refc_maxlevels, xfs_refcountbt_cur_cache);
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
>  
>  	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> @@ -505,3 +507,20 @@ xfs_refcountbt_calc_reserves(
>  
>  	return error;
>  }
> +
> +int __init
> +xfs_refcountbt_init_cur_cache(void)
> +{
> +	xfs_refcountbt_cur_cache = kmem_cache_create("xfs_refcbt_cur",
> +			xfs_btree_cur_sizeof(xfs_refcountbt_absolute_maxlevels()),
> +			0, 0, NULL);
> +
> +	return xfs_refcountbt_cur_cache != NULL ? 0 : -ENOMEM;
> +}
> +
> +void
> +xfs_refcountbt_destroy_cur_cache(void)
> +{
> +	kmem_cache_destroy(xfs_refcountbt_cur_cache);
> +	xfs_refcountbt_cur_cache = NULL;
> +}
> diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> index 2625b08f50a8..a1437d0a5717 100644
> --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> @@ -67,4 +67,7 @@ void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
>  
>  unsigned int xfs_refcountbt_absolute_maxlevels(void);
>  
> +int __init xfs_refcountbt_init_cur_cache(void);
> +void xfs_refcountbt_destroy_cur_cache(void);
> +
>  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index 860627b5ec08..0a9bc37c01d0 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -22,6 +22,8 @@
>  #include "xfs_ag.h"
>  #include "xfs_ag_resv.h"
>  
> +static kmem_zone_t	*xfs_rmapbt_cur_cache;
> +
>  /*
>   * Reverse map btree.
>   *
> @@ -453,7 +455,7 @@ xfs_rmapbt_init_common(
>  
>  	/* Overlapping btree; 2 keys per pointer. */
>  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
> -			mp->m_rmap_maxlevels);
> +			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
>  	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
>  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
>  	cur->bc_ops = &xfs_rmapbt_ops;
> @@ -670,3 +672,20 @@ xfs_rmapbt_calc_reserves(
>  
>  	return error;
>  }
> +
> +int __init
> +xfs_rmapbt_init_cur_cache(void)
> +{
> +	xfs_rmapbt_cur_cache = kmem_cache_create("xfs_rmapbt_cur",
> +			xfs_btree_cur_sizeof(xfs_rmapbt_absolute_maxlevels()),
> +			0, 0, NULL);
> +
> +	return xfs_rmapbt_cur_cache != NULL ? 0 : -ENOMEM;
> +}
> +
> +void
> +xfs_rmapbt_destroy_cur_cache(void)
> +{
> +	kmem_cache_destroy(xfs_rmapbt_cur_cache);
> +	xfs_rmapbt_cur_cache = NULL;
> +}
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> index 84fe74de923f..dd5422850656 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> @@ -61,4 +61,7 @@ extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
>  
>  unsigned int xfs_rmapbt_absolute_maxlevels(void);
>  
> +int __init xfs_rmapbt_init_cur_cache(void);
> +void xfs_rmapbt_destroy_cur_cache(void);
> +
>  #endif /* __XFS_RMAP_BTREE_H__ */
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 90c92a6a49e0..399d7cfc7d4b 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -37,6 +37,13 @@
>  #include "xfs_reflink.h"
>  #include "xfs_pwork.h"
>  #include "xfs_ag.h"
> +#include "xfs_btree.h"
> +#include "xfs_alloc_btree.h"
> +#include "xfs_ialloc_btree.h"
> +#include "xfs_bmap_btree.h"
> +#include "xfs_rmap_btree.h"
> +#include "xfs_refcount_btree.h"
> +
>  
>  #include <linux/magic.h>
>  #include <linux/fs_context.h>
> @@ -1950,9 +1957,45 @@ static struct file_system_type xfs_fs_type = {
>  };
>  MODULE_ALIAS_FS("xfs");
>  
> +STATIC int __init
> +xfs_init_btree_cur_caches(void)
> +{
> +	int				error;
> +
> +	error = xfs_allocbt_init_cur_cache();
> +	if (error)
> +		return error;
> +	error = xfs_inobt_init_cur_cache();
> +	if (error)
> +		return error;
> +	error = xfs_bmbt_init_cur_cache();
> +	if (error)
> +		return error;
> +	error = xfs_rmapbt_init_cur_cache();
> +	if (error)
> +		return error;
> +	error = xfs_refcountbt_init_cur_cache();
> +	if (error)
> +		return error;
> +
> +	return 0;
> +}
> +
> +STATIC void
> +xfs_destroy_btree_cur_caches(void)
> +{
> +	xfs_allocbt_destroy_cur_cache();
> +	xfs_inobt_destroy_cur_cache();
> +	xfs_bmbt_destroy_cur_cache();
> +	xfs_rmapbt_destroy_cur_cache();
> +	xfs_refcountbt_destroy_cur_cache();
> +}

MOve these to libxfs/xfs_btree.c and then it minimises the custom
init code for userspace. Also means you don't need to include
all the individual btree headers in xfs-super.c...

Otherwise it all looks ok.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 04/15] xfs: dynamically allocate btree scrub context structure
  2021-10-13  4:57   ` Dave Chinner
@ 2021-10-13 16:29     ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 16:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 03:57:38PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:32:55PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Reorganize struct xchk_btree so that we can dynamically size the context
> > structure to fit the type of btree cursor that we have.  This will
> > enable us to use memory more efficiently once we start adding very tall
> > btree types.  Right-size the lastkey array so that we stop wasting the
> > first array element.
> 
> "right size"?
> 
> I'm assuming this is the "nlevels - 1" bit?

Yep.  I'll change the last sentence to:

"Right-size the lastkey array to match the number of node levels in the
btree so that we stop wasting space."

> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/scrub/btree.c |   23 ++++++++++++-----------
> >  fs/xfs/scrub/btree.h |   11 ++++++++++-
> >  2 files changed, 22 insertions(+), 12 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/btree.c b/fs/xfs/scrub/btree.c
> > index d5e1ca521fc4..6d4eba85ef77 100644
> > --- a/fs/xfs/scrub/btree.c
> > +++ b/fs/xfs/scrub/btree.c
> > @@ -189,9 +189,9 @@ xchk_btree_key(
> >  
> >  	/* If this isn't the first key, are they in order? */
> >  	if (cur->bc_ptrs[level] > 1 &&
> > -	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level], key))
> > +	    !cur->bc_ops->keys_inorder(cur, &bs->lastkey[level - 1], key))
> >  		xchk_btree_set_corrupt(bs->sc, cur, level);
> > -	memcpy(&bs->lastkey[level], key, cur->bc_ops->key_len);
> > +	memcpy(&bs->lastkey[level - 1], key, cur->bc_ops->key_len);
> >  
> >  	if (level + 1 >= cur->bc_nlevels)
> >  		return;
> > @@ -631,17 +631,24 @@ xchk_btree(
> >  	union xfs_btree_ptr		*pp;
> >  	union xfs_btree_rec		*recp;
> >  	struct xfs_btree_block		*block;
> > -	int				level;
> >  	struct xfs_buf			*bp;
> >  	struct check_owner		*co;
> >  	struct check_owner		*n;
> > +	size_t				cur_sz;
> > +	int				level;
> >  	int				error = 0;
> >  
> >  	/*
> >  	 * Allocate the btree scrub context from the heap, because this
> > -	 * structure can get rather large.
> > +	 * structure can get rather large.  Don't let a caller feed us a
> > +	 * totally absurd size.
> >  	 */
> > -	bs = kmem_zalloc(sizeof(struct xchk_btree), KM_NOFS | KM_MAYFAIL);
> > +	cur_sz = xchk_btree_sizeof(cur->bc_nlevels);
> > +	if (cur_sz > PAGE_SIZE) {
> > +		xchk_btree_set_corrupt(sc, cur, 0);
> > +		return 0;
> > +	}
> > +	bs = kmem_zalloc(cur_sz, KM_NOFS | KM_MAYFAIL);
> >  	if (!bs)
> >  		return -ENOMEM;
> >  	bs->cur = cur;
> > @@ -653,12 +660,6 @@ xchk_btree(
> >  	/* Initialize scrub state */
> >  	INIT_LIST_HEAD(&bs->to_check);
> >  
> > -	/* Don't try to check a tree with a height we can't handle. */
> > -	if (cur->bc_nlevels > XFS_BTREE_MAXLEVELS) {
> > -		xchk_btree_set_corrupt(sc, cur, 0);
> > -		goto out;
> > -	}
> > -
> >  	/*
> >  	 * Load the root of the btree.  The helper function absorbs
> >  	 * error codes for us.
> > diff --git a/fs/xfs/scrub/btree.h b/fs/xfs/scrub/btree.h
> > index 7671108f9f85..62c3091ef20f 100644
> > --- a/fs/xfs/scrub/btree.h
> > +++ b/fs/xfs/scrub/btree.h
> > @@ -39,9 +39,18 @@ struct xchk_btree {
> >  
> >  	/* internal scrub state */
> >  	union xfs_btree_rec		lastrec;
> > -	union xfs_btree_key		lastkey[XFS_BTREE_MAXLEVELS];
> >  	struct list_head		to_check;
> > +
> > +	/* this element must come last! */
> > +	union xfs_btree_key		lastkey[];
> >  };
> > +
> > +static inline size_t
> > +xchk_btree_sizeof(unsigned int nlevels)
> > +{
> > +	return struct_size((struct xchk_btree *)NULL, lastkey, nlevels - 1);
> > +}
> 
> I'd like a comment here indicating that the max number of keys is
> "nlevels - 1" because the last level of the tree is records and
> that's held in a separate lastrec field...
> 
> That way there's a reminder of why there's a "- 1" here without
> having work it out from first principles every time we look at this
> code...

Ok; I've added the comment:

/*
 * Calculate the size of a xchk_btree structure.  There are nlevels-1
 * slots for keys because we track leaf records separately in lastrec.
 */

> Otherwise it seems reasonable.

<nod>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/15] xfs: support dynamic btree cursor heights
  2021-10-13  5:31   ` Dave Chinner
@ 2021-10-13 16:52     ` Darrick J. Wong
  2021-10-13 21:14       ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 16:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Babu R, Christoph Hellwig, linux-xfs

On Wed, Oct 13, 2021 at 04:31:22PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:01PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Split out the btree level information into a separate struct and put it
> > at the end of the cursor structure as a VLA.  The realtime rmap btree
> > (which is rooted in an inode) will require the ability to support many
> > more levels than a per-AG btree cursor, which means that we're going to
> > create two btree cursor caches to conserve memory for the more common
> > case.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |    6 +-
> >  fs/xfs/libxfs/xfs_bmap.c  |   10 +--
> >  fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
> >  fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
> >  fs/xfs/scrub/bitmap.c     |   22 +++---
> >  fs/xfs/scrub/bmap.c       |    2 -
> >  fs/xfs/scrub/btree.c      |   47 +++++++------
> >  fs/xfs/scrub/trace.c      |    7 +-
> >  fs/xfs/scrub/trace.h      |   10 +--
> >  fs/xfs/xfs_super.c        |    2 -
> >  fs/xfs/xfs_trace.h        |    2 -
> >  11 files changed, 164 insertions(+), 140 deletions(-)
> 
> Hmmm - subject of the patch doesn't really match the changes being
> made - there's nothing here that makes the btree cursor heights
> dynamic. It's just a structure layout change...

"xfs: prepare xfs_btree_cur for dynamic cursor heights" ?

> 
> > @@ -415,9 +415,9 @@ xfs_btree_dup_cursor(
> >  	 * For each level current, re-get the buffer and copy the ptr value.
> >  	 */
> >  	for (i = 0; i < new->bc_nlevels; i++) {
> > -		new->bc_ptrs[i] = cur->bc_ptrs[i];
> > -		new->bc_ra[i] = cur->bc_ra[i];
> > -		bp = cur->bc_bufs[i];
> > +		new->bc_levels[i].ptr = cur->bc_levels[i].ptr;
> > +		new->bc_levels[i].ra = cur->bc_levels[i].ra;
> > +		bp = cur->bc_levels[i].bp;
> >  		if (bp) {
> >  			error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp,
> >  						   xfs_buf_daddr(bp), mp->m_bsize,
> > @@ -429,7 +429,7 @@ xfs_btree_dup_cursor(
> >  				return error;
> >  			}
> >  		}
> > -		new->bc_bufs[i] = bp;
> > +		new->bc_levels[i].bp = bp;
> >  	}
> >  	*ncur = new;
> >  	return 0;
> 
> ObHuh: that dup_cursor code seems like a really obtuse way of doing:
> 
> 	bip = cur->bc_levels[i].bp->b_log_item;
> 	bip->bli_recur++;
> 	new->bc_levels[i] = cur->bc_levels[i];
> 
> But that's not a problem this patch needs to solve. Just something
> that made me go hmmmm...

Yeah, I noticed that too while I was checking the results of my sed
script.

> > @@ -922,11 +922,11 @@ xfs_btree_readahead(
> >  	    (lev == cur->bc_nlevels - 1))
> >  		return 0;
> >  
> > -	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
> > +	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
> >  		return 0;
> 
> That's whacky logic. Surely that's just:
> 
> 	if (cur->bc_levels[lev].ra & lr)
> 		return 0;

This is an early-exit test, which means the careful check is necessary.

If (some day) someone calls this function with (LEFTRA|RIGHTRA) to
readahead both siblings on a btree level where one sibling has been ra'd
but not the other, we must avoid taking the branch.

> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 1018bcc43d66..f31f057bec9d 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -212,6 +212,19 @@ struct xfs_btree_cur_ino {
> >  #define	XFS_BTCUR_BMBT_INVALID_OWNER	(1 << 1)
> >  };
> >  
> > +struct xfs_btree_level {
> > +	/* buffer pointer */
> > +	struct xfs_buf		*bp;
> > +
> > +	/* key/record number */
> > +	uint16_t		ptr;
> > +
> > +	/* readahead info */
> > +#define XFS_BTCUR_LEFTRA	1	/* left sibling has been read-ahead */
> > +#define XFS_BTCUR_RIGHTRA	2	/* right sibling has been read-ahead */
> > +	uint16_t		ra;
> > +};
> 
> The ra variable is a bit field. Can we define the values obviously
> as bit fields with (1 << 0) and (1 << 1) instead of 1 and 2?

Done.

> > @@ -242,8 +250,17 @@ struct xfs_btree_cur
> >  		struct xfs_btree_cur_ag	bc_ag;
> >  		struct xfs_btree_cur_ino bc_ino;
> >  	};
> > +
> > +	/* Must be at the end of the struct! */
> > +	struct xfs_btree_level	bc_levels[];
> >  };
> >  
> > +static inline size_t
> > +xfs_btree_cur_sizeof(unsigned int nlevels)
> > +{
> > +	return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
> > +}
> 
> Ooooh, yeah, we really need comments explaining how many btree
> levels these VLAs are tracking, because this one doesn't have a "-
> 1" in it like the previous one I commented on....

/*
 * Compute the size of a btree cursor that can handle a btree of a given
 * height.  The bc_levels array handles node and leaf blocks, so its
 * size is exactly nlevels.
 */


> > diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> > index c0ef53fe6611..816dfc8e5a80 100644
> > --- a/fs/xfs/scrub/trace.c
> > +++ b/fs/xfs/scrub/trace.c
> > @@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
> >  	struct xfs_btree_cur	*cur,
> >  	int			level)
> >  {
> > -	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> > +	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
> >  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> > -				xfs_buf_daddr(cur->bc_bufs[level]));
> > -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > +				xfs_buf_daddr(cur->bc_levels[level].bp));
> > +	else if (level == cur->bc_nlevels - 1 &&
> > +		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
> 
> No need for an else there as the first if () clause returns.
> Also, needs more () around that "a & b" second line.

TBH I think we check the wrong flag, and that last bit should be:

	if (level == cur->bc_nlevels - 1 &&
	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);

	return NULLFSBLOCK;

But for now I'll stick to the straight replacement and tack on another
patch to fix that.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels
  2021-10-13  5:40   ` Dave Chinner
@ 2021-10-13 16:55     ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 16:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 04:40:41PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:23PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > To support future btree code, we need to be able to size btree cursors
> > dynamically for very large btrees.  Switch the maxlevels computation to
> > use the precomputed values in the superblock, and create cursors that
> > can handle a certain height.  For now, we retain the btree cursor zone
> > that can handle up to 9-level btrees, and create larger cursors (which
> > shouldn't happen currently) from the heap as a failsafe.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_alloc_btree.c    |    2 +-
> >  fs/xfs/libxfs/xfs_bmap_btree.c     |    3 ++-
> >  fs/xfs/libxfs/xfs_btree.h          |   13 +++++++++++--
> >  fs/xfs/libxfs/xfs_ialloc_btree.c   |    3 ++-
> >  fs/xfs/libxfs/xfs_refcount_btree.c |    3 ++-
> >  fs/xfs/libxfs/xfs_rmap_btree.c     |    3 ++-
> >  fs/xfs/xfs_super.c                 |    4 ++--
> >  7 files changed, 22 insertions(+), 9 deletions(-)
> 
> minor nit:
> 
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 43766e5b680f..b8761a2fc24b 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -94,6 +94,12 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
> >  
> >  #define	XFS_BTREE_MAXLEVELS	9	/* max of all btrees */
> >  
> > +/*
> > + * The btree cursor zone hands out cursors that can handle up to this many
> > + * levels.  This is the known maximum for all btree types.
> > + */
> > +#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)
> 
> XFS_BTREE_CUR_CACHE_MAXLEVELS	9

Fixed.

--D

> Otherwise looks OK.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation
  2021-10-13  5:49   ` Dave Chinner
@ 2021-10-13 17:07     ` Darrick J. Wong
  2021-10-13 20:18       ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 17:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 04:49:39PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:28PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Compute the actual maximum btree height when deciding if per-AG block
> > reservation is critically low.  This only affects the sanity check
> > condition, since we /generally/ will trigger on the 10% threshold.
> > This is a long-winded way of saying that we're removing one more
> > usage of XFS_BTREE_MAXLEVELS.
> 
> And replacing it with a branchy dynamic calculation that has a
> static, unchanging result. :(
> 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c |   18 +++++++++++++++++-
> >  1 file changed, 17 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index 2aa2b3484c28..d34d4614f175 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -60,6 +60,20 @@
> >   * to use the reservation system should update ask/used in xfs_ag_resv_init.
> >   */
> >  
> > +/* Compute maximum possible height for per-AG btree types for this fs. */
> > +static unsigned int
> > +xfs_ag_btree_maxlevels(
> > +	struct xfs_mount	*mp)
> > +{
> > +	unsigned int		ret = mp->m_ag_maxlevels;
> > +
> > +	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
> > +	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> > +	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
> > +	ret = max(ret, mp->m_rmap_maxlevels);
> > +	return max(ret, mp->m_refc_maxlevels);
> > +}
> 
> Hmmmm. perhaps mp->m_ag_maxlevels should be renamed to
> mp->m_agbno_maxlevels and we pre-calculate mp->m_ag_maxlevels from

I prefer m_alloc_maxlevels for the first one, since "agbno" means "AG
block number" in my head.

As for the second, how about "m_agbtree_maxlevels" since we already use
'agbtree' to refer to per-AG btrees elsewhere?

Other than the naming, I agree with your suggestion.

--D

> the above function and just use the variable in the
> xfs_ag_resv_critical() check?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled
  2021-10-13  7:25   ` Dave Chinner
@ 2021-10-13 17:47     ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 17:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Babu R, linux-xfs, hch

On Wed, Oct 13, 2021 at 06:25:21PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:34PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big
> > enough to handle the maximally tall rmap btree when all blocks are in
> > use and maximally shared, let's compute the maximum height assuming the
> > rmapbt consumes as many blocks as possible.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_btree.c       |   34 ++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h       |    2 +
> >  fs/xfs/libxfs/xfs_rmap_btree.c  |   55 ++++++++++++++++++++++++---------------
> >  fs/xfs/libxfs/xfs_trans_resv.c  |   13 +++++++++
> >  fs/xfs/libxfs/xfs_trans_space.h |    7 +++++
> >  5 files changed, 90 insertions(+), 21 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index 6ced8f028d47..201b81d54622 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4531,6 +4531,40 @@ xfs_btree_compute_maxlevels(
> >  	return level;
> >  }
> >  
> > +/*
> > + * Compute the maximum height of a btree that is allowed to consume up to the
> > + * given number of blocks.
> > + */
> > +unsigned int
> > +xfs_btree_compute_maxlevels_size(
> > +	unsigned long long	max_btblocks,
> > +	unsigned int		leaf_mnr)
> 
> So "leaf_mnr" is supposed to be the minimum number of records in
> a leaf?
> 
> But this gets passed mp->m_rmap_mnr[1], which is the minimum number
> of keys/ptrs in a node, not a leaf. I'm confused.

That should have been "node_mnr".  Sorry. :(

> > +{
> > +	unsigned long long	leaf_blocks = leaf_mnr;
> > +	unsigned long long	blocks_left;
> > +	unsigned int		maxlevels;
> > +
> > +	if (max_btblocks < 1)
> > +		return 0;
> > +
> > +	/*
> > +	 * The loop increments maxlevels as long as there would be enough
> > +	 * blocks left in the reservation to handle each node block at the
> > +	 * current level pointing to the minimum possible number of leaf blocks
> > +	 * at the next level down.  We start the loop assuming a single-level
> > +	 * btree consuming one block.
> > +	 */
> > +	maxlevels = 1;
> > +	blocks_left = max_btblocks - 1;
> > +	while (leaf_blocks < blocks_left) {
> > +		maxlevels++;
> > +		blocks_left -= leaf_blocks;
> > +		leaf_blocks *= leaf_mnr;
> > +	}
> > +
> > +	return maxlevels;
> 
> Yup, I'm definitely confused. We also have:
> 
> xfs_btree_calc_size(limits, len)
> xfs_btree_compute_maxlevels(limits, len)
> 
> And they do something similar but subtly different. They aren't
> clearly documented, either, so from reading the code:
> 
> xfs_btree_calc_size is calculating the btree block usage for a
> discrete count of items based on the leaf and node population values
> from mp->m_rmap_mnr, etc. It uses a division based algorithm
> 
> 	recs = limits[0]	// min recs per block
> 	for (level = 0; len > 1; level++) {
> 		do_div(len, recs)
> 		recs = limits[1]	// min ptrs per node
> 		rval += len;
> 	}
> 	return rval
> 
> (why does this even calculate level?)
> 
> So it returns the number of blocks the btree will consume to
> index a given number of discrete blocks.
> 
> xfs_btree_compute_maxlevels() is basically:
> 
> 	len = len / limits[0]		// record blocks in level 0
> 	for (level = 1; len > 1; level++)
> 		len = len / limits[1]	// node blocks in level n
> 	return level
> 
> So it returns how many levels are required to index a specific
> number of discrete blocks given a specific leaf/node population.
> 
> But what does xfs_btree_compute_maxlevels_size do? I'm really not
> sure from the desription, the calculation or the parameters passed
> to it. Even a table doesn't tell me:
> 
> say 10000 records, leaf_mnr = 10
> 
> loop		blocks_left	leaf_blocks	max_levels
> 0 (at init)		9999		10		1
> 1			9989	       100		2
> 2			9889	      1000		3
> 3			8889	     10000		4
> Breaks out on (leaf_blocks > blocks_left)
> 
> So, after much head scratching, I *think* what this function is
> trying to do is take into account the case where we have a single
> block shared by reflink N times, such that the entire AG is made up
> of rmap records pointing to all the owners.  We're trying to
> determine the size is the height of the tree if we index enough leaf
> records to consume all the free space in the AG?
> 
> Which then means we don't care what the number of records are in the
> leaf nodes, we only need to know how many leaf blocks there are and
> how many interior nodes we consume to index them?
> 
> IOWs, we're counting the number of leaf blocks we can index at each
> level based on the _minimum number of pointers_ we can hold in a
> _node_?

Yes.

> If so, then the naming leaves a lot to be desired here. The
> variables all being named "leaf" even though they are being passed
> node limits and are calculating node level indexing limits and not
> leaf space consumption completely threw me in the wrong direction.
> I just spent the best part of 90 minutes working all this out
> from first principles because nothing is obvious about why this code
> is correct. Everything screamed "wrong wrong wrong" at me until
> I finally understood what was being calculated. And now I know, it
> still screams "wrong wrong wrong" at me.
> 
> So:
> 
> /*
>  * Given a number of available blocks for the btree to consume with
>  * records and pointers, calculate the height of the tree needed to
>  * index all the records that space can hold based on the number of
>  * pointers each interior node holds.
>  *
>  * We start by assuming a single level tree consumes a single block,
>  * then track the number of blocks each node level consumes until we
>  * no longer have space to store the next node level. At this point,
>  * we are indexing all the leaf blocks in the space, and there's no
>  * more free space to split the tree any further. That's our maximum
>  * btree height.

Ah, yes, that's a much better description and name than the ones I put
on the function.

>  */
> unsigned int
> xfs_btree_space_to_height(
> 	unsigned int		*limits,
> 	unsigned long long	leaf_blocks)
> {
> 	unsigned long long	node_blocks = limits[1];
> 	unsigned long long	blocks_left = leaf_blocks - 1;
> 	unsigned int		height = 1;
> 
> 	if (leaf_blocks < 1)
> 		return 0;
> 
> 	while (node_blocks < blocks_left) {
> 		height++;
> 		blocks_left -= node_blocks;
> 		node_blocks *= limits[1];
> 	}
> 
> 	return height;
> }
> 
> Oh, yeah, I made the parameters the same as the other btree
> height/size functions, too, because....
> 
> > +	unsigned int		val;
> > +
> > +	if (!xfs_has_rmapbt(mp)) {
> > +		mp->m_rmap_maxlevels = 0;
> > +		return;
> > +	}
> > +
> > +	if (xfs_has_reflink(mp)) {
> > +		/*
> > +		 * Compute the asymptotic maxlevels for an rmap btree on a
> > +		 * filesystem that supports reflink.
> > +		 *
> > +		 * On a reflink filesystem, each AG block can have up to 2^32
> > +		 * (per the refcount record format) owners, which means that
> > +		 * theoretically we could face up to 2^64 rmap records.
> > +		 * However, we're likely to run out of blocks in the AG long
> > +		 * before that happens, which means that we must compute the
> > +		 * max height based on what the btree will look like if it
> > +		 * consumes almost all the blocks in the AG due to maximal
> > +		 * sharing factor.
> > +		 */
> > +		val = xfs_btree_compute_maxlevels_size(mp->m_sb.sb_agblocks,
> > +				mp->m_rmap_mnr[1]);
> > +	} else {
> > +		/*
> > +		 * If there's no block sharing, compute the maximum rmapbt
> > +		 * height assuming one rmap record per AG block.
> > +		 */
> > +		val = xfs_btree_compute_maxlevels(mp->m_rmap_mnr,
> > +				mp->m_sb.sb_agblocks);
> 
> This just looks weird with the same parameters in reverse order to
> these two functions...

TBH I intentionally reversed the order to make it obvious which was
which, so we wouldn't end up with...

uint xfs_btree_compute_maxlevels(uint *limits, unsigned int len);
uint xfs_btree_space_to_height(uint *limits, unsigned int blocks);
uint xfs_btree_calc_size(uint *limits, unsigned int len);

...three functions with the same type signatures.  Three years have
flown by since I wrote this patch, and now the signatures have diverged
enough to make it at least somewhat distinct.

IOWs, I'll adopt your version. :)

> > +	}
> > +
> > +	mp->m_rmap_maxlevels = val;
> >  }
> 
> Also, this function becomes simpler if it just returns the maxlevels
> value and the caller writes it into mp->m_rmap_maxlevels.

Done.

> >  
> >  /* Calculate the refcount btree size for some records. */
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 5e300daa2559..97bd17d84a23 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -814,6 +814,16 @@ xfs_trans_resv_calc(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_trans_resv	*resp)
> >  {
> > +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> > +
> > +	/*
> > +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> > +	 * to 9 even if the AG was small enough that it would never grow to
> > +	 * that height.
> > +	 */
> > +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> > +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> > +
> >  	/*
> >  	 * The following transactions are logged in physical format and
> >  	 * require a permanent reservation on space.
> > @@ -916,4 +926,7 @@ xfs_trans_resv_calc(
> >  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
> >  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
> >  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> > +
> > +	/* Put everything back the way it was.  This goes at the end. */
> > +	mp->m_rmap_maxlevels = rmap_maxlevels;
> >  }
> 
> Why play games like this? We want the reservations to go down in
> size if the btrees don't reach "XFS_OLD_REFLINK_RMAP_MAXLEVELS"
> size. The reason isn't mentioned in the commit message...

I think I'll record the reason why in the code itself.

	/*
	 * In the early days of rmap+reflink, we always set the rmap
	 * maxlevels to 9 even if the AG was small enough that it would
	 * never grow to that height.  Transaction reservation sizes
	 * influence the minimum log size calculation, which influences
	 * the size of the log that mkfs creates.  Use the old value
	 * here to ensure that newly formatted small filesystems will
	 * mount on older kernels.
	 */
	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;


> 
> > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> > index 50332be34388..440c9c390b86 100644
> > --- a/fs/xfs/libxfs/xfs_trans_space.h
> > +++ b/fs/xfs/libxfs/xfs_trans_space.h
> > @@ -17,6 +17,13 @@
> >  /* Adding one rmap could split every level up to the top of the tree. */
> >  #define XFS_RMAPADD_SPACE_RES(mp) ((mp)->m_rmap_maxlevels)
> >  
> > +/*
> > + * Note that we historically set m_rmap_maxlevels to 9 when reflink was
> > + * enabled, so we must preserve this behavior to avoid changing the transaction
> > + * space reservations.
> > + */
> > +#define XFS_OLD_REFLINK_RMAP_MAXLEVELS	(9)
> 
> 9.

Assuming you meant '9 without the parentheses' here, fixed.  Thanks for
slogging through all that blocks_to_height stuff. :)

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation
  2021-10-13 17:07     ` Darrick J. Wong
@ 2021-10-13 20:18       ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13 20:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 10:07:47AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 13, 2021 at 04:49:39PM +1100, Dave Chinner wrote:
> > On Tue, Oct 12, 2021 at 04:33:28PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Compute the actual maximum btree height when deciding if per-AG block
> > > reservation is critically low.  This only affects the sanity check
> > > condition, since we /generally/ will trigger on the 10% threshold.
> > > This is a long-winded way of saying that we're removing one more
> > > usage of XFS_BTREE_MAXLEVELS.
> > 
> > And replacing it with a branchy dynamic calculation that has a
> > static, unchanging result. :(
> > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_ag_resv.c |   18 +++++++++++++++++-
> > >  1 file changed, 17 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > index 2aa2b3484c28..d34d4614f175 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > @@ -60,6 +60,20 @@
> > >   * to use the reservation system should update ask/used in xfs_ag_resv_init.
> > >   */
> > >  
> > > +/* Compute maximum possible height for per-AG btree types for this fs. */
> > > +static unsigned int
> > > +xfs_ag_btree_maxlevels(
> > > +	struct xfs_mount	*mp)
> > > +{
> > > +	unsigned int		ret = mp->m_ag_maxlevels;
> > > +
> > > +	ret = max(ret, mp->m_bm_maxlevels[XFS_DATA_FORK]);
> > > +	ret = max(ret, mp->m_bm_maxlevels[XFS_ATTR_FORK]);
> > > +	ret = max(ret, M_IGEO(mp)->inobt_maxlevels);
> > > +	ret = max(ret, mp->m_rmap_maxlevels);
> > > +	return max(ret, mp->m_refc_maxlevels);
> > > +}
> > 
> > Hmmmm. perhaps mp->m_ag_maxlevels should be renamed to
> > mp->m_agbno_maxlevels and we pre-calculate mp->m_ag_maxlevels from
> 
> I prefer m_alloc_maxlevels for the first one, since "agbno" means "AG
> block number" in my head.
> 
> As for the second, how about "m_agbtree_maxlevels" since we already use
> 'agbtree' to refer to per-AG btrees elsewhere?

Much better than my suggestions :)

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 05/15] xfs: support dynamic btree cursor heights
  2021-10-13 16:52     ` Darrick J. Wong
@ 2021-10-13 21:14       ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13 21:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, Christoph Hellwig, linux-xfs

On Wed, Oct 13, 2021 at 09:52:18AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 13, 2021 at 04:31:22PM +1100, Dave Chinner wrote:
> > On Tue, Oct 12, 2021 at 04:33:01PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Split out the btree level information into a separate struct and put it
> > > at the end of the cursor structure as a VLA.  The realtime rmap btree
> > > (which is rooted in an inode) will require the ability to support many
> > > more levels than a per-AG btree cursor, which means that we're going to
> > > create two btree cursor caches to conserve memory for the more common
> > > case.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |    6 +-
> > >  fs/xfs/libxfs/xfs_bmap.c  |   10 +--
> > >  fs/xfs/libxfs/xfs_btree.c |  168 +++++++++++++++++++++++----------------------
> > >  fs/xfs/libxfs/xfs_btree.h |   28 ++++++--
> > >  fs/xfs/scrub/bitmap.c     |   22 +++---
> > >  fs/xfs/scrub/bmap.c       |    2 -
> > >  fs/xfs/scrub/btree.c      |   47 +++++++------
> > >  fs/xfs/scrub/trace.c      |    7 +-
> > >  fs/xfs/scrub/trace.h      |   10 +--
> > >  fs/xfs/xfs_super.c        |    2 -
> > >  fs/xfs/xfs_trace.h        |    2 -
> > >  11 files changed, 164 insertions(+), 140 deletions(-)
> > 
> > Hmmm - subject of the patch doesn't really match the changes being
> > made - there's nothing here that makes the btree cursor heights
> > dynamic. It's just a structure layout change...
> 
> "xfs: prepare xfs_btree_cur for dynamic cursor heights" ?

*nod*

> > > @@ -922,11 +922,11 @@ xfs_btree_readahead(
> > >  	    (lev == cur->bc_nlevels - 1))
> > >  		return 0;
> > >  
> > > -	if ((cur->bc_ra[lev] | lr) == cur->bc_ra[lev])
> > > +	if ((cur->bc_levels[lev].ra | lr) == cur->bc_levels[lev].ra)
> > >  		return 0;
> > 
> > That's whacky logic. Surely that's just:
> > 
> > 	if (cur->bc_levels[lev].ra & lr)
> > 		return 0;
> 
> This is an early-exit test, which means the careful check is necessary.
> 
> If (some day) someone calls this function with (LEFTRA|RIGHTRA) to
> readahead both siblings on a btree level where one sibling has been ra'd
> but not the other, we must avoid taking the branch.

Which I didn't see any callers do, so I ignored that possibility.
Regardless, it's the use of "|" to do an additive mask match that
makes it look wierd. i.e.  the normal way of writing a multi-biti
mask match is to apply the mask and check that the returned value
matches the mask, like so:

	if ((cur->bc_levels[lev].ra & lr) == lr)
		return 0;

Really, though, this was just another "ObHuh" comment, and you don't
need to "fix" it now...

> > > @@ -242,8 +250,17 @@ struct xfs_btree_cur
> > >  		struct xfs_btree_cur_ag	bc_ag;
> > >  		struct xfs_btree_cur_ino bc_ino;
> > >  	};
> > > +
> > > +	/* Must be at the end of the struct! */
> > > +	struct xfs_btree_level	bc_levels[];
> > >  };
> > >  
> > > +static inline size_t
> > > +xfs_btree_cur_sizeof(unsigned int nlevels)
> > > +{
> > > +	return struct_size((struct xfs_btree_cur *)NULL, bc_levels, nlevels);
> > > +}
> > 
> > Ooooh, yeah, we really need comments explaining how many btree
> > levels these VLAs are tracking, because this one doesn't have a "-
> > 1" in it like the previous one I commented on....
> 
> /*
>  * Compute the size of a btree cursor that can handle a btree of a given
>  * height.  The bc_levels array handles node and leaf blocks, so its
>  * size is exactly nlevels.
>  */

Nice. Thanks!

> > > diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c
> > > index c0ef53fe6611..816dfc8e5a80 100644
> > > --- a/fs/xfs/scrub/trace.c
> > > +++ b/fs/xfs/scrub/trace.c
> > > @@ -21,10 +21,11 @@ xchk_btree_cur_fsbno(
> > >  	struct xfs_btree_cur	*cur,
> > >  	int			level)
> > >  {
> > > -	if (level < cur->bc_nlevels && cur->bc_bufs[level])
> > > +	if (level < cur->bc_nlevels && cur->bc_levels[level].bp)
> > >  		return XFS_DADDR_TO_FSB(cur->bc_mp,
> > > -				xfs_buf_daddr(cur->bc_bufs[level]));
> > > -	if (level == cur->bc_nlevels - 1 && cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > > +				xfs_buf_daddr(cur->bc_levels[level].bp));
> > > +	else if (level == cur->bc_nlevels - 1 &&
> > > +		 cur->bc_flags & XFS_BTREE_LONG_PTRS)
> > 
> > No need for an else there as the first if () clause returns.
> > Also, needs more () around that "a & b" second line.
> 
> TBH I think we check the wrong flag, and that last bit should be:
> 
> 	if (level == cur->bc_nlevels - 1 &&
> 	    (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE))
> 		return XFS_INO_TO_FSB(cur->bc_mp, cur->bc_ino.ip->i_ino);
> 
> 	return NULLFSBLOCK;

Yup, true, long ptrs and inodes are currently interchangable so it
works, but that's a landmine waiting to pounce....

> But for now I'll stick to the straight replacement and tack on another
> patch to fix that.

*nod*.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type
  2021-10-13  7:57   ` Dave Chinner
@ 2021-10-13 21:36     ` Darrick J. Wong
  2021-10-13 23:48       ` Dave Chinner
  0 siblings, 1 reply; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 21:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 06:57:43PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:50PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Add code for all five btree types so that we can compute the absolute
> > maximum possible btree height for each btree type.  This is a setup for
> > the next patch, which makes every btree type have its own cursor cache.
> > 
> > The functions are exported so that we can have xfs_db report the
> > absolute maximum btree heights for each btree type, rather than making
> > everyone run their own ad-hoc computations.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c          |    1 +
> >  fs/xfs/libxfs/xfs_alloc_btree.c    |   13 +++++++++++
> >  fs/xfs/libxfs/xfs_alloc_btree.h    |    2 ++
> >  fs/xfs/libxfs/xfs_bmap.c           |    1 +
> >  fs/xfs/libxfs/xfs_bmap_btree.c     |   14 ++++++++++++
> >  fs/xfs/libxfs/xfs_bmap_btree.h     |    2 ++
> >  fs/xfs/libxfs/xfs_btree.c          |   41 ++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_btree.h          |    3 +++
> >  fs/xfs/libxfs/xfs_fs.h             |    2 ++
> >  fs/xfs/libxfs/xfs_ialloc.c         |    1 +
> >  fs/xfs/libxfs/xfs_ialloc_btree.c   |   19 +++++++++++++++++
> >  fs/xfs/libxfs/xfs_ialloc_btree.h   |    2 ++
> >  fs/xfs/libxfs/xfs_refcount_btree.c |   20 ++++++++++++++++++
> >  fs/xfs/libxfs/xfs_refcount_btree.h |    2 ++
> >  fs/xfs/libxfs/xfs_rmap_btree.c     |   27 ++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h     |    2 ++
> >  16 files changed, 152 insertions(+)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 55c5adc9b54e..7145416a230c 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2198,6 +2198,7 @@ xfs_alloc_compute_maxlevels(
> >  {
> >  	mp->m_ag_maxlevels = xfs_btree_compute_maxlevels(mp->m_alloc_mnr,
> >  			(mp->m_sb.sb_agblocks + 1) / 2);
> > +	ASSERT(mp->m_ag_maxlevels <= xfs_allocbt_absolute_maxlevels());
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index f14bad21503f..61f6d266b822 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -582,6 +582,19 @@ xfs_allocbt_maxrecs(
> >  	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> >  }
> >  
> > +/* Compute the max possible height of the maximally sized free space btree. */
> > +unsigned int
> > +xfs_allocbt_absolute_maxlevels(void)
> > +{
> > +	unsigned int		minrecs[2];
> > +
> > +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_alloc_rec_t),
> > +			sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> > +
> > +	return xfs_btree_compute_maxlevels(minrecs,
> > +			(XFS_MAX_AG_BLOCKS + 1) / 2);
> > +}
> 
> Hmmmm. This is kinds messy. I'd prefer we share code with the
> xfs_allocbt_maxrecs() function that do this. Not sure "absolute" is
> the right word, either. It's more a function of the on-disk format
> maximum, not an "absolute" thing.

<nod> I'm not passionate about the name one way or the other.

> I mean, we know that the worst case is going to be for each btree
> type - we don't need to pass in XFS_BTREE_CRC_BLOCKS or
> XFS_BTREE_LONG_PTRS to generic code for it to branch multiple times
> to be generic.

Yeah, that function was a conditional mess.  I like...

> Instead:
> 
> static inline int
> xfs_allocbt_block_maxrecs(
>         int                     blocklen,
>         int                     leaf)
> {
>         if (leaf)
>                 return blocklen / sizeof(xfs_alloc_rec_t);
>         return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> }
> 
> /*
>  * Calculate number of records in an alloc btree block.
>  */
> int
> xfs_allocbt_maxrecs(
>         struct xfs_mount        *mp,
>         int                     blocklen,
>         int                     leaf)
> {
>         blocklen -= XFS_ALLOC_BLOCK_LEN(mp);
> 	return xfs_allobt_block_maxrecs(blocklen, leaf);
> }
> 
> xfs_allocbt_maxlevels_ondisk()
> {
> 	unsigned int		minrecs[2];
> 
> 	minrecs[0] = xfs_allocbt_block_maxrecs(
> 			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, true) / 2;
> 	minrecs[1] = xfs_allocbt_block_maxrecs(
> 			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, false) / 2;

...this a lot better since one doesn't have to switch back and forth
between source files to figure out how the computation works.

However, I want to propose a possibly pedantic addition to the blocksize
computation for btrees.  We want to compute the maximum btree height
that we're ever going to see, which means that we are modeling a btree
with the minimum possible fanout factor.  That means the smallest btree
nodes possible, and half full.

min V5 blocksize: 1024 bytes
V5 btree short header: 56 bytes
min V5 btree record area: 968 bytes

min V4 blocksize: 512 bytes
V4 btree short header: 16 bytes
min V4 btree record area: 496 bytes

In other words, the bit above for the allocbt ought to be:

	blocklen = min(XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN,
		       XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN);

Which is very pedantic, since the whole expression /always/ evalulates
to 496.  IIRC the kernel has enough macro soup to resolve that into a
constant expression so it shouldn't cost us anything.

> 
> 	return xfs_btree_compute_maxlevels(minrecs,
> 			(XFS_MAX_AG_BLOCKS + 1) / 2);
> }
> 
> All the other btree implementations factor this way, too, allowing
> the minrec values to be calculated clearly and directly in the
> specific btree function...

Yeah, that's a lot clearer.  I'll migrate all the btree types towards
that.

> > +
> >  /* Calculate the freespace btree size for some records. */
> >  xfs_extlen_t
> >  xfs_allocbt_calc_size(
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
> > index 2f6b816aaf9f..c47d0e285435 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.h
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.h
> > @@ -60,4 +60,6 @@ extern xfs_extlen_t xfs_allocbt_calc_size(struct xfs_mount *mp,
> >  void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
> >  		struct xfs_trans *tp, struct xfs_buf *agbp);
> >  
> > +unsigned int xfs_allocbt_absolute_maxlevels(void);
> > +
> >  #endif	/* __XFS_ALLOC_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 2ae5bf9a74e7..7e70df8d1a9b 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -93,6 +93,7 @@ xfs_bmap_compute_maxlevels(
> >  			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> >  	}
> >  	mp->m_bm_maxlevels[whichfork] = level;
> > +	ASSERT(mp->m_bm_maxlevels[whichfork] <= xfs_bmbt_absolute_maxlevels());
> >  }
> >  
> >  unsigned int
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > index b90122de0df0..7001aff639d2 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > @@ -587,6 +587,20 @@ xfs_bmbt_maxrecs(
> >  	return blocklen / (sizeof(xfs_bmbt_key_t) + sizeof(xfs_bmbt_ptr_t));
> >  }
> >  
> > +/* Compute the max possible height of the maximally sized bmap btree. */
> > +unsigned int
> > +xfs_bmbt_absolute_maxlevels(void)
> > +{
> > +	unsigned int		minrecs[2];
> > +
> > +	xfs_btree_absolute_minrecs(minrecs, XFS_BTREE_LONG_PTRS,
> > +			sizeof(struct xfs_bmbt_rec),
> > +			sizeof(struct xfs_bmbt_key) +
> > +				sizeof(xfs_bmbt_ptr_t));
> > +
> > +	return xfs_btree_compute_maxlevels(minrecs, MAXEXTNUM) + 1;
> > +}
> 
> 	minrecs[0] = xfs_bmbt_block_maxrecs(
> 			XFS_MIN_BLOCKSIZE - XFS_BTREE_LBLOCK_LEN, true) / 2;
> 	....
> 
> > +
> >  /*
> >   * Calculate number of records in a bmap btree inode root.
> >   */
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> > index 729e3bc569be..e9218e92526b 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> > @@ -110,4 +110,6 @@ extern struct xfs_btree_cur *xfs_bmbt_init_cursor(struct xfs_mount *,
> >  extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
> >  		unsigned long long len);
> >  
> > +unsigned int xfs_bmbt_absolute_maxlevels(void);
> > +
> >  #endif	/* __XFS_BMAP_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index b95c817ad90d..bea1bdf9b8b9 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -4964,3 +4964,44 @@ xfs_btree_has_more_records(
> >  	else
> >  		return block->bb_u.s.bb_rightsib != cpu_to_be32(NULLAGBLOCK);
> >  }
> > +
> > +/*
> > + * Compute absolute minrecs for leaf and node btree blocks.  Callers should set
> > + * BTREE_LONG_PTRS and BTREE_OVERLAPPING as they would for regular cursors.
> > + * Set BTREE_CRC_BLOCKS if the btree type is supported /only/ on V5 or newer
> > + * filesystems.
> > + */
> > +void
> > +xfs_btree_absolute_minrecs(
> > +	unsigned int		*minrecs,
> > +	unsigned int		bc_flags,
> > +	unsigned int		leaf_recbytes,
> > +	unsigned int		node_recbytes)
> > +{
> > +	unsigned int		min_recbytes;
> > +
> > +	/*
> > +	 * If this btree type is supported on V4, we use the smaller V4 min
> > +	 * block size along with the V4 header size.  If the btree type is only
> > +	 * supported on V5, use the (twice as large) V5 min block size along
> > +	 * with the V5 header size.
> > +	 */
> > +	if (!(bc_flags & XFS_BTREE_CRC_BLOCKS)) {
> > +		if (bc_flags & XFS_BTREE_LONG_PTRS)
> > +			min_recbytes = XFS_MIN_BLOCKSIZE -
> > +							XFS_BTREE_LBLOCK_LEN;
> > +		else
> > +			min_recbytes = XFS_MIN_BLOCKSIZE -
> > +							XFS_BTREE_SBLOCK_LEN;
> > +	} else if (bc_flags & XFS_BTREE_LONG_PTRS) {
> > +		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_LBLOCK_CRC_LEN;
> > +	} else {
> > +		min_recbytes = XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN;
> > +	}
> > +
> > +	if (bc_flags & XFS_BTREE_OVERLAPPING)
> > +		node_recbytes <<= 1;
> > +
> > +	minrecs[0] = min_recbytes / (2 * leaf_recbytes);
> > +	minrecs[1] = min_recbytes / (2 * node_recbytes);
> > +}
> 
> This can go away.

Done.

> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index 20a2828c11ef..acb202839afd 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -601,4 +601,7 @@ xfs_btree_alloc_cursor(
> >  	return cur;
> >  }
> >  
> > +void xfs_btree_absolute_minrecs(unsigned int *minrecs, unsigned int bc_flags,
> > +		unsigned int leaf_recbytes, unsigned int node_recbytes);
> > +
> >  #endif	/* __XFS_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> > index bde2b4c64dbe..c43877c8a279 100644
> > --- a/fs/xfs/libxfs/xfs_fs.h
> > +++ b/fs/xfs/libxfs/xfs_fs.h
> > @@ -268,6 +268,8 @@ typedef struct xfs_fsop_resblks {
> >   */
> >  #define XFS_MIN_AG_BYTES	(1ULL << 24)	/* 16 MB */
> >  #define XFS_MAX_AG_BYTES	(1ULL << 40)	/* 1 TB */
> > +#define XFS_MAX_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_BLOCKSIZE)
> > +#define XFS_MAX_CRC_AG_BLOCKS	(XFS_MAX_AG_BYTES / XFS_MIN_CRC_BLOCKSIZE)
> >  
> >  /* keep the maximum size under 2^31 by a small amount */
> >  #define XFS_MAX_LOG_BYTES \
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 994ad783d407..017aebdda42f 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -2793,6 +2793,7 @@ xfs_ialloc_setup_geometry(
> >  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> >  	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
> >  			inodes);
> > +	ASSERT(igeo->inobt_maxlevels <= xfs_inobt_absolute_maxlevels());
> >  
> >  	/*
> >  	 * Set the maximum inode count for this filesystem, being careful not
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 3a5a24648b87..2e3dd1d798bd 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -542,6 +542,25 @@ xfs_inobt_maxrecs(
> >  	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
> >  }
> >  
> > +/* Compute the max possible height of the maximally sized inode btree. */
> > +unsigned int
> > +xfs_inobt_absolute_maxlevels(void)
> > +{
> > +	unsigned int		minrecs[2];
> > +	unsigned long long	max_ag_inodes;
> > +
> > +	/*
> > +	 * For the absolute maximum, pretend that we can fill an entire AG
> > +	 * completely full of inodes except for the AG headers.
> > +	 */
> > +	max_ag_inodes = (XFS_MAX_AG_BYTES - (4 * BBSIZE)) / XFS_DINODE_MIN_SIZE;
> > +
> > +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_inobt_rec_t),
> > +			sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
> > +
> > +	return xfs_btree_compute_maxlevels(minrecs, max_ag_inodes);
> > +}
> 
> We've got two different inobt max levels on disk. The inobt which has v4
> limits, whilst the finobt that has v5 limits...

<nod> I'll make it return the larger of the two heights, though the
inode btree is always going to win due to its smaller minimum block size.

> > +/* Compute the max possible height of the maximally sized rmap btree. */
> > +unsigned int
> > +xfs_rmapbt_absolute_maxlevels(void)
> > +{
> > +	unsigned int		minrecs[2];
> > +
> > +	xfs_btree_absolute_minrecs(minrecs,
> > +			XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING,
> > +			sizeof(struct xfs_rmap_rec),
> > +			sizeof(struct xfs_rmap_key) + sizeof(xfs_rmap_ptr_t));
> > +
> > +	/*
> > +	 * Compute the asymptotic maxlevels for an rmapbt on any reflink fs.
> > +	 *
> > +	 * On a reflink filesystem, each AG block can have up to 2^32 (per the
> > +	 * refcount record format) owners, which means that theoretically we
> > +	 * could face up to 2^64 rmap records.  However, we're likely to run
> > +	 * out of blocks in the AG long before that happens, which means that
> > +	 * we must compute the max height based on what the btree will look
> > +	 * like if it consumes almost all the blocks in the AG due to maximal
> > +	 * sharing factor.
> > +	 */
> > +	return xfs_btree_compute_maxlevels_size(XFS_MAX_CRC_AG_BLOCKS,
> 
> Huh. I don't know where XFS_MAX_CRC_AG_BLOCKS is defined. I must
> have missed it somewhere?

It was added elsewhere in this patch.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 15/15] xfs: use separate btree cursor cache for each btree type
  2021-10-13  8:01   ` Dave Chinner
@ 2021-10-13 21:42     ` Darrick J. Wong
  0 siblings, 0 replies; 41+ messages in thread
From: Darrick J. Wong @ 2021-10-13 21:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 07:01:14PM +1100, Dave Chinner wrote:
> On Tue, Oct 12, 2021 at 04:33:56PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Now that we have the infrastructure to track the max possible height of
> > each btree type, we can create a separate slab cache for cursors of each
> > type of btree.  For smaller indices like the free space btrees, this
> > means that we can pack more cursors into a slab page, improving slab
> > utilization.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_alloc_btree.c    |   21 ++++++++++++++
> >  fs/xfs/libxfs/xfs_alloc_btree.h    |    3 ++
> >  fs/xfs/libxfs/xfs_bmap_btree.c     |   21 ++++++++++++++
> >  fs/xfs/libxfs/xfs_bmap_btree.h     |    3 ++
> >  fs/xfs/libxfs/xfs_btree.c          |    7 +----
> >  fs/xfs/libxfs/xfs_btree.h          |   17 +++---------
> >  fs/xfs/libxfs/xfs_ialloc_btree.c   |   21 ++++++++++++++
> >  fs/xfs/libxfs/xfs_ialloc_btree.h   |    3 ++
> >  fs/xfs/libxfs/xfs_refcount_btree.c |   21 ++++++++++++++
> >  fs/xfs/libxfs/xfs_refcount_btree.h |    3 ++
> >  fs/xfs/libxfs/xfs_rmap_btree.c     |   21 ++++++++++++++
> >  fs/xfs/libxfs/xfs_rmap_btree.h     |    3 ++
> >  fs/xfs/xfs_super.c                 |   53 ++++++++++++++++++++++++++++++++----
> >  13 files changed, 168 insertions(+), 29 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index 61f6d266b822..4c5942146b05 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -20,6 +20,7 @@
> >  #include "xfs_trans.h"
> >  #include "xfs_ag.h"
> >  
> > +static kmem_zone_t	*xfs_allocbt_cur_cache;
> >  
> >  STATIC struct xfs_btree_cur *
> >  xfs_allocbt_dup_cursor(
> > @@ -477,7 +478,8 @@ xfs_allocbt_init_common(
> >  
> >  	ASSERT(btnum == XFS_BTNUM_BNO || btnum == XFS_BTNUM_CNT);
> >  
> > -	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels);
> > +	cur = xfs_btree_alloc_cursor(mp, tp, btnum, mp->m_ag_maxlevels,
> > +			xfs_allocbt_cur_cache);
> >  	cur->bc_ag.abt.active = false;
> >  
> >  	if (btnum == XFS_BTNUM_CNT) {
> > @@ -603,3 +605,20 @@ xfs_allocbt_calc_size(
> >  {
> >  	return xfs_btree_calc_size(mp->m_alloc_mnr, len);
> >  }
> > +
> > +int __init
> > +xfs_allocbt_init_cur_cache(void)
> > +{
> > +	xfs_allocbt_cur_cache = kmem_cache_create("xfs_bnobt_cur",
> > +			xfs_btree_cur_sizeof(xfs_allocbt_absolute_maxlevels()),
> > +			0, 0, NULL);
> > +
> > +	return xfs_allocbt_cur_cache != NULL ? 0 : -ENOMEM;
> 
> 	if (!xfs_allocbt_cur_cache)
> 		return -ENOMEM;
> 	return 0;
> 
> (and the others :)

Fixed.

> > +}
> > +
> > +void
> > +xfs_allocbt_destroy_cur_cache(void)
> > +{
> > +	kmem_cache_destroy(xfs_allocbt_cur_cache);
> > +	xfs_allocbt_cur_cache = NULL;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.h b/fs/xfs/libxfs/xfs_alloc_btree.h
> > index c47d0e285435..82a9b3201f91 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.h
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.h
> > @@ -62,4 +62,7 @@ void xfs_allocbt_commit_staged_btree(struct xfs_btree_cur *cur,
> >  
> >  unsigned int xfs_allocbt_absolute_maxlevels(void);
> >  
> > +int __init xfs_allocbt_init_cur_cache(void);
> > +void xfs_allocbt_destroy_cur_cache(void);
> > +
> >  #endif	/* __XFS_ALLOC_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.c b/fs/xfs/libxfs/xfs_bmap_btree.c
> > index 7001aff639d2..99261d51d2c3 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.c
> > @@ -22,6 +22,8 @@
> >  #include "xfs_trace.h"
> >  #include "xfs_rmap.h"
> >  
> > +static kmem_zone_t	*xfs_bmbt_cur_cache;
> > +
> >  /*
> >   * Convert on-disk form of btree root to in-memory form.
> >   */
> > @@ -553,7 +555,7 @@ xfs_bmbt_init_cursor(
> >  	ASSERT(whichfork != XFS_COW_FORK);
> >  
> >  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_BMAP,
> > -			mp->m_bm_maxlevels[whichfork]);
> > +			mp->m_bm_maxlevels[whichfork], xfs_bmbt_cur_cache);
> >  	cur->bc_nlevels = be16_to_cpu(ifp->if_broot->bb_level) + 1;
> >  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_bmbt_2);
> >  
> > @@ -664,3 +666,20 @@ xfs_bmbt_calc_size(
> >  {
> >  	return xfs_btree_calc_size(mp->m_bmap_dmnr, len);
> >  }
> > +
> > +int __init
> > +xfs_bmbt_init_cur_cache(void)
> > +{
> > +	xfs_bmbt_cur_cache = kmem_cache_create("xfs_bmbt_cur",
> > +			xfs_btree_cur_sizeof(xfs_bmbt_absolute_maxlevels()),
> > +			0, 0, NULL);
> > +
> > +	return xfs_bmbt_cur_cache != NULL ? 0 : -ENOMEM;
> > +}
> > +
> > +void
> > +xfs_bmbt_destroy_cur_cache(void)
> > +{
> > +	kmem_cache_destroy(xfs_bmbt_cur_cache);
> > +	xfs_bmbt_cur_cache = NULL;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> > index e9218e92526b..4c752f7341df 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> > @@ -112,4 +112,7 @@ extern unsigned long long xfs_bmbt_calc_size(struct xfs_mount *mp,
> >  
> >  unsigned int xfs_bmbt_absolute_maxlevels(void);
> >  
> > +int __init xfs_bmbt_init_cur_cache(void);
> > +void xfs_bmbt_destroy_cur_cache(void);
> > +
> >  #endif	/* __XFS_BMAP_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> > index bea1bdf9b8b9..11ff814996a1 100644
> > --- a/fs/xfs/libxfs/xfs_btree.c
> > +++ b/fs/xfs/libxfs/xfs_btree.c
> > @@ -23,11 +23,6 @@
> >  #include "xfs_btree_staging.h"
> >  #include "xfs_ag.h"
> >  
> > -/*
> > - * Cursor allocation zone.
> > - */
> > -kmem_zone_t	*xfs_btree_cur_zone;
> > -
> >  /*
> >   * Btree magic numbers.
> >   */
> > @@ -379,7 +374,7 @@ xfs_btree_del_cursor(
> >  		kmem_free(cur->bc_ops);
> >  	if (!(cur->bc_flags & XFS_BTREE_LONG_PTRS) && cur->bc_ag.pag)
> >  		xfs_perag_put(cur->bc_ag.pag);
> > -	kmem_cache_free(xfs_btree_cur_zone, cur);
> > +	kmem_cache_free(cur->bc_cache, cur);
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index acb202839afd..6d61ce1559e2 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -13,8 +13,6 @@ struct xfs_trans;
> >  struct xfs_ifork;
> >  struct xfs_perag;
> >  
> > -extern kmem_zone_t	*xfs_btree_cur_zone;
> > -
> >  /*
> >   * Generic key, ptr and record wrapper structures.
> >   *
> > @@ -92,12 +90,6 @@ uint32_t xfs_btree_magic(int crc, xfs_btnum_t btnum);
> >  #define XFS_BTREE_STATS_ADD(cur, stat, val)	\
> >  	XFS_STATS_ADD_OFF((cur)->bc_mp, (cur)->bc_statoff + __XBTS_ ## stat, val)
> >  
> > -/*
> > - * The btree cursor zone hands out cursors that can handle up to this many
> > - * levels.  This is the known maximum for all btree types.
> > - */
> > -#define XFS_BTREE_CUR_ZONE_MAXLEVELS	(9)
> > -
> >  struct xfs_btree_ops {
> >  	/* size of the key and record structures */
> >  	size_t	key_len;
> > @@ -238,6 +230,7 @@ struct xfs_btree_cur
> >  	struct xfs_trans	*bc_tp;	/* transaction we're in, if any */
> >  	struct xfs_mount	*bc_mp;	/* file system mount struct */
> >  	const struct xfs_btree_ops *bc_ops;
> > +	kmem_zone_t		*bc_cache; /* cursor cache */
> >  	unsigned int		bc_flags; /* btree features - below */
> >  	xfs_btnum_t		bc_btnum; /* identifies which btree type */
> >  	union xfs_btree_irec	bc_rec;	/* current insert/search record value */
> > @@ -586,17 +579,17 @@ xfs_btree_alloc_cursor(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_trans	*tp,
> >  	xfs_btnum_t		btnum,
> > -	uint8_t			maxlevels)
> > +	uint8_t			maxlevels,
> > +	kmem_zone_t		*cache)
> >  {
> >  	struct xfs_btree_cur	*cur;
> >  
> > -	ASSERT(maxlevels <= XFS_BTREE_CUR_ZONE_MAXLEVELS);
> > -
> > -	cur = kmem_cache_zalloc(xfs_btree_cur_zone, GFP_NOFS | __GFP_NOFAIL);
> > +	cur = kmem_cache_zalloc(cache, GFP_NOFS | __GFP_NOFAIL);
> >  	cur->bc_tp = tp;
> >  	cur->bc_mp = mp;
> >  	cur->bc_btnum = btnum;
> >  	cur->bc_maxlevels = maxlevels;
> > +	cur->bc_cache = cache;
> >  
> >  	return cur;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > index 2e3dd1d798bd..2502085d476c 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > @@ -22,6 +22,8 @@
> >  #include "xfs_rmap.h"
> >  #include "xfs_ag.h"
> >  
> > +static kmem_zone_t	*xfs_inobt_cur_cache;
> > +
> >  STATIC int
> >  xfs_inobt_get_minrecs(
> >  	struct xfs_btree_cur	*cur,
> > @@ -433,7 +435,7 @@ xfs_inobt_init_common(
> >  	struct xfs_btree_cur	*cur;
> >  
> >  	cur = xfs_btree_alloc_cursor(mp, tp, btnum,
> > -			M_IGEO(mp)->inobt_maxlevels);
> > +			M_IGEO(mp)->inobt_maxlevels, xfs_inobt_cur_cache);
> >  	if (btnum == XFS_BTNUM_INO) {
> >  		cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_ibt_2);
> >  		cur->bc_ops = &xfs_inobt_ops;
> > @@ -776,3 +778,20 @@ xfs_iallocbt_calc_size(
> >  {
> >  	return xfs_btree_calc_size(M_IGEO(mp)->inobt_mnr, len);
> >  }
> > +
> > +int __init
> > +xfs_inobt_init_cur_cache(void)
> > +{
> > +	xfs_inobt_cur_cache = kmem_cache_create("xfs_inobt_cur",
> > +			xfs_btree_cur_sizeof(xfs_inobt_absolute_maxlevels()),
> > +			0, 0, NULL);
> > +
> > +	return xfs_inobt_cur_cache != NULL ? 0 : -ENOMEM;
> > +}
> > +
> > +void
> > +xfs_inobt_destroy_cur_cache(void)
> > +{
> > +	kmem_cache_destroy(xfs_inobt_cur_cache);
> > +	xfs_inobt_cur_cache = NULL;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
> > index 1f09530bf856..b384733d5e0f 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc_btree.h
> > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
> > @@ -77,4 +77,7 @@ void xfs_inobt_commit_staged_btree(struct xfs_btree_cur *cur,
> >  
> >  unsigned int xfs_inobt_absolute_maxlevels(void);
> >  
> > +int __init xfs_inobt_init_cur_cache(void);
> > +void xfs_inobt_destroy_cur_cache(void);
> > +
> >  #endif	/* __XFS_IALLOC_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.c b/fs/xfs/libxfs/xfs_refcount_btree.c
> > index bacd1b442b09..ba27a3ea2ce2 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.c
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.c
> > @@ -21,6 +21,8 @@
> >  #include "xfs_rmap.h"
> >  #include "xfs_ag.h"
> >  
> > +static kmem_zone_t	*xfs_refcountbt_cur_cache;
> > +
> >  static struct xfs_btree_cur *
> >  xfs_refcountbt_dup_cursor(
> >  	struct xfs_btree_cur	*cur)
> > @@ -323,7 +325,7 @@ xfs_refcountbt_init_common(
> >  	ASSERT(pag->pag_agno < mp->m_sb.sb_agcount);
> >  
> >  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_REFC,
> > -			mp->m_refc_maxlevels);
> > +			mp->m_refc_maxlevels, xfs_refcountbt_cur_cache);
> >  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_refcbt_2);
> >  
> >  	cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> > @@ -505,3 +507,20 @@ xfs_refcountbt_calc_reserves(
> >  
> >  	return error;
> >  }
> > +
> > +int __init
> > +xfs_refcountbt_init_cur_cache(void)
> > +{
> > +	xfs_refcountbt_cur_cache = kmem_cache_create("xfs_refcbt_cur",
> > +			xfs_btree_cur_sizeof(xfs_refcountbt_absolute_maxlevels()),
> > +			0, 0, NULL);
> > +
> > +	return xfs_refcountbt_cur_cache != NULL ? 0 : -ENOMEM;
> > +}
> > +
> > +void
> > +xfs_refcountbt_destroy_cur_cache(void)
> > +{
> > +	kmem_cache_destroy(xfs_refcountbt_cur_cache);
> > +	xfs_refcountbt_cur_cache = NULL;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_refcount_btree.h b/fs/xfs/libxfs/xfs_refcount_btree.h
> > index 2625b08f50a8..a1437d0a5717 100644
> > --- a/fs/xfs/libxfs/xfs_refcount_btree.h
> > +++ b/fs/xfs/libxfs/xfs_refcount_btree.h
> > @@ -67,4 +67,7 @@ void xfs_refcountbt_commit_staged_btree(struct xfs_btree_cur *cur,
> >  
> >  unsigned int xfs_refcountbt_absolute_maxlevels(void);
> >  
> > +int __init xfs_refcountbt_init_cur_cache(void);
> > +void xfs_refcountbt_destroy_cur_cache(void);
> > +
> >  #endif	/* __XFS_REFCOUNT_BTREE_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index 860627b5ec08..0a9bc37c01d0 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -22,6 +22,8 @@
> >  #include "xfs_ag.h"
> >  #include "xfs_ag_resv.h"
> >  
> > +static kmem_zone_t	*xfs_rmapbt_cur_cache;
> > +
> >  /*
> >   * Reverse map btree.
> >   *
> > @@ -453,7 +455,7 @@ xfs_rmapbt_init_common(
> >  
> >  	/* Overlapping btree; 2 keys per pointer. */
> >  	cur = xfs_btree_alloc_cursor(mp, tp, XFS_BTNUM_RMAP,
> > -			mp->m_rmap_maxlevels);
> > +			mp->m_rmap_maxlevels, xfs_rmapbt_cur_cache);
> >  	cur->bc_flags = XFS_BTREE_CRC_BLOCKS | XFS_BTREE_OVERLAPPING;
> >  	cur->bc_statoff = XFS_STATS_CALC_INDEX(xs_rmap_2);
> >  	cur->bc_ops = &xfs_rmapbt_ops;
> > @@ -670,3 +672,20 @@ xfs_rmapbt_calc_reserves(
> >  
> >  	return error;
> >  }
> > +
> > +int __init
> > +xfs_rmapbt_init_cur_cache(void)
> > +{
> > +	xfs_rmapbt_cur_cache = kmem_cache_create("xfs_rmapbt_cur",
> > +			xfs_btree_cur_sizeof(xfs_rmapbt_absolute_maxlevels()),
> > +			0, 0, NULL);
> > +
> > +	return xfs_rmapbt_cur_cache != NULL ? 0 : -ENOMEM;
> > +}
> > +
> > +void
> > +xfs_rmapbt_destroy_cur_cache(void)
> > +{
> > +	kmem_cache_destroy(xfs_rmapbt_cur_cache);
> > +	xfs_rmapbt_cur_cache = NULL;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.h b/fs/xfs/libxfs/xfs_rmap_btree.h
> > index 84fe74de923f..dd5422850656 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.h
> > @@ -61,4 +61,7 @@ extern int xfs_rmapbt_calc_reserves(struct xfs_mount *mp, struct xfs_trans *tp,
> >  
> >  unsigned int xfs_rmapbt_absolute_maxlevels(void);
> >  
> > +int __init xfs_rmapbt_init_cur_cache(void);
> > +void xfs_rmapbt_destroy_cur_cache(void);
> > +
> >  #endif /* __XFS_RMAP_BTREE_H__ */
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 90c92a6a49e0..399d7cfc7d4b 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -37,6 +37,13 @@
> >  #include "xfs_reflink.h"
> >  #include "xfs_pwork.h"
> >  #include "xfs_ag.h"
> > +#include "xfs_btree.h"
> > +#include "xfs_alloc_btree.h"
> > +#include "xfs_ialloc_btree.h"
> > +#include "xfs_bmap_btree.h"
> > +#include "xfs_rmap_btree.h"
> > +#include "xfs_refcount_btree.h"
> > +
> >  
> >  #include <linux/magic.h>
> >  #include <linux/fs_context.h>
> > @@ -1950,9 +1957,45 @@ static struct file_system_type xfs_fs_type = {
> >  };
> >  MODULE_ALIAS_FS("xfs");
> >  
> > +STATIC int __init
> > +xfs_init_btree_cur_caches(void)
> > +{
> > +	int				error;
> > +
> > +	error = xfs_allocbt_init_cur_cache();
> > +	if (error)
> > +		return error;
> > +	error = xfs_inobt_init_cur_cache();
> > +	if (error)
> > +		return error;
> > +	error = xfs_bmbt_init_cur_cache();
> > +	if (error)
> > +		return error;
> > +	error = xfs_rmapbt_init_cur_cache();
> > +	if (error)
> > +		return error;
> > +	error = xfs_refcountbt_init_cur_cache();
> > +	if (error)
> > +		return error;
> > +
> > +	return 0;
> > +}
> > +
> > +STATIC void
> > +xfs_destroy_btree_cur_caches(void)
> > +{
> > +	xfs_allocbt_destroy_cur_cache();
> > +	xfs_inobt_destroy_cur_cache();
> > +	xfs_bmbt_destroy_cur_cache();
> > +	xfs_rmapbt_destroy_cur_cache();
> > +	xfs_refcountbt_destroy_cur_cache();
> > +}
> 
> MOve these to libxfs/xfs_btree.c and then it minimises the custom
> init code for userspace. Also means you don't need to include
> all the individual btree headers in xfs-super.c...
> 
> Otherwise it all looks ok.

Done.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type
  2021-10-13 21:36     ` Darrick J. Wong
@ 2021-10-13 23:48       ` Dave Chinner
  0 siblings, 0 replies; 41+ messages in thread
From: Dave Chinner @ 2021-10-13 23:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, chandan.babu, hch

On Wed, Oct 13, 2021 at 02:36:33PM -0700, Darrick J. Wong wrote:
> On Wed, Oct 13, 2021 at 06:57:43PM +1100, Dave Chinner wrote:
> > On Tue, Oct 12, 2021 at 04:33:50PM -0700, Darrick J. Wong wrote:
> > > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > > @@ -582,6 +582,19 @@ xfs_allocbt_maxrecs(
> > >  	return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> > >  }
> > >  
> > > +/* Compute the max possible height of the maximally sized free space btree. */
> > > +unsigned int
> > > +xfs_allocbt_absolute_maxlevels(void)
> > > +{
> > > +	unsigned int		minrecs[2];
> > > +
> > > +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_alloc_rec_t),
> > > +			sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> > > +
> > > +	return xfs_btree_compute_maxlevels(minrecs,
> > > +			(XFS_MAX_AG_BLOCKS + 1) / 2);
> > > +}
> > 
> > Hmmmm. This is kinds messy. I'd prefer we share code with the
> > xfs_allocbt_maxrecs() function that do this. Not sure "absolute" is
> > the right word, either. It's more a function of the on-disk format
> > maximum, not an "absolute" thing.
> 
> <nod> I'm not passionate about the name one way or the other.
> 
> > I mean, we know that the worst case is going to be for each btree
> > type - we don't need to pass in XFS_BTREE_CRC_BLOCKS or
> > XFS_BTREE_LONG_PTRS to generic code for it to branch multiple times
> > to be generic.
> 
> Yeah, that function was a conditional mess.  I like...
> 
> > Instead:
> > 
> > static inline int
> > xfs_allocbt_block_maxrecs(
> >         int                     blocklen,
> >         int                     leaf)
> > {
> >         if (leaf)
> >                 return blocklen / sizeof(xfs_alloc_rec_t);
> >         return blocklen / (sizeof(xfs_alloc_key_t) + sizeof(xfs_alloc_ptr_t));
> > }
> > 
> > /*
> >  * Calculate number of records in an alloc btree block.
> >  */
> > int
> > xfs_allocbt_maxrecs(
> >         struct xfs_mount        *mp,
> >         int                     blocklen,
> >         int                     leaf)
> > {
> >         blocklen -= XFS_ALLOC_BLOCK_LEN(mp);
> > 	return xfs_allobt_block_maxrecs(blocklen, leaf);
> > }
> > 
> > xfs_allocbt_maxlevels_ondisk()
> > {
> > 	unsigned int		minrecs[2];
> > 
> > 	minrecs[0] = xfs_allocbt_block_maxrecs(
> > 			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, true) / 2;
> > 	minrecs[1] = xfs_allocbt_block_maxrecs(
> > 			XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN, false) / 2;
> 
> ...this a lot better since one doesn't have to switch back and forth
> between source files to figure out how the computation works.
> 
> However, I want to propose a possibly pedantic addition to the blocksize
> computation for btrees.  We want to compute the maximum btree height
> that we're ever going to see, which means that we are modeling a btree
> with the minimum possible fanout factor.  That means the smallest btree
> nodes possible, and half full.
> 
> min V5 blocksize: 1024 bytes
> V5 btree short header: 56 bytes
> min V5 btree record area: 968 bytes
> 
> min V4 blocksize: 512 bytes
> V4 btree short header: 16 bytes
> min V4 btree record area: 496 bytes
> 
> In other words, the bit above for the allocbt ought to be:
> 
> 	blocklen = min(XFS_MIN_BLOCKSIZE - XFS_BTREE_SBLOCK_LEN,
> 		       XFS_MIN_CRC_BLOCKSIZE - XFS_BTREE_SBLOCK_CRC_LEN);
> 
> Which is very pedantic, since the whole expression /always/ evalulates
> to 496.  IIRC the kernel has enough macro soup to resolve that into a
> constant expression so it shouldn't cost us anything.

Yup, good idea, I'm happy with that - now the code documents the
on-disk format calculation exactly in a single location. :)

> > > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > > @@ -2793,6 +2793,7 @@ xfs_ialloc_setup_geometry(
> > >  	inodes = (1LL << XFS_INO_AGINO_BITS(mp)) >> XFS_INODES_PER_CHUNK_LOG;
> > >  	igeo->inobt_maxlevels = xfs_btree_compute_maxlevels(igeo->inobt_mnr,
> > >  			inodes);
> > > +	ASSERT(igeo->inobt_maxlevels <= xfs_inobt_absolute_maxlevels());
> > >  
> > >  	/*
> > >  	 * Set the maximum inode count for this filesystem, being careful not
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > index 3a5a24648b87..2e3dd1d798bd 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> > > @@ -542,6 +542,25 @@ xfs_inobt_maxrecs(
> > >  	return blocklen / (sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
> > >  }
> > >  
> > > +/* Compute the max possible height of the maximally sized inode btree. */
> > > +unsigned int
> > > +xfs_inobt_absolute_maxlevels(void)
> > > +{
> > > +	unsigned int		minrecs[2];
> > > +	unsigned long long	max_ag_inodes;
> > > +
> > > +	/*
> > > +	 * For the absolute maximum, pretend that we can fill an entire AG
> > > +	 * completely full of inodes except for the AG headers.
> > > +	 */
> > > +	max_ag_inodes = (XFS_MAX_AG_BYTES - (4 * BBSIZE)) / XFS_DINODE_MIN_SIZE;
> > > +
> > > +	xfs_btree_absolute_minrecs(minrecs, 0, sizeof(xfs_inobt_rec_t),
> > > +			sizeof(xfs_inobt_key_t) + sizeof(xfs_inobt_ptr_t));
> > > +
> > > +	return xfs_btree_compute_maxlevels(minrecs, max_ag_inodes);
> > > +}
> > 
> > We've got two different inobt max levels on disk. The inobt which has v4
> > limits, whilst the finobt that has v5 limits...
> 
> <nod> I'll make it return the larger of the two heights, though the
> inode btree is always going to win due to its smaller minimum block size.

Yup, I expect so, but it would be good to make it explicit :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-10-13 23:49 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12 23:32 [PATCHSET v3 00/15] xfs: support dynamic btree cursor height Darrick J. Wong
2021-10-12 23:32 ` [PATCH 01/15] xfs: remove xfs_btree_cur.bc_blocklog Darrick J. Wong
2021-10-13  0:56   ` Dave Chinner
2021-10-12 23:32 ` [PATCH 02/15] xfs: reduce the size of nr_ops for refcount btree cursors Darrick J. Wong
2021-10-13  0:57   ` Dave Chinner
2021-10-12 23:32 ` [PATCH 03/15] xfs: don't track firstrec/firstkey separately in xchk_btree Darrick J. Wong
2021-10-13  1:02   ` Dave Chinner
2021-10-12 23:32 ` [PATCH 04/15] xfs: dynamically allocate btree scrub context structure Darrick J. Wong
2021-10-13  4:57   ` Dave Chinner
2021-10-13 16:29     ` Darrick J. Wong
2021-10-12 23:33 ` [PATCH 05/15] xfs: support dynamic btree cursor heights Darrick J. Wong
2021-10-13  5:31   ` Dave Chinner
2021-10-13 16:52     ` Darrick J. Wong
2021-10-13 21:14       ` Dave Chinner
2021-10-12 23:33 ` [PATCH 06/15] xfs: rearrange xfs_btree_cur fields for better packing Darrick J. Wong
2021-10-13  5:34   ` Dave Chinner
2021-10-12 23:33 ` [PATCH 07/15] xfs: refactor btree cursor allocation function Darrick J. Wong
2021-10-13  5:34   ` Dave Chinner
2021-10-12 23:33 ` [PATCH 08/15] xfs: encode the max btree height in the cursor Darrick J. Wong
2021-10-13  5:38   ` Dave Chinner
2021-10-12 23:33 ` [PATCH 09/15] xfs: dynamically allocate cursors based on maxlevels Darrick J. Wong
2021-10-13  5:40   ` Dave Chinner
2021-10-13 16:55     ` Darrick J. Wong
2021-10-12 23:33 ` [PATCH 10/15] xfs: compute actual maximum btree height for critical reservation calculation Darrick J. Wong
2021-10-13  5:49   ` Dave Chinner
2021-10-13 17:07     ` Darrick J. Wong
2021-10-13 20:18       ` Dave Chinner
2021-10-12 23:33 ` [PATCH 11/15] xfs: compute the maximum height of the rmap btree when reflink enabled Darrick J. Wong
2021-10-13  7:25   ` Dave Chinner
2021-10-13 17:47     ` Darrick J. Wong
2021-10-12 23:33 ` [PATCH 12/15] xfs: kill XFS_BTREE_MAXLEVELS Darrick J. Wong
2021-10-13  7:25   ` Dave Chinner
2021-10-12 23:33 ` [PATCH 13/15] xfs: widen btree maxlevels computation to handle 64-bit record counts Darrick J. Wong
2021-10-13  7:28   ` Dave Chinner
2021-10-12 23:33 ` [PATCH 14/15] xfs: compute absolute maximum nlevels for each btree type Darrick J. Wong
2021-10-13  7:57   ` Dave Chinner
2021-10-13 21:36     ` Darrick J. Wong
2021-10-13 23:48       ` Dave Chinner
2021-10-12 23:33 ` [PATCH 15/15] xfs: use separate btree cursor cache " Darrick J. Wong
2021-10-13  8:01   ` Dave Chinner
2021-10-13 21:42     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).