[PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting
@ 2022-03-17 21:20 Darrick J. Wong
  2022-03-17 21:21 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:20 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

Hi all,

Last week, I was running xfs/306 (which formats a single-AG 20MB
filesystem) with an fstests configuration that specifies a 1k blocksize
and a specially crafted log size that will consume 7/8 of the space
(17920 blocks, specifically) in that AG.  This resulted mount hanging on
the infinite loop in xfs_reserve_blocks because we overestimate the
number of blocks that xfs_mod_fdblocks will give us, and the code
retries forever.

Initially, this was a single patch that fixed the calculation and
transformed the loop into a finite loop.  However, further discussion
revealed several more issues:

 * People had a hard time grokking that the "alloc_set_aside" is
   actually the amount of space we hide so that a bmbt split will always
   succeed;
 * The author didn't understand what XFS_ALLOC_AGFL_RESERVE actually
   mean or whether it was correctly calculated;
 * The "alloc_set_aside" underestimated the blocks needed to handle any
   bmap btree split;
 * Both statfs and XFS_IOC_FSCOUNTS forgot to exclude the amount of
   space used by the free space btrees on behalf of per-AG reservations,
   leading to overreporting of available space;
 * The loop control change really should have been separate.

Hence, this patchset has now grown to six patches to address each of
these issues separately.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reserve-block-fixes-5.18
---
 fs/xfs/libxfs/xfs_alloc.c |   30 +++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_alloc.h |    3 +--
 fs/xfs/libxfs/xfs_sb.c    |    2 --
 fs/xfs/xfs_fsops.c        |   24 +++++++++++++++++-------
 fs/xfs/xfs_log_recover.c  |    2 +-
 fs/xfs/xfs_mount.c        |    9 ++++++++-
 fs/xfs/xfs_mount.h        |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 fs/xfs/xfs_super.c        |    4 +---
 8 files changed, 95 insertions(+), 24 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:17   ` Brian Foster
  2022-03-17 21:21 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full.  This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
 fs/xfs/libxfs/xfs_alloc.h |    1 -
 2 files changed, 18 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 353e53b892e6..b0678e96ce61 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -82,6 +82,19 @@ xfs_prealloc_blocks(
 }
 
 /*
+ * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
+ * guarantee that we can refill the AGFL prior to allocating space in a nearly
+ * full AG.  We require two blocks per free space btree because free space
+ * btrees shrink to a single block as the AG fills up, and any allocation can
+ * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
+ * space from xfs_mod_fdblocks, so we do not account for that here.
+ */
+#define XFS_ALLOCBT_AGFL_RESERVE	4
+
+/*
+ * Compute the number of blocks that we set aside to guarantee the ability to
+ * refill the AGFL and handle a full bmap btree split.
+ *
  * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
  * AGF buffer (PV 947395), we place constraints on the relationship among
  * actual allocations for data blocks, freelist blocks, and potential file data
@@ -93,14 +106,14 @@ xfs_prealloc_blocks(
  * extents need to be actually allocated. To get around this, we explicitly set
  * aside a few blocks which will not be reserved in delayed allocation.
  *
- * We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
- * potential split of the file's bmap btree.
+ * For each AG, we need to reserve enough blocks to replenish a totally empty
+ * AGFL and 4 more to handle a potential split of the file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
 }
 
 /*
@@ -124,12 +137,12 @@ xfs_alloc_ag_max_usable(
 	unsigned int		blocks;
 
 	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
-	blocks += XFS_ALLOC_AGFL_RESERVE;
+	blocks += XFS_ALLOCBT_AGFL_RESERVE;
 	blocks += 3;			/* AGF, AGI btree root blocks */
 	if (xfs_has_finobt(mp))
 		blocks++;		/* finobt root block */
 	if (xfs_has_rmapbt(mp))
-		blocks++; 		/* rmap root block */
+		blocks++;		/* rmap root block */
 	if (xfs_has_reflink(mp))
 		blocks++;		/* refcount root block */
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 1c14a0b1abea..d4c057b764f9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -88,7 +88,6 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
 
 /* freespace limit calculations */
-#define XFS_ALLOC_AGFL_RESERVE	4
 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
  2022-03-17 21:21 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:17   ` Brian Foster
  2022-03-17 21:21 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

The comment for xfs_alloc_set_aside indicates that we want to set aside
enough space to handle a bmap btree split.  The code, unfortunately,
hardcodes this to 4.

This is incorrect, since file bmap btrees can be taller than that:

xfs_db> btheight bmapbt -n 4294967296 -b 512
bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
level 0: 4294967296 records, 330382100 blocks
level 1: 330382100 records, 25414008 blocks
level 2: 25414008 records, 1954924 blocks
level 3: 1954924 records, 150379 blocks
level 4: 150379 records, 11568 blocks
level 5: 11568 records, 890 blocks
level 6: 890 records, 69 blocks
level 7: 69 records, 6 blocks
level 8: 6 records, 1 block
9 levels, 357913945 blocks total

Fix this by using the actual bmap btree maxlevel value for the
set-aside.  We subtract one because the root is always in the inode and
hence never splits.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
 fs/xfs/libxfs/xfs_sb.c    |    2 --
 fs/xfs/xfs_mount.c        |    7 +++++++
 3 files changed, 12 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index b0678e96ce61..747b3e45303f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -107,13 +107,16 @@ xfs_prealloc_blocks(
  * aside a few blocks which will not be reserved in delayed allocation.
  *
  * For each AG, we need to reserve enough blocks to replenish a totally empty
- * AGFL and 4 more to handle a potential split of the file's bmap btree.
+ * AGFL and enough to handle a potential split of a file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
+	unsigned int		bmbt_splits;
+
+	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f4e84aa1d50a..b823beb944e4 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -887,8 +887,6 @@ xfs_sb_mount_common(
 	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
 
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
-	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bed73e8002a5..9336176dc706 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -652,6 +652,13 @@ xfs_mountfs(
 
 	xfs_agbtree_compute_maxlevels(mp);
 
+	/*
+	 * Compute the amount of space to set aside to handle btree splits now
+	 * that we have calculated the btree maxlevels.
+	 */
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
  2022-03-17 21:21 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
  2022-03-17 21:21 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:18   ` Brian Foster
  2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

xfs_reserve_blocks controls the size of the user-visible free space
reserve pool.  Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks.  However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
reserve space so long as xfs_mod_fdblocks returns ENOSPC.

In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.

Fix this by creating a function to estimate the number of blocks that
can be reserved from fdblocks, which needs to exclude the set-aside and
m_allocbt_blks.

Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.

Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |    7 +++++--
 fs/xfs/xfs_mount.h |   29 +++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..b71799a3acd3 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -433,8 +433,11 @@ xfs_reserve_blocks(
 	 */
 	error = -ENOSPC;
 	do {
-		free = percpu_counter_sum(&mp->m_fdblocks) -
-						mp->m_alloc_set_aside;
+		/*
+		 * The reservation pool cannot take space that xfs_mod_fdblocks
+		 * will not give us.
+		 */
+		free = xfs_fdblocks_available(mp);
 		if (free <= 0)
 			break;
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 00720a02e761..998b54c3c454 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -479,6 +479,35 @@ extern void	xfs_unmountfs(xfs_mount_t *);
  */
 #define XFS_FDBLOCKS_BATCH	1024
 
+/*
+ * Estimate the amount of space that xfs_mod_fdblocks might give us without
+ * drawing from the reservation pool.  In other words, estimate the free space
+ * that is available to userspace.
+ *
+ * This quantity is the amount of free space tracked in the on-disk metadata
+ * minus:
+ *
+ * - Delayed allocation reservations
+ * - Per-AG space reservations to guarantee metadata expansion
+ * - Userspace-controlled free space reserve pool
+ *
+ * - Space reserved to ensure that we can always split a bmap btree
+ * - Free space btree blocks that are not available for allocation due to
+ *   per-AG metadata reservations
+ *
+ * The first three are captured in the incore fdblocks counter.
+ */
+static inline int64_t
+xfs_fdblocks_available(
+	struct xfs_mount	*mp)
+{
+	int64_t			free = percpu_counter_sum(&mp->m_fdblocks);
+
+	free -= mp->m_alloc_set_aside;
+	free -= atomic64_read(&mp->m_allocbt_blks);
+	return free;
+}
+
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 bool reserved);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-03-17 21:21 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:18   ` Brian Foster
  2022-03-17 21:21 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
  2022-03-17 21:21 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

Don't spin in an infinite loop trying to reserve blocks -- if we can't
do it after 30 tries, we're racing with a nearly full filesystem, so
just give up.

Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b71799a3acd3..4076b9004077 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -379,6 +379,7 @@ xfs_reserve_blocks(
 	int64_t			fdblks_delta = 0;
 	uint64_t		request;
 	int64_t			free;
+	unsigned int		tries;
 	int			error = 0;
 
 	/* If inval is null, report current values and return */
@@ -430,9 +431,16 @@ xfs_reserve_blocks(
 	 * If the request is larger than the current reservation, reserve the
 	 * blocks before we update the reserve counters. Sample m_fdblocks and
 	 * perform a partial reservation if the request exceeds free space.
+	 *
+	 * The loop body estimates how many blocks it can request from fdblocks
+	 * to stash in the reserve pool.  This is a classic TOCTOU race since
+	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
+	 * cannot tell if @free remaining unchanged between iterations is due
+	 * to an idle system or freed blocks being consumed immediately, so
+	 * we'll try a finite number of times to satisfy the request.
 	 */
 	error = -ENOSPC;
-	do {
+	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
 		/*
 		 * The reservation pool cannot take space that xfs_mod_fdblocks
 		 * will not give us.
@@ -462,7 +470,7 @@ xfs_reserve_blocks(
 		spin_unlock(&mp->m_sb_lock);
 		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
 		spin_lock(&mp->m_sb_lock);
-	} while (error == -ENOSPC);
+	}
 
 	/*
 	 * Update the reserve counters if blocks have been successfully


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:19   ` Brian Foster
  2022-03-17 21:21 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

On a modern filesystem, we don't allow userspace to allocate blocks for
data storage from the per-AG space reservations, the user-controlled
reservation pool that prevents ENOSPC in the middle of internal
operations, or the internal per-AG set-aside that prevents ENOSPC.
Since we now consider free space btree blocks as unavailable for
allocation for data storage, we shouldn't report those blocks via statfs
either.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |    3 +--
 fs/xfs/xfs_mount.h |   13 +++++++++++++
 fs/xfs/xfs_super.c |    4 +---
 3 files changed, 15 insertions(+), 5 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 4076b9004077..b42b8bc55729 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -346,8 +346,7 @@ xfs_fs_counts(
 {
 	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
-	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
-						mp->m_alloc_set_aside;
+	cnt->freedata = xfs_fdblocks_available_fast(mp);
 
 	spin_lock(&mp->m_sb_lock);
 	cnt->freertx = mp->m_sb.sb_frextents;
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 998b54c3c454..74e9b8558162 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -508,6 +508,19 @@ xfs_fdblocks_available(
 	return free;
 }
 
+/* Same as above, but don't take the slow path. */
+static inline int64_t
+xfs_fdblocks_available_fast(
+	struct xfs_mount	*mp)
+{
+	int64_t			free;
+
+	free = percpu_counter_read_positive(&mp->m_fdblocks);
+	free -= mp->m_alloc_set_aside;
+	free -= atomic64_read(&mp->m_allocbt_blks);
+	return free;
+}
+
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 bool reserved);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d84714e4e46a..7b6c147e63c4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -791,7 +791,6 @@ xfs_fs_statfs(
 	uint64_t		fakeinos, id;
 	uint64_t		icount;
 	uint64_t		ifree;
-	uint64_t		fdblocks;
 	xfs_extlen_t		lsize;
 	int64_t			ffree;
 
@@ -806,7 +805,6 @@ xfs_fs_statfs(
 
 	icount = percpu_counter_sum(&mp->m_icount);
 	ifree = percpu_counter_sum(&mp->m_ifree);
-	fdblocks = percpu_counter_sum(&mp->m_fdblocks);
 
 	spin_lock(&mp->m_sb_lock);
 	statp->f_bsize = sbp->sb_blocksize;
@@ -815,7 +813,7 @@ xfs_fs_statfs(
 	spin_unlock(&mp->m_sb_lock);
 
 	/* make sure statp->f_bfree does not underflow */
-	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
+	statp->f_bfree = max_t(int64_t, xfs_fdblocks_available(mp), 0);
 	statp->f_bavail = statp->f_bfree;
 
 	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-03-17 21:21 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:21   ` Brian Foster
  5 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

We've established in this patchset that the "alloc_set_aside" pool is
actually used to ensure that a bmbt split always succeeds so that the
filesystem won't run out of space mid-transaction and crash.  Rename the
variable and the function to be a little more suggestive of the purpose
of this quantity.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_alloc.c |    4 ++--
 fs/xfs/libxfs/xfs_alloc.h |    2 +-
 fs/xfs/xfs_fsops.c        |    2 +-
 fs/xfs/xfs_log_recover.c  |    2 +-
 fs/xfs/xfs_mount.c        |    4 ++--
 fs/xfs/xfs_mount.h        |    7 ++++---
 6 files changed, 11 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 747b3e45303f..a4a6cca1ffd1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -110,7 +110,7 @@ xfs_prealloc_blocks(
  * AGFL and enough to handle a potential split of a file's bmap btree.
  */
 unsigned int
-xfs_alloc_set_aside(
+xfs_bmbt_split_setaside(
 	struct xfs_mount	*mp)
 {
 	unsigned int		bmbt_splits;
@@ -127,7 +127,7 @@ xfs_alloc_set_aside(
  *	- the AG superblock, AGF, AGI and AGFL
  *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
  *	  the AGI free inode and rmap btree root blocks.
- *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
+ *	- blocks on the AGFL according to xfs_bmbt_split_setaside() limits
  *	- the rmapbt root block
  *
  * The AG headers are sector sized, so the amount of space they take up is
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index d4c057b764f9..7d676c1c66bc 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -88,7 +88,7 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
 
 /* freespace limit calculations */
-unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
+unsigned int xfs_bmbt_split_setaside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_perag *pag,
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b42b8bc55729..28a9a6f8eb18 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -190,7 +190,7 @@ xfs_growfs_data_private(
 	if (nagimax)
 		mp->m_maxagi = nagimax;
 	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 
 	if (delta > 0) {
 		/*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96c997ed2ec8..30e22cd943c2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3351,7 +3351,7 @@ xlog_do_recover(
 		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
 		return error;
 	}
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 
 	xlog_recover_check_summary(log);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 9336176dc706..eac9534338fd 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -656,7 +656,7 @@ xfs_mountfs(
 	 * Compute the amount of space to set aside to handle btree splits now
 	 * that we have calculated the btree maxlevels.
 	 */
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 
 	/*
@@ -1153,7 +1153,7 @@ xfs_mod_fdblocks(
 	 * problems (i.e. transaction abort, pagecache discards, etc.) than
 	 * slightly premature -ENOSPC.
 	 */
-	set_aside = mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
+	set_aside = mp->m_bmbt_split_setaside + atomic64_read(&mp->m_allocbt_blks);
 	percpu_counter_add_batch(&mp->m_fdblocks, delta, batch);
 	if (__percpu_counter_compare(&mp->m_fdblocks, set_aside,
 				     XFS_FDBLOCKS_BATCH) >= 0) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 74e9b8558162..6c4cbd4a0c32 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -134,7 +134,8 @@ typedef struct xfs_mount {
 	uint			m_refc_maxlevels; /* max refcount btree level */
 	unsigned int		m_agbtree_maxlevels; /* max level of all AG btrees */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
-	uint			m_alloc_set_aside; /* space we can't use */
+	/* space reserved to ensure bmbt splits always succeed */
+	unsigned int		m_bmbt_split_setaside;
 	uint			m_ag_max_usable; /* max space per AG */
 	int			m_dalign;	/* stripe unit */
 	int			m_swidth;	/* stripe width */
@@ -503,7 +504,7 @@ xfs_fdblocks_available(
 {
 	int64_t			free = percpu_counter_sum(&mp->m_fdblocks);
 
-	free -= mp->m_alloc_set_aside;
+	free -= mp->m_bmbt_split_setaside;
 	free -= atomic64_read(&mp->m_allocbt_blks);
 	return free;
 }
@@ -516,7 +517,7 @@ xfs_fdblocks_available_fast(
 	int64_t			free;
 
 	free = percpu_counter_read_positive(&mp->m_fdblocks);
-	free -= mp->m_alloc_set_aside;
+	free -= mp->m_bmbt_split_setaside;
 	free -= atomic64_read(&mp->m_allocbt_blks);
 	return free;
 }


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-17 21:21 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
@ 2022-03-18 12:17   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:01PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Currently, we use this undocumented macro to encode the minimum number
> of blocks needed to replenish a completely empty AGFL when an AG is
> nearly full.  This has lead to confusion on the part of the maintainers,
> so let's document what the value actually means, and move it to
> xfs_alloc.c since it's not used outside of that module.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
>  fs/xfs/libxfs/xfs_alloc.h |    1 -
>  2 files changed, 18 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 353e53b892e6..b0678e96ce61 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
>  }
>  
>  /*
> + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> + * full AG.  We require two blocks per free space btree because free space
> + * btrees shrink to a single block as the AG fills up, and any allocation can
> + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> + * space from xfs_mod_fdblocks, so we do not account for that here.
> + */
> +#define XFS_ALLOCBT_AGFL_RESERVE	4
> +
> +/*
> + * Compute the number of blocks that we set aside to guarantee the ability to
> + * refill the AGFL and handle a full bmap btree split.
> + *
>   * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
>   * AGF buffer (PV 947395), we place constraints on the relationship among
>   * actual allocations for data blocks, freelist blocks, and potential file data
> @@ -93,14 +106,14 @@ xfs_prealloc_blocks(
>   * extents need to be actually allocated. To get around this, we explicitly set
>   * aside a few blocks which will not be reserved in delayed allocation.
>   *
> - * We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
> - * potential split of the file's bmap btree.
> + * For each AG, we need to reserve enough blocks to replenish a totally empty
> + * AGFL and 4 more to handle a potential split of the file's bmap btree.
>   */
>  unsigned int
>  xfs_alloc_set_aside(
>  	struct xfs_mount	*mp)
>  {
> -	return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
> +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
>  }
>  
>  /*
> @@ -124,12 +137,12 @@ xfs_alloc_ag_max_usable(
>  	unsigned int		blocks;
>  
>  	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
> -	blocks += XFS_ALLOC_AGFL_RESERVE;
> +	blocks += XFS_ALLOCBT_AGFL_RESERVE;
>  	blocks += 3;			/* AGF, AGI btree root blocks */
>  	if (xfs_has_finobt(mp))
>  		blocks++;		/* finobt root block */
>  	if (xfs_has_rmapbt(mp))
> -		blocks++; 		/* rmap root block */
> +		blocks++;		/* rmap root block */
>  	if (xfs_has_reflink(mp))
>  		blocks++;		/* refcount root block */
>  
> diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> index 1c14a0b1abea..d4c057b764f9 100644
> --- a/fs/xfs/libxfs/xfs_alloc.h
> +++ b/fs/xfs/libxfs/xfs_alloc.h
> @@ -88,7 +88,6 @@ typedef struct xfs_alloc_arg {
>  #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
>  
>  /* freespace limit calculations */
> -#define XFS_ALLOC_AGFL_RESERVE	4
>  unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
>  unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
>  
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-17 21:21 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
@ 2022-03-18 12:17   ` Brian Foster
  2022-03-18 20:52     ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:06PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The comment for xfs_alloc_set_aside indicates that we want to set aside
> enough space to handle a bmap btree split.  The code, unfortunately,
> hardcodes this to 4.
> 
> This is incorrect, since file bmap btrees can be taller than that:
> 
> xfs_db> btheight bmapbt -n 4294967296 -b 512
> bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> level 0: 4294967296 records, 330382100 blocks
> level 1: 330382100 records, 25414008 blocks
> level 2: 25414008 records, 1954924 blocks
> level 3: 1954924 records, 150379 blocks
> level 4: 150379 records, 11568 blocks
> level 5: 11568 records, 890 blocks
> level 6: 890 records, 69 blocks
> level 7: 69 records, 6 blocks
> level 8: 6 records, 1 block
> 9 levels, 357913945 blocks total
> 
> Fix this by using the actual bmap btree maxlevel value for the
> set-aside.  We subtract one because the root is always in the inode and
> hence never splits.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
>  fs/xfs/libxfs/xfs_sb.c    |    2 --
>  fs/xfs/xfs_mount.c        |    7 +++++++
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index bed73e8002a5..9336176dc706 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -652,6 +652,13 @@ xfs_mountfs(
>  
>  	xfs_agbtree_compute_maxlevels(mp);
>  
> +	/*
> +	 * Compute the amount of space to set aside to handle btree splits now
> +	 * that we have calculated the btree maxlevels.
> +	 */

"... to handle btree splits near -ENOSPC ..." ?

Otherwise LGTM:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
> +
>  	/*
>  	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
>  	 * is NOT aligned turn off m_dalign since allocator alignment is within
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-17 21:21 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
@ 2022-03-18 12:18   ` Brian Foster
  2022-03-18 21:01     ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:12PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> xfs_reserve_blocks controls the size of the user-visible free space
> reserve pool.  Given the difference between the current and requested
> pool sizes, it will try to reserve free space from fdblocks.  However,
> the amount requested from fdblocks is also constrained by the amount of
> space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> 
> In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> out the "free space" used by the free space btrees, because some portion
> of the free space btrees hold in reserve space for future btree
> expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> of blocks that it could request from xfs_mod_fdblocks was not updated to
> include m_allocbt_blks, so if space is extremely low, the caller hangs.
> 
> Fix this by creating a function to estimate the number of blocks that
> can be reserved from fdblocks, which needs to exclude the set-aside and
> m_allocbt_blks.
> 
> Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> with an fstests configuration that specifies a 1k blocksize and a
> specially crafted log size that will consume 7/8 of the space (17920
> blocks, specifically) in that AG.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_fsops.c |    7 +++++--
>  fs/xfs/xfs_mount.h |   29 +++++++++++++++++++++++++++++
>  2 files changed, 34 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..b71799a3acd3 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -433,8 +433,11 @@ xfs_reserve_blocks(
>  	 */
>  	error = -ENOSPC;
>  	do {
> -		free = percpu_counter_sum(&mp->m_fdblocks) -
> -						mp->m_alloc_set_aside;
> +		/*
> +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> +		 * will not give us.
> +		 */

This comment seems unnecessary. I'm not sure what this is telling that
the code doesn't already..?

> +		free = xfs_fdblocks_available(mp);
>  		if (free <= 0)
>  			break;
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..998b54c3c454 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -479,6 +479,35 @@ extern void	xfs_unmountfs(xfs_mount_t *);
>   */
>  #define XFS_FDBLOCKS_BATCH	1024
>  
> +/*
> + * Estimate the amount of space that xfs_mod_fdblocks might give us without
> + * drawing from the reservation pool.  In other words, estimate the free space
> + * that is available to userspace.
> + *
> + * This quantity is the amount of free space tracked in the on-disk metadata
> + * minus:
> + *
> + * - Delayed allocation reservations
> + * - Per-AG space reservations to guarantee metadata expansion
> + * - Userspace-controlled free space reserve pool
> + *
> + * - Space reserved to ensure that we can always split a bmap btree
> + * - Free space btree blocks that are not available for allocation due to
> + *   per-AG metadata reservations
> + *
> + * The first three are captured in the incore fdblocks counter.
> + */

Hm. Sometimes I wonder if we overdocument things to our own detriment
(reading back my own comments at times suggests I'm terrible at this).
So do we really need to document what other internal reservations are or
are not taken out of ->m_fdblocks here..? I suspect we already have
plenty of sufficient documentation for things like perag res colocated
with the actual code, such that this kind of thing just creates an
external reference that will probably just bitrot as years go by. Can we
reduce this down to just explain how/why this helper has to calculate a
block availability value for blocks that otherwise haven't been
explicitly allocated out of the in-core free block counters?

> +static inline int64_t
> +xfs_fdblocks_available(
> +	struct xfs_mount	*mp)
> +{
> +	int64_t			free = percpu_counter_sum(&mp->m_fdblocks);
> +
> +	free -= mp->m_alloc_set_aside;
> +	free -= atomic64_read(&mp->m_allocbt_blks);
> +	return free;
> +}
> +

FWIW the helper seems fine in context, but will this help us avoid the
duplicate calculation in xfs_mod_fdblocks(), for instance?

Brian

>  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
>  				 bool reserved);
>  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
@ 2022-03-18 12:18   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Don't spin in an infinite loop trying to reserve blocks -- if we can't
> do it after 30 tries, we're racing with a nearly full filesystem, so
> just give up.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_fsops.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b71799a3acd3..4076b9004077 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -379,6 +379,7 @@ xfs_reserve_blocks(
>  	int64_t			fdblks_delta = 0;
>  	uint64_t		request;
>  	int64_t			free;
> +	unsigned int		tries;
>  	int			error = 0;
>  
>  	/* If inval is null, report current values and return */
> @@ -430,9 +431,16 @@ xfs_reserve_blocks(
>  	 * If the request is larger than the current reservation, reserve the
>  	 * blocks before we update the reserve counters. Sample m_fdblocks and
>  	 * perform a partial reservation if the request exceeds free space.
> +	 *
> +	 * The loop body estimates how many blocks it can request from fdblocks
> +	 * to stash in the reserve pool.  This is a classic TOCTOU race since
> +	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
> +	 * cannot tell if @free remaining unchanged between iterations is due
> +	 * to an idle system or freed blocks being consumed immediately, so
> +	 * we'll try a finite number of times to satisfy the request.
>  	 */
>  	error = -ENOSPC;
> -	do {
> +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
>  		/*
>  		 * The reservation pool cannot take space that xfs_mod_fdblocks
>  		 * will not give us.
> @@ -462,7 +470,7 @@ xfs_reserve_blocks(
>  		spin_unlock(&mp->m_sb_lock);
>  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
>  		spin_lock(&mp->m_sb_lock);
> -	} while (error == -ENOSPC);
> +	}
>  
>  	/*
>  	 * Update the reserve counters if blocks have been successfully
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-17 21:21 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
@ 2022-03-18 12:19   ` Brian Foster
  2022-03-18 21:19     ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:23PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> On a modern filesystem, we don't allow userspace to allocate blocks for
> data storage from the per-AG space reservations, the user-controlled
> reservation pool that prevents ENOSPC in the middle of internal
> operations, or the internal per-AG set-aside that prevents ENOSPC.
> Since we now consider free space btree blocks as unavailable for
> allocation for data storage, we shouldn't report those blocks via statfs
> either.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_fsops.c |    3 +--
>  fs/xfs/xfs_mount.h |   13 +++++++++++++
>  fs/xfs/xfs_super.c |    4 +---
>  3 files changed, 15 insertions(+), 5 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 998b54c3c454..74e9b8558162 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -508,6 +508,19 @@ xfs_fdblocks_available(
>  	return free;
>  }
>  
> +/* Same as above, but don't take the slow path. */
> +static inline int64_t
> +xfs_fdblocks_available_fast(
> +	struct xfs_mount	*mp)
> +{
> +	int64_t			free;
> +
> +	free = percpu_counter_read_positive(&mp->m_fdblocks);
> +	free -= mp->m_alloc_set_aside;
> +	free -= atomic64_read(&mp->m_allocbt_blks);
> +	return free;
> +}
> +

No objection to the behavior change, but the point of the helper should
be to simplify things and reduce duplication. Here it seems we're going
to continue duplicating the set aside calculation, just in separate
helpers because different contexts apparently have different ways of
reading the free space counters (?).

If that's the case and we want an _available() helper, can we create a
single helper that takes the fdblocks count as a parameter and returns
the final "available" value so the helper can be used more broadly and
consistently? (Or factor out the common bits into an internal helper and
turn these two into simple parameter passing wrappers if you really want
to keep the api as such).

Brian

>  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
>  				 bool reserved);
>  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d84714e4e46a..7b6c147e63c4 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -791,7 +791,6 @@ xfs_fs_statfs(
>  	uint64_t		fakeinos, id;
>  	uint64_t		icount;
>  	uint64_t		ifree;
> -	uint64_t		fdblocks;
>  	xfs_extlen_t		lsize;
>  	int64_t			ffree;
>  
> @@ -806,7 +805,6 @@ xfs_fs_statfs(
>  
>  	icount = percpu_counter_sum(&mp->m_icount);
>  	ifree = percpu_counter_sum(&mp->m_ifree);
> -	fdblocks = percpu_counter_sum(&mp->m_fdblocks);
>  
>  	spin_lock(&mp->m_sb_lock);
>  	statp->f_bsize = sbp->sb_blocksize;
> @@ -815,7 +813,7 @@ xfs_fs_statfs(
>  	spin_unlock(&mp->m_sb_lock);
>  
>  	/* make sure statp->f_bfree does not underflow */
> -	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
> +	statp->f_bfree = max_t(int64_t, xfs_fdblocks_available(mp), 0);
>  	statp->f_bavail = statp->f_bfree;
>  
>  	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive
  2022-03-17 21:21 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
@ 2022-03-18 12:21   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2022-03-18 12:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:29PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We've established in this patchset that the "alloc_set_aside" pool is
> actually used to ensure that a bmbt split always succeeds so that the
> filesystem won't run out of space mid-transaction and crash.  Rename the
> variable and the function to be a little more suggestive of the purpose
> of this quantity.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    4 ++--
>  fs/xfs/libxfs/xfs_alloc.h |    2 +-
>  fs/xfs/xfs_fsops.c        |    2 +-
>  fs/xfs/xfs_log_recover.c  |    2 +-
>  fs/xfs/xfs_mount.c        |    4 ++--
>  fs/xfs/xfs_mount.h        |    7 ++++---
>  6 files changed, 11 insertions(+), 10 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 9336176dc706..eac9534338fd 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -656,7 +656,7 @@ xfs_mountfs(
>  	 * Compute the amount of space to set aside to handle btree splits now
>  	 * that we have calculated the btree maxlevels.
>  	 */
> -	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> +	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
>  	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
>  
>  	/*
> @@ -1153,7 +1153,7 @@ xfs_mod_fdblocks(
>  	 * problems (i.e. transaction abort, pagecache discards, etc.) than
>  	 * slightly premature -ENOSPC.
>  	 */
> -	set_aside = mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
> +	set_aside = mp->m_bmbt_split_setaside + atomic64_read(&mp->m_allocbt_blks);

IMO the whole end result would be more simple if the helper(s) were
written to fully isolate usage of >m_bmbt_split_setaside to within those
helpers, as opposed to continuing to leave around the need to open code
the set aside calculation anywhere. That said, this is more commentary
on the previous couple or so patches than this one and this is an
improvement overall:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  	percpu_counter_add_batch(&mp->m_fdblocks, delta, batch);
>  	if (__percpu_counter_compare(&mp->m_fdblocks, set_aside,
>  				     XFS_FDBLOCKS_BATCH) >= 0) {
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 74e9b8558162..6c4cbd4a0c32 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -134,7 +134,8 @@ typedef struct xfs_mount {
>  	uint			m_refc_maxlevels; /* max refcount btree level */
>  	unsigned int		m_agbtree_maxlevels; /* max level of all AG btrees */
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> -	uint			m_alloc_set_aside; /* space we can't use */
> +	/* space reserved to ensure bmbt splits always succeed */
> +	unsigned int		m_bmbt_split_setaside;
>  	uint			m_ag_max_usable; /* max space per AG */
>  	int			m_dalign;	/* stripe unit */
>  	int			m_swidth;	/* stripe width */
> @@ -503,7 +504,7 @@ xfs_fdblocks_available(
>  {
>  	int64_t			free = percpu_counter_sum(&mp->m_fdblocks);
>  
> -	free -= mp->m_alloc_set_aside;
> +	free -= mp->m_bmbt_split_setaside;
>  	free -= atomic64_read(&mp->m_allocbt_blks);
>  	return free;
>  }
> @@ -516,7 +517,7 @@ xfs_fdblocks_available_fast(
>  	int64_t			free;
>  
>  	free = percpu_counter_read_positive(&mp->m_fdblocks);
> -	free -= mp->m_alloc_set_aside;
> +	free -= mp->m_bmbt_split_setaside;
>  	free -= atomic64_read(&mp->m_allocbt_blks);
>  	return free;
>  }
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-18 12:17   ` Brian Foster
@ 2022-03-18 20:52     ` Darrick J. Wong
  0 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-18 20:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Fri, Mar 18, 2022 at 08:17:53AM -0400, Brian Foster wrote:
> On Thu, Mar 17, 2022 at 02:21:06PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The comment for xfs_alloc_set_aside indicates that we want to set aside
> > enough space to handle a bmap btree split.  The code, unfortunately,
> > hardcodes this to 4.
> > 
> > This is incorrect, since file bmap btrees can be taller than that:
> > 
> > xfs_db> btheight bmapbt -n 4294967296 -b 512
> > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > level 0: 4294967296 records, 330382100 blocks
> > level 1: 330382100 records, 25414008 blocks
> > level 2: 25414008 records, 1954924 blocks
> > level 3: 1954924 records, 150379 blocks
> > level 4: 150379 records, 11568 blocks
> > level 5: 11568 records, 890 blocks
> > level 6: 890 records, 69 blocks
> > level 7: 69 records, 6 blocks
> > level 8: 6 records, 1 block
> > 9 levels, 357913945 blocks total
> > 
> > Fix this by using the actual bmap btree maxlevel value for the
> > set-aside.  We subtract one because the root is always in the inode and
> > hence never splits.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
> >  fs/xfs/libxfs/xfs_sb.c    |    2 --
> >  fs/xfs/xfs_mount.c        |    7 +++++++
> >  3 files changed, 12 insertions(+), 4 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index bed73e8002a5..9336176dc706 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -652,6 +652,13 @@ xfs_mountfs(
> >  
> >  	xfs_agbtree_compute_maxlevels(mp);
> >  
> > +	/*
> > +	 * Compute the amount of space to set aside to handle btree splits now
> > +	 * that we have calculated the btree maxlevels.
> > +	 */
> 
> "... to handle btree splits near -ENOSPC ..." ?

Fixed; thanks for the review!

--D

> Otherwise LGTM:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
> > +	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
> > +
> >  	/*
> >  	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
> >  	 * is NOT aligned turn off m_dalign since allocator alignment is within
> > 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-18 12:18   ` Brian Foster
@ 2022-03-18 21:01     ` Darrick J. Wong
  0 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-18 21:01 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Fri, Mar 18, 2022 at 08:18:37AM -0400, Brian Foster wrote:
> On Thu, Mar 17, 2022 at 02:21:12PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > xfs_reserve_blocks controls the size of the user-visible free space
> > reserve pool.  Given the difference between the current and requested
> > pool sizes, it will try to reserve free space from fdblocks.  However,
> > the amount requested from fdblocks is also constrained by the amount of
> > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > 
> > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > out the "free space" used by the free space btrees, because some portion
> > of the free space btrees hold in reserve space for future btree
> > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > 
> > Fix this by creating a function to estimate the number of blocks that
> > can be reserved from fdblocks, which needs to exclude the set-aside and
> > m_allocbt_blks.
> > 
> > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > with an fstests configuration that specifies a 1k blocksize and a
> > specially crafted log size that will consume 7/8 of the space (17920
> > blocks, specifically) in that AG.
> > 
> > Cc: Brian Foster <bfoster@redhat.com>
> > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_fsops.c |    7 +++++--
> >  fs/xfs/xfs_mount.h |   29 +++++++++++++++++++++++++++++
> >  2 files changed, 34 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 33e26690a8c4..b71799a3acd3 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -433,8 +433,11 @@ xfs_reserve_blocks(
> >  	 */
> >  	error = -ENOSPC;
> >  	do {
> > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > -						mp->m_alloc_set_aside;
> > +		/*
> > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > +		 * will not give us.
> > +		 */
> 
> This comment seems unnecessary. I'm not sure what this is telling that
> the code doesn't already..?

Yeah, I'll get rid of it.

> > +		free = xfs_fdblocks_available(mp);
> >  		if (free <= 0)
> >  			break;
> >  
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 00720a02e761..998b54c3c454 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -479,6 +479,35 @@ extern void	xfs_unmountfs(xfs_mount_t *);
> >   */
> >  #define XFS_FDBLOCKS_BATCH	1024
> >  
> > +/*
> > + * Estimate the amount of space that xfs_mod_fdblocks might give us without
> > + * drawing from the reservation pool.  In other words, estimate the free space
> > + * that is available to userspace.
> > + *
> > + * This quantity is the amount of free space tracked in the on-disk metadata
> > + * minus:
> > + *
> > + * - Delayed allocation reservations
> > + * - Per-AG space reservations to guarantee metadata expansion
> > + * - Userspace-controlled free space reserve pool
> > + *
> > + * - Space reserved to ensure that we can always split a bmap btree
> > + * - Free space btree blocks that are not available for allocation due to
> > + *   per-AG metadata reservations
> > + *
> > + * The first three are captured in the incore fdblocks counter.
> > + */
> 
> Hm. Sometimes I wonder if we overdocument things to our own detriment
> (reading back my own comments at times suggests I'm terrible at this).
> So do we really need to document what other internal reservations are or
> are not taken out of ->m_fdblocks here..? I suspect we already have
> plenty of sufficient documentation for things like perag res colocated
> with the actual code, such that this kind of thing just creates an
> external reference that will probably just bitrot as years go by. Can we
> reduce this down to just explain how/why this helper has to calculate a
> block availability value for blocks that otherwise haven't been
> explicitly allocated out of the in-core free block counters?

Hmm.  I suppose I could reduce the comment at the same time that I split
out the code that computes the amount of free space that isn't
available.

> > +static inline int64_t
> > +xfs_fdblocks_available(
> > +	struct xfs_mount	*mp)
> > +{
> > +	int64_t			free = percpu_counter_sum(&mp->m_fdblocks);
> > +
> > +	free -= mp->m_alloc_set_aside;
> > +	free -= atomic64_read(&mp->m_allocbt_blks);
> > +	return free;
> > +}
> > +
> 
> FWIW the helper seems fine in context, but will this help us avoid the
> duplicate calculation in xfs_mod_fdblocks(), for instance?

It will once I turn that into:


/*
 * Estimate the amount of free space that is not available to userspace
 * and is not explicitly reserved from the incore fdblocks:
 *
 * - Space reserved to ensure that we can always split a bmap btree
 * - Free space btree blocks that are not available for allocation due
 *   to per-AG metadata reservations
 */
static inline uint64_t
xfs_fdblocks_unavailable(
	struct xfs_mount	*mp)
{
	return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
}

/*
 * Estimate the amount of space that xfs_mod_fdblocks might give us
 * without drawing from any reservation pool.  In other words, estimate
 * the free space that is available to userspace.
 */
static inline int64_t
xfs_fdblocks_available(
	struct xfs_mount	*mp)
{
	return percpu_counter_sum(&mp->m_fdblocks) -
			xfs_fdblocks_unavailable(mp);
}

--D

> 
> Brian
> 
> >  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
> >  				 bool reserved);
> >  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> > 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-18 12:19   ` Brian Foster
@ 2022-03-18 21:19     ` Darrick J. Wong
  0 siblings, 0 replies; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-18 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Fri, Mar 18, 2022 at 08:19:48AM -0400, Brian Foster wrote:
> On Thu, Mar 17, 2022 at 02:21:23PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > On a modern filesystem, we don't allow userspace to allocate blocks for
> > data storage from the per-AG space reservations, the user-controlled
> > reservation pool that prevents ENOSPC in the middle of internal
> > operations, or the internal per-AG set-aside that prevents ENOSPC.
> > Since we now consider free space btree blocks as unavailable for
> > allocation for data storage, we shouldn't report those blocks via statfs
> > either.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_fsops.c |    3 +--
> >  fs/xfs/xfs_mount.h |   13 +++++++++++++
> >  fs/xfs/xfs_super.c |    4 +---
> >  3 files changed, 15 insertions(+), 5 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 998b54c3c454..74e9b8558162 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -508,6 +508,19 @@ xfs_fdblocks_available(
> >  	return free;
> >  }
> >  
> > +/* Same as above, but don't take the slow path. */
> > +static inline int64_t
> > +xfs_fdblocks_available_fast(
> > +	struct xfs_mount	*mp)
> > +{
> > +	int64_t			free;
> > +
> > +	free = percpu_counter_read_positive(&mp->m_fdblocks);
> > +	free -= mp->m_alloc_set_aside;
> > +	free -= atomic64_read(&mp->m_allocbt_blks);
> > +	return free;
> > +}
> > +
> 
> No objection to the behavior change, but the point of the helper should
> be to simplify things and reduce duplication. Here it seems we're going
> to continue duplicating the set aside calculation, just in separate
> helpers because different contexts apparently have different ways of
> reading the free space counters (?).
> 
> If that's the case and we want an _available() helper, can we create a
> single helper that takes the fdblocks count as a parameter and returns
> the final "available" value so the helper can be used more broadly and
> consistently? (Or factor out the common bits into an internal helper and
> turn these two into simple parameter passing wrappers if you really want
> to keep the api as such).

In the end, I deleted xfs_fdblocks_avail* so the only new function is
the xfs_fdblocks_unavailable() that I mentioned in the last reply.

It does make the patches smaller...

--D

> Brian
> 
> >  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
> >  				 bool reserved);
> >  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index d84714e4e46a..7b6c147e63c4 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -791,7 +791,6 @@ xfs_fs_statfs(
> >  	uint64_t		fakeinos, id;
> >  	uint64_t		icount;
> >  	uint64_t		ifree;
> > -	uint64_t		fdblocks;
> >  	xfs_extlen_t		lsize;
> >  	int64_t			ffree;
> >  
> > @@ -806,7 +805,6 @@ xfs_fs_statfs(
> >  
> >  	icount = percpu_counter_sum(&mp->m_icount);
> >  	ifree = percpu_counter_sum(&mp->m_ifree);
> > -	fdblocks = percpu_counter_sum(&mp->m_fdblocks);
> >  
> >  	spin_lock(&mp->m_sb_lock);
> >  	statp->f_bsize = sbp->sb_blocksize;
> > @@ -815,7 +813,7 @@ xfs_fs_statfs(
> >  	spin_unlock(&mp->m_sb_lock);
> >  
> >  	/* make sure statp->f_bfree does not underflow */
> > -	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
> > +	statp->f_bfree = max_t(int64_t, xfs_fdblocks_available(mp), 0);
> >  	statp->f_bavail = statp->f_bfree;
> >  
> >  	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);
> > 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-24  5:26     ` Darrick J. Wong
@ 2022-03-24  6:00       ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2022-03-24  6:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Wed, Mar 23, 2022 at 10:26:14PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 24, 2022 at 07:48:56AM +1100, Dave Chinner wrote:
> > On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > The comment for xfs_alloc_set_aside indicates that we want to set aside
> > > enough space to handle a bmap btree split.  The code, unfortunately,
> > > hardcodes this to 4.
> > > 
> > > This is incorrect, since file bmap btrees can be taller than that:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967295 -b 512
> > > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > > level 0: 4294967295 records, 330382100 blocks
> > > level 1: 330382100 records, 25414008 blocks
> > > level 2: 25414008 records, 1954924 blocks
> > > level 3: 1954924 records, 150379 blocks
> > > level 4: 150379 records, 11568 blocks
> > > level 5: 11568 records, 890 blocks
> > > level 6: 890 records, 69 blocks
> > > level 7: 69 records, 6 blocks
> > > level 8: 6 records, 1 block
> > > 9 levels, 357913945 blocks total
> > > 
> > > Or, for V5 filesystems:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967295 -b 1024
> > > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> > > level 0: 4294967295 records, 148102321 blocks
> > > level 1: 148102321 records, 5106977 blocks
> > > level 2: 5106977 records, 176103 blocks
> > > level 3: 176103 records, 6073 blocks
> > > level 4: 6073 records, 210 blocks
> > > level 5: 210 records, 8 blocks
> > > level 6: 8 records, 1 block
> > > 7 levels, 153391693 blocks total
> > > 
> > > Fix this by using the actual bmap btree maxlevel value for the
> > > set-aside.  We subtract one because the root is always in the inode and
> > > hence never splits.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
> > >  fs/xfs/libxfs/xfs_sb.c    |    2 --
> > >  fs/xfs/xfs_mount.c        |    7 +++++++
> > >  3 files changed, 12 insertions(+), 4 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index b0678e96ce61..747b3e45303f 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
> > >   * aside a few blocks which will not be reserved in delayed allocation.
> > >   *
> > >   * For each AG, we need to reserve enough blocks to replenish a totally empty
> > > - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> > > + * AGFL and enough to handle a potential split of a file's bmap btree.
> > >   */
> > >  unsigned int
> > >  xfs_alloc_set_aside(
> > >  	struct xfs_mount	*mp)
> > >  {
> > > -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> > > +	unsigned int		bmbt_splits;
> > > +
> > > +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> > > +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
> > >  }
> > 
> > So right now I'm trying to understand why this global space set
> > aside ever needed to take into account the space used by a single
> > BMBT split. ISTR it was done back in 2006 because the ag selection
> > code, alloc args and/or xfs_alloc_space_available() didn't take into
> > account the BMBT space via args->minleft correctly to ensure that
> > the AGF we select had enough space in it for both the data extent
> > and the followup BMBT split. Hence the original SET ASIDE (which
> > wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.
> > 
> > The transaction reservation takes into account the space needed by
> > BMBT splits so we don't over-commit global space on user allocation
> > anymore, args->minleft takes it into account so we don't overcommit
> > AG space on extent allocation, and we have the reserved block pool
> > to handle situations were delalloc extents are fragmented more than
> > the initial indirect block reservation that is taken with the
> > delalloc extent reservation.
> > 
> > So where/why is this needed anymore?
> 
> Beats me.  I started by reading the comment and thought "Gee, that's not
> true of bmbts!" :(

Yeah. Time and knowledge are showing through here. I've been running
ENOSPC tests with the BMBT reservation part removed now for several
hours and I've haven't seen any issues as a result removing it. I'll
keep it running though to see if anything falls out, but my initial
impression is that we don't obviously need it anymore.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-23 20:48   ` Dave Chinner
@ 2022-03-24  5:26     ` Darrick J. Wong
  2022-03-24  6:00       ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-24  5:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Thu, Mar 24, 2022 at 07:48:56AM +1100, Dave Chinner wrote:
> On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The comment for xfs_alloc_set_aside indicates that we want to set aside
> > enough space to handle a bmap btree split.  The code, unfortunately,
> > hardcodes this to 4.
> > 
> > This is incorrect, since file bmap btrees can be taller than that:
> > 
> > xfs_db> btheight bmapbt -n 4294967295 -b 512
> > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > level 0: 4294967295 records, 330382100 blocks
> > level 1: 330382100 records, 25414008 blocks
> > level 2: 25414008 records, 1954924 blocks
> > level 3: 1954924 records, 150379 blocks
> > level 4: 150379 records, 11568 blocks
> > level 5: 11568 records, 890 blocks
> > level 6: 890 records, 69 blocks
> > level 7: 69 records, 6 blocks
> > level 8: 6 records, 1 block
> > 9 levels, 357913945 blocks total
> > 
> > Or, for V5 filesystems:
> > 
> > xfs_db> btheight bmapbt -n 4294967295 -b 1024
> > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> > level 0: 4294967295 records, 148102321 blocks
> > level 1: 148102321 records, 5106977 blocks
> > level 2: 5106977 records, 176103 blocks
> > level 3: 176103 records, 6073 blocks
> > level 4: 6073 records, 210 blocks
> > level 5: 210 records, 8 blocks
> > level 6: 8 records, 1 block
> > 7 levels, 153391693 blocks total
> > 
> > Fix this by using the actual bmap btree maxlevel value for the
> > set-aside.  We subtract one because the root is always in the inode and
> > hence never splits.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
> >  fs/xfs/libxfs/xfs_sb.c    |    2 --
> >  fs/xfs/xfs_mount.c        |    7 +++++++
> >  3 files changed, 12 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index b0678e96ce61..747b3e45303f 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
> >   * aside a few blocks which will not be reserved in delayed allocation.
> >   *
> >   * For each AG, we need to reserve enough blocks to replenish a totally empty
> > - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> > + * AGFL and enough to handle a potential split of a file's bmap btree.
> >   */
> >  unsigned int
> >  xfs_alloc_set_aside(
> >  	struct xfs_mount	*mp)
> >  {
> > -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> > +	unsigned int		bmbt_splits;
> > +
> > +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> > +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
> >  }
> 
> So right now I'm trying to understand why this global space set
> aside ever needed to take into account the space used by a single
> BMBT split. ISTR it was done back in 2006 because the ag selection
> code, alloc args and/or xfs_alloc_space_available() didn't take into
> account the BMBT space via args->minleft correctly to ensure that
> the AGF we select had enough space in it for both the data extent
> and the followup BMBT split. Hence the original SET ASIDE (which
> wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.
> 
> The transaction reservation takes into account the space needed by
> BMBT splits so we don't over-commit global space on user allocation
> anymore, args->minleft takes it into account so we don't overcommit
> AG space on extent allocation, and we have the reserved block pool
> to handle situations were delalloc extents are fragmented more than
> the initial indirect block reservation that is taken with the
> delalloc extent reservation.
> 
> So where/why is this needed anymore?

Beats me.  I started by reading the comment and thought "Gee, that's not
true of bmbts!" :(

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
@ 2022-03-23 20:48   ` Dave Chinner
  2022-03-24  5:26     ` Darrick J. Wong
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2022-03-23 20:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The comment for xfs_alloc_set_aside indicates that we want to set aside
> enough space to handle a bmap btree split.  The code, unfortunately,
> hardcodes this to 4.
> 
> This is incorrect, since file bmap btrees can be taller than that:
> 
> xfs_db> btheight bmapbt -n 4294967295 -b 512
> bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> level 0: 4294967295 records, 330382100 blocks
> level 1: 330382100 records, 25414008 blocks
> level 2: 25414008 records, 1954924 blocks
> level 3: 1954924 records, 150379 blocks
> level 4: 150379 records, 11568 blocks
> level 5: 11568 records, 890 blocks
> level 6: 890 records, 69 blocks
> level 7: 69 records, 6 blocks
> level 8: 6 records, 1 block
> 9 levels, 357913945 blocks total
> 
> Or, for V5 filesystems:
> 
> xfs_db> btheight bmapbt -n 4294967295 -b 1024
> bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> level 0: 4294967295 records, 148102321 blocks
> level 1: 148102321 records, 5106977 blocks
> level 2: 5106977 records, 176103 blocks
> level 3: 176103 records, 6073 blocks
> level 4: 6073 records, 210 blocks
> level 5: 210 records, 8 blocks
> level 6: 8 records, 1 block
> 7 levels, 153391693 blocks total
> 
> Fix this by using the actual bmap btree maxlevel value for the
> set-aside.  We subtract one because the root is always in the inode and
> hence never splits.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
>  fs/xfs/libxfs/xfs_sb.c    |    2 --
>  fs/xfs/xfs_mount.c        |    7 +++++++
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index b0678e96ce61..747b3e45303f 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
>   * aside a few blocks which will not be reserved in delayed allocation.
>   *
>   * For each AG, we need to reserve enough blocks to replenish a totally empty
> - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> + * AGFL and enough to handle a potential split of a file's bmap btree.
>   */
>  unsigned int
>  xfs_alloc_set_aside(
>  	struct xfs_mount	*mp)
>  {
> -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> +	unsigned int		bmbt_splits;
> +
> +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
>  }

So right now I'm trying to understand why this global space set
aside ever needed to take into account the space used by a single
BMBT split. ISTR it was done back in 2006 because the ag selection
code, alloc args and/or xfs_alloc_space_available() didn't take into
account the BMBT space via args->minleft correctly to ensure that
the AGF we select had enough space in it for both the data extent
and the followup BMBT split. Hence the original SET ASIDE (which
wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.

The transaction reservation takes into account the space needed by
BMBT splits so we don't over-commit global space on user allocation
anymore, args->minleft takes it into account so we don't overcommit
AG space on extent allocation, and we have the reserved block pool
to handle situations were delalloc extents are fragmented more than
the initial indirect block reservation that is taken with the
delalloc extent reservation.

So where/why is this needed anymore?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-23 20:48   ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

The comment for xfs_alloc_set_aside indicates that we want to set aside
enough space to handle a bmap btree split.  The code, unfortunately,
hardcodes this to 4.

This is incorrect, since file bmap btrees can be taller than that:

xfs_db> btheight bmapbt -n 4294967295 -b 512
bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
level 0: 4294967295 records, 330382100 blocks
level 1: 330382100 records, 25414008 blocks
level 2: 25414008 records, 1954924 blocks
level 3: 1954924 records, 150379 blocks
level 4: 150379 records, 11568 blocks
level 5: 11568 records, 890 blocks
level 6: 890 records, 69 blocks
level 7: 69 records, 6 blocks
level 8: 6 records, 1 block
9 levels, 357913945 blocks total

Or, for V5 filesystems:

xfs_db> btheight bmapbt -n 4294967295 -b 1024
bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
level 0: 4294967295 records, 148102321 blocks
level 1: 148102321 records, 5106977 blocks
level 2: 5106977 records, 176103 blocks
level 3: 176103 records, 6073 blocks
level 4: 6073 records, 210 blocks
level 5: 210 records, 8 blocks
level 6: 8 records, 1 block
7 levels, 153391693 blocks total

Fix this by using the actual bmap btree maxlevel value for the
set-aside.  We subtract one because the root is always in the inode and
hence never splits.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
 fs/xfs/libxfs/xfs_sb.c    |    2 --
 fs/xfs/xfs_mount.c        |    7 +++++++
 3 files changed, 12 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index b0678e96ce61..747b3e45303f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -107,13 +107,16 @@ xfs_prealloc_blocks(
  * aside a few blocks which will not be reserved in delayed allocation.
  *
  * For each AG, we need to reserve enough blocks to replenish a totally empty
- * AGFL and 4 more to handle a potential split of the file's bmap btree.
+ * AGFL and enough to handle a potential split of a file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
+	unsigned int		bmbt_splits;
+
+	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f4e84aa1d50a..b823beb944e4 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -887,8 +887,6 @@ xfs_sb_mount_common(
 	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
 
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
-	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bed73e8002a5..4f8fac8175e8 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -652,6 +652,13 @@ xfs_mountfs(
 
 	xfs_agbtree_compute_maxlevels(mp);
 
+	/*
+	 * Compute the amount of space to set aside to handle btree splits near
+	 * ENOSPC now that we have calculated the btree maxlevels.
+	 */
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within


^ permalink raw reply related	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-03-24  6:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
2022-03-17 21:21 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
2022-03-18 12:17   ` Brian Foster
2022-03-17 21:21 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
2022-03-18 12:17   ` Brian Foster
2022-03-18 20:52     ` Darrick J. Wong
2022-03-17 21:21 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
2022-03-18 12:18   ` Brian Foster
2022-03-18 21:01     ` Darrick J. Wong
2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
2022-03-18 12:18   ` Brian Foster
2022-03-17 21:21 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
2022-03-18 12:19   ` Brian Foster
2022-03-18 21:19     ` Darrick J. Wong
2022-03-17 21:21 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
2022-03-18 12:21   ` Brian Foster
2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
2022-03-23 20:48   ` Dave Chinner
2022-03-24  5:26     ` Darrick J. Wong
2022-03-24  6:00       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.