linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting
@ 2022-03-20 16:43 Darrick J. Wong
  2022-03-20 16:43 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

Hi all,

Last week, I was running xfs/306 (which formats a single-AG 20MB
filesystem) with an fstests configuration that specifies a 1k blocksize
and a specially crafted log size that will consume 7/8 of the space
(17920 blocks, specifically) in that AG.  This resulted mount hanging on
the infinite loop in xfs_reserve_blocks because we overestimate the
number of blocks that xfs_mod_fdblocks will give us, and the code
retries forever.

v2: Initially, this was a single patch that fixed the calculation and
transformed the loop into a finite loop.  However, further discussion
revealed several more issues:

 * People had a hard time grokking that the "alloc_set_aside" is
   actually the amount of space we hide so that a bmbt split will always
   succeed;
 * The author didn't understand what XFS_ALLOC_AGFL_RESERVE actually
   mean or whether it was correctly calculated;
 * The "alloc_set_aside" underestimated the blocks needed to handle any
   bmap btree split;
 * Both statfs and XFS_IOC_FSCOUNTS forgot to exclude the amount of
   space used by the free space btrees on behalf of per-AG reservations,
   leading to overreporting of available space;
 * The loop control change really should have been separate.

Hence, this patchset has now grown to six patches to address each of
these issues separately.

v3: only add one helper for calculating the total fdblocks setaside and
tighten some documentation

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reserve-block-fixes-5.18
---
 fs/xfs/libxfs/xfs_alloc.c |   30 +++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_alloc.h |    3 +--
 fs/xfs/libxfs/xfs_sb.c    |    2 --
 fs/xfs/xfs_fsops.c        |   18 +++++++++++++-----
 fs/xfs/xfs_log_recover.c  |    2 +-
 fs/xfs/xfs_mount.c        |    9 ++++++++-
 fs/xfs/xfs_mount.h        |   18 +++++++++++++++++-
 fs/xfs/xfs_super.c        |    3 ++-
 8 files changed, 65 insertions(+), 20 deletions(-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-23 20:39   ` Dave Chinner
  2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full.  This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
 fs/xfs/libxfs/xfs_alloc.h |    1 -
 2 files changed, 18 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 353e53b892e6..b0678e96ce61 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -82,6 +82,19 @@ xfs_prealloc_blocks(
 }
 
 /*
+ * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
+ * guarantee that we can refill the AGFL prior to allocating space in a nearly
+ * full AG.  We require two blocks per free space btree because free space
+ * btrees shrink to a single block as the AG fills up, and any allocation can
+ * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
+ * space from xfs_mod_fdblocks, so we do not account for that here.
+ */
+#define XFS_ALLOCBT_AGFL_RESERVE	4
+
+/*
+ * Compute the number of blocks that we set aside to guarantee the ability to
+ * refill the AGFL and handle a full bmap btree split.
+ *
  * In order to avoid ENOSPC-related deadlock caused by out-of-order locking of
  * AGF buffer (PV 947395), we place constraints on the relationship among
  * actual allocations for data blocks, freelist blocks, and potential file data
@@ -93,14 +106,14 @@ xfs_prealloc_blocks(
  * extents need to be actually allocated. To get around this, we explicitly set
  * aside a few blocks which will not be reserved in delayed allocation.
  *
- * We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to handle a
- * potential split of the file's bmap btree.
+ * For each AG, we need to reserve enough blocks to replenish a totally empty
+ * AGFL and 4 more to handle a potential split of the file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOC_AGFL_RESERVE + 4);
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
 }
 
 /*
@@ -124,12 +137,12 @@ xfs_alloc_ag_max_usable(
 	unsigned int		blocks;
 
 	blocks = XFS_BB_TO_FSB(mp, XFS_FSS_TO_BB(mp, 4)); /* ag headers */
-	blocks += XFS_ALLOC_AGFL_RESERVE;
+	blocks += XFS_ALLOCBT_AGFL_RESERVE;
 	blocks += 3;			/* AGF, AGI btree root blocks */
 	if (xfs_has_finobt(mp))
 		blocks++;		/* finobt root block */
 	if (xfs_has_rmapbt(mp))
-		blocks++; 		/* rmap root block */
+		blocks++;		/* rmap root block */
 	if (xfs_has_reflink(mp))
 		blocks++;		/* refcount root block */
 
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 1c14a0b1abea..d4c057b764f9 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -88,7 +88,6 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
 
 /* freespace limit calculations */
-#define XFS_ALLOC_AGFL_RESERVE	4
 unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
 


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
  2022-03-20 16:43 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-23 20:48   ` Dave Chinner
  2022-03-20 16:43 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

The comment for xfs_alloc_set_aside indicates that we want to set aside
enough space to handle a bmap btree split.  The code, unfortunately,
hardcodes this to 4.

This is incorrect, since file bmap btrees can be taller than that:

xfs_db> btheight bmapbt -n 4294967295 -b 512
bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
level 0: 4294967295 records, 330382100 blocks
level 1: 330382100 records, 25414008 blocks
level 2: 25414008 records, 1954924 blocks
level 3: 1954924 records, 150379 blocks
level 4: 150379 records, 11568 blocks
level 5: 11568 records, 890 blocks
level 6: 890 records, 69 blocks
level 7: 69 records, 6 blocks
level 8: 6 records, 1 block
9 levels, 357913945 blocks total

Or, for V5 filesystems:

xfs_db> btheight bmapbt -n 4294967295 -b 1024
bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
level 0: 4294967295 records, 148102321 blocks
level 1: 148102321 records, 5106977 blocks
level 2: 5106977 records, 176103 blocks
level 3: 176103 records, 6073 blocks
level 4: 6073 records, 210 blocks
level 5: 210 records, 8 blocks
level 6: 8 records, 1 block
7 levels, 153391693 blocks total

Fix this by using the actual bmap btree maxlevel value for the
set-aside.  We subtract one because the root is always in the inode and
hence never splits.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
 fs/xfs/libxfs/xfs_sb.c    |    2 --
 fs/xfs/xfs_mount.c        |    7 +++++++
 3 files changed, 12 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index b0678e96ce61..747b3e45303f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -107,13 +107,16 @@ xfs_prealloc_blocks(
  * aside a few blocks which will not be reserved in delayed allocation.
  *
  * For each AG, we need to reserve enough blocks to replenish a totally empty
- * AGFL and 4 more to handle a potential split of the file's bmap btree.
+ * AGFL and enough to handle a potential split of a file's bmap btree.
  */
 unsigned int
 xfs_alloc_set_aside(
 	struct xfs_mount	*mp)
 {
-	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
+	unsigned int		bmbt_splits;
+
+	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
+	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index f4e84aa1d50a..b823beb944e4 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -887,8 +887,6 @@ xfs_sb_mount_common(
 	mp->m_refc_mnr[1] = mp->m_refc_mxr[1] / 2;
 
 	mp->m_bsize = XFS_FSB_TO_BB(mp, 1);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
-	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 }
 
 /*
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bed73e8002a5..4f8fac8175e8 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -652,6 +652,13 @@ xfs_mountfs(
 
 	xfs_agbtree_compute_maxlevels(mp);
 
+	/*
+	 * Compute the amount of space to set aside to handle btree splits near
+	 * ENOSPC now that we have calculated the btree maxlevels.
+	 */
+	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
+
 	/*
 	 * Check if sb_agblocks is aligned at stripe boundary.  If sb_agblocks
 	 * is NOT aligned turn off m_dalign since allocator alignment is within


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
  2022-03-20 16:43 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
  2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-21 15:22   ` Brian Foster
  2022-03-23 20:51   ` Dave Chinner
  2022-03-20 16:43 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

xfs_reserve_blocks controls the size of the user-visible free space
reserve pool.  Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks.  However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
reserve space so long as xfs_mod_fdblocks returns ENOSPC.

In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.

Fix this by creating a function to estimate the number of blocks that
can be reserved from fdblocks, which needs to exclude the set-aside and
m_allocbt_blks.

Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.

Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |    2 +-
 fs/xfs/xfs_mount.c |    2 +-
 fs/xfs/xfs_mount.h |   15 +++++++++++++++
 3 files changed, 17 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..710e857bb825 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -434,7 +434,7 @@ xfs_reserve_blocks(
 	error = -ENOSPC;
 	do {
 		free = percpu_counter_sum(&mp->m_fdblocks) -
-						mp->m_alloc_set_aside;
+						xfs_fdblocks_unavailable(mp);
 		if (free <= 0)
 			break;
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 4f8fac8175e8..c9fd5219d377 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1153,7 +1153,7 @@ xfs_mod_fdblocks(
 	 * problems (i.e. transaction abort, pagecache discards, etc.) than
 	 * slightly premature -ENOSPC.
 	 */
-	set_aside = mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
+	set_aside = xfs_fdblocks_unavailable(mp);
 	percpu_counter_add_batch(&mp->m_fdblocks, delta, batch);
 	if (__percpu_counter_compare(&mp->m_fdblocks, set_aside,
 				     XFS_FDBLOCKS_BATCH) >= 0) {
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 00720a02e761..da1b7056e743 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -479,6 +479,21 @@ extern void	xfs_unmountfs(xfs_mount_t *);
  */
 #define XFS_FDBLOCKS_BATCH	1024
 
+/*
+ * Estimate the amount of free space that is not available to userspace and is
+ * not explicitly reserved from the incore fdblocks:
+ *
+ * - Space reserved to ensure that we can always split a bmap btree
+ * - Free space btree blocks that are not available for allocation due to
+ *   per-AG metadata reservations
+ */
+static inline uint64_t
+xfs_fdblocks_unavailable(
+	struct xfs_mount	*mp)
+{
+	return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
+}
+
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
 				 bool reserved);
 extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-03-20 16:43 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-23 21:11   ` Dave Chinner
  2022-03-20 16:43 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
  2022-03-20 16:44 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
  5 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

Don't spin in an infinite loop trying to reserve blocks -- if we can't
do it after 30 tries, we're racing with a nearly full filesystem, so
just give up.

Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_fsops.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 710e857bb825..615334e4f689 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -379,6 +379,7 @@ xfs_reserve_blocks(
 	int64_t			fdblks_delta = 0;
 	uint64_t		request;
 	int64_t			free;
+	unsigned int		tries;
 	int			error = 0;
 
 	/* If inval is null, report current values and return */
@@ -430,9 +431,16 @@ xfs_reserve_blocks(
 	 * If the request is larger than the current reservation, reserve the
 	 * blocks before we update the reserve counters. Sample m_fdblocks and
 	 * perform a partial reservation if the request exceeds free space.
+	 *
+	 * The loop body estimates how many blocks it can request from fdblocks
+	 * to stash in the reserve pool.  This is a classic TOCTOU race since
+	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
+	 * cannot tell if @free remaining unchanged between iterations is due
+	 * to an idle system or freed blocks being consumed immediately, so
+	 * we'll try a finite number of times to satisfy the request.
 	 */
 	error = -ENOSPC;
-	do {
+	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
 		free = percpu_counter_sum(&mp->m_fdblocks) -
 						xfs_fdblocks_unavailable(mp);
 		if (free <= 0)
@@ -459,7 +467,7 @@ xfs_reserve_blocks(
 		spin_unlock(&mp->m_sb_lock);
 		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
 		spin_lock(&mp->m_sb_lock);
-	} while (error == -ENOSPC);
+	}
 
 	/*
 	 * Update the reserve counters if blocks have been successfully


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-03-20 16:43 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
@ 2022-03-20 16:43 ` Darrick J. Wong
  2022-03-21 15:22   ` Brian Foster
  2022-03-23 21:12   ` Dave Chinner
  2022-03-20 16:44 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
  5 siblings, 2 replies; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:43 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

On a modern filesystem, we don't allow userspace to allocate blocks for
data storage from the per-AG space reservations, the user-controlled
reservation pool that prevents ENOSPC in the middle of internal
operations, or the internal per-AG set-aside that prevents ENOSPC.
Since we now consider free space btree blocks as unavailable for
allocation for data storage, we shouldn't report those blocks via statfs
either.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |    2 +-
 fs/xfs/xfs_super.c |    3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 615334e4f689..863e6389c6ff 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -347,7 +347,7 @@ xfs_fs_counts(
 	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
 	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
 	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
-						mp->m_alloc_set_aside;
+						xfs_fdblocks_unavailable(mp);
 
 	spin_lock(&mp->m_sb_lock);
 	cnt->freertx = mp->m_sb.sb_frextents;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index d84714e4e46a..54be9d64093e 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -815,7 +815,8 @@ xfs_fs_statfs(
 	spin_unlock(&mp->m_sb_lock);
 
 	/* make sure statp->f_bfree does not underflow */
-	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
+	statp->f_bfree = max_t(int64_t, 0,
+				fdblocks - xfs_fdblocks_unavailable(mp));
 	statp->f_bavail = statp->f_bfree;
 
 	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive
  2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-03-20 16:43 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
@ 2022-03-20 16:44 ` Darrick J. Wong
  2022-03-23 21:21   ` Dave Chinner
  5 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-20 16:44 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

We've established in this patchset that the "alloc_set_aside" pool is
actually used to ensure that a bmbt split always succeeds so that the
filesystem won't run out of space mid-transaction and crash.  Rename the
variable and the function to be a little more suggestive of the purpose
of this quantity.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |    4 ++--
 fs/xfs/libxfs/xfs_alloc.h |    2 +-
 fs/xfs/xfs_fsops.c        |    2 +-
 fs/xfs/xfs_log_recover.c  |    2 +-
 fs/xfs/xfs_mount.c        |    2 +-
 fs/xfs/xfs_mount.h        |    5 +++--
 6 files changed, 9 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 747b3e45303f..a4a6cca1ffd1 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -110,7 +110,7 @@ xfs_prealloc_blocks(
  * AGFL and enough to handle a potential split of a file's bmap btree.
  */
 unsigned int
-xfs_alloc_set_aside(
+xfs_bmbt_split_setaside(
 	struct xfs_mount	*mp)
 {
 	unsigned int		bmbt_splits;
@@ -127,7 +127,7 @@ xfs_alloc_set_aside(
  *	- the AG superblock, AGF, AGI and AGFL
  *	- the AGF (bno and cnt) and AGI btree root blocks, and optionally
  *	  the AGI free inode and rmap btree root blocks.
- *	- blocks on the AGFL according to xfs_alloc_set_aside() limits
+ *	- blocks on the AGFL according to xfs_bmbt_split_setaside() limits
  *	- the rmapbt root block
  *
  * The AG headers are sector sized, so the amount of space they take up is
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index d4c057b764f9..7d676c1c66bc 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -88,7 +88,7 @@ typedef struct xfs_alloc_arg {
 #define XFS_ALLOC_NOBUSY		(1 << 2)/* Busy extents not allowed */
 
 /* freespace limit calculations */
-unsigned int xfs_alloc_set_aside(struct xfs_mount *mp);
+unsigned int xfs_bmbt_split_setaside(struct xfs_mount *mp);
 unsigned int xfs_alloc_ag_max_usable(struct xfs_mount *mp);
 
 xfs_extlen_t xfs_alloc_longest_free_extent(struct xfs_perag *pag,
diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 863e6389c6ff..b1840daf89c2 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -190,7 +190,7 @@ xfs_growfs_data_private(
 	if (nagimax)
 		mp->m_maxagi = nagimax;
 	xfs_set_low_space_thresholds(mp);
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 
 	if (delta > 0) {
 		/*
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 96c997ed2ec8..30e22cd943c2 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3351,7 +3351,7 @@ xlog_do_recover(
 		xfs_warn(mp, "Failed post-recovery per-ag init: %d", error);
 		return error;
 	}
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 
 	xlog_recover_check_summary(log);
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c9fd5219d377..e9fb61b2290a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -656,7 +656,7 @@ xfs_mountfs(
 	 * Compute the amount of space to set aside to handle btree splits near
 	 * ENOSPC now that we have calculated the btree maxlevels.
 	 */
-	mp->m_alloc_set_aside = xfs_alloc_set_aside(mp);
+	mp->m_bmbt_split_setaside = xfs_bmbt_split_setaside(mp);
 	mp->m_ag_max_usable = xfs_alloc_ag_max_usable(mp);
 
 	/*
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index da1b7056e743..b948f4002e7f 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -134,7 +134,8 @@ typedef struct xfs_mount {
 	uint			m_refc_maxlevels; /* max refcount btree level */
 	unsigned int		m_agbtree_maxlevels; /* max level of all AG btrees */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
-	uint			m_alloc_set_aside; /* space we can't use */
+	/* space reserved to ensure bmbt splits always succeed */
+	unsigned int		m_bmbt_split_setaside;
 	uint			m_ag_max_usable; /* max space per AG */
 	int			m_dalign;	/* stripe unit */
 	int			m_swidth;	/* stripe width */
@@ -491,7 +492,7 @@ static inline uint64_t
 xfs_fdblocks_unavailable(
 	struct xfs_mount	*mp)
 {
-	return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
+	return mp->m_bmbt_split_setaside + atomic64_read(&mp->m_allocbt_blks);
 }
 
 extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-20 16:43 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
@ 2022-03-21 15:22   ` Brian Foster
  2022-03-21 20:42     ` Darrick J. Wong
  2022-03-23 20:51   ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Foster @ 2022-03-21 15:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Sun, Mar 20, 2022 at 09:43:43AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> xfs_reserve_blocks controls the size of the user-visible free space
> reserve pool.  Given the difference between the current and requested
> pool sizes, it will try to reserve free space from fdblocks.  However,
> the amount requested from fdblocks is also constrained by the amount of
> space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> 
> In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> out the "free space" used by the free space btrees, because some portion
> of the free space btrees hold in reserve space for future btree
> expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> of blocks that it could request from xfs_mod_fdblocks was not updated to
> include m_allocbt_blks, so if space is extremely low, the caller hangs.
> 
> Fix this by creating a function to estimate the number of blocks that
> can be reserved from fdblocks, which needs to exclude the set-aside and
> m_allocbt_blks.
> 
> Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> with an fstests configuration that specifies a 1k blocksize and a
> specially crafted log size that will consume 7/8 of the space (17920
> blocks, specifically) in that AG.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_fsops.c |    2 +-
>  fs/xfs/xfs_mount.c |    2 +-
>  fs/xfs/xfs_mount.h |   15 +++++++++++++++
>  3 files changed, 17 insertions(+), 2 deletions(-)
> 
> 
...
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 00720a02e761..da1b7056e743 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -479,6 +479,21 @@ extern void	xfs_unmountfs(xfs_mount_t *);
>   */
>  #define XFS_FDBLOCKS_BATCH	1024
>  
> +/*
> + * Estimate the amount of free space that is not available to userspace and is
> + * not explicitly reserved from the incore fdblocks:
> + *
> + * - Space reserved to ensure that we can always split a bmap btree
> + * - Free space btree blocks that are not available for allocation due to
> + *   per-AG metadata reservations
> + */

What does this mean by "due to" perag res? That sounds like a separate
thing to me. Perhaps this could just say something like:

"Estimate the amount of accounted free space that is not available to
userspace. This includes the minimum number of blocks to support a bmbt
split (calculated at mount time) and the blocks currently in-use by the
allocation btrees."

Comment nit aside, this LGTM. Thanks for the rework..

Reviewed-by: Brian Foster <bfoster@redhat.com>

> +static inline uint64_t
> +xfs_fdblocks_unavailable(
> +	struct xfs_mount	*mp)
> +{
> +	return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
> +}
> +
>  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
>  				 bool reserved);
>  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-20 16:43 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
@ 2022-03-21 15:22   ` Brian Foster
  2022-03-21 20:48     ` Darrick J. Wong
  2022-03-23 21:12   ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Foster @ 2022-03-21 15:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Sun, Mar 20, 2022 at 09:43:55AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> On a modern filesystem, we don't allow userspace to allocate blocks for
> data storage from the per-AG space reservations, the user-controlled
> reservation pool that prevents ENOSPC in the middle of internal
> operations, or the internal per-AG set-aside that prevents ENOSPC.

We can prevent -ENOSPC now? Neat! :)

> Since we now consider free space btree blocks as unavailable for
> allocation for data storage, we shouldn't report those blocks via statfs
> either.
> 

Might be worth a sentence or two that document the (intentional) side
effects of this from a user perspective. I.e., that technically the
presented free space will be a conservative estimate of actual free
space (since allocbt blocks free up as free extents are consumed, etc.).

Otherwise with that sort of commit log tweak:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_fsops.c |    2 +-
>  fs/xfs/xfs_super.c |    3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 615334e4f689..863e6389c6ff 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -347,7 +347,7 @@ xfs_fs_counts(
>  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
>  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
>  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> -						mp->m_alloc_set_aside;
> +						xfs_fdblocks_unavailable(mp);
>  
>  	spin_lock(&mp->m_sb_lock);
>  	cnt->freertx = mp->m_sb.sb_frextents;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index d84714e4e46a..54be9d64093e 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -815,7 +815,8 @@ xfs_fs_statfs(
>  	spin_unlock(&mp->m_sb_lock);
>  
>  	/* make sure statp->f_bfree does not underflow */
> -	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
> +	statp->f_bfree = max_t(int64_t, 0,
> +				fdblocks - xfs_fdblocks_unavailable(mp));
>  	statp->f_bavail = statp->f_bfree;
>  
>  	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-21 15:22   ` Brian Foster
@ 2022-03-21 20:42     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-21 20:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Mon, Mar 21, 2022 at 11:22:27AM -0400, Brian Foster wrote:
> On Sun, Mar 20, 2022 at 09:43:43AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > xfs_reserve_blocks controls the size of the user-visible free space
> > reserve pool.  Given the difference between the current and requested
> > pool sizes, it will try to reserve free space from fdblocks.  However,
> > the amount requested from fdblocks is also constrained by the amount of
> > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > 
> > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > out the "free space" used by the free space btrees, because some portion
> > of the free space btrees hold in reserve space for future btree
> > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > 
> > Fix this by creating a function to estimate the number of blocks that
> > can be reserved from fdblocks, which needs to exclude the set-aside and
> > m_allocbt_blks.
> > 
> > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > with an fstests configuration that specifies a 1k blocksize and a
> > specially crafted log size that will consume 7/8 of the space (17920
> > blocks, specifically) in that AG.
> > 
> > Cc: Brian Foster <bfoster@redhat.com>
> > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_fsops.c |    2 +-
> >  fs/xfs/xfs_mount.c |    2 +-
> >  fs/xfs/xfs_mount.h |   15 +++++++++++++++
> >  3 files changed, 17 insertions(+), 2 deletions(-)
> > 
> > 
> ...
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index 00720a02e761..da1b7056e743 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -479,6 +479,21 @@ extern void	xfs_unmountfs(xfs_mount_t *);
> >   */
> >  #define XFS_FDBLOCKS_BATCH	1024
> >  
> > +/*
> > + * Estimate the amount of free space that is not available to userspace and is
> > + * not explicitly reserved from the incore fdblocks:
> > + *
> > + * - Space reserved to ensure that we can always split a bmap btree
> > + * - Free space btree blocks that are not available for allocation due to
> > + *   per-AG metadata reservations
> > + */
> 
> What does this mean by "due to" perag res? That sounds like a separate
> thing to me. Perhaps this could just say something like:
> 
> "Estimate the amount of accounted free space that is not available to
> userspace. This includes the minimum number of blocks to support a bmbt
> split (calculated at mount time) and the blocks currently in-use by the
> allocation btrees."

IMO, the last sentence should capture /why/ we subtract the blocks that
are in use by the free space btrees because we used to advertise freesp
btree blocks in f_bfree/f_bavail, and someone who is accustomed to that
behavior might read the sentence and think "Oh, that's incorrect".

But clearly "due to per-AG metadata reservations" isn't clear enough, so
I'll change it to:

"The blocks currently in use by the freespace btrees because they record
the actual blocks that will fill per-AG metadata space reservations."

Thanks for the review :)

--D

> Comment nit aside, this LGTM. Thanks for the rework..
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > +static inline uint64_t
> > +xfs_fdblocks_unavailable(
> > +	struct xfs_mount	*mp)
> > +{
> > +	return mp->m_alloc_set_aside + atomic64_read(&mp->m_allocbt_blks);
> > +}
> > +
> >  extern int	xfs_mod_fdblocks(struct xfs_mount *mp, int64_t delta,
> >  				 bool reserved);
> >  extern int	xfs_mod_frextents(struct xfs_mount *mp, int64_t delta);
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-21 15:22   ` Brian Foster
@ 2022-03-21 20:48     ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-21 20:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, david

On Mon, Mar 21, 2022 at 11:22:58AM -0400, Brian Foster wrote:
> On Sun, Mar 20, 2022 at 09:43:55AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > On a modern filesystem, we don't allow userspace to allocate blocks for
> > data storage from the per-AG space reservations, the user-controlled
> > reservation pool that prevents ENOSPC in the middle of internal
> > operations, or the internal per-AG set-aside that prevents ENOSPC.
> 
> We can prevent -ENOSPC now? Neat! :)

Heh, that was a bad sentence.

I'll update the end of that to read:

"...or the internal per-AG set-aside that prevents unwanted filesystem
shutdowns due to ENOSPC during a bmap btree split."

> > Since we now consider free space btree blocks as unavailable for
> > allocation for data storage, we shouldn't report those blocks via statfs
> > either.
> > 
> 
> Might be worth a sentence or two that document the (intentional) side
> effects of this from a user perspective. I.e., that technically the
> presented free space will be a conservative estimate of actual free
> space (since allocbt blocks free up as free extents are consumed, etc.).

Ok.  Added the following at the end of the commit log:

"This makes the numbers that we return via the statfs f_bavail and
f_bfree fields a more conservative estimate of actual free space."

--D

> 
> Otherwise with that sort of commit log tweak:
> 
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_fsops.c |    2 +-
> >  fs/xfs/xfs_super.c |    3 ++-
> >  2 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 615334e4f689..863e6389c6ff 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -347,7 +347,7 @@ xfs_fs_counts(
> >  	cnt->allocino = percpu_counter_read_positive(&mp->m_icount);
> >  	cnt->freeino = percpu_counter_read_positive(&mp->m_ifree);
> >  	cnt->freedata = percpu_counter_read_positive(&mp->m_fdblocks) -
> > -						mp->m_alloc_set_aside;
> > +						xfs_fdblocks_unavailable(mp);
> >  
> >  	spin_lock(&mp->m_sb_lock);
> >  	cnt->freertx = mp->m_sb.sb_frextents;
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index d84714e4e46a..54be9d64093e 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -815,7 +815,8 @@ xfs_fs_statfs(
> >  	spin_unlock(&mp->m_sb_lock);
> >  
> >  	/* make sure statp->f_bfree does not underflow */
> > -	statp->f_bfree = max_t(int64_t, fdblocks - mp->m_alloc_set_aside, 0);
> > +	statp->f_bfree = max_t(int64_t, 0,
> > +				fdblocks - xfs_fdblocks_unavailable(mp));
> >  	statp->f_bavail = statp->f_bfree;
> >  
> >  	fakeinos = XFS_FSB_TO_INO(mp, statp->f_bfree);
> > 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-20 16:43 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
@ 2022-03-23 20:39   ` Dave Chinner
  2022-03-24  5:15     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 20:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Currently, we use this undocumented macro to encode the minimum number
> of blocks needed to replenish a completely empty AGFL when an AG is
> nearly full.  This has lead to confusion on the part of the maintainers,
> so let's document what the value actually means, and move it to
> xfs_alloc.c since it's not used outside of that module.

Code change looks fine, but....

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
>  fs/xfs/libxfs/xfs_alloc.h |    1 -
>  2 files changed, 18 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 353e53b892e6..b0678e96ce61 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
>  }
>  
>  /*
> + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> + * full AG.  We require two blocks per free space btree because free space
> + * btrees shrink to a single block as the AG fills up, and any allocation can
> + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> + * space from xfs_mod_fdblocks, so we do not account for that here.
> + */
> +#define XFS_ALLOCBT_AGFL_RESERVE	4

.... that comment is not correct.  this number had nothing to do
with btree split reservations and is all about preventing
oversubscription of the AG when the free space trees are completely
empty.  By the time there is enough free space records in the AG for
the bno/cnt trees to be at risk of a single level root split
(several hundred free extent records), there is enough free space to
fully allocate the 4 blocks that the AGFL needs for that split.

That is, the set aside was designed to prevent AG selection in
xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
AG to fix up the AGFL and then fail to fully fill the AGFL because,
say, there were only 2 blocks free in the AG and the AGFL needed 3.
Then we try all other AGs and get ENOSPC from all of them, and we
end up cancelling a dirty transaction and shutting down instead of
propagating ENOSPC up to the user.

This "not enough blocks to populate the AGFL" problem arose because
we can allocate extents directly from the AGFL if the free space
btree is empty, resulting in depletion of the AGFL and no free space
to re-populate it. Freeing a block will then go back into the btree,
and so the next allocation attempt might need 2 blocks for the AGFL,
have one block in the free space tree, and then we fail to fill
the AGFL completely because we still need one block for the AGFL and
there's no space available anymore. If all other AGs are like this
or ENOSPC, then kaboom.

IOWs, I originally added this per-ag set aside so that when the AG
was almost completely empty and we were down to allocating the last
blocks from the AG, users couldn't oversubscribe global free space by
consuming the blocks the AGs required to fill the AGFLs to allow the
last blocks that users could allocate to be allocated safely.

Hence the set aside of 4 blocks per AG was not to ensure the
freespace trees could be split, but to ensure every last possible
block could be allocated from the AG without causing the AG
selection algorithms to select and modify AGs that could not have
their AGFL fully fixed up to allocate the blocks that the caller
needed when near ENOSPC...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
@ 2022-03-23 20:48   ` Dave Chinner
  2022-03-24  5:26     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 20:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> The comment for xfs_alloc_set_aside indicates that we want to set aside
> enough space to handle a bmap btree split.  The code, unfortunately,
> hardcodes this to 4.
> 
> This is incorrect, since file bmap btrees can be taller than that:
> 
> xfs_db> btheight bmapbt -n 4294967295 -b 512
> bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> level 0: 4294967295 records, 330382100 blocks
> level 1: 330382100 records, 25414008 blocks
> level 2: 25414008 records, 1954924 blocks
> level 3: 1954924 records, 150379 blocks
> level 4: 150379 records, 11568 blocks
> level 5: 11568 records, 890 blocks
> level 6: 890 records, 69 blocks
> level 7: 69 records, 6 blocks
> level 8: 6 records, 1 block
> 9 levels, 357913945 blocks total
> 
> Or, for V5 filesystems:
> 
> xfs_db> btheight bmapbt -n 4294967295 -b 1024
> bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> level 0: 4294967295 records, 148102321 blocks
> level 1: 148102321 records, 5106977 blocks
> level 2: 5106977 records, 176103 blocks
> level 3: 176103 records, 6073 blocks
> level 4: 6073 records, 210 blocks
> level 5: 210 records, 8 blocks
> level 6: 8 records, 1 block
> 7 levels, 153391693 blocks total
> 
> Fix this by using the actual bmap btree maxlevel value for the
> set-aside.  We subtract one because the root is always in the inode and
> hence never splits.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
>  fs/xfs/libxfs/xfs_sb.c    |    2 --
>  fs/xfs/xfs_mount.c        |    7 +++++++
>  3 files changed, 12 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index b0678e96ce61..747b3e45303f 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
>   * aside a few blocks which will not be reserved in delayed allocation.
>   *
>   * For each AG, we need to reserve enough blocks to replenish a totally empty
> - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> + * AGFL and enough to handle a potential split of a file's bmap btree.
>   */
>  unsigned int
>  xfs_alloc_set_aside(
>  	struct xfs_mount	*mp)
>  {
> -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> +	unsigned int		bmbt_splits;
> +
> +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
>  }

So right now I'm trying to understand why this global space set
aside ever needed to take into account the space used by a single
BMBT split. ISTR it was done back in 2006 because the ag selection
code, alloc args and/or xfs_alloc_space_available() didn't take into
account the BMBT space via args->minleft correctly to ensure that
the AGF we select had enough space in it for both the data extent
and the followup BMBT split. Hence the original SET ASIDE (which
wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.

The transaction reservation takes into account the space needed by
BMBT splits so we don't over-commit global space on user allocation
anymore, args->minleft takes it into account so we don't overcommit
AG space on extent allocation, and we have the reserved block pool
to handle situations were delalloc extents are fragmented more than
the initial indirect block reservation that is taken with the
delalloc extent reservation.

So where/why is this needed anymore?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-20 16:43 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
  2022-03-21 15:22   ` Brian Foster
@ 2022-03-23 20:51   ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 20:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:43:43AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> xfs_reserve_blocks controls the size of the user-visible free space
> reserve pool.  Given the difference between the current and requested
> pool sizes, it will try to reserve free space from fdblocks.  However,
> the amount requested from fdblocks is also constrained by the amount of
> space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> 
> In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> out the "free space" used by the free space btrees, because some portion
> of the free space btrees hold in reserve space for future btree
> expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> of blocks that it could request from xfs_mod_fdblocks was not updated to
> include m_allocbt_blks, so if space is extremely low, the caller hangs.
> 
> Fix this by creating a function to estimate the number of blocks that
> can be reserved from fdblocks, which needs to exclude the set-aside and
> m_allocbt_blks.
> 
> Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> with an fstests configuration that specifies a 1k blocksize and a
> specially crafted log size that will consume 7/8 of the space (17920
> blocks, specifically) in that AG.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-20 16:43 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
@ 2022-03-23 21:11   ` Dave Chinner
  2022-03-24  5:24     ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 21:11 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:43:49AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Don't spin in an infinite loop trying to reserve blocks -- if we can't
> do it after 30 tries, we're racing with a nearly full filesystem, so
> just give up.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_fsops.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 710e857bb825..615334e4f689 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -379,6 +379,7 @@ xfs_reserve_blocks(
>  	int64_t			fdblks_delta = 0;
>  	uint64_t		request;
>  	int64_t			free;
> +	unsigned int		tries;
>  	int			error = 0;
>  
>  	/* If inval is null, report current values and return */
> @@ -430,9 +431,16 @@ xfs_reserve_blocks(
>  	 * If the request is larger than the current reservation, reserve the
>  	 * blocks before we update the reserve counters. Sample m_fdblocks and
>  	 * perform a partial reservation if the request exceeds free space.
> +	 *
> +	 * The loop body estimates how many blocks it can request from fdblocks
> +	 * to stash in the reserve pool.  This is a classic TOCTOU race since
> +	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
> +	 * cannot tell if @free remaining unchanged between iterations is due
> +	 * to an idle system or freed blocks being consumed immediately, so
> +	 * we'll try a finite number of times to satisfy the request.
>  	 */
>  	error = -ENOSPC;
> -	do {
> +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
>  		free = percpu_counter_sum(&mp->m_fdblocks) -
>  						xfs_fdblocks_unavailable(mp);
>  		if (free <= 0)
> @@ -459,7 +467,7 @@ xfs_reserve_blocks(
>  		spin_unlock(&mp->m_sb_lock);
>  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
>  		spin_lock(&mp->m_sb_lock);
> -	} while (error == -ENOSPC);
> +	}

So the problem here is that if we don't get all of the reservation,
we get none of it, so we try again. This seems like we should simply
punch the error back out to userspace and let it retry rather than
try a few times and then fail anyway. Either way, userspace has to
handle failure to reserve enough blocks.

The other alternative here is that we just change the reserve block
limit when ENOSPC occurs and allow the xfs_mod_fdblocks() refill
code just fill up the pool as space is freed. That would greatly
simplify the code - we do a single attempt to reserve the new space,
and if it fails we don't care because the reserve space gets
refilled before free space is made available again.  I think that's
preferable to having an arbitrary number of retries like this that's
going to end up failing anyway.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 5/6] xfs: don't report reserved bnobt space as available
  2022-03-20 16:43 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
  2022-03-21 15:22   ` Brian Foster
@ 2022-03-23 21:12   ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 21:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, bfoster

On Sun, Mar 20, 2022 at 09:43:55AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> On a modern filesystem, we don't allow userspace to allocate blocks for
> data storage from the per-AG space reservations, the user-controlled
> reservation pool that prevents ENOSPC in the middle of internal
> operations, or the internal per-AG set-aside that prevents ENOSPC.
> Since we now consider free space btree blocks as unavailable for
> allocation for data storage, we shouldn't report those blocks via statfs
> either.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Seems reasonable.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive
  2022-03-20 16:44 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
@ 2022-03-23 21:21   ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-23 21:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Sun, Mar 20, 2022 at 09:44:00AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> We've established in this patchset that the "alloc_set_aside" pool is
> actually used to ensure that a bmbt split always succeeds so that the

I don't think this is correct. See my previous comments about the
reason for set_aside existing and why I think the bmbt portion of it
isn't valid anymore.

> filesystem won't run out of space mid-transaction and crash.

That's exactly what transaction reservations and
args->minleft is for, yes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-23 20:39   ` Dave Chinner
@ 2022-03-24  5:15     ` Darrick J. Wong
  2022-03-24  5:58       ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-24  5:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Thu, Mar 24, 2022 at 07:39:16AM +1100, Dave Chinner wrote:
> On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Currently, we use this undocumented macro to encode the minimum number
> > of blocks needed to replenish a completely empty AGFL when an AG is
> > nearly full.  This has lead to confusion on the part of the maintainers,
> > so let's document what the value actually means, and move it to
> > xfs_alloc.c since it's not used outside of that module.
> 
> Code change looks fine, but....
> 
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
> >  fs/xfs/libxfs/xfs_alloc.h |    1 -
> >  2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 353e53b892e6..b0678e96ce61 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
> >  }
> >  
> >  /*
> > + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> > + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> > + * full AG.  We require two blocks per free space btree because free space
> > + * btrees shrink to a single block as the AG fills up, and any allocation can
> > + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> > + * space from xfs_mod_fdblocks, so we do not account for that here.
> > + */
> > +#define XFS_ALLOCBT_AGFL_RESERVE	4
> 
> .... that comment is not correct.  this number had nothing to do
> with btree split reservations and is all about preventing
> oversubscription of the AG when the free space trees are completely
> empty.  By the time there is enough free space records in the AG for
> the bno/cnt trees to be at risk of a single level root split
> (several hundred free extent records), there is enough free space to
> fully allocate the 4 blocks that the AGFL needs for that split.
> 
> That is, the set aside was designed to prevent AG selection in
> xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
> AG to fix up the AGFL and then fail to fully fill the AGFL because,
> say, there were only 2 blocks free in the AG and the AGFL needed 3.
> Then we try all other AGs and get ENOSPC from all of them, and we
> end up cancelling a dirty transaction and shutting down instead of
> propagating ENOSPC up to the user.
> 
> This "not enough blocks to populate the AGFL" problem arose because
> we can allocate extents directly from the AGFL if the free space
> btree is empty, resulting in depletion of the AGFL and no free space
> to re-populate it. Freeing a block will then go back into the btree,
> and so the next allocation attempt might need 2 blocks for the AGFL,
> have one block in the free space tree, and then we fail to fill
> the AGFL completely because we still need one block for the AGFL and
> there's no space available anymore. If all other AGs are like this
> or ENOSPC, then kaboom.
> 
> IOWs, I originally added this per-ag set aside so that when the AG
> was almost completely empty and we were down to allocating the last
> blocks from the AG, users couldn't oversubscribe global free space by
> consuming the blocks the AGs required to fill the AGFLs to allow the
> last blocks that users could allocate to be allocated safely.
> 
> Hence the set aside of 4 blocks per AG was not to ensure the
> freespace trees could be split, but to ensure every last possible
> block could be allocated from the AG without causing the AG
> selection algorithms to select and modify AGs that could not have
> their AGFL fully fixed up to allocate the blocks that the caller
> needed when near ENOSPC...

Hmmm, thanks for the context!  What do you think about this revised
comment?

/*
 * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
 * guarantee that we can refill the AGFL prior to allocating space in a
 * nearly full AG.  Although the the space described in the free space
 * btrees, the freesp btrees, and the blocks owned by the agfl are all
 * added to the ondisk fdblocks, it's a mistake to let the ondisk free
 * space in the AG drop so low that the free space btrees cannot refill
 * an empty AGFL up to the minimum level.  Rather than grind through
 * empty AGs until the fs goes down, we subtract this many AG blocks
 * from the incore fdblocks to ENOSPC early.
 */
#define XFS_ALLOCBT_AGFL_RESERVE	4

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-23 21:11   ` Dave Chinner
@ 2022-03-24  5:24     ` Darrick J. Wong
  2022-03-24  6:21       ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-24  5:24 UTC (permalink / raw)
  To: Dave Chinner, g; +Cc: Brian Foster, linux-xfs

On Thu, Mar 24, 2022 at 08:11:47AM +1100, Dave Chinner wrote:
> On Sun, Mar 20, 2022 at 09:43:49AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Don't spin in an infinite loop trying to reserve blocks -- if we can't
> > do it after 30 tries, we're racing with a nearly full filesystem, so
> > just give up.
> > 
> > Cc: Brian Foster <bfoster@redhat.com>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_fsops.c |   12 ++++++++++--
> >  1 file changed, 10 insertions(+), 2 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 710e857bb825..615334e4f689 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> >  	int64_t			fdblks_delta = 0;
> >  	uint64_t		request;
> >  	int64_t			free;
> > +	unsigned int		tries;
> >  	int			error = 0;
> >  
> >  	/* If inval is null, report current values and return */
> > @@ -430,9 +431,16 @@ xfs_reserve_blocks(
> >  	 * If the request is larger than the current reservation, reserve the
> >  	 * blocks before we update the reserve counters. Sample m_fdblocks and
> >  	 * perform a partial reservation if the request exceeds free space.
> > +	 *
> > +	 * The loop body estimates how many blocks it can request from fdblocks
> > +	 * to stash in the reserve pool.  This is a classic TOCTOU race since
> > +	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
> > +	 * cannot tell if @free remaining unchanged between iterations is due
> > +	 * to an idle system or freed blocks being consumed immediately, so
> > +	 * we'll try a finite number of times to satisfy the request.
> >  	 */
> >  	error = -ENOSPC;
> > -	do {
> > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> >  		free = percpu_counter_sum(&mp->m_fdblocks) -
> >  						xfs_fdblocks_unavailable(mp);
> >  		if (free <= 0)
> > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> >  		spin_unlock(&mp->m_sb_lock);
> >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> >  		spin_lock(&mp->m_sb_lock);
> > -	} while (error == -ENOSPC);
> > +	}
> 
> So the problem here is that if we don't get all of the reservation,
> we get none of it, so we try again. This seems like we should simply
> punch the error back out to userspace and let it retry rather than
> try a few times and then fail anyway. Either way, userspace has to
> handle failure to reserve enough blocks.
> 
> The other alternative here is that we just change the reserve block
> limit when ENOSPC occurs and allow the xfs_mod_fdblocks() refill
> code just fill up the pool as space is freed. That would greatly
> simplify the code - we do a single attempt to reserve the new space,
> and if it fails we don't care because the reserve space gets
> refilled before free space is made available again.  I think that's
> preferable to having an arbitrary number of retries like this that's
> going to end up failing anyway.

How about I try the following tomorrow?

spin_lock(...);
mp->m_resblks = request;
free = percpu_counter_sum(&mp->m_fdblocks) - xfs_fdblocks_unavailable(mp);
if (request < mp->m_resblks_avail && free > 0) {
	ask = min(request - mp->m_resblks_avail, free);

	spin_unlock(...);
	error = xfs_mod_fdblocks(mp, -ask, 0);
	spin_lock(...);

	if (!error)
		mp->m_resblks_avail += ask;
}
spin_unlock(...);
return error;

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-23 20:48   ` Dave Chinner
@ 2022-03-24  5:26     ` Darrick J. Wong
  2022-03-24  6:00       ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-24  5:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Thu, Mar 24, 2022 at 07:48:56AM +1100, Dave Chinner wrote:
> On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > The comment for xfs_alloc_set_aside indicates that we want to set aside
> > enough space to handle a bmap btree split.  The code, unfortunately,
> > hardcodes this to 4.
> > 
> > This is incorrect, since file bmap btrees can be taller than that:
> > 
> > xfs_db> btheight bmapbt -n 4294967295 -b 512
> > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > level 0: 4294967295 records, 330382100 blocks
> > level 1: 330382100 records, 25414008 blocks
> > level 2: 25414008 records, 1954924 blocks
> > level 3: 1954924 records, 150379 blocks
> > level 4: 150379 records, 11568 blocks
> > level 5: 11568 records, 890 blocks
> > level 6: 890 records, 69 blocks
> > level 7: 69 records, 6 blocks
> > level 8: 6 records, 1 block
> > 9 levels, 357913945 blocks total
> > 
> > Or, for V5 filesystems:
> > 
> > xfs_db> btheight bmapbt -n 4294967295 -b 1024
> > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> > level 0: 4294967295 records, 148102321 blocks
> > level 1: 148102321 records, 5106977 blocks
> > level 2: 5106977 records, 176103 blocks
> > level 3: 176103 records, 6073 blocks
> > level 4: 6073 records, 210 blocks
> > level 5: 210 records, 8 blocks
> > level 6: 8 records, 1 block
> > 7 levels, 153391693 blocks total
> > 
> > Fix this by using the actual bmap btree maxlevel value for the
> > set-aside.  We subtract one because the root is always in the inode and
> > hence never splits.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
> >  fs/xfs/libxfs/xfs_sb.c    |    2 --
> >  fs/xfs/xfs_mount.c        |    7 +++++++
> >  3 files changed, 12 insertions(+), 4 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index b0678e96ce61..747b3e45303f 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
> >   * aside a few blocks which will not be reserved in delayed allocation.
> >   *
> >   * For each AG, we need to reserve enough blocks to replenish a totally empty
> > - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> > + * AGFL and enough to handle a potential split of a file's bmap btree.
> >   */
> >  unsigned int
> >  xfs_alloc_set_aside(
> >  	struct xfs_mount	*mp)
> >  {
> > -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> > +	unsigned int		bmbt_splits;
> > +
> > +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> > +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
> >  }
> 
> So right now I'm trying to understand why this global space set
> aside ever needed to take into account the space used by a single
> BMBT split. ISTR it was done back in 2006 because the ag selection
> code, alloc args and/or xfs_alloc_space_available() didn't take into
> account the BMBT space via args->minleft correctly to ensure that
> the AGF we select had enough space in it for both the data extent
> and the followup BMBT split. Hence the original SET ASIDE (which
> wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.
> 
> The transaction reservation takes into account the space needed by
> BMBT splits so we don't over-commit global space on user allocation
> anymore, args->minleft takes it into account so we don't overcommit
> AG space on extent allocation, and we have the reserved block pool
> to handle situations were delalloc extents are fragmented more than
> the initial indirect block reservation that is taken with the
> delalloc extent reservation.
> 
> So where/why is this needed anymore?

Beats me.  I started by reading the comment and thought "Gee, that's not
true of bmbts!" :(

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant
  2022-03-24  5:15     ` Darrick J. Wong
@ 2022-03-24  5:58       ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-24  5:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Wed, Mar 23, 2022 at 10:15:42PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 24, 2022 at 07:39:16AM +1100, Dave Chinner wrote:
> > On Sun, Mar 20, 2022 at 09:43:32AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Currently, we use this undocumented macro to encode the minimum number
> > > of blocks needed to replenish a completely empty AGFL when an AG is
> > > nearly full.  This has lead to confusion on the part of the maintainers,
> > > so let's document what the value actually means, and move it to
> > > xfs_alloc.c since it's not used outside of that module.
> > 
> > Code change looks fine, but....
> > 
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |   23 ++++++++++++++++++-----
> > >  fs/xfs/libxfs/xfs_alloc.h |    1 -
> > >  2 files changed, 18 insertions(+), 6 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 353e53b892e6..b0678e96ce61 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -82,6 +82,19 @@ xfs_prealloc_blocks(
> > >  }
> > >  
> > >  /*
> > > + * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
> > > + * guarantee that we can refill the AGFL prior to allocating space in a nearly
> > > + * full AG.  We require two blocks per free space btree because free space
> > > + * btrees shrink to a single block as the AG fills up, and any allocation can
> > > + * cause a btree split.  The rmap btree uses a per-AG reservation to withhold
> > > + * space from xfs_mod_fdblocks, so we do not account for that here.
> > > + */
> > > +#define XFS_ALLOCBT_AGFL_RESERVE	4
> > 
> > .... that comment is not correct.  this number had nothing to do
> > with btree split reservations and is all about preventing
> > oversubscription of the AG when the free space trees are completely
> > empty.  By the time there is enough free space records in the AG for
> > the bno/cnt trees to be at risk of a single level root split
> > (several hundred free extent records), there is enough free space to
> > fully allocate the 4 blocks that the AGFL needs for that split.
> > 
> > That is, the set aside was designed to prevent AG selection in
> > xfs_alloc_vextent() having xfs_alloc_fixup_freelist() modifying an
> > AG to fix up the AGFL and then fail to fully fill the AGFL because,
> > say, there were only 2 blocks free in the AG and the AGFL needed 3.
> > Then we try all other AGs and get ENOSPC from all of them, and we
> > end up cancelling a dirty transaction and shutting down instead of
> > propagating ENOSPC up to the user.
> > 
> > This "not enough blocks to populate the AGFL" problem arose because
> > we can allocate extents directly from the AGFL if the free space
> > btree is empty, resulting in depletion of the AGFL and no free space
> > to re-populate it. Freeing a block will then go back into the btree,
> > and so the next allocation attempt might need 2 blocks for the AGFL,
> > have one block in the free space tree, and then we fail to fill
> > the AGFL completely because we still need one block for the AGFL and
> > there's no space available anymore. If all other AGs are like this
> > or ENOSPC, then kaboom.
> > 
> > IOWs, I originally added this per-ag set aside so that when the AG
> > was almost completely empty and we were down to allocating the last
> > blocks from the AG, users couldn't oversubscribe global free space by
> > consuming the blocks the AGs required to fill the AGFLs to allow the
> > last blocks that users could allocate to be allocated safely.
> > 
> > Hence the set aside of 4 blocks per AG was not to ensure the
> > freespace trees could be split, but to ensure every last possible
> > block could be allocated from the AG without causing the AG
> > selection algorithms to select and modify AGs that could not have
> > their AGFL fully fixed up to allocate the blocks that the caller
> > needed when near ENOSPC...
> 
> Hmmm, thanks for the context!  What do you think about this revised
> comment?
> 
> /*
>  * The number of blocks per AG that we withhold from xfs_mod_fdblocks to
>  * guarantee that we can refill the AGFL prior to allocating space in a
>  * nearly full AG.  Although the the space described in the free space
>  * btrees, the freesp btrees, and the blocks owned by the agfl are all
>  * added to the ondisk fdblocks, it's a mistake to let the ondisk free
>  * space in the AG drop so low that the free space btrees cannot refill
>  * an empty AGFL up to the minimum level.  Rather than grind through
>  * empty AGs until the fs goes down, we subtract this many AG blocks
>  * from the incore fdblocks to ENOSPC early.

s/ENOSPC early./ensure user allocation does not overcommit the space
the filesystem needs for the AGFLs./

And it's all good by me.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split
  2022-03-24  5:26     ` Darrick J. Wong
@ 2022-03-24  6:00       ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-24  6:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Wed, Mar 23, 2022 at 10:26:14PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 24, 2022 at 07:48:56AM +1100, Dave Chinner wrote:
> > On Sun, Mar 20, 2022 at 09:43:38AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > The comment for xfs_alloc_set_aside indicates that we want to set aside
> > > enough space to handle a bmap btree split.  The code, unfortunately,
> > > hardcodes this to 4.
> > > 
> > > This is incorrect, since file bmap btrees can be taller than that:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967295 -b 512
> > > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > > level 0: 4294967295 records, 330382100 blocks
> > > level 1: 330382100 records, 25414008 blocks
> > > level 2: 25414008 records, 1954924 blocks
> > > level 3: 1954924 records, 150379 blocks
> > > level 4: 150379 records, 11568 blocks
> > > level 5: 11568 records, 890 blocks
> > > level 6: 890 records, 69 blocks
> > > level 7: 69 records, 6 blocks
> > > level 8: 6 records, 1 block
> > > 9 levels, 357913945 blocks total
> > > 
> > > Or, for V5 filesystems:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967295 -b 1024
> > > bmapbt: worst case per 1024-byte block: 29 records (leaf) / 29 keyptrs (node)
> > > level 0: 4294967295 records, 148102321 blocks
> > > level 1: 148102321 records, 5106977 blocks
> > > level 2: 5106977 records, 176103 blocks
> > > level 3: 176103 records, 6073 blocks
> > > level 4: 6073 records, 210 blocks
> > > level 5: 210 records, 8 blocks
> > > level 6: 8 records, 1 block
> > > 7 levels, 153391693 blocks total
> > > 
> > > Fix this by using the actual bmap btree maxlevel value for the
> > > set-aside.  We subtract one because the root is always in the inode and
> > > hence never splits.
> > > 
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c |    7 +++++--
> > >  fs/xfs/libxfs/xfs_sb.c    |    2 --
> > >  fs/xfs/xfs_mount.c        |    7 +++++++
> > >  3 files changed, 12 insertions(+), 4 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index b0678e96ce61..747b3e45303f 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -107,13 +107,16 @@ xfs_prealloc_blocks(
> > >   * aside a few blocks which will not be reserved in delayed allocation.
> > >   *
> > >   * For each AG, we need to reserve enough blocks to replenish a totally empty
> > > - * AGFL and 4 more to handle a potential split of the file's bmap btree.
> > > + * AGFL and enough to handle a potential split of a file's bmap btree.
> > >   */
> > >  unsigned int
> > >  xfs_alloc_set_aside(
> > >  	struct xfs_mount	*mp)
> > >  {
> > > -	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + 4);
> > > +	unsigned int		bmbt_splits;
> > > +
> > > +	bmbt_splits = max(mp->m_bm_maxlevels[0], mp->m_bm_maxlevels[1]) - 1;
> > > +	return mp->m_sb.sb_agcount * (XFS_ALLOCBT_AGFL_RESERVE + bmbt_splits);
> > >  }
> > 
> > So right now I'm trying to understand why this global space set
> > aside ever needed to take into account the space used by a single
> > BMBT split. ISTR it was done back in 2006 because the ag selection
> > code, alloc args and/or xfs_alloc_space_available() didn't take into
> > account the BMBT space via args->minleft correctly to ensure that
> > the AGF we select had enough space in it for both the data extent
> > and the followup BMBT split. Hence the original SET ASIDE (which
> > wasn't per AG) was just 8 blocks - 4 for the AGFL, 4 for the BMBT.
> > 
> > The transaction reservation takes into account the space needed by
> > BMBT splits so we don't over-commit global space on user allocation
> > anymore, args->minleft takes it into account so we don't overcommit
> > AG space on extent allocation, and we have the reserved block pool
> > to handle situations were delalloc extents are fragmented more than
> > the initial indirect block reservation that is taken with the
> > delalloc extent reservation.
> > 
> > So where/why is this needed anymore?
> 
> Beats me.  I started by reading the comment and thought "Gee, that's not
> true of bmbts!" :(

Yeah. Time and knowledge are showing through here. I've been running
ENOSPC tests with the BMBT reservation part removed now for several
hours and I've haven't seen any issues as a result removing it. I'll
keep it running though to see if anything falls out, but my initial
impression is that we don't obviously need it anymore.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-24  5:24     ` Darrick J. Wong
@ 2022-03-24  6:21       ` Dave Chinner
  0 siblings, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2022-03-24  6:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: g, Brian Foster, linux-xfs

On Wed, Mar 23, 2022 at 10:24:29PM -0700, Darrick J. Wong wrote:
> On Thu, Mar 24, 2022 at 08:11:47AM +1100, Dave Chinner wrote:
> > On Sun, Mar 20, 2022 at 09:43:49AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Don't spin in an infinite loop trying to reserve blocks -- if we can't
> > > do it after 30 tries, we're racing with a nearly full filesystem, so
> > > just give up.
> > > 
> > > Cc: Brian Foster <bfoster@redhat.com>
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_fsops.c |   12 ++++++++++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 710e857bb825..615334e4f689 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > >  	int64_t			fdblks_delta = 0;
> > >  	uint64_t		request;
> > >  	int64_t			free;
> > > +	unsigned int		tries;
> > >  	int			error = 0;
> > >  
> > >  	/* If inval is null, report current values and return */
> > > @@ -430,9 +431,16 @@ xfs_reserve_blocks(
> > >  	 * If the request is larger than the current reservation, reserve the
> > >  	 * blocks before we update the reserve counters. Sample m_fdblocks and
> > >  	 * perform a partial reservation if the request exceeds free space.
> > > +	 *
> > > +	 * The loop body estimates how many blocks it can request from fdblocks
> > > +	 * to stash in the reserve pool.  This is a classic TOCTOU race since
> > > +	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
> > > +	 * cannot tell if @free remaining unchanged between iterations is due
> > > +	 * to an idle system or freed blocks being consumed immediately, so
> > > +	 * we'll try a finite number of times to satisfy the request.
> > >  	 */
> > >  	error = -ENOSPC;
> > > -	do {
> > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > >  		free = percpu_counter_sum(&mp->m_fdblocks) -
> > >  						xfs_fdblocks_unavailable(mp);
> > >  		if (free <= 0)
> > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > >  		spin_unlock(&mp->m_sb_lock);
> > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > >  		spin_lock(&mp->m_sb_lock);
> > > -	} while (error == -ENOSPC);
> > > +	}
> > 
> > So the problem here is that if we don't get all of the reservation,
> > we get none of it, so we try again. This seems like we should simply
> > punch the error back out to userspace and let it retry rather than
> > try a few times and then fail anyway. Either way, userspace has to
> > handle failure to reserve enough blocks.
> > 
> > The other alternative here is that we just change the reserve block
> > limit when ENOSPC occurs and allow the xfs_mod_fdblocks() refill
> > code just fill up the pool as space is freed. That would greatly
> > simplify the code - we do a single attempt to reserve the new space,
> > and if it fails we don't care because the reserve space gets
> > refilled before free space is made available again.  I think that's
> > preferable to having an arbitrary number of retries like this that's
> > going to end up failing anyway.
> 
> How about I try the following tomorrow?
> 
> spin_lock(...);
> mp->m_resblks = request;
> free = percpu_counter_sum(&mp->m_fdblocks) - xfs_fdblocks_unavailable(mp);
> if (request < mp->m_resblks_avail && free > 0) {
              >

> 	ask = min(request - mp->m_resblks_avail, free);
> 
> 	spin_unlock(...);
> 	error = xfs_mod_fdblocks(mp, -ask, 0);
> 	spin_lock(...);
> 
> 	if (!error)
> 		mp->m_resblks_avail += ask;
> }
> spin_unlock(...);
> return error;

Yeah, I think that's fine, though I think that to ensure things
end up as correct as possible:

	spin_unlock(...);
	error = xfs_mod_fdblocks(mp, -ask, 0);
	if (!error)
		xfs_mod_fdblocks(mp, ask, 0);

because if we are racing with other operations, the second
xfs_mod_fdblocks() call will put as much of @ask as needed back on
the resblks counter, and put any left over back on the percpu
counter....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop when reserving free block pool Darrick J. Wong
@ 2022-03-18 12:18   ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2022-03-18 12:18 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david

On Thu, Mar 17, 2022 at 02:21:17PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Don't spin in an infinite loop trying to reserve blocks -- if we can't
> do it after 30 tries, we're racing with a nearly full filesystem, so
> just give up.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_fsops.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index b71799a3acd3..4076b9004077 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -379,6 +379,7 @@ xfs_reserve_blocks(
>  	int64_t			fdblks_delta = 0;
>  	uint64_t		request;
>  	int64_t			free;
> +	unsigned int		tries;
>  	int			error = 0;
>  
>  	/* If inval is null, report current values and return */
> @@ -430,9 +431,16 @@ xfs_reserve_blocks(
>  	 * If the request is larger than the current reservation, reserve the
>  	 * blocks before we update the reserve counters. Sample m_fdblocks and
>  	 * perform a partial reservation if the request exceeds free space.
> +	 *
> +	 * The loop body estimates how many blocks it can request from fdblocks
> +	 * to stash in the reserve pool.  This is a classic TOCTOU race since
> +	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
> +	 * cannot tell if @free remaining unchanged between iterations is due
> +	 * to an idle system or freed blocks being consumed immediately, so
> +	 * we'll try a finite number of times to satisfy the request.
>  	 */
>  	error = -ENOSPC;
> -	do {
> +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
>  		/*
>  		 * The reservation pool cannot take space that xfs_mod_fdblocks
>  		 * will not give us.
> @@ -462,7 +470,7 @@ xfs_reserve_blocks(
>  		spin_unlock(&mp->m_sb_lock);
>  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
>  		spin_lock(&mp->m_sb_lock);
> -	} while (error == -ENOSPC);
> +	}
>  
>  	/*
>  	 * Update the reserve counters if blocks have been successfully
> 


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 4/6] xfs: fix infinite loop when reserving free block pool
  2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
@ 2022-03-17 21:21 ` Darrick J. Wong
  2022-03-18 12:18   ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2022-03-17 21:21 UTC (permalink / raw)
  To: djwong; +Cc: Brian Foster, linux-xfs, bfoster, david

From: Darrick J. Wong <djwong@kernel.org>

Don't spin in an infinite loop trying to reserve blocks -- if we can't
do it after 30 tries, we're racing with a nearly full filesystem, so
just give up.

Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index b71799a3acd3..4076b9004077 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -379,6 +379,7 @@ xfs_reserve_blocks(
 	int64_t			fdblks_delta = 0;
 	uint64_t		request;
 	int64_t			free;
+	unsigned int		tries;
 	int			error = 0;
 
 	/* If inval is null, report current values and return */
@@ -430,9 +431,16 @@ xfs_reserve_blocks(
 	 * If the request is larger than the current reservation, reserve the
 	 * blocks before we update the reserve counters. Sample m_fdblocks and
 	 * perform a partial reservation if the request exceeds free space.
+	 *
+	 * The loop body estimates how many blocks it can request from fdblocks
+	 * to stash in the reserve pool.  This is a classic TOCTOU race since
+	 * fdblocks updates are not always coordinated via m_sb_lock.  We also
+	 * cannot tell if @free remaining unchanged between iterations is due
+	 * to an idle system or freed blocks being consumed immediately, so
+	 * we'll try a finite number of times to satisfy the request.
 	 */
 	error = -ENOSPC;
-	do {
+	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
 		/*
 		 * The reservation pool cannot take space that xfs_mod_fdblocks
 		 * will not give us.
@@ -462,7 +470,7 @@ xfs_reserve_blocks(
 		spin_unlock(&mp->m_sb_lock);
 		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
 		spin_lock(&mp->m_sb_lock);
-	} while (error == -ENOSPC);
+	}
 
 	/*
 	 * Update the reserve counters if blocks have been successfully


^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-03-24  6:21 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-20 16:43 [PATCHSET v3 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
2022-03-20 16:43 ` [PATCH 1/6] xfs: document the XFS_ALLOC_AGFL_RESERVE constant Darrick J. Wong
2022-03-23 20:39   ` Dave Chinner
2022-03-24  5:15     ` Darrick J. Wong
2022-03-24  5:58       ` Dave Chinner
2022-03-20 16:43 ` [PATCH 2/6] xfs: actually set aside enough space to handle a bmbt split Darrick J. Wong
2022-03-23 20:48   ` Dave Chinner
2022-03-24  5:26     ` Darrick J. Wong
2022-03-24  6:00       ` Dave Chinner
2022-03-20 16:43 ` [PATCH 3/6] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
2022-03-21 15:22   ` Brian Foster
2022-03-21 20:42     ` Darrick J. Wong
2022-03-23 20:51   ` Dave Chinner
2022-03-20 16:43 ` [PATCH 4/6] xfs: fix infinite loop " Darrick J. Wong
2022-03-23 21:11   ` Dave Chinner
2022-03-24  5:24     ` Darrick J. Wong
2022-03-24  6:21       ` Dave Chinner
2022-03-20 16:43 ` [PATCH 5/6] xfs: don't report reserved bnobt space as available Darrick J. Wong
2022-03-21 15:22   ` Brian Foster
2022-03-21 20:48     ` Darrick J. Wong
2022-03-23 21:12   ` Dave Chinner
2022-03-20 16:44 ` [PATCH 6/6] xfs: rename "alloc_set_aside" to be more descriptive Darrick J. Wong
2022-03-23 21:21   ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2022-03-17 21:20 [PATCHSET v2 0/6] xfs: fix incorrect reserve pool calculations and reporting Darrick J. Wong
2022-03-17 21:21 ` [PATCH 4/6] xfs: fix infinite loop when reserving free block pool Darrick J. Wong
2022-03-18 12:18   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).