[PATCH] xfs: don't include bnobt blocks when reserving free block pool

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] xfs: don't include bnobt blocks when reserving free block pool
@ 2022-03-14 18:08 Darrick J. Wong
  2022-03-16 11:28 ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-14 18:08 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

From: Darrick J. Wong <djwong@kernel.org>

xfs_reserve_blocks controls the size of the user-visible free space
reserve pool.  Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks.  However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
reserve space so long as xfs_mod_fdblocks returns ENOSPC.

In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.

Fix this by including m_allocbt_blks in the estimation, and modify the
loop so that it will not retry infinitely.

Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.

Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_fsops.c |   16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
index 33e26690a8c4..78b6982ea5b0 100644
--- a/fs/xfs/xfs_fsops.c
+++ b/fs/xfs/xfs_fsops.c
@@ -379,6 +379,7 @@ xfs_reserve_blocks(
 	int64_t			fdblks_delta = 0;
 	uint64_t		request;
 	int64_t			free;
+	unsigned int		tries;
 	int			error = 0;

 	/* If inval is null, report current values and return */
@@ -432,9 +433,16 @@ xfs_reserve_blocks(
 	 * perform a partial reservation if the request exceeds free space.
 	 */
 	error = -ENOSPC;
-	do {
-		free = percpu_counter_sum(&mp->m_fdblocks) -
-						mp->m_alloc_set_aside;
+	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
+		/*
+		 * The reservation pool cannot take space that xfs_mod_fdblocks
+		 * will not give us.  This includes the per-AG set-aside space
+		 * and free space btree blocks that are not available for
+		 * allocation due to per-AG metadata reservations.
+		 */
+		free = percpu_counter_sum(&mp->m_fdblocks);
+		free -= mp->m_alloc_set_aside;
+		free -= atomic64_read(&mp->m_allocbt_blks);
 		if (free <= 0)
 			break;

@@ -459,7 +467,7 @@ xfs_reserve_blocks(
 		spin_unlock(&mp->m_sb_lock);
 		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
 		spin_lock(&mp->m_sb_lock);
-	} while (error == -ENOSPC);
+	}

 	/*
 	 * Update the reserve counters if blocks have been successfully

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-14 18:08 [PATCH] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
@ 2022-03-16 11:28 ` Brian Foster
  2022-03-16 16:32   ` Darrick J. Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2022-03-16 11:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> xfs_reserve_blocks controls the size of the user-visible free space
> reserve pool.  Given the difference between the current and requested
> pool sizes, it will try to reserve free space from fdblocks.  However,
> the amount requested from fdblocks is also constrained by the amount of
> space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> 
> In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> out the "free space" used by the free space btrees, because some portion
> of the free space btrees hold in reserve space for future btree
> expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> of blocks that it could request from xfs_mod_fdblocks was not updated to
> include m_allocbt_blks, so if space is extremely low, the caller hangs.
> 
> Fix this by including m_allocbt_blks in the estimation, and modify the
> loop so that it will not retry infinitely.
> 
> Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> with an fstests configuration that specifies a 1k blocksize and a
> specially crafted log size that will consume 7/8 of the space (17920
> blocks, specifically) in that AG.
> 
> Cc: Brian Foster <bfoster@redhat.com>
> Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
>  1 file changed, 12 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> index 33e26690a8c4..78b6982ea5b0 100644
> --- a/fs/xfs/xfs_fsops.c
> +++ b/fs/xfs/xfs_fsops.c
> @@ -379,6 +379,7 @@ xfs_reserve_blocks(
>  	int64_t			fdblks_delta = 0;
>  	uint64_t		request;
>  	int64_t			free;
> +	unsigned int		tries;
>  	int			error = 0;
>  
>  	/* If inval is null, report current values and return */
> @@ -432,9 +433,16 @@ xfs_reserve_blocks(
>  	 * perform a partial reservation if the request exceeds free space.
>  	 */
>  	error = -ENOSPC;
> -	do {
> -		free = percpu_counter_sum(&mp->m_fdblocks) -
> -						mp->m_alloc_set_aside;
> +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {

Any reason for the magic number of retries as opposed to perhaps just
not retrying at all? This seems a little odd when you think about it
given that the request is already intended to take available space into
account and modify the request from userspace. OTOH, another
consideration could be to retry some (really large?) number of times and
then bail out if we happen to iterate without an observable change in
free space (i.e., something is wrong), however I suppose that could be
racy as well. *shrug*

> +		/*
> +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> +		 * will not give us.  This includes the per-AG set-aside space
> +		 * and free space btree blocks that are not available for
> +		 * allocation due to per-AG metadata reservations.
> +		 */
> +		free = percpu_counter_sum(&mp->m_fdblocks);
> +		free -= mp->m_alloc_set_aside;
> +		free -= atomic64_read(&mp->m_allocbt_blks);

Seems reasonable. Do we want to consider ->m_allocbt_blks in other
places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
Not sure how much it matters for space reporting purposes, but if so, it
might also be worth reconsidering the usefulness of a static field and
initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
uses involve a dynamic calculation (due to ->m_allocbt_blks).

Brian

>  		if (free <= 0)
>  			break;
>  
> @@ -459,7 +467,7 @@ xfs_reserve_blocks(
>  		spin_unlock(&mp->m_sb_lock);
>  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
>  		spin_lock(&mp->m_sb_lock);
> -	} while (error == -ENOSPC);
> +	}
>  
>  	/*
>  	 * Update the reserve counters if blocks have been successfully
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 11:28 ` Brian Foster
@ 2022-03-16 16:32   ` Darrick J. Wong
  2022-03-16 17:29     ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-16 16:32 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > xfs_reserve_blocks controls the size of the user-visible free space
> > reserve pool.  Given the difference between the current and requested
> > pool sizes, it will try to reserve free space from fdblocks.  However,
> > the amount requested from fdblocks is also constrained by the amount of
> > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > 
> > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > out the "free space" used by the free space btrees, because some portion
> > of the free space btrees hold in reserve space for future btree
> > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > 
> > Fix this by including m_allocbt_blks in the estimation, and modify the
> > loop so that it will not retry infinitely.
> > 
> > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > with an fstests configuration that specifies a 1k blocksize and a
> > specially crafted log size that will consume 7/8 of the space (17920
> > blocks, specifically) in that AG.
> > 
> > Cc: Brian Foster <bfoster@redhat.com>
> > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> >  1 file changed, 12 insertions(+), 4 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index 33e26690a8c4..78b6982ea5b0 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> >  	int64_t			fdblks_delta = 0;
> >  	uint64_t		request;
> >  	int64_t			free;
> > +	unsigned int		tries;
> >  	int			error = 0;
> >  
> >  	/* If inval is null, report current values and return */
> > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> >  	 * perform a partial reservation if the request exceeds free space.
> >  	 */
> >  	error = -ENOSPC;
> > -	do {
> > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > -						mp->m_alloc_set_aside;
> > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> 
> Any reason for the magic number of retries as opposed to perhaps just
> not retrying at all?

I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
block reservation mechanism."), where I guess Dave decided that we
should loop forever trying to satisfy a request from userspace to
increase the reserve pool.  OFC you and I have been patching this
function to fix all its horrible warts over the years, so maybe you're
right that this should only try once...

(For the mount time default reservation, we should only iterate the loop
once (provided the accounting is correct ;) since nobody else is
touching the free space counters.)

> This seems a little odd when you think about it
> given that the request is already intended to take available space into
> account and modify the request from userspace. OTOH, another
> consideration could be to retry some (really large?) number of times and
> then bail out if we happen to iterate without an observable change in
> free space (i.e., something is wrong), however I suppose that could be
> racy as well. *shrug*

...but if you're the sysadmin desperately trying to increase the size of
the reserve pool when the fs is running near ENOSPC, you're going to be
racing with fdblocks bouncing up and down.  The @free samples that we
take here in the loop body are indeed racy since we can't tell the
difference between @free being unchanged from the last iteration because
someone freed a block and someone else immediately consumed it, or a
totally idle system.

Either way, it's better than hanging the whole system. :)

What if I augment the loop control with a comment capturing some of this:

	/*
	 * The loop body estimates how many blocks it can request from
	 * fdblocks to stash in the reserve pool.  This is a classic
	 * TOCTOU race since fdblocks updates are not always coordinated
	 * via m_sb_lock.  We also cannot tell if @free remaining
	 * unchanged between iterations is due to an idle system or
	 * freed blocks being consumed immediately, so we'll try a
	 * finite number of times to satisfy the request.
	 */
	for (tries = 0; tries < 30...) {

> 
> > +		/*
> > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > +		 * will not give us.  This includes the per-AG set-aside space
> > +		 * and free space btree blocks that are not available for
> > +		 * allocation due to per-AG metadata reservations.
> > +		 */
> > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > +		free -= mp->m_alloc_set_aside;
> > +		free -= atomic64_read(&mp->m_allocbt_blks);
> 
> Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> Not sure how much it matters for space reporting purposes, but if so, it
> might also be worth reconsidering the usefulness of a static field and
> initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> uses involve a dynamic calculation (due to ->m_allocbt_blks).

When I was writing this patch, I very nearly decided to make those three
lines above their own helper.  I didn't see any other spots that looked
like obvious candidates for such a calculation outside of statfs.

Subtracting m_allocbt_blks from statfs' avail field is a behavior
change, since we always used to consider bnobt blocks as available.  We
don't have an exact count of how many blocks are needed to hide the
per-ag reserved extents, so in the end we have to decide whether we want
to hear complaints about over- or under-estimation of available blocks.

So I think the statfs stuff is a separate patch. :)

--D

> 
> Brian
> 
> >  		if (free <= 0)
> >  			break;
> >  
> > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> >  		spin_unlock(&mp->m_sb_lock);
> >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> >  		spin_lock(&mp->m_sb_lock);
> > -	} while (error == -ENOSPC);
> > +	}
> >  
> >  	/*
> >  	 * Update the reserve counters if blocks have been successfully
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 16:32   ` Darrick J. Wong
@ 2022-03-16 17:29     ` Brian Foster
  2022-03-16 18:17       ` Darrick J. Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2022-03-16 17:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > xfs_reserve_blocks controls the size of the user-visible free space
> > > reserve pool.  Given the difference between the current and requested
> > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > the amount requested from fdblocks is also constrained by the amount of
> > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > 
> > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > out the "free space" used by the free space btrees, because some portion
> > > of the free space btrees hold in reserve space for future btree
> > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > 
> > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > loop so that it will not retry infinitely.
> > > 
> > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > with an fstests configuration that specifies a 1k blocksize and a
> > > specially crafted log size that will consume 7/8 of the space (17920
> > > blocks, specifically) in that AG.
> > > 
> > > Cc: Brian Foster <bfoster@redhat.com>
> > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > index 33e26690a8c4..78b6982ea5b0 100644
> > > --- a/fs/xfs/xfs_fsops.c
> > > +++ b/fs/xfs/xfs_fsops.c
> > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > >  	int64_t			fdblks_delta = 0;
> > >  	uint64_t		request;
> > >  	int64_t			free;
> > > +	unsigned int		tries;
> > >  	int			error = 0;
> > >  
> > >  	/* If inval is null, report current values and return */
> > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > >  	 * perform a partial reservation if the request exceeds free space.
> > >  	 */
> > >  	error = -ENOSPC;
> > > -	do {
> > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > -						mp->m_alloc_set_aside;
> > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > 
> > Any reason for the magic number of retries as opposed to perhaps just
> > not retrying at all?
> 
> I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> block reservation mechanism."), where I guess Dave decided that we
> should loop forever trying to satisfy a request from userspace to
> increase the reserve pool.  OFC you and I have been patching this
> function to fix all its horrible warts over the years, so maybe you're
> right that this should only try once...
> 
> (For the mount time default reservation, we should only iterate the loop
> once (provided the accounting is correct ;) since nobody else is
> touching the free space counters.)
> 
> > This seems a little odd when you think about it
> > given that the request is already intended to take available space into
> > account and modify the request from userspace. OTOH, another
> > consideration could be to retry some (really large?) number of times and
> > then bail out if we happen to iterate without an observable change in
> > free space (i.e., something is wrong), however I suppose that could be
> > racy as well. *shrug*
> 
> ...but if you're the sysadmin desperately trying to increase the size of
> the reserve pool when the fs is running near ENOSPC, you're going to be
> racing with fdblocks bouncing up and down.  The @free samples that we
> take here in the loop body are indeed racy since we can't tell the
> difference between @free being unchanged from the last iteration because
> someone freed a block and someone else immediately consumed it, or a
> totally idle system.
> 
> Either way, it's better than hanging the whole system. :)
> 

Yeah.. I'm not bothered much by whether we retry once, 42 times or
forever. I think what this boils down to for me is whether it's worth
the risk of a behavior change of an -ENOSPC return causing something
unexpected for some random user or use case. Could we just do this in
two separate patches? Patch 1 fixes the calculation and targets stable,
patch 2 does whatever to the retry loop that potentially changes retry
semantics (and doesn't really need backporting)..?

> What if I augment the loop control with a comment capturing some of this:
> 
> 	/*
> 	 * The loop body estimates how many blocks it can request from
> 	 * fdblocks to stash in the reserve pool.  This is a classic
> 	 * TOCTOU race since fdblocks updates are not always coordinated
> 	 * via m_sb_lock.  We also cannot tell if @free remaining
> 	 * unchanged between iterations is due to an idle system or
> 	 * freed blocks being consumed immediately, so we'll try a
> 	 * finite number of times to satisfy the request.
> 	 */
> 	for (tries = 0; tries < 30...) {
> 
> > 
> > > +		/*
> > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > +		 * will not give us.  This includes the per-AG set-aside space
> > > +		 * and free space btree blocks that are not available for
> > > +		 * allocation due to per-AG metadata reservations.
> > > +		 */
> > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > +		free -= mp->m_alloc_set_aside;
> > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > 
> > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > Not sure how much it matters for space reporting purposes, but if so, it
> > might also be worth reconsidering the usefulness of a static field and
> > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> 
> When I was writing this patch, I very nearly decided to make those three
> lines above their own helper.  I didn't see any other spots that looked
> like obvious candidates for such a calculation outside of statfs.
> 

Indeed..

> Subtracting m_allocbt_blks from statfs' avail field is a behavior
> change, since we always used to consider bnobt blocks as available.  We
> don't have an exact count of how many blocks are needed to hide the
> per-ag reserved extents, so in the end we have to decide whether we want
> to hear complaints about over- or under-estimation of available blocks.
> 
> So I think the statfs stuff is a separate patch. :)
> 

Similar deal as above.. I'm more interested in a potential cleanup of
the code that helps prevent this sort of buglet for the next user of
->m_alloc_set_aside that will (expectedly) have no idea about this
subtle quirk than I am about what's presented in the free space
counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
or something that just does the (agcount << 3) thing, and then define a
new xfs_alloc_set_aside() that combines the macro calculation with
->m_allocbt_blks. Then the whole "set aside" concept is calculated and
documented in one place. Hm?

Brian

> --D
> 
> > 
> > Brian
> > 
> > >  		if (free <= 0)
> > >  			break;
> > >  
> > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > >  		spin_unlock(&mp->m_sb_lock);
> > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > >  		spin_lock(&mp->m_sb_lock);
> > > -	} while (error == -ENOSPC);
> > > +	}
> > >  
> > >  	/*
> > >  	 * Update the reserve counters if blocks have been successfully
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 17:29     ` Brian Foster
@ 2022-03-16 18:17       ` Darrick J. Wong
  2022-03-16 18:48         ` Brian Foster
  2022-03-17  2:05         ` Dave Chinner
  0 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-16 18:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > xfs_reserve_blocks controls the size of the user-visible free space
> > > > reserve pool.  Given the difference between the current and requested
> > > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > > the amount requested from fdblocks is also constrained by the amount of
> > > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > > 
> > > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > > out the "free space" used by the free space btrees, because some portion
> > > > of the free space btrees hold in reserve space for future btree
> > > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > > 
> > > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > > loop so that it will not retry infinitely.
> > > > 
> > > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > > with an fstests configuration that specifies a 1k blocksize and a
> > > > specially crafted log size that will consume 7/8 of the space (17920
> > > > blocks, specifically) in that AG.
> > > > 
> > > > Cc: Brian Foster <bfoster@redhat.com>
> > > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > ---
> > > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > index 33e26690a8c4..78b6982ea5b0 100644
> > > > --- a/fs/xfs/xfs_fsops.c
> > > > +++ b/fs/xfs/xfs_fsops.c
> > > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > > >  	int64_t			fdblks_delta = 0;
> > > >  	uint64_t		request;
> > > >  	int64_t			free;
> > > > +	unsigned int		tries;
> > > >  	int			error = 0;
> > > >  
> > > >  	/* If inval is null, report current values and return */
> > > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > > >  	 * perform a partial reservation if the request exceeds free space.
> > > >  	 */
> > > >  	error = -ENOSPC;
> > > > -	do {
> > > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > > -						mp->m_alloc_set_aside;
> > > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > > 
> > > Any reason for the magic number of retries as opposed to perhaps just
> > > not retrying at all?
> > 
> > I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> > block reservation mechanism."), where I guess Dave decided that we
> > should loop forever trying to satisfy a request from userspace to
> > increase the reserve pool.  OFC you and I have been patching this
> > function to fix all its horrible warts over the years, so maybe you're
> > right that this should only try once...
> > 
> > (For the mount time default reservation, we should only iterate the loop
> > once (provided the accounting is correct ;) since nobody else is
> > touching the free space counters.)
> > 
> > > This seems a little odd when you think about it
> > > given that the request is already intended to take available space into
> > > account and modify the request from userspace. OTOH, another
> > > consideration could be to retry some (really large?) number of times and
> > > then bail out if we happen to iterate without an observable change in
> > > free space (i.e., something is wrong), however I suppose that could be
> > > racy as well. *shrug*
> > 
> > ...but if you're the sysadmin desperately trying to increase the size of
> > the reserve pool when the fs is running near ENOSPC, you're going to be
> > racing with fdblocks bouncing up and down.  The @free samples that we
> > take here in the loop body are indeed racy since we can't tell the
> > difference between @free being unchanged from the last iteration because
> > someone freed a block and someone else immediately consumed it, or a
> > totally idle system.
> > 
> > Either way, it's better than hanging the whole system. :)
> > 
> 
> Yeah.. I'm not bothered much by whether we retry once, 42 times or
> forever. I think what this boils down to for me is whether it's worth
> the risk of a behavior change of an -ENOSPC return causing something
> unexpected for some random user or use case. Could we just do this in
> two separate patches? Patch 1 fixes the calculation and targets stable,
> patch 2 does whatever to the retry loop that potentially changes retry
> semantics (and doesn't really need backporting)..?

Splitting the two patches sounds good to me.

> > What if I augment the loop control with a comment capturing some of this:
> > 
> > 	/*
> > 	 * The loop body estimates how many blocks it can request from
> > 	 * fdblocks to stash in the reserve pool.  This is a classic
> > 	 * TOCTOU race since fdblocks updates are not always coordinated
> > 	 * via m_sb_lock.  We also cannot tell if @free remaining
> > 	 * unchanged between iterations is due to an idle system or
> > 	 * freed blocks being consumed immediately, so we'll try a
> > 	 * finite number of times to satisfy the request.
> > 	 */
> > 	for (tries = 0; tries < 30...) {
> > 
> > > 
> > > > +		/*
> > > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > > +		 * will not give us.  This includes the per-AG set-aside space
> > > > +		 * and free space btree blocks that are not available for
> > > > +		 * allocation due to per-AG metadata reservations.
> > > > +		 */
> > > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > > +		free -= mp->m_alloc_set_aside;
> > > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > > 
> > > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > > Not sure how much it matters for space reporting purposes, but if so, it
> > > might also be worth reconsidering the usefulness of a static field and
> > > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> > 
> > When I was writing this patch, I very nearly decided to make those three
> > lines above their own helper.  I didn't see any other spots that looked
> > like obvious candidates for such a calculation outside of statfs.
> > 
> 
> Indeed..
> 
> > Subtracting m_allocbt_blks from statfs' avail field is a behavior
> > change, since we always used to consider bnobt blocks as available.  We
> > don't have an exact count of how many blocks are needed to hide the
> > per-ag reserved extents, so in the end we have to decide whether we want
> > to hear complaints about over- or under-estimation of available blocks.
> > 
> > So I think the statfs stuff is a separate patch. :)
> > 
> 
> Similar deal as above.. I'm more interested in a potential cleanup of
> the code that helps prevent this sort of buglet for the next user of
> ->m_alloc_set_aside that will (expectedly) have no idea about this
> subtle quirk than I am about what's presented in the free space
> counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> or something that just does the (agcount << 3) thing, and then define a

I'm not sure that the current xfs_alloc_set_aside code is correct.
Right now it comes with this comment:

"We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
handle a potential split of the file's bmap btree."

I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
AFAICT, that part refers to the number of blocks we need to keep free in
case we have to replenish a completely empty AGFL.  The hardcoded value
of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
uses to decide how big the AGFL needs to be, and it returns 6 on a
filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
is wrong here and should be replaced with the function call.

I also think the second part ("and 4 more to handle a split of the
file's bmap btree") is wrong.  If we're really supposed to save enough
blocks to handle a bmbt split, then I think this ought to be
(mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
btrees can be 9 levels tall:

xfs_db> btheight bmapbt -n 4294967296 -b 512
bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
level 0: 4294967296 records, 330382100 blocks
level 1: 330382100 records, 25414008 blocks
level 2: 25414008 records, 1954924 blocks
level 3: 1954924 records, 150379 blocks
level 4: 150379 records, 11568 blocks
level 5: 11568 records, 890 blocks
level 6: 890 records, 69 blocks
level 7: 69 records, 6 blocks
level 8: 6 records, 1 block
9 levels, 357913945 blocks total

The root level is in the inode, so we need 8 blocks to handle a full
split on a 512b block filesystem.

Granted, it's at least correct for 4k+ block filesystems:

xfs_db> btheight bmapbt -n 4294967296
bmapbt: worst case per 4096-byte block: 125 records (leaf) / 125 keyptrs (node)
level 0: 4294967296 records, 34359739 blocks
level 1: 34359739 records, 274878 blocks
level 2: 274878 records, 2200 blocks
level 3: 2200 records, 18 blocks
level 4: 18 records, 1 block
5 levels, 34636836 blocks total

So in the end, I think that calculation should become:

unsigned int
xfs_alloc_set_aside(
	struct xfs_mount	*mp)
{
	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);

	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
}

> new xfs_alloc_set_aside() that combines the macro calculation with
> ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> documented in one place. Hm?

I think I'd rather call the new function xfs_fdblocks_avail() over
reusing an existing name, because I fear that zapping an old function
and replacing it with a new function with the same name will cause
confusion for anyone backporting patches or reading code after an
absence.

Also the only reason we have a mount variable and a function (instead of
a macro) is that Dave asked me to change the codebase away from the
XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
reflink.

--D

> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > >  		if (free <= 0)
> > > >  			break;
> > > >  
> > > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > > >  		spin_unlock(&mp->m_sb_lock);
> > > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > > >  		spin_lock(&mp->m_sb_lock);
> > > > -	} while (error == -ENOSPC);
> > > > +	}
> > > >  
> > > >  	/*
> > > >  	 * Update the reserve counters if blocks have been successfully
> > > > 
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 18:17       ` Darrick J. Wong
@ 2022-03-16 18:48         ` Brian Foster
  2022-03-16 19:17           ` Brian Foster
  2022-03-17  2:05         ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2022-03-16 18:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > xfs_reserve_blocks controls the size of the user-visible free space
> > > > > reserve pool.  Given the difference between the current and requested
> > > > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > > > the amount requested from fdblocks is also constrained by the amount of
> > > > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > > > 
> > > > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > > > out the "free space" used by the free space btrees, because some portion
> > > > > of the free space btrees hold in reserve space for future btree
> > > > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > > > 
> > > > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > > > loop so that it will not retry infinitely.
> > > > > 
> > > > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > > > with an fstests configuration that specifies a 1k blocksize and a
> > > > > specially crafted log size that will consume 7/8 of the space (17920
> > > > > blocks, specifically) in that AG.
> > > > > 
> > > > > Cc: Brian Foster <bfoster@redhat.com>
> > > > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > ---
> > > > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > index 33e26690a8c4..78b6982ea5b0 100644
> > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > > > >  	int64_t			fdblks_delta = 0;
> > > > >  	uint64_t		request;
> > > > >  	int64_t			free;
> > > > > +	unsigned int		tries;
> > > > >  	int			error = 0;
> > > > >  
> > > > >  	/* If inval is null, report current values and return */
> > > > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > > > >  	 * perform a partial reservation if the request exceeds free space.
> > > > >  	 */
> > > > >  	error = -ENOSPC;
> > > > > -	do {
> > > > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > > > -						mp->m_alloc_set_aside;
> > > > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > > > 
> > > > Any reason for the magic number of retries as opposed to perhaps just
> > > > not retrying at all?
> > > 
> > > I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> > > block reservation mechanism."), where I guess Dave decided that we
> > > should loop forever trying to satisfy a request from userspace to
> > > increase the reserve pool.  OFC you and I have been patching this
> > > function to fix all its horrible warts over the years, so maybe you're
> > > right that this should only try once...
> > > 
> > > (For the mount time default reservation, we should only iterate the loop
> > > once (provided the accounting is correct ;) since nobody else is
> > > touching the free space counters.)
> > > 
> > > > This seems a little odd when you think about it
> > > > given that the request is already intended to take available space into
> > > > account and modify the request from userspace. OTOH, another
> > > > consideration could be to retry some (really large?) number of times and
> > > > then bail out if we happen to iterate without an observable change in
> > > > free space (i.e., something is wrong), however I suppose that could be
> > > > racy as well. *shrug*
> > > 
> > > ...but if you're the sysadmin desperately trying to increase the size of
> > > the reserve pool when the fs is running near ENOSPC, you're going to be
> > > racing with fdblocks bouncing up and down.  The @free samples that we
> > > take here in the loop body are indeed racy since we can't tell the
> > > difference between @free being unchanged from the last iteration because
> > > someone freed a block and someone else immediately consumed it, or a
> > > totally idle system.
> > > 
> > > Either way, it's better than hanging the whole system. :)
> > > 
> > 
> > Yeah.. I'm not bothered much by whether we retry once, 42 times or
> > forever. I think what this boils down to for me is whether it's worth
> > the risk of a behavior change of an -ENOSPC return causing something
> > unexpected for some random user or use case. Could we just do this in
> > two separate patches? Patch 1 fixes the calculation and targets stable,
> > patch 2 does whatever to the retry loop that potentially changes retry
> > semantics (and doesn't really need backporting)..?
> 
> Splitting the two patches sounds good to me.
> 
> > > What if I augment the loop control with a comment capturing some of this:
> > > 
> > > 	/*
> > > 	 * The loop body estimates how many blocks it can request from
> > > 	 * fdblocks to stash in the reserve pool.  This is a classic
> > > 	 * TOCTOU race since fdblocks updates are not always coordinated
> > > 	 * via m_sb_lock.  We also cannot tell if @free remaining
> > > 	 * unchanged between iterations is due to an idle system or
> > > 	 * freed blocks being consumed immediately, so we'll try a
> > > 	 * finite number of times to satisfy the request.
> > > 	 */
> > > 	for (tries = 0; tries < 30...) {
> > > 
> > > > 
> > > > > +		/*
> > > > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > > > +		 * will not give us.  This includes the per-AG set-aside space
> > > > > +		 * and free space btree blocks that are not available for
> > > > > +		 * allocation due to per-AG metadata reservations.
> > > > > +		 */
> > > > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > > > +		free -= mp->m_alloc_set_aside;
> > > > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > > > 
> > > > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > > > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > > > Not sure how much it matters for space reporting purposes, but if so, it
> > > > might also be worth reconsidering the usefulness of a static field and
> > > > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > > > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> > > 
> > > When I was writing this patch, I very nearly decided to make those three
> > > lines above their own helper.  I didn't see any other spots that looked
> > > like obvious candidates for such a calculation outside of statfs.
> > > 
> > 
> > Indeed..
> > 
> > > Subtracting m_allocbt_blks from statfs' avail field is a behavior
> > > change, since we always used to consider bnobt blocks as available.  We
> > > don't have an exact count of how many blocks are needed to hide the
> > > per-ag reserved extents, so in the end we have to decide whether we want
> > > to hear complaints about over- or under-estimation of available blocks.
> > > 
> > > So I think the statfs stuff is a separate patch. :)
> > > 
> > 
> > Similar deal as above.. I'm more interested in a potential cleanup of
> > the code that helps prevent this sort of buglet for the next user of
> > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > subtle quirk than I am about what's presented in the free space
> > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > or something that just does the (agcount << 3) thing, and then define a
> 
> I'm not sure that the current xfs_alloc_set_aside code is correct.
> Right now it comes with this comment:
> 
> "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> handle a potential split of the file's bmap btree."
> 
> I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> AFAICT, that part refers to the number of blocks we need to keep free in
> case we have to replenish a completely empty AGFL.  The hardcoded value
> of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> uses to decide how big the AGFL needs to be, and it returns 6 on a
> filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> is wrong here and should be replaced with the function call.
> 
> I also think the second part ("and 4 more to handle a split of the
> file's bmap btree") is wrong.  If we're really supposed to save enough
> blocks to handle a bmbt split, then I think this ought to be
> (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> btrees can be 9 levels tall:
> 
> xfs_db> btheight bmapbt -n 4294967296 -b 512
> bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> level 0: 4294967296 records, 330382100 blocks
> level 1: 330382100 records, 25414008 blocks
> level 2: 25414008 records, 1954924 blocks
> level 3: 1954924 records, 150379 blocks
> level 4: 150379 records, 11568 blocks
> level 5: 11568 records, 890 blocks
> level 6: 890 records, 69 blocks
> level 7: 69 records, 6 blocks
> level 8: 6 records, 1 block
> 9 levels, 357913945 blocks total
> 
> The root level is in the inode, so we need 8 blocks to handle a full
> split on a 512b block filesystem.
> 
> Granted, it's at least correct for 4k+ block filesystems:
> 
> xfs_db> btheight bmapbt -n 4294967296
> bmapbt: worst case per 4096-byte block: 125 records (leaf) / 125 keyptrs (node)
> level 0: 4294967296 records, 34359739 blocks
> level 1: 34359739 records, 274878 blocks
> level 2: 274878 records, 2200 blocks
> level 3: 2200 records, 18 blocks
> level 4: 18 records, 1 block
> 5 levels, 34636836 blocks total
> 
> So in the end, I think that calculation should become:
> 
> unsigned int
> xfs_alloc_set_aside(
> 	struct xfs_mount	*mp)
> {
> 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> 
> 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> }
> 

I'm not familiar enough with the original context, but that seems
reasonable to me at a glance. If we do want to change this, I'd again
just suggest a separate patch since presumably there isn't any known bug
report or anything associated with the current code.

> > new xfs_alloc_set_aside() that combines the macro calculation with
> > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > documented in one place. Hm?
> 
> I think I'd rather call the new function xfs_fdblocks_avail() over
> reusing an existing name, because I fear that zapping an old function
> and replacing it with a new function with the same name will cause
> confusion for anyone backporting patches or reading code after an
> absence.
> 

The logical change is basically just to include the allocbt block count
in the "set aside" calculation, not necessarily redefine what "set
aside" means, so I don't really see that as a problem. The function is
only used in a couple places, but we've managed to make it confusing by
conflating the naming of the function, xfs_mount field and local
variables in certain contexts.

BTW, do we ever really expect the "set aside" value to be of any
significance relative to the perag reservation values for any moderately
recent kernel? IIRC (most) perag res is not presented as free space, so
I'm curious if it would even be noticeable if we just fixed and used
xfs_alloc_set_aside() consistently across the board..

Brian

> Also the only reason we have a mount variable and a function (instead of
> a macro) is that Dave asked me to change the codebase away from the
> XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> reflink.
> 
> --D
> 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > >  		if (free <= 0)
> > > > >  			break;
> > > > >  
> > > > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > > > >  		spin_unlock(&mp->m_sb_lock);
> > > > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > > > >  		spin_lock(&mp->m_sb_lock);
> > > > > -	} while (error == -ENOSPC);
> > > > > +	}
> > > > >  
> > > > >  	/*
> > > > >  	 * Update the reserve counters if blocks have been successfully
> > > > > 
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 18:48         ` Brian Foster
@ 2022-03-16 19:17           ` Brian Foster
  2022-03-16 21:15             ` Darrick J. Wong
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2022-03-16 19:17 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Mar 16, 2022 at 02:48:43PM -0400, Brian Foster wrote:
> On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > xfs_reserve_blocks controls the size of the user-visible free space
> > > > > > reserve pool.  Given the difference between the current and requested
> > > > > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > > > > the amount requested from fdblocks is also constrained by the amount of
> > > > > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > > > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > > > > 
> > > > > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > > > > out the "free space" used by the free space btrees, because some portion
> > > > > > of the free space btrees hold in reserve space for future btree
> > > > > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > > > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > > > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > > > > 
> > > > > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > > > > loop so that it will not retry infinitely.
> > > > > > 
> > > > > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > > > > with an fstests configuration that specifies a 1k blocksize and a
> > > > > > specially crafted log size that will consume 7/8 of the space (17920
> > > > > > blocks, specifically) in that AG.
> > > > > > 
> > > > > > Cc: Brian Foster <bfoster@redhat.com>
> > > > > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > > ---
> > > > > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > > > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > > index 33e26690a8c4..78b6982ea5b0 100644
> > > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > > > > >  	int64_t			fdblks_delta = 0;
> > > > > >  	uint64_t		request;
> > > > > >  	int64_t			free;
> > > > > > +	unsigned int		tries;
> > > > > >  	int			error = 0;
> > > > > >  
> > > > > >  	/* If inval is null, report current values and return */
> > > > > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > > > > >  	 * perform a partial reservation if the request exceeds free space.
> > > > > >  	 */
> > > > > >  	error = -ENOSPC;
> > > > > > -	do {
> > > > > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > > > > -						mp->m_alloc_set_aside;
> > > > > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > > > > 
> > > > > Any reason for the magic number of retries as opposed to perhaps just
> > > > > not retrying at all?
> > > > 
> > > > I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> > > > block reservation mechanism."), where I guess Dave decided that we
> > > > should loop forever trying to satisfy a request from userspace to
> > > > increase the reserve pool.  OFC you and I have been patching this
> > > > function to fix all its horrible warts over the years, so maybe you're
> > > > right that this should only try once...
> > > > 
> > > > (For the mount time default reservation, we should only iterate the loop
> > > > once (provided the accounting is correct ;) since nobody else is
> > > > touching the free space counters.)
> > > > 
> > > > > This seems a little odd when you think about it
> > > > > given that the request is already intended to take available space into
> > > > > account and modify the request from userspace. OTOH, another
> > > > > consideration could be to retry some (really large?) number of times and
> > > > > then bail out if we happen to iterate without an observable change in
> > > > > free space (i.e., something is wrong), however I suppose that could be
> > > > > racy as well. *shrug*
> > > > 
> > > > ...but if you're the sysadmin desperately trying to increase the size of
> > > > the reserve pool when the fs is running near ENOSPC, you're going to be
> > > > racing with fdblocks bouncing up and down.  The @free samples that we
> > > > take here in the loop body are indeed racy since we can't tell the
> > > > difference between @free being unchanged from the last iteration because
> > > > someone freed a block and someone else immediately consumed it, or a
> > > > totally idle system.
> > > > 
> > > > Either way, it's better than hanging the whole system. :)
> > > > 
> > > 
> > > Yeah.. I'm not bothered much by whether we retry once, 42 times or
> > > forever. I think what this boils down to for me is whether it's worth
> > > the risk of a behavior change of an -ENOSPC return causing something
> > > unexpected for some random user or use case. Could we just do this in
> > > two separate patches? Patch 1 fixes the calculation and targets stable,
> > > patch 2 does whatever to the retry loop that potentially changes retry
> > > semantics (and doesn't really need backporting)..?
> > 
> > Splitting the two patches sounds good to me.
> > 
> > > > What if I augment the loop control with a comment capturing some of this:
> > > > 
> > > > 	/*
> > > > 	 * The loop body estimates how many blocks it can request from
> > > > 	 * fdblocks to stash in the reserve pool.  This is a classic
> > > > 	 * TOCTOU race since fdblocks updates are not always coordinated
> > > > 	 * via m_sb_lock.  We also cannot tell if @free remaining
> > > > 	 * unchanged between iterations is due to an idle system or
> > > > 	 * freed blocks being consumed immediately, so we'll try a
> > > > 	 * finite number of times to satisfy the request.
> > > > 	 */
> > > > 	for (tries = 0; tries < 30...) {
> > > > 
> > > > > 
> > > > > > +		/*
> > > > > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > > > > +		 * will not give us.  This includes the per-AG set-aside space
> > > > > > +		 * and free space btree blocks that are not available for
> > > > > > +		 * allocation due to per-AG metadata reservations.
> > > > > > +		 */
> > > > > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > > > > +		free -= mp->m_alloc_set_aside;
> > > > > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > > > > 
> > > > > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > > > > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > > > > Not sure how much it matters for space reporting purposes, but if so, it
> > > > > might also be worth reconsidering the usefulness of a static field and
> > > > > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > > > > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> > > > 
> > > > When I was writing this patch, I very nearly decided to make those three
> > > > lines above their own helper.  I didn't see any other spots that looked
> > > > like obvious candidates for such a calculation outside of statfs.
> > > > 
> > > 
> > > Indeed..
> > > 
> > > > Subtracting m_allocbt_blks from statfs' avail field is a behavior
> > > > change, since we always used to consider bnobt blocks as available.  We
> > > > don't have an exact count of how many blocks are needed to hide the
> > > > per-ag reserved extents, so in the end we have to decide whether we want
> > > > to hear complaints about over- or under-estimation of available blocks.
> > > > 
> > > > So I think the statfs stuff is a separate patch. :)
> > > > 
> > > 
> > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > the code that helps prevent this sort of buglet for the next user of
> > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > subtle quirk than I am about what's presented in the free space
> > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > or something that just does the (agcount << 3) thing, and then define a
> > 
> > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > Right now it comes with this comment:
> > 
> > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > handle a potential split of the file's bmap btree."
> > 
> > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > AFAICT, that part refers to the number of blocks we need to keep free in
> > case we have to replenish a completely empty AGFL.  The hardcoded value
> > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > is wrong here and should be replaced with the function call.
> > 
> > I also think the second part ("and 4 more to handle a split of the
> > file's bmap btree") is wrong.  If we're really supposed to save enough
> > blocks to handle a bmbt split, then I think this ought to be
> > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > btrees can be 9 levels tall:
> > 
> > xfs_db> btheight bmapbt -n 4294967296 -b 512
> > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > level 0: 4294967296 records, 330382100 blocks
> > level 1: 330382100 records, 25414008 blocks
> > level 2: 25414008 records, 1954924 blocks
> > level 3: 1954924 records, 150379 blocks
> > level 4: 150379 records, 11568 blocks
> > level 5: 11568 records, 890 blocks
> > level 6: 890 records, 69 blocks
> > level 7: 69 records, 6 blocks
> > level 8: 6 records, 1 block
> > 9 levels, 357913945 blocks total
> > 
> > The root level is in the inode, so we need 8 blocks to handle a full
> > split on a 512b block filesystem.
> > 
> > Granted, it's at least correct for 4k+ block filesystems:
> > 
> > xfs_db> btheight bmapbt -n 4294967296
> > bmapbt: worst case per 4096-byte block: 125 records (leaf) / 125 keyptrs (node)
> > level 0: 4294967296 records, 34359739 blocks
> > level 1: 34359739 records, 274878 blocks
> > level 2: 274878 records, 2200 blocks
> > level 3: 2200 records, 18 blocks
> > level 4: 18 records, 1 block
> > 5 levels, 34636836 blocks total
> > 
> > So in the end, I think that calculation should become:
> > 
> > unsigned int
> > xfs_alloc_set_aside(
> > 	struct xfs_mount	*mp)
> > {
> > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > 
> > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > }
> > 
> 
> I'm not familiar enough with the original context, but that seems
> reasonable to me at a glance. If we do want to change this, I'd again
> just suggest a separate patch since presumably there isn't any known bug
> report or anything associated with the current code.
> 
> > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > documented in one place. Hm?
> > 
> > I think I'd rather call the new function xfs_fdblocks_avail() over
> > reusing an existing name, because I fear that zapping an old function
> > and replacing it with a new function with the same name will cause
> > confusion for anyone backporting patches or reading code after an
> > absence.
> > 
> 
> The logical change is basically just to include the allocbt block count
> in the "set aside" calculation, not necessarily redefine what "set
> aside" means, so I don't really see that as a problem. The function is
> only used in a couple places, but we've managed to make it confusing by
> conflating the naming of the function, xfs_mount field and local
> variables in certain contexts.
> 
> BTW, do we ever really expect the "set aside" value to be of any
> significance relative to the perag reservation values for any moderately
> recent kernel? IIRC (most) perag res is not presented as free space, so
> I'm curious if it would even be noticeable if we just fixed and used
> xfs_alloc_set_aside() consistently across the board..
> 

I suppose it's reasonable to want to include allocbt blocks as free
space in general, though, since technically they are eventually usable
blocks. So naming and whatnot aside, at the end the day I'm good with
anything that refactors this code one way or another such that it's
harder to repeat the same mistake.

Brian

> Brian
> 
> > Also the only reason we have a mount variable and a function (instead of
> > a macro) is that Dave asked me to change the codebase away from the
> > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > reflink.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > > --D
> > > > 
> > > > > 
> > > > > Brian
> > > > > 
> > > > > >  		if (free <= 0)
> > > > > >  			break;
> > > > > >  
> > > > > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > > > > >  		spin_unlock(&mp->m_sb_lock);
> > > > > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > > > > >  		spin_lock(&mp->m_sb_lock);
> > > > > > -	} while (error == -ENOSPC);
> > > > > > +	}
> > > > > >  
> > > > > >  	/*
> > > > > >  	 * Update the reserve counters if blocks have been successfully
> > > > > > 
> > > > > 
> > > > 
> > > 
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 19:17           ` Brian Foster
@ 2022-03-16 21:15             ` Darrick J. Wong
  2022-03-17  2:19               ` Dave Chinner
  2022-03-17 12:53               ` Brian Foster
  0 siblings, 2 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-16 21:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Mar 16, 2022 at 03:17:27PM -0400, Brian Foster wrote:
> On Wed, Mar 16, 2022 at 02:48:43PM -0400, Brian Foster wrote:
> > On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > > 
> > > > > > > xfs_reserve_blocks controls the size of the user-visible free space
> > > > > > > reserve pool.  Given the difference between the current and requested
> > > > > > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > > > > > the amount requested from fdblocks is also constrained by the amount of
> > > > > > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > > > > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > > > > > 
> > > > > > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > > > > > out the "free space" used by the free space btrees, because some portion
> > > > > > > of the free space btrees hold in reserve space for future btree
> > > > > > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > > > > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > > > > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > > > > > 
> > > > > > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > > > > > loop so that it will not retry infinitely.
> > > > > > > 
> > > > > > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > > > > > with an fstests configuration that specifies a 1k blocksize and a
> > > > > > > specially crafted log size that will consume 7/8 of the space (17920
> > > > > > > blocks, specifically) in that AG.
> > > > > > > 
> > > > > > > Cc: Brian Foster <bfoster@redhat.com>
> > > > > > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > > > > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > > > index 33e26690a8c4..78b6982ea5b0 100644
> > > > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > > > > > >  	int64_t			fdblks_delta = 0;
> > > > > > >  	uint64_t		request;
> > > > > > >  	int64_t			free;
> > > > > > > +	unsigned int		tries;
> > > > > > >  	int			error = 0;
> > > > > > >  
> > > > > > >  	/* If inval is null, report current values and return */
> > > > > > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > > > > > >  	 * perform a partial reservation if the request exceeds free space.
> > > > > > >  	 */
> > > > > > >  	error = -ENOSPC;
> > > > > > > -	do {
> > > > > > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > > > > > -						mp->m_alloc_set_aside;
> > > > > > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > > > > > 
> > > > > > Any reason for the magic number of retries as opposed to perhaps just
> > > > > > not retrying at all?
> > > > > 
> > > > > I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> > > > > block reservation mechanism."), where I guess Dave decided that we
> > > > > should loop forever trying to satisfy a request from userspace to
> > > > > increase the reserve pool.  OFC you and I have been patching this
> > > > > function to fix all its horrible warts over the years, so maybe you're
> > > > > right that this should only try once...
> > > > > 
> > > > > (For the mount time default reservation, we should only iterate the loop
> > > > > once (provided the accounting is correct ;) since nobody else is
> > > > > touching the free space counters.)
> > > > > 
> > > > > > This seems a little odd when you think about it
> > > > > > given that the request is already intended to take available space into
> > > > > > account and modify the request from userspace. OTOH, another
> > > > > > consideration could be to retry some (really large?) number of times and
> > > > > > then bail out if we happen to iterate without an observable change in
> > > > > > free space (i.e., something is wrong), however I suppose that could be
> > > > > > racy as well. *shrug*
> > > > > 
> > > > > ...but if you're the sysadmin desperately trying to increase the size of
> > > > > the reserve pool when the fs is running near ENOSPC, you're going to be
> > > > > racing with fdblocks bouncing up and down.  The @free samples that we
> > > > > take here in the loop body are indeed racy since we can't tell the
> > > > > difference between @free being unchanged from the last iteration because
> > > > > someone freed a block and someone else immediately consumed it, or a
> > > > > totally idle system.
> > > > > 
> > > > > Either way, it's better than hanging the whole system. :)
> > > > > 
> > > > 
> > > > Yeah.. I'm not bothered much by whether we retry once, 42 times or
> > > > forever. I think what this boils down to for me is whether it's worth
> > > > the risk of a behavior change of an -ENOSPC return causing something
> > > > unexpected for some random user or use case. Could we just do this in
> > > > two separate patches? Patch 1 fixes the calculation and targets stable,
> > > > patch 2 does whatever to the retry loop that potentially changes retry
> > > > semantics (and doesn't really need backporting)..?
> > > 
> > > Splitting the two patches sounds good to me.
> > > 
> > > > > What if I augment the loop control with a comment capturing some of this:
> > > > > 
> > > > > 	/*
> > > > > 	 * The loop body estimates how many blocks it can request from
> > > > > 	 * fdblocks to stash in the reserve pool.  This is a classic
> > > > > 	 * TOCTOU race since fdblocks updates are not always coordinated
> > > > > 	 * via m_sb_lock.  We also cannot tell if @free remaining
> > > > > 	 * unchanged between iterations is due to an idle system or
> > > > > 	 * freed blocks being consumed immediately, so we'll try a
> > > > > 	 * finite number of times to satisfy the request.
> > > > > 	 */
> > > > > 	for (tries = 0; tries < 30...) {
> > > > > 
> > > > > > 
> > > > > > > +		/*
> > > > > > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > > > > > +		 * will not give us.  This includes the per-AG set-aside space
> > > > > > > +		 * and free space btree blocks that are not available for
> > > > > > > +		 * allocation due to per-AG metadata reservations.
> > > > > > > +		 */
> > > > > > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > > > > > +		free -= mp->m_alloc_set_aside;
> > > > > > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > > > > > 
> > > > > > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > > > > > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > > > > > Not sure how much it matters for space reporting purposes, but if so, it
> > > > > > might also be worth reconsidering the usefulness of a static field and
> > > > > > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > > > > > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> > > > > 
> > > > > When I was writing this patch, I very nearly decided to make those three
> > > > > lines above their own helper.  I didn't see any other spots that looked
> > > > > like obvious candidates for such a calculation outside of statfs.
> > > > > 
> > > > 
> > > > Indeed..
> > > > 
> > > > > Subtracting m_allocbt_blks from statfs' avail field is a behavior
> > > > > change, since we always used to consider bnobt blocks as available.  We
> > > > > don't have an exact count of how many blocks are needed to hide the
> > > > > per-ag reserved extents, so in the end we have to decide whether we want
> > > > > to hear complaints about over- or under-estimation of available blocks.
> > > > > 
> > > > > So I think the statfs stuff is a separate patch. :)
> > > > > 
> > > > 
> > > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > > the code that helps prevent this sort of buglet for the next user of
> > > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > > subtle quirk than I am about what's presented in the free space
> > > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > > or something that just does the (agcount << 3) thing, and then define a
> > > 
> > > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > > Right now it comes with this comment:
> > > 
> > > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > > handle a potential split of the file's bmap btree."
> > > 
> > > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > > AFAICT, that part refers to the number of blocks we need to keep free in
> > > case we have to replenish a completely empty AGFL.  The hardcoded value
> > > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > > is wrong here and should be replaced with the function call.
> > > 
> > > I also think the second part ("and 4 more to handle a split of the
> > > file's bmap btree") is wrong.  If we're really supposed to save enough
> > > blocks to handle a bmbt split, then I think this ought to be
> > > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > > btrees can be 9 levels tall:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967296 -b 512
> > > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > > level 0: 4294967296 records, 330382100 blocks
> > > level 1: 330382100 records, 25414008 blocks
> > > level 2: 25414008 records, 1954924 blocks
> > > level 3: 1954924 records, 150379 blocks
> > > level 4: 150379 records, 11568 blocks
> > > level 5: 11568 records, 890 blocks
> > > level 6: 890 records, 69 blocks
> > > level 7: 69 records, 6 blocks
> > > level 8: 6 records, 1 block
> > > 9 levels, 357913945 blocks total
> > > 
> > > The root level is in the inode, so we need 8 blocks to handle a full
> > > split on a 512b block filesystem.
> > > 
> > > Granted, it's at least correct for 4k+ block filesystems:
> > > 
> > > xfs_db> btheight bmapbt -n 4294967296
> > > bmapbt: worst case per 4096-byte block: 125 records (leaf) / 125 keyptrs (node)
> > > level 0: 4294967296 records, 34359739 blocks
> > > level 1: 34359739 records, 274878 blocks
> > > level 2: 274878 records, 2200 blocks
> > > level 3: 2200 records, 18 blocks
> > > level 4: 18 records, 1 block
> > > 5 levels, 34636836 blocks total
> > > 
> > > So in the end, I think that calculation should become:
> > > 
> > > unsigned int
> > > xfs_alloc_set_aside(
> > > 	struct xfs_mount	*mp)
> > > {
> > > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > > 
> > > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > > }
> > > 
> > 
> > I'm not familiar enough with the original context, but that seems
> > reasonable to me at a glance. If we do want to change this, I'd again
> > just suggest a separate patch since presumably there isn't any known bug
> > report or anything associated with the current code.

Oh definitely, each of these little fixes nets their own patch.

> > > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > > documented in one place. Hm?
> > > 
> > > I think I'd rather call the new function xfs_fdblocks_avail() over
> > > reusing an existing name, because I fear that zapping an old function
> > > and replacing it with a new function with the same name will cause
> > > confusion for anyone backporting patches or reading code after an
> > > absence.
> > > 
> > 
> > The logical change is basically just to include the allocbt block count
> > in the "set aside" calculation, not necessarily redefine what "set
> > aside" means, so I don't really see that as a problem. The function is
> > only used in a couple places, but we've managed to make it confusing by
> > conflating the naming of the function, xfs_mount field and local
> > variables in certain contexts.

Hmm.  The way I see things, the setaside is the *minimum* amount of free
space that we need to withhold from the user to guarantee that we can
always make progress in a file mapping update.  m_allocbt_blks is the
amount of space used by free space btrees (currently, the entire btree)
that we withhold from the user to guarantee that btree expansions from
the per-AG reservations can actually be filled.

The reserve pool looks to me like a user-controlled chunk of space that
we also withhold to avoid ENOSPC failure from within writeback; and the
per-AG reservations of course are even more space withheld to guarantee
btree expansion.

IOWs, there are four (I hope!) different free space pools that we use to
guarantee forward progress on this or that operation, and the only thing
they have in common is they don't get reported to userspace as available
blocks.  I don't know that I want to go mixing them too much.

Though, I agree that all of this is really confusing and needs a central
function that makes it obvious how we compute available blocks given all
the free space pools that we have.

> > BTW, do we ever really expect the "set aside" value to be of any
> > significance relative to the perag reservation values for any moderately
> > recent kernel?

Nope.  Any non-tiny filesystem with at least finobt=1 is going to have
way more per-AG reservation than either the setaside or the reserve
pool.

> > IIRC (most) perag res is not presented as free space, so
> > I'm curious if it would even be noticeable if we just fixed and used
> > xfs_alloc_set_aside() consistently across the board..

It isn't.  On a 500G test image with reflink, rmap, and finobt turned
on, the per-AG reservations consume ~630kblocks for each of four AGs of
about ~32.7mblocks apiece.  That's about 2%.  The reservation pool is
8192 blocks and the set-aside is 4*8=32 blocks, both of which are
rounding error by comparison.

At maximal free space fragmentation, the free space btrees would consume
(worst case) about 65kblocks per AG.  That might be noticeable.

> I suppose it's reasonable to want to include allocbt blocks as free
> space in general, though, since technically they are eventually usable
> blocks. So naming and whatnot aside, at the end the day I'm good with
> anything that refactors this code one way or another such that it's
> harder to repeat the same mistake.

Yeah, I at least like the idea of having a function that estimates how
much free space we can try to reserve without using reserve pool blocks.
I also am warming to the idea of telling users they can't have /any/ of
those reserved blocks... particularly since I haven't even gotten to the
"I fill my fs to 99% full and only then buy one more gigabyte of space,
wash, rinse, repeat" complaints.

--D

> 
> Brian
> 
> > Brian
> > 
> > > Also the only reason we have a mount variable and a function (instead of
> > > a macro) is that Dave asked me to change the codebase away from the
> > > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > > reflink.
> > > 
> > > --D
> > > 
> > > > Brian
> > > > 
> > > > > --D
> > > > > 
> > > > > > 
> > > > > > Brian
> > > > > > 
> > > > > > >  		if (free <= 0)
> > > > > > >  			break;
> > > > > > >  
> > > > > > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > > > > > >  		spin_unlock(&mp->m_sb_lock);
> > > > > > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > > > > > >  		spin_lock(&mp->m_sb_lock);
> > > > > > > -	} while (error == -ENOSPC);
> > > > > > > +	}
> > > > > > >  
> > > > > > >  	/*
> > > > > > >  	 * Update the reserve counters if blocks have been successfully
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 18:17       ` Darrick J. Wong
  2022-03-16 18:48         ` Brian Foster
@ 2022-03-17  2:05         ` Dave Chinner
  2022-03-17 12:56           ` Brian Foster
  2022-03-17 17:05           ` Darrick J. Wong
  1 sibling, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2022-03-17  2:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs

On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > Similar deal as above.. I'm more interested in a potential cleanup of
> > the code that helps prevent this sort of buglet for the next user of
> > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > subtle quirk than I am about what's presented in the free space
> > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > or something that just does the (agcount << 3) thing, and then define a
> 
> I'm not sure that the current xfs_alloc_set_aside code is correct.
> Right now it comes with this comment:
> 
> "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> handle a potential split of the file's bmap btree."
>
> I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> AFAICT, that part refers to the number of blocks we need to keep free in
> case we have to replenish a completely empty AGFL.  The hardcoded value
> of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> uses to decide how big the AGFL needs to be, and it returns 6 on a
> filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> is wrong here and should be replaced with the function call.

Back when I wrote that code (circa 2007, IIRC), that was actually
correct according to the reservations that were made when freeing
an extent at ENOSPC.

We needed 4 blocks for the AGFL fixup to always succeed  - 2 blocks
for each BNO and CNT btrees, and, IIRC, the extent free reservation
was just 4 blocks at that time. Hence the 4+4 value.

However, you are right that rmap also adds another per-ag btree that
is allocated from the agfl and that set_aside() should be taking
that into accout. That said, I think that xfs_alloc_min_freelist()
might even be wrong by just adding 2 blocks to the AGFL for the
rmapbt.

That is, at ENOSPC the rmapbt can be a *big* btree. It's not like
the BNO and CNT btrees which are completely empty at that point in
time; the RMAP tree could be one level below max height, and freeing
a single block could split a rmap rec and trigger a full height RMAP
split.

So the minimum free list length in that case is 2 + 2 + MAX_RMAP_HEIGHT.

> I also think the second part ("and 4 more to handle a split of the
> file's bmap btree") is wrong.  If we're really supposed to save enough
> blocks to handle a bmbt split, then I think this ought to be
> (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> btrees can be 9 levels tall:

Yes, we've changed the BMBT reservations in the years since that
code was written to handle max height reservations correctly, too.
So, like the RMAP btree reservation, we probably should be reserving
MAX_BMAP_HEIGHT in the set-aside calculation.

refcount btree space is handled by the ag_resv code and blocks
aren't allocated from the AGFL, so I don't think we need to take
taht into account for xfs_alloc_set_aside().

> So in the end, I think that calculation should become:
> 
> unsigned int
> xfs_alloc_set_aside(
> 	struct xfs_mount	*mp)
> {
> 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> 
> 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> }

*nod*, but with the proviso that xfs_alloc_min_freelist() doesn't
appear to be correct, either....

Also, that's a fixed value for the physical geometry of the
filesystem, so it should be calculated once at mount time and stored
in the xfs_mount (and only updated if needed at growfs time)...

> > new xfs_alloc_set_aside() that combines the macro calculation with
> > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > documented in one place. Hm?
> 
> I think I'd rather call the new function xfs_fdblocks_avail() over
> reusing an existing name, because I fear that zapping an old function
> and replacing it with a new function with the same name will cause
> confusion for anyone backporting patches or reading code after an
> absence.
> 
> Also the only reason we have a mount variable and a function (instead of
> a macro) is that Dave asked me to change the codebase away from the
> XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> reflink.

Yeah, macros wrapping a variable or repeated constant calculation
are bad, and it's something we've been cleaning up for a long
time...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 21:15             ` Darrick J. Wong
@ 2022-03-17  2:19               ` Dave Chinner
  2022-03-17 12:53               ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2022-03-17  2:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, xfs

On Wed, Mar 16, 2022 at 02:15:26PM -0700, Darrick J. Wong wrote:
> Yeah, I at least like the idea of having a function that estimates how
> much free space we can try to reserve without using reserve pool blocks.
> I also am warming to the idea of telling users they can't have /any/ of
> those reserved blocks... particularly since I haven't even gotten to the
> "I fill my fs to 99% full and only then buy one more gigabyte of space,
> wash, rinse, repeat" complaints.

Go look at what I considered for the "thinp aware" filesystems a few
years ago:

https://lore.kernel.org/linux-xfs/20171026083322.20428-1-david@fromorbit.com/

The available space the filesystem presented to the user is
completely divorced from the physical device block count and
internal filesystem metadata space usage. Users simply didn't see
any of the space we reserved for metadata as space they could
allocate. i.e. it split the superblock sb_dblocks value away from
the number of data blocks we present to the user that they can
allocate as user data.

(https://lore.kernel.org/linux-xfs/20171026083322.20428-10-david@fromorbit.com/)

So, yeah, moving towards a model that essentially hides all the
reserved space from users and only presenting them with the amount
of space that is available without dipping into any reserves would
make things an awful lot simpler from an architectural POV...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-16 21:15             ` Darrick J. Wong
  2022-03-17  2:19               ` Dave Chinner
@ 2022-03-17 12:53               ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Brian Foster @ 2022-03-17 12:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Mar 16, 2022 at 02:15:26PM -0700, Darrick J. Wong wrote:
> On Wed, Mar 16, 2022 at 03:17:27PM -0400, Brian Foster wrote:
> > On Wed, Mar 16, 2022 at 02:48:43PM -0400, Brian Foster wrote:
> > > On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > > > 
> > > > > > > > xfs_reserve_blocks controls the size of the user-visible free space
> > > > > > > > reserve pool.  Given the difference between the current and requested
> > > > > > > > pool sizes, it will try to reserve free space from fdblocks.  However,
> > > > > > > > the amount requested from fdblocks is also constrained by the amount of
> > > > > > > > space that we think xfs_mod_fdblocks will give us.  We'll keep trying to
> > > > > > > > reserve space so long as xfs_mod_fdblocks returns ENOSPC.
> > > > > > > > 
> > > > > > > > In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
> > > > > > > > out the "free space" used by the free space btrees, because some portion
> > > > > > > > of the free space btrees hold in reserve space for future btree
> > > > > > > > expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
> > > > > > > > of blocks that it could request from xfs_mod_fdblocks was not updated to
> > > > > > > > include m_allocbt_blks, so if space is extremely low, the caller hangs.
> > > > > > > > 
> > > > > > > > Fix this by including m_allocbt_blks in the estimation, and modify the
> > > > > > > > loop so that it will not retry infinitely.
> > > > > > > > 
> > > > > > > > Found by running xfs/306 (which formats a single-AG 20MB filesystem)
> > > > > > > > with an fstests configuration that specifies a 1k blocksize and a
> > > > > > > > specially crafted log size that will consume 7/8 of the space (17920
> > > > > > > > blocks, specifically) in that AG.
> > > > > > > > 
> > > > > > > > Cc: Brian Foster <bfoster@redhat.com>
> > > > > > > > Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
> > > > > > > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > > > > > > ---
> > > > > > > >  fs/xfs/xfs_fsops.c |   16 ++++++++++++----
> > > > > > > >  1 file changed, 12 insertions(+), 4 deletions(-)
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > > > > > > > index 33e26690a8c4..78b6982ea5b0 100644
> > > > > > > > --- a/fs/xfs/xfs_fsops.c
> > > > > > > > +++ b/fs/xfs/xfs_fsops.c
> > > > > > > > @@ -379,6 +379,7 @@ xfs_reserve_blocks(
> > > > > > > >  	int64_t			fdblks_delta = 0;
> > > > > > > >  	uint64_t		request;
> > > > > > > >  	int64_t			free;
> > > > > > > > +	unsigned int		tries;
> > > > > > > >  	int			error = 0;
> > > > > > > >  
> > > > > > > >  	/* If inval is null, report current values and return */
> > > > > > > > @@ -432,9 +433,16 @@ xfs_reserve_blocks(
> > > > > > > >  	 * perform a partial reservation if the request exceeds free space.
> > > > > > > >  	 */
> > > > > > > >  	error = -ENOSPC;
> > > > > > > > -	do {
> > > > > > > > -		free = percpu_counter_sum(&mp->m_fdblocks) -
> > > > > > > > -						mp->m_alloc_set_aside;
> > > > > > > > +	for (tries = 0; tries < 30 && error == -ENOSPC; tries++) {
> > > > > > > 
> > > > > > > Any reason for the magic number of retries as opposed to perhaps just
> > > > > > > not retrying at all?
> > > > > > 
> > > > > > I /think/ the origins of the loop was commit dbcabad19aa9 ("[XFS] Fix
> > > > > > block reservation mechanism."), where I guess Dave decided that we
> > > > > > should loop forever trying to satisfy a request from userspace to
> > > > > > increase the reserve pool.  OFC you and I have been patching this
> > > > > > function to fix all its horrible warts over the years, so maybe you're
> > > > > > right that this should only try once...
> > > > > > 
> > > > > > (For the mount time default reservation, we should only iterate the loop
> > > > > > once (provided the accounting is correct ;) since nobody else is
> > > > > > touching the free space counters.)
> > > > > > 
> > > > > > > This seems a little odd when you think about it
> > > > > > > given that the request is already intended to take available space into
> > > > > > > account and modify the request from userspace. OTOH, another
> > > > > > > consideration could be to retry some (really large?) number of times and
> > > > > > > then bail out if we happen to iterate without an observable change in
> > > > > > > free space (i.e., something is wrong), however I suppose that could be
> > > > > > > racy as well. *shrug*
> > > > > > 
> > > > > > ...but if you're the sysadmin desperately trying to increase the size of
> > > > > > the reserve pool when the fs is running near ENOSPC, you're going to be
> > > > > > racing with fdblocks bouncing up and down.  The @free samples that we
> > > > > > take here in the loop body are indeed racy since we can't tell the
> > > > > > difference between @free being unchanged from the last iteration because
> > > > > > someone freed a block and someone else immediately consumed it, or a
> > > > > > totally idle system.
> > > > > > 
> > > > > > Either way, it's better than hanging the whole system. :)
> > > > > > 
> > > > > 
> > > > > Yeah.. I'm not bothered much by whether we retry once, 42 times or
> > > > > forever. I think what this boils down to for me is whether it's worth
> > > > > the risk of a behavior change of an -ENOSPC return causing something
> > > > > unexpected for some random user or use case. Could we just do this in
> > > > > two separate patches? Patch 1 fixes the calculation and targets stable,
> > > > > patch 2 does whatever to the retry loop that potentially changes retry
> > > > > semantics (and doesn't really need backporting)..?
> > > > 
> > > > Splitting the two patches sounds good to me.
> > > > 
> > > > > > What if I augment the loop control with a comment capturing some of this:
> > > > > > 
> > > > > > 	/*
> > > > > > 	 * The loop body estimates how many blocks it can request from
> > > > > > 	 * fdblocks to stash in the reserve pool.  This is a classic
> > > > > > 	 * TOCTOU race since fdblocks updates are not always coordinated
> > > > > > 	 * via m_sb_lock.  We also cannot tell if @free remaining
> > > > > > 	 * unchanged between iterations is due to an idle system or
> > > > > > 	 * freed blocks being consumed immediately, so we'll try a
> > > > > > 	 * finite number of times to satisfy the request.
> > > > > > 	 */
> > > > > > 	for (tries = 0; tries < 30...) {
> > > > > > 
> > > > > > > 
> > > > > > > > +		/*
> > > > > > > > +		 * The reservation pool cannot take space that xfs_mod_fdblocks
> > > > > > > > +		 * will not give us.  This includes the per-AG set-aside space
> > > > > > > > +		 * and free space btree blocks that are not available for
> > > > > > > > +		 * allocation due to per-AG metadata reservations.
> > > > > > > > +		 */
> > > > > > > > +		free = percpu_counter_sum(&mp->m_fdblocks);
> > > > > > > > +		free -= mp->m_alloc_set_aside;
> > > > > > > > +		free -= atomic64_read(&mp->m_allocbt_blks);
> > > > > > > 
> > > > > > > Seems reasonable. Do we want to consider ->m_allocbt_blks in other
> > > > > > > places where ->m_alloc_set_aside is used (i.e. xfs_fs_statfs(), etc.)?
> > > > > > > Not sure how much it matters for space reporting purposes, but if so, it
> > > > > > > might also be worth reconsidering the usefulness of a static field and
> > > > > > > initialization helper (i.e. xfs_alloc_set_aside()) if the majority of
> > > > > > > uses involve a dynamic calculation (due to ->m_allocbt_blks).
> > > > > > 
> > > > > > When I was writing this patch, I very nearly decided to make those three
> > > > > > lines above their own helper.  I didn't see any other spots that looked
> > > > > > like obvious candidates for such a calculation outside of statfs.
> > > > > > 
> > > > > 
> > > > > Indeed..
> > > > > 
> > > > > > Subtracting m_allocbt_blks from statfs' avail field is a behavior
> > > > > > change, since we always used to consider bnobt blocks as available.  We
> > > > > > don't have an exact count of how many blocks are needed to hide the
> > > > > > per-ag reserved extents, so in the end we have to decide whether we want
> > > > > > to hear complaints about over- or under-estimation of available blocks.
> > > > > > 
> > > > > > So I think the statfs stuff is a separate patch. :)
> > > > > > 
> > > > > 
> > > > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > > > the code that helps prevent this sort of buglet for the next user of
> > > > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > > > subtle quirk than I am about what's presented in the free space
> > > > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > > > or something that just does the (agcount << 3) thing, and then define a
> > > > 
> > > > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > > > Right now it comes with this comment:
> > > > 
> > > > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > > > handle a potential split of the file's bmap btree."
> > > > 
> > > > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > > > AFAICT, that part refers to the number of blocks we need to keep free in
> > > > case we have to replenish a completely empty AGFL.  The hardcoded value
> > > > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > > > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > > > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > > > is wrong here and should be replaced with the function call.
> > > > 
> > > > I also think the second part ("and 4 more to handle a split of the
> > > > file's bmap btree") is wrong.  If we're really supposed to save enough
> > > > blocks to handle a bmbt split, then I think this ought to be
> > > > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > > > btrees can be 9 levels tall:
> > > > 
> > > > xfs_db> btheight bmapbt -n 4294967296 -b 512
> > > > bmapbt: worst case per 512-byte block: 13 records (leaf) / 13 keyptrs (node)
> > > > level 0: 4294967296 records, 330382100 blocks
> > > > level 1: 330382100 records, 25414008 blocks
> > > > level 2: 25414008 records, 1954924 blocks
> > > > level 3: 1954924 records, 150379 blocks
> > > > level 4: 150379 records, 11568 blocks
> > > > level 5: 11568 records, 890 blocks
> > > > level 6: 890 records, 69 blocks
> > > > level 7: 69 records, 6 blocks
> > > > level 8: 6 records, 1 block
> > > > 9 levels, 357913945 blocks total
> > > > 
> > > > The root level is in the inode, so we need 8 blocks to handle a full
> > > > split on a 512b block filesystem.
> > > > 
> > > > Granted, it's at least correct for 4k+ block filesystems:
> > > > 
> > > > xfs_db> btheight bmapbt -n 4294967296
> > > > bmapbt: worst case per 4096-byte block: 125 records (leaf) / 125 keyptrs (node)
> > > > level 0: 4294967296 records, 34359739 blocks
> > > > level 1: 34359739 records, 274878 blocks
> > > > level 2: 274878 records, 2200 blocks
> > > > level 3: 2200 records, 18 blocks
> > > > level 4: 18 records, 1 block
> > > > 5 levels, 34636836 blocks total
> > > > 
> > > > So in the end, I think that calculation should become:
> > > > 
> > > > unsigned int
> > > > xfs_alloc_set_aside(
> > > > 	struct xfs_mount	*mp)
> > > > {
> > > > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > > > 
> > > > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > > > }
> > > > 
> > > 
> > > I'm not familiar enough with the original context, but that seems
> > > reasonable to me at a glance. If we do want to change this, I'd again
> > > just suggest a separate patch since presumably there isn't any known bug
> > > report or anything associated with the current code.
> 
> Oh definitely, each of these little fixes nets their own patch.
> 
> > > > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > > > documented in one place. Hm?
> > > > 
> > > > I think I'd rather call the new function xfs_fdblocks_avail() over
> > > > reusing an existing name, because I fear that zapping an old function
> > > > and replacing it with a new function with the same name will cause
> > > > confusion for anyone backporting patches or reading code after an
> > > > absence.
> > > > 
> > > 
> > > The logical change is basically just to include the allocbt block count
> > > in the "set aside" calculation, not necessarily redefine what "set
> > > aside" means, so I don't really see that as a problem. The function is
> > > only used in a couple places, but we've managed to make it confusing by
> > > conflating the naming of the function, xfs_mount field and local
> > > variables in certain contexts.
> 
> Hmm.  The way I see things, the setaside is the *minimum* amount of free
> space that we need to withhold from the user to guarantee that we can
> always make progress in a file mapping update.  m_allocbt_blks is the
> amount of space used by free space btrees (currently, the entire btree)
> that we withhold from the user to guarantee that btree expansions from
> the per-AG reservations can actually be filled.
> 
> The reserve pool looks to me like a user-controlled chunk of space that
> we also withhold to avoid ENOSPC failure from within writeback; and the
> per-AG reservations of course are even more space withheld to guarantee
> btree expansion.
> 
> IOWs, there are four (I hope!) different free space pools that we use to
> guarantee forward progress on this or that operation, and the only thing
> they have in common is they don't get reported to userspace as available
> blocks.  I don't know that I want to go mixing them too much.
> 

Yeah, I'm.. not sure where the idea of mixing them comes from.

To try and clarify, I was just prioritizing reducing confusion of the
set aside bits over whether they happen to be reported as available
space or not. I.e., if we wanted to consolidate to a single helper and
use that everywhere with the side effect that set aside blocks are not
shown as free, then that sounds reasonable to me (re: to Dave's follow
up point as well). OTOH, if you had some reason to not do the latter,
then we should at least try to eliminate the need to duplicate the set
aside naming/calculation where it does happen to be used, because either
way the current code is sort of written to nearly ensure it won't be
used correctly.

Brian

> Though, I agree that all of this is really confusing and needs a central
> function that makes it obvious how we compute available blocks given all
> the free space pools that we have.
> 
> > > BTW, do we ever really expect the "set aside" value to be of any
> > > significance relative to the perag reservation values for any moderately
> > > recent kernel?
> 
> Nope.  Any non-tiny filesystem with at least finobt=1 is going to have
> way more per-AG reservation than either the setaside or the reserve
> pool.
> 
> > > IIRC (most) perag res is not presented as free space, so
> > > I'm curious if it would even be noticeable if we just fixed and used
> > > xfs_alloc_set_aside() consistently across the board..
> 
> It isn't.  On a 500G test image with reflink, rmap, and finobt turned
> on, the per-AG reservations consume ~630kblocks for each of four AGs of
> about ~32.7mblocks apiece.  That's about 2%.  The reservation pool is
> 8192 blocks and the set-aside is 4*8=32 blocks, both of which are
> rounding error by comparison.
> 
> At maximal free space fragmentation, the free space btrees would consume
> (worst case) about 65kblocks per AG.  That might be noticeable.
> 
> > I suppose it's reasonable to want to include allocbt blocks as free
> > space in general, though, since technically they are eventually usable
> > blocks. So naming and whatnot aside, at the end the day I'm good with
> > anything that refactors this code one way or another such that it's
> > harder to repeat the same mistake.
> 
> Yeah, I at least like the idea of having a function that estimates how
> much free space we can try to reserve without using reserve pool blocks.
> I also am warming to the idea of telling users they can't have /any/ of
> those reserved blocks... particularly since I haven't even gotten to the
> "I fill my fs to 99% full and only then buy one more gigabyte of space,
> wash, rinse, repeat" complaints.
> 
> --D
> 
> > 
> > Brian
> > 
> > > Brian
> > > 
> > > > Also the only reason we have a mount variable and a function (instead of
> > > > a macro) is that Dave asked me to change the codebase away from the
> > > > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > > > reflink.
> > > > 
> > > > --D
> > > > 
> > > > > Brian
> > > > > 
> > > > > > --D
> > > > > > 
> > > > > > > 
> > > > > > > Brian
> > > > > > > 
> > > > > > > >  		if (free <= 0)
> > > > > > > >  			break;
> > > > > > > >  
> > > > > > > > @@ -459,7 +467,7 @@ xfs_reserve_blocks(
> > > > > > > >  		spin_unlock(&mp->m_sb_lock);
> > > > > > > >  		error = xfs_mod_fdblocks(mp, -fdblks_delta, 0);
> > > > > > > >  		spin_lock(&mp->m_sb_lock);
> > > > > > > > -	} while (error == -ENOSPC);
> > > > > > > > +	}
> > > > > > > >  
> > > > > > > >  	/*
> > > > > > > >  	 * Update the reserve counters if blocks have been successfully
> > > > > > > > 
> > > > > > > 
> > > > > > 
> > > > > 
> > > > 
> > 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-17  2:05         ` Dave Chinner
@ 2022-03-17 12:56           ` Brian Foster
  2022-03-17 15:46             ` Darrick J. Wong
  2022-03-17 17:05           ` Darrick J. Wong
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2022-03-17 12:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Thu, Mar 17, 2022 at 01:05:26PM +1100, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > the code that helps prevent this sort of buglet for the next user of
> > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > subtle quirk than I am about what's presented in the free space
> > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > or something that just does the (agcount << 3) thing, and then define a
> > 
> > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > Right now it comes with this comment:
> > 
> > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > handle a potential split of the file's bmap btree."
> >
> > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > AFAICT, that part refers to the number of blocks we need to keep free in
> > case we have to replenish a completely empty AGFL.  The hardcoded value
> > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > is wrong here and should be replaced with the function call.
> 
> Back when I wrote that code (circa 2007, IIRC), that was actually
> correct according to the reservations that were made when freeing
> an extent at ENOSPC.
> 
> We needed 4 blocks for the AGFL fixup to always succeed  - 2 blocks
> for each BNO and CNT btrees, and, IIRC, the extent free reservation
> was just 4 blocks at that time. Hence the 4+4 value.
> 
> However, you are right that rmap also adds another per-ag btree that
> is allocated from the agfl and that set_aside() should be taking
> that into accout. That said, I think that xfs_alloc_min_freelist()
> might even be wrong by just adding 2 blocks to the AGFL for the
> rmapbt.
> 
> That is, at ENOSPC the rmapbt can be a *big* btree. It's not like
> the BNO and CNT btrees which are completely empty at that point in
> time; the RMAP tree could be one level below max height, and freeing
> a single block could split a rmap rec and trigger a full height RMAP
> split.
> 
> So the minimum free list length in that case is 2 + 2 + MAX_RMAP_HEIGHT.
> 
> > I also think the second part ("and 4 more to handle a split of the
> > file's bmap btree") is wrong.  If we're really supposed to save enough
> > blocks to handle a bmbt split, then I think this ought to be
> > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > btrees can be 9 levels tall:
> 
> Yes, we've changed the BMBT reservations in the years since that
> code was written to handle max height reservations correctly, too.
> So, like the RMAP btree reservation, we probably should be reserving
> MAX_BMAP_HEIGHT in the set-aside calculation.
> 
> refcount btree space is handled by the ag_resv code and blocks
> aren't allocated from the AGFL, so I don't think we need to take
> taht into account for xfs_alloc_set_aside().
> 
> > So in the end, I think that calculation should become:
> > 
> > unsigned int
> > xfs_alloc_set_aside(
> > 	struct xfs_mount	*mp)
> > {
> > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > 
> > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > }
> 
> *nod*, but with the proviso that xfs_alloc_min_freelist() doesn't
> appear to be correct, either....
> 
> Also, that's a fixed value for the physical geometry of the
> filesystem, so it should be calculated once at mount time and stored
> in the xfs_mount (and only updated if needed at growfs time)...
> 

To my earlier point... please just don't call this fixed mount value
"set_aside" if that's not what it actually is. Rename the field and
helper to something self-descriptive based on whatever fixed components
it's made up of (you could even qualify it as a subcomponent of
set_aside with something like ".._agfl_bmap_set_aside" or whatever) then
reserve the _set_aside() name for the helper that calculates and
documents what the actual/final/dynamic "set aside" value is.

Brian

> > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > documented in one place. Hm?
> > 
> > I think I'd rather call the new function xfs_fdblocks_avail() over
> > reusing an existing name, because I fear that zapping an old function
> > and replacing it with a new function with the same name will cause
> > confusion for anyone backporting patches or reading code after an
> > absence.
> > 
> > Also the only reason we have a mount variable and a function (instead of
> > a macro) is that Dave asked me to change the codebase away from the
> > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > reflink.
> 
> Yeah, macros wrapping a variable or repeated constant calculation
> are bad, and it's something we've been cleaning up for a long
> time...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-17 12:56           ` Brian Foster
@ 2022-03-17 15:46             ` Darrick J. Wong
  0 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-17 15:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, xfs

On Thu, Mar 17, 2022 at 08:56:55AM -0400, Brian Foster wrote:
> On Thu, Mar 17, 2022 at 01:05:26PM +1100, Dave Chinner wrote:
> > On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > > the code that helps prevent this sort of buglet for the next user of
> > > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > > subtle quirk than I am about what's presented in the free space
> > > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > > or something that just does the (agcount << 3) thing, and then define a
> > > 
> > > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > > Right now it comes with this comment:
> > > 
> > > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > > handle a potential split of the file's bmap btree."
> > >
> > > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > > AFAICT, that part refers to the number of blocks we need to keep free in
> > > case we have to replenish a completely empty AGFL.  The hardcoded value
> > > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > > is wrong here and should be replaced with the function call.
> > 
> > Back when I wrote that code (circa 2007, IIRC), that was actually
> > correct according to the reservations that were made when freeing
> > an extent at ENOSPC.
> > 
> > We needed 4 blocks for the AGFL fixup to always succeed  - 2 blocks
> > for each BNO and CNT btrees, and, IIRC, the extent free reservation
> > was just 4 blocks at that time. Hence the 4+4 value.
> > 
> > However, you are right that rmap also adds another per-ag btree that
> > is allocated from the agfl and that set_aside() should be taking
> > that into accout. That said, I think that xfs_alloc_min_freelist()
> > might even be wrong by just adding 2 blocks to the AGFL for the
> > rmapbt.
> > 
> > That is, at ENOSPC the rmapbt can be a *big* btree. It's not like
> > the BNO and CNT btrees which are completely empty at that point in
> > time; the RMAP tree could be one level below max height, and freeing
> > a single block could split a rmap rec and trigger a full height RMAP
> > split.
> > 
> > So the minimum free list length in that case is 2 + 2 + MAX_RMAP_HEIGHT.
> > 
> > > I also think the second part ("and 4 more to handle a split of the
> > > file's bmap btree") is wrong.  If we're really supposed to save enough
> > > blocks to handle a bmbt split, then I think this ought to be
> > > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > > btrees can be 9 levels tall:
> > 
> > Yes, we've changed the BMBT reservations in the years since that
> > code was written to handle max height reservations correctly, too.
> > So, like the RMAP btree reservation, we probably should be reserving
> > MAX_BMAP_HEIGHT in the set-aside calculation.
> > 
> > refcount btree space is handled by the ag_resv code and blocks
> > aren't allocated from the AGFL, so I don't think we need to take
> > taht into account for xfs_alloc_set_aside().
> > 
> > > So in the end, I think that calculation should become:
> > > 
> > > unsigned int
> > > xfs_alloc_set_aside(
> > > 	struct xfs_mount	*mp)
> > > {
> > > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > > 
> > > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > > }
> > 
> > *nod*, but with the proviso that xfs_alloc_min_freelist() doesn't
> > appear to be correct, either....
> > 
> > Also, that's a fixed value for the physical geometry of the
> > filesystem, so it should be calculated once at mount time and stored
> > in the xfs_mount (and only updated if needed at growfs time)...
> > 
> 
> To my earlier point... please just don't call this fixed mount value
> "set_aside" if that's not what it actually is. Rename the field and
> helper to something self-descriptive based on whatever fixed components
> it's made up of (you could even qualify it as a subcomponent of
> set_aside with something like ".._agfl_bmap_set_aside" or whatever) then
> reserve the _set_aside() name for the helper that calculates and
> documents what the actual/final/dynamic "set aside" value is.

Ahh, ok, that's what you were getting at.  Every time I look at
"alloc_set_aside" I have to figure out what that /really/ does, and I
think "agfl_bmap_setaside" (or even "bmbt_split_setaside") is a better
hint to what this actually does.

--D

> Brian
> 
> > > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > > documented in one place. Hm?
> > > 
> > > I think I'd rather call the new function xfs_fdblocks_avail() over
> > > reusing an existing name, because I fear that zapping an old function
> > > and replacing it with a new function with the same name will cause
> > > confusion for anyone backporting patches or reading code after an
> > > absence.
> > > 
> > > Also the only reason we have a mount variable and a function (instead of
> > > a macro) is that Dave asked me to change the codebase away from the
> > > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > > reflink.
> > 
> > Yeah, macros wrapping a variable or repeated constant calculation
> > are bad, and it's something we've been cleaning up for a long
> > time...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
  2022-03-17  2:05         ` Dave Chinner
  2022-03-17 12:56           ` Brian Foster
@ 2022-03-17 17:05           ` Darrick J. Wong
  1 sibling, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2022-03-17 17:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, xfs

On Thu, Mar 17, 2022 at 01:05:26PM +1100, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > the code that helps prevent this sort of buglet for the next user of
> > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > subtle quirk than I am about what's presented in the free space
> > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > or something that just does the (agcount << 3) thing, and then define a
> > 
> > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > Right now it comes with this comment:
> > 
> > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > handle a potential split of the file's bmap btree."
> >
> > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > AFAICT, that part refers to the number of blocks we need to keep free in
> > case we have to replenish a completely empty AGFL.  The hardcoded value
> > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > is wrong here and should be replaced with the function call.
> 
> Back when I wrote that code (circa 2007, IIRC), that was actually
> correct according to the reservations that were made when freeing
> an extent at ENOSPC.
> 
> We needed 4 blocks for the AGFL fixup to always succeed  - 2 blocks
> for each BNO and CNT btrees, and, IIRC, the extent free reservation
> was just 4 blocks at that time. Hence the 4+4 value.
> 
> However, you are right that rmap also adds another per-ag btree that
> is allocated from the agfl and that set_aside() should be taking
> that into accout. That said, I think that xfs_alloc_min_freelist()
> might even be wrong by just adding 2 blocks to the AGFL for the
> rmapbt.
> 
> That is, at ENOSPC the rmapbt can be a *big* btree. It's not like
> the BNO and CNT btrees which are completely empty at that point in
> time; the RMAP tree could be one level below max height, and freeing
> a single block could split a rmap rec and trigger a full height RMAP
> split.
> 
> So the minimum free list length in that case is 2 + 2 + MAX_RMAP_HEIGHT.

The rmap btree can become a big btree, but the per-ag rmapbt reservation
ensures that there's enough free space to refill the AGFL to handle the
rmap btree expanding to its maximum allowable size.  XFS_AG_RESV_RMAPBT
is subtracted from fdblocks, so I don't think alloc_set_aside ought to
withhold even more blocks from xfs_mod_fdblocks.

IOWS, I was wrong earlier -- we only need to withhold enough space from
fdblocks to handle splits of the bnobt and cntbt at or near ENOSPC.  The
value 4 is actually correct, but needs much better explanation.
Especially for benefit of the original author. ;)

> > I also think the second part ("and 4 more to handle a split of the
> > file's bmap btree") is wrong.  If we're really supposed to save enough
> > blocks to handle a bmbt split, then I think this ought to be
> > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > btrees can be 9 levels tall:
> 
> Yes, we've changed the BMBT reservations in the years since that
> code was written to handle max height reservations correctly, too.
> So, like the RMAP btree reservation, we probably should be reserving
> MAX_BMAP_HEIGHT in the set-aside calculation.

Right.

> refcount btree space is handled by the ag_resv code and blocks
> aren't allocated from the AGFL, so I don't think we need to take
> taht into account for xfs_alloc_set_aside().

Right.

> > So in the end, I think that calculation should become:
> > 
> > unsigned int
> > xfs_alloc_set_aside(
> > 	struct xfs_mount	*mp)
> > {
> > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > 
> > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > }
> 
> *nod*, but with the proviso that xfs_alloc_min_freelist() doesn't
> appear to be correct, either....
> 
> Also, that's a fixed value for the physical geometry of the
> filesystem, so it should be calculated once at mount time and stored
> in the xfs_mount (and only updated if needed at growfs time)...

There are three callers of xfs_alloc_min_freelist(, NULL) now.  One of
them is the the function that does the root inode calculation, which we
only use in mkfs and repair.

The other two are xfs_alloc_set_aside and xfs_alloc_ag_max_usable, and
we already cache the return value of those two functions, so I don't see
why we need to cache xfs_alloc_min_freelist separately?

(Or even touch it at all, really...)

--D

> > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > documented in one place. Hm?
> > 
> > I think I'd rather call the new function xfs_fdblocks_avail() over
> > reusing an existing name, because I fear that zapping an old function
> > and replacing it with a new function with the same name will cause
> > confusion for anyone backporting patches or reading code after an
> > absence.
> > 
> > Also the only reason we have a mount variable and a function (instead of
> > a macro) is that Dave asked me to change the codebase away from the
> > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > reflink.
> 
> Yeah, macros wrapping a variable or repeated constant calculation
> are bad, and it's something we've been cleaning up for a long
> time...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-03-17 17:05 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-14 18:08 [PATCH] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
2022-03-16 11:28 ` Brian Foster
2022-03-16 16:32   ` Darrick J. Wong
2022-03-16 17:29     ` Brian Foster
2022-03-16 18:17       ` Darrick J. Wong
2022-03-16 18:48         ` Brian Foster
2022-03-16 19:17           ` Brian Foster
2022-03-16 21:15             ` Darrick J. Wong
2022-03-17  2:19               ` Dave Chinner
2022-03-17 12:53               ` Brian Foster
2022-03-17  2:05         ` Dave Chinner
2022-03-17 12:56           ` Brian Foster
2022-03-17 15:46             ` Darrick J. Wong
2022-03-17 17:05           ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.