All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Brian Foster <bfoster@redhat.com>, xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH] xfs: don't include bnobt blocks when reserving free block pool
Date: Thu, 17 Mar 2022 10:05:33 -0700	[thread overview]
Message-ID: <20220317170533.GC8224@magnolia> (raw)
In-Reply-To: <20220317020526.GV3927073@dread.disaster.area>

On Thu, Mar 17, 2022 at 01:05:26PM +1100, Dave Chinner wrote:
> On Wed, Mar 16, 2022 at 11:17:26AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 16, 2022 at 01:29:01PM -0400, Brian Foster wrote:
> > > On Wed, Mar 16, 2022 at 09:32:16AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 16, 2022 at 07:28:18AM -0400, Brian Foster wrote:
> > > > > On Mon, Mar 14, 2022 at 11:08:47AM -0700, Darrick J. Wong wrote:
> > > Similar deal as above.. I'm more interested in a potential cleanup of
> > > the code that helps prevent this sort of buglet for the next user of
> > > ->m_alloc_set_aside that will (expectedly) have no idea about this
> > > subtle quirk than I am about what's presented in the free space
> > > counters. ISTM that we ought to just ditch ->m_alloc_set_aside, replace
> > > the existing xfs_alloc_set_aside() with an XFS_ALLOC_FS_RESERVED() macro
> > > or something that just does the (agcount << 3) thing, and then define a
> > 
> > I'm not sure that the current xfs_alloc_set_aside code is correct.
> > Right now it comes with this comment:
> > 
> > "We need to reserve 4 fsbs _per AG_ for the freelist and 4 more to
> > handle a potential split of the file's bmap btree."
> >
> > I think the first part ("4 fsbs _per AG_ for the freelist") is wrong.
> > AFAICT, that part refers to the number of blocks we need to keep free in
> > case we have to replenish a completely empty AGFL.  The hardcoded value
> > of 4 seems wrong, since xfs_alloc_min_freelist() is what _fix_freelist
> > uses to decide how big the AGFL needs to be, and it returns 6 on a
> > filesystem that has rmapbt enabled.  So I think XFS_ALLOC_AGFL_RESERVE
> > is wrong here and should be replaced with the function call.
> 
> Back when I wrote that code (circa 2007, IIRC), that was actually
> correct according to the reservations that were made when freeing
> an extent at ENOSPC.
> 
> We needed 4 blocks for the AGFL fixup to always succeed  - 2 blocks
> for each BNO and CNT btrees, and, IIRC, the extent free reservation
> was just 4 blocks at that time. Hence the 4+4 value.
> 
> However, you are right that rmap also adds another per-ag btree that
> is allocated from the agfl and that set_aside() should be taking
> that into accout. That said, I think that xfs_alloc_min_freelist()
> might even be wrong by just adding 2 blocks to the AGFL for the
> rmapbt.
> 
> That is, at ENOSPC the rmapbt can be a *big* btree. It's not like
> the BNO and CNT btrees which are completely empty at that point in
> time; the RMAP tree could be one level below max height, and freeing
> a single block could split a rmap rec and trigger a full height RMAP
> split.
> 
> So the minimum free list length in that case is 2 + 2 + MAX_RMAP_HEIGHT.

The rmap btree can become a big btree, but the per-ag rmapbt reservation
ensures that there's enough free space to refill the AGFL to handle the
rmap btree expanding to its maximum allowable size.  XFS_AG_RESV_RMAPBT
is subtracted from fdblocks, so I don't think alloc_set_aside ought to
withhold even more blocks from xfs_mod_fdblocks.

IOWS, I was wrong earlier -- we only need to withhold enough space from
fdblocks to handle splits of the bnobt and cntbt at or near ENOSPC.  The
value 4 is actually correct, but needs much better explanation.
Especially for benefit of the original author. ;)

> > I also think the second part ("and 4 more to handle a split of the
> > file's bmap btree") is wrong.  If we're really supposed to save enough
> > blocks to handle a bmbt split, then I think this ought to be
> > (mp->m_bm_maxlevels[0] - 1), not 4, right?  According to xfs_db, bmap
> > btrees can be 9 levels tall:
> 
> Yes, we've changed the BMBT reservations in the years since that
> code was written to handle max height reservations correctly, too.
> So, like the RMAP btree reservation, we probably should be reserving
> MAX_BMAP_HEIGHT in the set-aside calculation.

Right.

> refcount btree space is handled by the ag_resv code and blocks
> aren't allocated from the AGFL, so I don't think we need to take
> taht into account for xfs_alloc_set_aside().

Right.

> > So in the end, I think that calculation should become:
> > 
> > unsigned int
> > xfs_alloc_set_aside(
> > 	struct xfs_mount	*mp)
> > {
> > 	unsigned int		min-agfl = xfs_alloc_min_freelist(mp, NULL);
> > 
> > 	return mp->m_sb.sb_agcount * (min_agfl + mp->m_bm_maxlevels[0] - 1);
> > }
> 
> *nod*, but with the proviso that xfs_alloc_min_freelist() doesn't
> appear to be correct, either....
> 
> Also, that's a fixed value for the physical geometry of the
> filesystem, so it should be calculated once at mount time and stored
> in the xfs_mount (and only updated if needed at growfs time)...

There are three callers of xfs_alloc_min_freelist(, NULL) now.  One of
them is the the function that does the root inode calculation, which we
only use in mkfs and repair.

The other two are xfs_alloc_set_aside and xfs_alloc_ag_max_usable, and
we already cache the return value of those two functions, so I don't see
why we need to cache xfs_alloc_min_freelist separately?

(Or even touch it at all, really...)

--D

> > > new xfs_alloc_set_aside() that combines the macro calculation with
> > > ->m_allocbt_blks. Then the whole "set aside" concept is calculated and
> > > documented in one place. Hm?
> > 
> > I think I'd rather call the new function xfs_fdblocks_avail() over
> > reusing an existing name, because I fear that zapping an old function
> > and replacing it with a new function with the same name will cause
> > confusion for anyone backporting patches or reading code after an
> > absence.
> > 
> > Also the only reason we have a mount variable and a function (instead of
> > a macro) is that Dave asked me to change the codebase away from the
> > XFS_ALLOC_AG_MAX_USABLE/XFS_ALLOC_SET_ASIDE macros as part of merging
> > reflink.
> 
> Yeah, macros wrapping a variable or repeated constant calculation
> are bad, and it's something we've been cleaning up for a long
> time...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

      parent reply	other threads:[~2022-03-17 17:05 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-14 18:08 [PATCH] xfs: don't include bnobt blocks when reserving free block pool Darrick J. Wong
2022-03-16 11:28 ` Brian Foster
2022-03-16 16:32   ` Darrick J. Wong
2022-03-16 17:29     ` Brian Foster
2022-03-16 18:17       ` Darrick J. Wong
2022-03-16 18:48         ` Brian Foster
2022-03-16 19:17           ` Brian Foster
2022-03-16 21:15             ` Darrick J. Wong
2022-03-17  2:19               ` Dave Chinner
2022-03-17 12:53               ` Brian Foster
2022-03-17  2:05         ` Dave Chinner
2022-03-17 12:56           ` Brian Foster
2022-03-17 15:46             ` Darrick J. Wong
2022-03-17 17:05           ` Darrick J. Wong [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220317170533.GC8224@magnolia \
    --to=djwong@kernel.org \
    --cc=bfoster@redhat.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.