Re: [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair
Date: Wed, 16 May 2018 22:58:05 -0700	[thread overview]
Message-ID: <20180517055805.GR23858@magnolia> (raw)
In-Reply-To: <20180516231820.GO23858@magnolia>

On Wed, May 16, 2018 at 04:18:20PM -0700, Darrick J. Wong wrote:
> On Thu, May 17, 2018 at 08:32:25AM +1000, Dave Chinner wrote:
> > On Wed, May 16, 2018 at 12:34:25PM -0700, Darrick J. Wong wrote:
> > > On Wed, May 16, 2018 at 06:32:32PM +1000, Dave Chinner wrote:
> > > > On Tue, May 15, 2018 at 03:34:04PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Now that we've plumbed in the ability to construct a list of dead btree
> > > > > blocks following a repair, add more helpers to dispose of them.  This is
> > > > > done by examining the rmapbt -- if the btree was the only owner we can
> > > > > free the block, otherwise it's crosslinked and we can only remove the
> > > > > rmapbt record.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > 
> > [...]
> > 
> > > > > +	struct xfs_owner_info		oinfo;
> > > > > +	struct xfs_perag		*pag;
> > > > > +	int				error;
> > > > > +
> > > > > +	/* Make sure there's space on the freelist. */
> > > > > +	error = xfs_repair_fix_freelist(sc, true);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +	pag = xfs_perag_get(sc->mp, sc->sa.agno);
> > > > 
> > > > Because this is how it quickly gets it gets to silly numbers of
> > > > lookups. That's two now in this function.
> > > > 
> > > > > +	if (pag->pagf_flcount == 0) {
> > > > > +		xfs_perag_put(pag);
> > > > > +		return -EFSCORRUPTED;
> > > > 
> > > > Why is having an empty freelist a problem here? It's an AG thatis
> > > > completely out of space, but it isn't corruption? And I don't see
> > > > why an empty freelist prevents us from adding a backs back onto the
> > > > AGFL?
> > 
> > I think you missed a question :P
> 
> Doh, sorry.  I don't remember exactly why I put that in there; judging
> from my notes I think the idea was that if the AG is completely full
> we'd rather shut down with a corruption signal hoping that the admin
> will run xfs_repair.
> 
> I also don't see why it's necessary now, I'll see what happens if I
> remove it.
> 
> > > > > +	/* Can we find any other rmappings? */
> > > > > +	error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
> > > > > +	if (error)
> > > > > +		goto out_cur;
> > > > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > > > +
> > > > > +	/*
> > > > > +	 * If there are other rmappings, this block is cross linked and must
> > > > > +	 * not be freed.  Remove the reverse mapping and move on.  Otherwise,
> > > > 
> > > > Why do we just remove the reverse mapping if the block cannot be
> > > > freed? I have my suspicions that this is removing cross-links one by
> > > > one until there's only one reference left to the extent, but then I
> > > > ask "how do we know which one is the correct mapping"?
> > > 
> > > Right.  Prior to calling this function we built a totally new btree with
> > > blocks from the freespace, so now we need to remove the rmaps that
> > > covered the old btree and/or free the block.  The goal is to rebuild
> > > /all/ the trees that think they own this block so that we can free the
> > > block and not have to care which one is correct.
> > 
> > Ok, so  we've already rebuilt the new btree, and this is removing
> > stale references to cross-linked blocks that have owners different
> > to the one we are currently scanning.
> > 
> > What happens if the cross-linked block is cross-linked within the
> > same owner context?
> 
> It won't end up on the reap list in first place, because we scan every
> block of every object with the same rmap owner to construct sublist.
> Then we subtract sublist from exlist (which we got from rmap) and only
> reap the difference.
> 
> > > > > +	struct xfs_scrub_context	*sc,
> > > > > +	xfs_fsblock_t			fsbno,
> > > > > +	xfs_extlen_t			len,
> > > > > +	struct xfs_owner_info		*oinfo,
> > > > > +	enum xfs_ag_resv_type		resv)
> > > > > +{
> > > > > +	struct xfs_mount		*mp = sc->mp;
> > > > > +	int				error = 0;
> > > > > +
> > > > > +	ASSERT(xfs_sb_version_hasrmapbt(&mp->m_sb));
> > > > > +	ASSERT(sc->ip != NULL || XFS_FSB_TO_AGNO(mp, fsbno) == sc->sa.agno);
> > > > > +
> > > > > +	trace_xfs_repair_dispose_btree_extent(mp, XFS_FSB_TO_AGNO(mp, fsbno),
> > > > > +			XFS_FSB_TO_AGBNO(mp, fsbno), len);
> > > > > +
> > > > > +	for (; len > 0; len--, fsbno++) {
> > > > > +		error = xfs_repair_dispose_btree_block(sc, fsbno, oinfo, resv);
> > > > > +		if (error)
> > > > > +			return error;
> > > > 
> > > > So why do we do this one block at a time, rather than freeing it
> > > > as an entire extent in one go?
> > > 
> > > At the moment the xfs_rmap_has_other_keys helper can only tell you if
> > > there are multiple rmap owners for any part of a given extent.  For
> > > example, if the rmap records were:
> > > 
> > > (start = 35, len = 3, owner = rmap)
> > > (start = 35, len = 1, owner = refcount)
> > > (start = 37, len = 1, owner = inobt)
> > > 
> > > Notice how block 35 and 37 are crosslinked, but 36 isn't.  A call to
> > > xfs_rmap_has_other_keys(35, 3) will say "yes" but doesn't have a way to
> > > signal back that the yes applies to 35 but that the caller should try
> > > again with block 36.  Doing so would require _has_other_keys to maintain
> > > a refcount and to return to the caller any time the refcount changed,
> > > and the caller would still have to loop the extent.  It's easier to have
> > > a dumb loop for the initial implementation and optimize it if we start
> > > taking more heat than we'd like on crosslinked filesystems.
> > 
> > Well, I can see why you are doing this now, but the problems with
> > multi-block metadata makes me think that we really need to know more
> > detail of the owner in the rmap. e.g. that it's directory or
> > attribute data, not user file data and hence we can infer things
> > about expected block sizes, do the correct sort of buffer lookups
> > for invalidation, etc.
> 
> I'm not sure we can do that without causing a deadlocking problem, since
> we lock all the AG headers to rebuild a btree and in general we can't
> _iget an inode to find out if it's a dir or not.  But I have more to say
> on this in a few paragraphs...
> 
> > I'm tending towards "this needs a design doc to explain all
> > this stuff" right now. Code is great, but I'm struggling understand
> > (reverse engineer!) all the algorithms and decisions that have been
> > made from the code...
> 
> Working on it.

Nearly my bedtime, so here's the current draft:

/*
 * Reconstructing per-AG Btrees
 *
 * When a space btree is corrupt, we don't bother trying to fix it.
 * Instead, we scan secondary space metadata to derive the records that
 * should be in the damaged btree, initialize a fresh btree root, and
 * insert the records.  Note that for rebuilding the rmapbt we scan all
 * the primary data.
 *
 * However, that leaves the matter of removing all the metadata
 * describing the old broken structure.  For primary metadata we use the
 * rmap data to construct a first bitmap of every extent with a matching
 * rmap owner; we then iterate all other metadata structures with the
 * same rmap owner to construct a second bitmap of rmaps that cannot be
 * removed.  We then subtract the second bitmap from the first bitmap
 * (first & ~second) to derive the blocks that were used by the old
 * btree.  These blocks can be reaped.
 *
 * For rmapbt reconstructions we must use different tactics.  First we
 * iterate all primary metadata (this excludes the old rmapbt,
 * obviously) to generate new rmap records.  Then we iterate the new
 * rmap records to find the gaps, which should be encompass the free
 * space and the old rmapbt blocks.  That corresponds to the 'first
 * bitmap' of the previous section.  The bnobt is iterated to generate
 * the second bitmap of the previous section.  We then reap the blocks
 * corresponding to the difference just like we do for primary data.
 *
 * The comment for xfs_repair_reap_btree_extents will describe the block
 * disposal process in more detail.
 */

And later, down by xfs_repair_reap_btree_extents,

/*
 * Dispose of btree blocks from the old per-AG btree.
 *
 * Now that we've constructed a new btree to replace the damaged one, we
 * want to dispose of the blocks that (we think) the old btree was
 * using.  Previously, we used the rmapbt to construct a list of extents
 * (@exlist) with the rmap owner corresponding to the tree we rebuilt,
 * then subtracted out any other blocks with the same rmap owner that
 * are owned by another data structure.  In theory the extents in
 * @exlist are the old btree's blocks.
 *
 * Unfortunately, it's possible that the btree was crosslinked with
 * other blocks on disk.  The rmap data can tell us if there are
 * multiple owners, so if the rmapbt says there is an owner of this
 * block other than @oinfo, then the block is crosslinked.  Remove the
 * reverse mapping and continue.
 *
 * If there is one rmap record, we can free the block, which removes the
 * reverse mapping but doesn't add the block to the free space.  Our
 * repair strategy is to hope the other metadata objects crosslinked on
 * this block will be rebuilt (atop different blocks), thereby removing
 * all the cross links.
 *
 * If there are no rmap records at all, we also free the block.  If the
 * btree being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then
 * there isn't supposed to be a rmap record and everything is ok.  For
 * other btrees there had to have been an rmap entry for the block to
 * have ended up on @exlist, so if it's gone now there's something wrong
 * and the fs will shut down.
 *
 * The caller is responsible for locking the AG headers for the entire
 * rebuild operation so that nothing else can sneak in and change the AG
 * state while we're not looking.  We also assume that the caller
 * already invalidated any buffers associated with @exlist.
 */

Later, for the function that finds AG btree roots for agf/agi
reconstruction:

/*
 * Find the roots of the per-AG btrees described in btree_info.
 *
 * The caller provides information about the btrees to look for by
 * passing in an array (@btree_info) of xfs_repair_find_ag_btree with
 * the (rmap owner, buf_ops, magic) fields set.  The last element of the
 * array should have a NULL buf_ops, and the (root, height) fields will
 * be set on return if anything is found.
 *
 * For every rmapbt record matching any of the rmap owners in
 * @btree_info, read each block referenced by the rmap record.  If the
 * block is a btree block from this filesystem matching any of the magic
 * numbers and has a level higher than what we've already seen, remember
 * the block and the height of the tree required to have such a block.
 * When the call completes, we return the highest block we've found for
 * each btree description; those should be the roots.
 *
 * The caller must lock the applicable per-AG header buffers (AGF, AGI)
 * to prevent other threads from changing the shape of the btrees that
 * we are looking for.  It must maintain those locks until it's safe for
 * other threads to change the btrees' shapes.
 */

--D

> 
> > > > > +/*
> > > > > + * Invalidate buffers for per-AG btree blocks we're dumping.  We assume that
> > > > > + * exlist points only to metadata blocks.
> > > > > + */
> > > > > +int
> > > > > +xfs_repair_invalidate_blocks(
> > > > > +	struct xfs_scrub_context	*sc,
> > > > > +	struct xfs_repair_extent_list	*exlist)
> > > > > +{
> > > > > +	struct xfs_repair_extent	*rex;
> > > > > +	struct xfs_repair_extent	*n;
> > > > > +	struct xfs_buf			*bp;
> > > > > +	xfs_agnumber_t			agno;
> > > > > +	xfs_agblock_t			agbno;
> > > > > +	xfs_agblock_t			i;
> > > > > +
> > > > > +	for_each_xfs_repair_extent_safe(rex, n, exlist) {
> > > > > +		agno = XFS_FSB_TO_AGNO(sc->mp, rex->fsbno);
> > > > > +		agbno = XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno);
> > > > > +		for (i = 0; i < rex->len; i++) {
> > > > > +			bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > > > > +					agbno + i, 0);
> > > > > +			xfs_trans_binval(sc->tp, bp);
> > > > > +		}
> > > > 
> > > > Again, this is doing things by single blocks. We do have multi-block
> > > > metadata (inodes, directory blocks, remote attrs) that, if it
> > > > is already in memory, needs to be treated as multi-block extents. If
> > > > we don't do that, we'll cause aliasing problems in the buffer cache
> > > > (see _xfs_buf_obj_cmp()) and it's all downhill from there.
> > > 
> > > I only recently started testing with filesystems containing multiblock
> > > dir/rmt metadata, and this is an unsolved problem. :(
> > 
> > That needs documenting, too. Perhaps explicitly, by rejecting repair
> > requests on filesystems or types that have multi-block constructs
> > until we solve these problems.
> 
> Trouble is, remote attr values can have an xfs_buf that spans however
> many blocks you need to store a full 64k value, and what happens if the
> rmapbt collides with that?  It sorta implies that we can't do
> invalidation on /any/ filesystem, which is unfortunate....
> 
> ...unless we have an easy way of finding /any/ buffer that points to a
> given block?  Probably not, since iirc they're indexed by the first disk
> block number.  Hm.  I suppose we could use the rmap data to look for
> anything within 64k of the logical offset of an attr/data rmap
> overlapping the same block...
> 
> ...but on second thought we only care about invalidating the buffer if
> the block belonged to the ag btree we've just killed, right?  If there's
> a multi-block buffer because it's part of a directory or an rmt block
> then the buffer is clearly owned by someone else and we don't even have
> to look for that.  Likewise, if it's a single-block buffer  but the
> block has some other magic then we don't own it and we should leave it
> alone.
> 
> > > I /think/ the solution is that we need to query the buffer cache to see
> > > if it has a buffer for the given disk blocks, and if it matches the
> > > btree we're discarding (correct magic/uuid/b_length) then we invalidate
> > > it,
> > 
> > I don't think that provides any guarantees. Even ignoring all the
> > problems with invalidation while the buffer is dirty and tracked in
> > the AIL, there's nothing stopping the other code from attempting to
> > re-instantiate the buffer due to some other access. And then we
> > have aliasing problems again....
> 
> Well, we /could/ just freeze the fs while we do repairs on any ag btree.
> 
> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html