All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair
Date: Wed, 16 May 2018 22:58:05 -0700	[thread overview]
Message-ID: <20180517055805.GR23858@magnolia> (raw)
In-Reply-To: <20180516231820.GO23858@magnolia>

On Wed, May 16, 2018 at 04:18:20PM -0700, Darrick J. Wong wrote:
> On Thu, May 17, 2018 at 08:32:25AM +1000, Dave Chinner wrote:
> > On Wed, May 16, 2018 at 12:34:25PM -0700, Darrick J. Wong wrote:
> > > On Wed, May 16, 2018 at 06:32:32PM +1000, Dave Chinner wrote:
> > > > On Tue, May 15, 2018 at 03:34:04PM -0700, Darrick J. Wong wrote:
> > > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > 
> > > > > Now that we've plumbed in the ability to construct a list of dead btree
> > > > > blocks following a repair, add more helpers to dispose of them.  This is
> > > > > done by examining the rmapbt -- if the btree was the only owner we can
> > > > > free the block, otherwise it's crosslinked and we can only remove the
> > > > > rmapbt record.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > 
> > [...]
> > 
> > > > > +	struct xfs_owner_info		oinfo;
> > > > > +	struct xfs_perag		*pag;
> > > > > +	int				error;
> > > > > +
> > > > > +	/* Make sure there's space on the freelist. */
> > > > > +	error = xfs_repair_fix_freelist(sc, true);
> > > > > +	if (error)
> > > > > +		return error;
> > > > > +	pag = xfs_perag_get(sc->mp, sc->sa.agno);
> > > > 
> > > > Because this is how it quickly gets it gets to silly numbers of
> > > > lookups. That's two now in this function.
> > > > 
> > > > > +	if (pag->pagf_flcount == 0) {
> > > > > +		xfs_perag_put(pag);
> > > > > +		return -EFSCORRUPTED;
> > > > 
> > > > Why is having an empty freelist a problem here? It's an AG thatis
> > > > completely out of space, but it isn't corruption? And I don't see
> > > > why an empty freelist prevents us from adding a backs back onto the
> > > > AGFL?
> > 
> > I think you missed a question :P
> 
> Doh, sorry.  I don't remember exactly why I put that in there; judging
> from my notes I think the idea was that if the AG is completely full
> we'd rather shut down with a corruption signal hoping that the admin
> will run xfs_repair.
> 
> I also don't see why it's necessary now, I'll see what happens if I
> remove it.
> 
> > > > > +	/* Can we find any other rmappings? */
> > > > > +	error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
> > > > > +	if (error)
> > > > > +		goto out_cur;
> > > > > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > > > > +
> > > > > +	/*
> > > > > +	 * If there are other rmappings, this block is cross linked and must
> > > > > +	 * not be freed.  Remove the reverse mapping and move on.  Otherwise,
> > > > 
> > > > Why do we just remove the reverse mapping if the block cannot be
> > > > freed? I have my suspicions that this is removing cross-links one by
> > > > one until there's only one reference left to the extent, but then I
> > > > ask "how do we know which one is the correct mapping"?
> > > 
> > > Right.  Prior to calling this function we built a totally new btree with
> > > blocks from the freespace, so now we need to remove the rmaps that
> > > covered the old btree and/or free the block.  The goal is to rebuild
> > > /all/ the trees that think they own this block so that we can free the
> > > block and not have to care which one is correct.
> > 
> > Ok, so  we've already rebuilt the new btree, and this is removing
> > stale references to cross-linked blocks that have owners different
> > to the one we are currently scanning.
> > 
> > What happens if the cross-linked block is cross-linked within the
> > same owner context?
> 
> It won't end up on the reap list in first place, because we scan every
> block of every object with the same rmap owner to construct sublist.
> Then we subtract sublist from exlist (which we got from rmap) and only
> reap the difference.
> 
> > > > > +	struct xfs_scrub_context	*sc,
> > > > > +	xfs_fsblock_t			fsbno,
> > > > > +	xfs_extlen_t			len,
> > > > > +	struct xfs_owner_info		*oinfo,
> > > > > +	enum xfs_ag_resv_type		resv)
> > > > > +{
> > > > > +	struct xfs_mount		*mp = sc->mp;
> > > > > +	int				error = 0;
> > > > > +
> > > > > +	ASSERT(xfs_sb_version_hasrmapbt(&mp->m_sb));
> > > > > +	ASSERT(sc->ip != NULL || XFS_FSB_TO_AGNO(mp, fsbno) == sc->sa.agno);
> > > > > +
> > > > > +	trace_xfs_repair_dispose_btree_extent(mp, XFS_FSB_TO_AGNO(mp, fsbno),
> > > > > +			XFS_FSB_TO_AGBNO(mp, fsbno), len);
> > > > > +
> > > > > +	for (; len > 0; len--, fsbno++) {
> > > > > +		error = xfs_repair_dispose_btree_block(sc, fsbno, oinfo, resv);
> > > > > +		if (error)
> > > > > +			return error;
> > > > 
> > > > So why do we do this one block at a time, rather than freeing it
> > > > as an entire extent in one go?
> > > 
> > > At the moment the xfs_rmap_has_other_keys helper can only tell you if
> > > there are multiple rmap owners for any part of a given extent.  For
> > > example, if the rmap records were:
> > > 
> > > (start = 35, len = 3, owner = rmap)
> > > (start = 35, len = 1, owner = refcount)
> > > (start = 37, len = 1, owner = inobt)
> > > 
> > > Notice how block 35 and 37 are crosslinked, but 36 isn't.  A call to
> > > xfs_rmap_has_other_keys(35, 3) will say "yes" but doesn't have a way to
> > > signal back that the yes applies to 35 but that the caller should try
> > > again with block 36.  Doing so would require _has_other_keys to maintain
> > > a refcount and to return to the caller any time the refcount changed,
> > > and the caller would still have to loop the extent.  It's easier to have
> > > a dumb loop for the initial implementation and optimize it if we start
> > > taking more heat than we'd like on crosslinked filesystems.
> > 
> > Well, I can see why you are doing this now, but the problems with
> > multi-block metadata makes me think that we really need to know more
> > detail of the owner in the rmap. e.g. that it's directory or
> > attribute data, not user file data and hence we can infer things
> > about expected block sizes, do the correct sort of buffer lookups
> > for invalidation, etc.
> 
> I'm not sure we can do that without causing a deadlocking problem, since
> we lock all the AG headers to rebuild a btree and in general we can't
> _iget an inode to find out if it's a dir or not.  But I have more to say
> on this in a few paragraphs...
> 
> > I'm tending towards "this needs a design doc to explain all
> > this stuff" right now. Code is great, but I'm struggling understand
> > (reverse engineer!) all the algorithms and decisions that have been
> > made from the code...
> 
> Working on it.

Nearly my bedtime, so here's the current draft:

/*
 * Reconstructing per-AG Btrees
 *
 * When a space btree is corrupt, we don't bother trying to fix it.
 * Instead, we scan secondary space metadata to derive the records that
 * should be in the damaged btree, initialize a fresh btree root, and
 * insert the records.  Note that for rebuilding the rmapbt we scan all
 * the primary data.
 *
 * However, that leaves the matter of removing all the metadata
 * describing the old broken structure.  For primary metadata we use the
 * rmap data to construct a first bitmap of every extent with a matching
 * rmap owner; we then iterate all other metadata structures with the
 * same rmap owner to construct a second bitmap of rmaps that cannot be
 * removed.  We then subtract the second bitmap from the first bitmap
 * (first & ~second) to derive the blocks that were used by the old
 * btree.  These blocks can be reaped.
 *
 * For rmapbt reconstructions we must use different tactics.  First we
 * iterate all primary metadata (this excludes the old rmapbt,
 * obviously) to generate new rmap records.  Then we iterate the new
 * rmap records to find the gaps, which should be encompass the free
 * space and the old rmapbt blocks.  That corresponds to the 'first
 * bitmap' of the previous section.  The bnobt is iterated to generate
 * the second bitmap of the previous section.  We then reap the blocks
 * corresponding to the difference just like we do for primary data.
 *
 * The comment for xfs_repair_reap_btree_extents will describe the block
 * disposal process in more detail.
 */

And later, down by xfs_repair_reap_btree_extents,


/*
 * Dispose of btree blocks from the old per-AG btree.
 *
 * Now that we've constructed a new btree to replace the damaged one, we
 * want to dispose of the blocks that (we think) the old btree was
 * using.  Previously, we used the rmapbt to construct a list of extents
 * (@exlist) with the rmap owner corresponding to the tree we rebuilt,
 * then subtracted out any other blocks with the same rmap owner that
 * are owned by another data structure.  In theory the extents in
 * @exlist are the old btree's blocks.
 *
 * Unfortunately, it's possible that the btree was crosslinked with
 * other blocks on disk.  The rmap data can tell us if there are
 * multiple owners, so if the rmapbt says there is an owner of this
 * block other than @oinfo, then the block is crosslinked.  Remove the
 * reverse mapping and continue.
 *
 * If there is one rmap record, we can free the block, which removes the
 * reverse mapping but doesn't add the block to the free space.  Our
 * repair strategy is to hope the other metadata objects crosslinked on
 * this block will be rebuilt (atop different blocks), thereby removing
 * all the cross links.
 *
 * If there are no rmap records at all, we also free the block.  If the
 * btree being rebuilt lives in the free space (bnobt/cntbt/rmapbt) then
 * there isn't supposed to be a rmap record and everything is ok.  For
 * other btrees there had to have been an rmap entry for the block to
 * have ended up on @exlist, so if it's gone now there's something wrong
 * and the fs will shut down.
 *
 * The caller is responsible for locking the AG headers for the entire
 * rebuild operation so that nothing else can sneak in and change the AG
 * state while we're not looking.  We also assume that the caller
 * already invalidated any buffers associated with @exlist.
 */

Later, for the function that finds AG btree roots for agf/agi
reconstruction:


/*
 * Find the roots of the per-AG btrees described in btree_info.
 *
 * The caller provides information about the btrees to look for by
 * passing in an array (@btree_info) of xfs_repair_find_ag_btree with
 * the (rmap owner, buf_ops, magic) fields set.  The last element of the
 * array should have a NULL buf_ops, and the (root, height) fields will
 * be set on return if anything is found.
 *
 * For every rmapbt record matching any of the rmap owners in
 * @btree_info, read each block referenced by the rmap record.  If the
 * block is a btree block from this filesystem matching any of the magic
 * numbers and has a level higher than what we've already seen, remember
 * the block and the height of the tree required to have such a block.
 * When the call completes, we return the highest block we've found for
 * each btree description; those should be the roots.
 *
 * The caller must lock the applicable per-AG header buffers (AGF, AGI)
 * to prevent other threads from changing the shape of the btrees that
 * we are looking for.  It must maintain those locks until it's safe for
 * other threads to change the btrees' shapes.
 */

--D

> 
> > > > > +/*
> > > > > + * Invalidate buffers for per-AG btree blocks we're dumping.  We assume that
> > > > > + * exlist points only to metadata blocks.
> > > > > + */
> > > > > +int
> > > > > +xfs_repair_invalidate_blocks(
> > > > > +	struct xfs_scrub_context	*sc,
> > > > > +	struct xfs_repair_extent_list	*exlist)
> > > > > +{
> > > > > +	struct xfs_repair_extent	*rex;
> > > > > +	struct xfs_repair_extent	*n;
> > > > > +	struct xfs_buf			*bp;
> > > > > +	xfs_agnumber_t			agno;
> > > > > +	xfs_agblock_t			agbno;
> > > > > +	xfs_agblock_t			i;
> > > > > +
> > > > > +	for_each_xfs_repair_extent_safe(rex, n, exlist) {
> > > > > +		agno = XFS_FSB_TO_AGNO(sc->mp, rex->fsbno);
> > > > > +		agbno = XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno);
> > > > > +		for (i = 0; i < rex->len; i++) {
> > > > > +			bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > > > > +					agbno + i, 0);
> > > > > +			xfs_trans_binval(sc->tp, bp);
> > > > > +		}
> > > > 
> > > > Again, this is doing things by single blocks. We do have multi-block
> > > > metadata (inodes, directory blocks, remote attrs) that, if it
> > > > is already in memory, needs to be treated as multi-block extents. If
> > > > we don't do that, we'll cause aliasing problems in the buffer cache
> > > > (see _xfs_buf_obj_cmp()) and it's all downhill from there.
> > > 
> > > I only recently started testing with filesystems containing multiblock
> > > dir/rmt metadata, and this is an unsolved problem. :(
> > 
> > That needs documenting, too. Perhaps explicitly, by rejecting repair
> > requests on filesystems or types that have multi-block constructs
> > until we solve these problems.
> 
> Trouble is, remote attr values can have an xfs_buf that spans however
> many blocks you need to store a full 64k value, and what happens if the
> rmapbt collides with that?  It sorta implies that we can't do
> invalidation on /any/ filesystem, which is unfortunate....
> 
> ...unless we have an easy way of finding /any/ buffer that points to a
> given block?  Probably not, since iirc they're indexed by the first disk
> block number.  Hm.  I suppose we could use the rmap data to look for
> anything within 64k of the logical offset of an attr/data rmap
> overlapping the same block...
> 
> ...but on second thought we only care about invalidating the buffer if
> the block belonged to the ag btree we've just killed, right?  If there's
> a multi-block buffer because it's part of a directory or an rmt block
> then the buffer is clearly owned by someone else and we don't even have
> to look for that.  Likewise, if it's a single-block buffer  but the
> block has some other magic then we don't own it and we should leave it
> alone.
> 
> > > I /think/ the solution is that we need to query the buffer cache to see
> > > if it has a buffer for the given disk blocks, and if it matches the
> > > btree we're discarding (correct magic/uuid/b_length) then we invalidate
> > > it,
> > 
> > I don't think that provides any guarantees. Even ignoring all the
> > problems with invalidation while the buffer is dirty and tracked in
> > the AIL, there's nothing stopping the other code from attempting to
> > re-instantiate the buffer due to some other access. And then we
> > have aliasing problems again....
> 
> Well, we /could/ just freeze the fs while we do repairs on any ag btree.
> 
> --D
> 
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-05-17  5:58 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 22:33 [PATCH v15.1 00/22] xfs-4.18: online repair support Darrick J. Wong
2018-05-15 22:33 ` [PATCH 01/22] xfs: add helpers to deal with transaction allocation and rolling Darrick J. Wong
2018-05-16  6:51   ` Dave Chinner
2018-05-16 16:46     ` Darrick J. Wong
2018-05-16 21:19       ` Dave Chinner
2018-05-16 16:48   ` Allison Henderson
2018-05-18  3:49   ` [PATCH v2 " Darrick J. Wong
2018-05-15 22:33 ` [PATCH 02/22] xfs: add helpers to allocate and initialize fresh btree roots Darrick J. Wong
2018-05-16  7:07   ` Dave Chinner
2018-05-16 17:15     ` Darrick J. Wong
2018-05-16 17:00   ` Allison Henderson
2018-05-15 22:33 ` [PATCH 03/22] xfs: add helpers to collect and sift btree block pointers during repair Darrick J. Wong
2018-05-16  7:56   ` Dave Chinner
2018-05-16 17:34     ` Allison Henderson
2018-05-16 18:06       ` Darrick J. Wong
2018-05-16 21:23         ` Dave Chinner
2018-05-16 21:33           ` Allison Henderson
2018-05-16 18:01     ` Darrick J. Wong
2018-05-16 21:32       ` Dave Chinner
2018-05-16 22:05         ` Darrick J. Wong
2018-05-17  0:41           ` Dave Chinner
2018-05-17  5:05             ` Darrick J. Wong
2018-05-18  3:51   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:10     ` Dave Chinner
2018-05-29 15:28       ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair Darrick J. Wong
2018-05-16  8:32   ` Dave Chinner
2018-05-16 18:02     ` Allison Henderson
2018-05-16 19:34     ` Darrick J. Wong
2018-05-16 22:32       ` Dave Chinner
2018-05-16 23:18         ` Darrick J. Wong
2018-05-17  5:58           ` Darrick J. Wong [this message]
2018-05-18  3:53   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:14     ` Dave Chinner
2018-05-29 18:01       ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 05/22] xfs: recover AG btree roots from rmap data Darrick J. Wong
2018-05-16  8:51   ` Dave Chinner
2018-05-16 18:37     ` Darrick J. Wong
2018-05-16 19:18       ` Allison Henderson
2018-05-16 22:36       ` Dave Chinner
2018-05-17  5:53         ` Darrick J. Wong
2018-05-18  3:54   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:16     ` Dave Chinner
2018-05-15 22:34 ` [PATCH 06/22] xfs: add a repair helper to reset superblock counters Darrick J. Wong
2018-05-16 21:29   ` Allison Henderson
2018-05-18  3:56     ` Darrick J. Wong
2018-05-18  3:56   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:28     ` Dave Chinner
2018-05-29 22:07       ` Darrick J. Wong
2018-05-29 22:24         ` Dave Chinner
2018-05-29 22:43           ` Darrick J. Wong
2018-05-30  1:23             ` Dave Chinner
2018-05-30  3:22               ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 07/22] xfs: add helpers to attach quotas to inodes Darrick J. Wong
2018-05-16 22:21   ` Allison Henderson
2018-05-18  3:58   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:29     ` Dave Chinner
2018-05-15 22:34 ` [PATCH 08/22] xfs: repair superblocks Darrick J. Wong
2018-05-16 22:55   ` Allison Henderson
2018-05-29  3:42   ` Dave Chinner
2018-05-15 22:34 ` [PATCH 09/22] xfs: repair the AGF and AGFL Darrick J. Wong
2018-05-15 22:34 ` [PATCH 10/22] xfs: repair the AGI Darrick J. Wong
2018-05-15 22:34 ` [PATCH 11/22] xfs: repair free space btrees Darrick J. Wong
2018-05-15 22:34 ` [PATCH 12/22] xfs: repair inode btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 13/22] xfs: repair the rmapbt Darrick J. Wong
2018-05-15 22:35 ` [PATCH 14/22] xfs: repair refcount btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 15/22] xfs: repair inode records Darrick J. Wong
2018-05-15 22:35 ` [PATCH 16/22] xfs: zap broken inode forks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 17/22] xfs: repair inode block maps Darrick J. Wong
2018-05-15 22:35 ` [PATCH 18/22] xfs: repair damaged symlinks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 19/22] xfs: repair extended attributes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 20/22] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 21/22] xfs: repair quotas Darrick J. Wong
2018-05-15 22:36 ` [PATCH 22/22] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-05-18  3:47 ` [PATCH 0.5/22] xfs: grab the per-ag structure whenever relevant Darrick J. Wong
2018-05-30  6:44   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180517055805.GR23858@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.