All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair
Date: Wed, 16 May 2018 12:34:25 -0700	[thread overview]
Message-ID: <20180516193425.GM23858@magnolia> (raw)
In-Reply-To: <20180516083232.GS23861@dastard>

On Wed, May 16, 2018 at 06:32:32PM +1000, Dave Chinner wrote:
> On Tue, May 15, 2018 at 03:34:04PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > Now that we've plumbed in the ability to construct a list of dead btree
> > blocks following a repair, add more helpers to dispose of them.  This is
> > done by examining the rmapbt -- if the btree was the only owner we can
> > free the block, otherwise it's crosslinked and we can only remove the
> > rmapbt record.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/xfs/scrub/repair.c |  200 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  fs/xfs/scrub/repair.h |    3 +
> >  2 files changed, 202 insertions(+), 1 deletion(-)
> > 
> > 
> > diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
> > index 8e8ecddd7537..d820e01d1146 100644
> > --- a/fs/xfs/scrub/repair.c
> > +++ b/fs/xfs/scrub/repair.c
> > @@ -362,6 +362,173 @@ xfs_repair_init_btblock(
> >  	return 0;
> >  }
> >  
> > +/* Ensure the freelist is the correct size. */
> > +int
> > +xfs_repair_fix_freelist(
> > +	struct xfs_scrub_context	*sc,
> > +	bool				can_shrink)
> > +{
> > +	struct xfs_alloc_arg		args = {0};
> > +	int				error;
> > +
> > +	args.mp = sc->mp;
> > +	args.tp = sc->tp;
> > +	args.agno = sc->sa.agno;
> > +	args.alignment = 1;
> > +	args.pag = xfs_perag_get(args.mp, sc->sa.agno);
> > +
> > +	error = xfs_alloc_fix_freelist(&args,
> > +			can_shrink ? 0 : XFS_ALLOC_FLAG_NOSHRINK);
> > +	xfs_perag_put(args.pag);
> 
> with all these pag lookups, I'm starting to wonder if you should
> just add a lookup and store the pag in the scrub context? That'd
> save a lot of lookups - the pag structures never go away and i never
> really intended them to be used like this in single, very short use
> contexts. Grab once per operation, hold the reference across the
> entire operation context, then release it....

Ok, I'll change xfs_scrub_ag_{init,free} to get / put the perag
structure so we don't have to do this repeatedly.

> > +	return error;
> > +}
> > +
> > +/*
> > + * Put a block back on the AGFL.
> > + */
> > +STATIC int
> > +xfs_repair_put_freelist(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_agblock_t			agbno)
> > +{
> > +	struct xfs_owner_info		oinfo;
> > +	struct xfs_perag		*pag;
> > +	int				error;
> > +
> > +	/* Make sure there's space on the freelist. */
> > +	error = xfs_repair_fix_freelist(sc, true);
> > +	if (error)
> > +		return error;
> > +	pag = xfs_perag_get(sc->mp, sc->sa.agno);
> 
> Because this is how it quickly gets it gets to silly numbers of
> lookups. That's two now in this function.
> 
> > +	if (pag->pagf_flcount == 0) {
> > +		xfs_perag_put(pag);
> > +		return -EFSCORRUPTED;
> 
> Why is having an empty freelist a problem here? It's an AG thatis
> completely out of space, but it isn't corruption? And I don't see
> why an empty freelist prevents us from adding a backs back onto the
> AGFL?
> 
> > +	}
> > +	xfs_perag_put(pag);
> > +
> > +	/*
> > +	 * Since we're "freeing" a lost block onto the AGFL, we have to
> > +	 * create an rmap for the block prior to merging it or else other
> > +	 * parts will break.
> > +	 */
> > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1,
> > +			&oinfo);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Put the block on the AGFL. */
> > +	error = xfs_alloc_put_freelist(sc->tp, sc->sa.agf_bp, sc->sa.agfl_bp,
> > +			agbno, 0);
> 
> There's another pag lookup in here.
> 
> > +	if (error)
> > +		return error;
> > +	xfs_extent_busy_insert(sc->tp, sc->sa.agno, agbno, 1,
> > +			XFS_EXTENT_BUSY_SKIP_DISCARD);
> 
> And another in here, so 4 perag lookups for the same structure in
> one simple operation. The code here in the function is fine, but I
> really think we need to rethink how we use the perag in our
> allocation code...

Yeah.  If we ever introduce the ability to lock an AG for the duration
of a transaction then we might consider linking the perag structures
through the transaction at the same time so that these library functions
can skip the _perag_get.

> > +
> > +	return 0;
> > +}
> > +
> > +/* Dispose of a single metadata block. */
> > +STATIC int
> > +xfs_repair_dispose_btree_block(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_fsblock_t			fsbno,
> > +	struct xfs_owner_info		*oinfo,
> > +	enum xfs_ag_resv_type		resv)
> > +{
> > +	struct xfs_btree_cur		*cur;
> > +	struct xfs_buf			*agf_bp = NULL;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			agbno;
> > +	bool				has_other_rmap;
> > +	int				error;
> > +
> > +	agno = XFS_FSB_TO_AGNO(sc->mp, fsbno);
> > +	agbno = XFS_FSB_TO_AGBNO(sc->mp, fsbno);
> > +
> > +	if (sc->ip) {
> > +		/* Repairing per-inode metadata, read in the AGF. */
> 
> This would be better with a comment above saying:
> 
> 	/*
> 	 * If we are repairing per-inode metadata, we need to read
> 	 * in the AGF buffer. Otherwise we can re-use the existing
> 	 * AGF buffer we set up for repairing the per-AG btrees.
> 	 */

Ok.

> > +		error = xfs_alloc_read_agf(sc->mp, sc->tp, agno, 0, &agf_bp);
> > +		if (error)
> > +			return error;
> > +		if (!agf_bp)
> > +			return -ENOMEM;
> > +	} else {
> > +		/* Repairing per-AG btree, reuse existing AGF buffer. */
> > +		agf_bp = sc->sa.agf_bp;
> > +	}
> > +	cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, agf_bp, agno);
> > +
> > +	/* Can we find any other rmappings? */
> > +	error = xfs_rmap_has_other_keys(cur, agbno, 1, oinfo, &has_other_rmap);
> > +	if (error)
> > +		goto out_cur;
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
> > +
> > +	/*
> > +	 * If there are other rmappings, this block is cross linked and must
> > +	 * not be freed.  Remove the reverse mapping and move on.  Otherwise,
> 
> Why do we just remove the reverse mapping if the block cannot be
> freed? I have my suspicions that this is removing cross-links one by
> one until there's only one reference left to the extent, but then I
> ask "how do we know which one is the correct mapping"?

Right.  Prior to calling this function we built a totally new btree with
blocks from the freespace, so now we need to remove the rmaps that
covered the old btree and/or free the block.  The goal is to rebuild
/all/ the trees that think they own this block so that we can free the
block and not have to care which one is correct.

> i.e. this comment raised more questions about the algorithm for
> dealing with cross-linked blocks - which doesn't appear to be
> explained anywhere - than it answers....
> 
> > +	 * we were the only owner of the block, so free the extent, which will
> > +	 * also remove the rmap.
> > +	 */
> > +	if (has_other_rmap)
> > +		error = xfs_rmap_free(sc->tp, agf_bp, agno, agbno, 1, oinfo);
> > +	else if (resv == XFS_AG_RESV_AGFL)
> > +		error = xfs_repair_put_freelist(sc, agbno);
> > +	else
> > +		error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv);
> > +	if (agf_bp != sc->sa.agf_bp)
> > +		xfs_trans_brelse(sc->tp, agf_bp);
> > +	if (error)
> > +		return error;
> > +
> > +	if (sc->ip)
> > +		return xfs_trans_roll_inode(&sc->tp, sc->ip);
> > +	return xfs_repair_roll_ag_trans(sc);
> > +
> > +out_cur:
> > +	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
> > +	if (agf_bp != sc->sa.agf_bp)
> > +		xfs_trans_brelse(sc->tp, agf_bp);
> > +	return error;
> > +}
> > +
> > +/*
> > + * Dispose of a given metadata extent.
> > + *
> > + * If the rmapbt says the extent has multiple owners, we simply remove the
> > + * rmap associated with this owner hoping that we'll eventually disentangle
> > + * the crosslinked metadata.  Otherwise, there's one owner, so call the
> > + * regular free code to remove the rmap and free the extent.  Any existing
> > + * buffers for the blocks in the extent must have been _binval'd previously.
> 
> Ok, so there's a little more detail about cross-linked files. Seems
> my suspicions of "remove and hope" were close to the mark :P
> 
> Perhaps we need a document that describes the various algorithms we
> use for resolving these problems, so they can be discussed and
> improved without having to troll through the code to understand?

Yeah.  I'll work on consolidating the crosslink functions after lunch
and documenting how the tree rebuilder interacts with fixing crosslinks.

> > + */
> > +STATIC int
> > +xfs_repair_dispose_btree_extent(
> > +	struct xfs_scrub_context	*sc,
> > +	xfs_fsblock_t			fsbno,
> > +	xfs_extlen_t			len,
> > +	struct xfs_owner_info		*oinfo,
> > +	enum xfs_ag_resv_type		resv)
> > +{
> > +	struct xfs_mount		*mp = sc->mp;
> > +	int				error = 0;
> > +
> > +	ASSERT(xfs_sb_version_hasrmapbt(&mp->m_sb));
> > +	ASSERT(sc->ip != NULL || XFS_FSB_TO_AGNO(mp, fsbno) == sc->sa.agno);
> > +
> > +	trace_xfs_repair_dispose_btree_extent(mp, XFS_FSB_TO_AGNO(mp, fsbno),
> > +			XFS_FSB_TO_AGBNO(mp, fsbno), len);
> > +
> > +	for (; len > 0; len--, fsbno++) {
> > +		error = xfs_repair_dispose_btree_block(sc, fsbno, oinfo, resv);
> > +		if (error)
> > +			return error;
> 
> So why do we do this one block at a time, rather than freeing it
> as an entire extent in one go?

At the moment the xfs_rmap_has_other_keys helper can only tell you if
there are multiple rmap owners for any part of a given extent.  For
example, if the rmap records were:

(start = 35, len = 3, owner = rmap)
(start = 35, len = 1, owner = refcount)
(start = 37, len = 1, owner = inobt)

Notice how block 35 and 37 are crosslinked, but 36 isn't.  A call to
xfs_rmap_has_other_keys(35, 3) will say "yes" but doesn't have a way to
signal back that the yes applies to 35 but that the caller should try
again with block 36.  Doing so would require _has_other_keys to maintain
a refcount and to return to the caller any time the refcount changed,
and the caller would still have to loop the extent.  It's easier to have
a dumb loop for the initial implementation and optimize it if we start
taking more heat than we'd like on crosslinked filesystems.

> And if there is only a single caller, why not just open code the loop
> in the caller?

I certainly could do that.

> 
> > +/*
> > + * Invalidate buffers for per-AG btree blocks we're dumping.  We assume that
> > + * exlist points only to metadata blocks.
> > + */
> > +int
> > +xfs_repair_invalidate_blocks(
> > +	struct xfs_scrub_context	*sc,
> > +	struct xfs_repair_extent_list	*exlist)
> > +{
> > +	struct xfs_repair_extent	*rex;
> > +	struct xfs_repair_extent	*n;
> > +	struct xfs_buf			*bp;
> > +	xfs_agnumber_t			agno;
> > +	xfs_agblock_t			agbno;
> > +	xfs_agblock_t			i;
> > +
> > +	for_each_xfs_repair_extent_safe(rex, n, exlist) {
> > +		agno = XFS_FSB_TO_AGNO(sc->mp, rex->fsbno);
> > +		agbno = XFS_FSB_TO_AGBNO(sc->mp, rex->fsbno);
> > +		for (i = 0; i < rex->len; i++) {
> > +			bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > +					agbno + i, 0);
> > +			xfs_trans_binval(sc->tp, bp);
> > +		}
> 
> Again, this is doing things by single blocks. We do have multi-block
> metadata (inodes, directory blocks, remote attrs) that, if it
> is already in memory, needs to be treated as multi-block extents. If
> we don't do that, we'll cause aliasing problems in the buffer cache
> (see _xfs_buf_obj_cmp()) and it's all downhill from there.

I only recently started testing with filesystems containing multiblock
dir/rmt metadata, and this is an unsolved problem. :(

I /think/ the solution is that we need to query the buffer cache to see
if it has a buffer for the given disk blocks, and if it matches the
btree we're discarding (correct magic/uuid/b_length) then we invalidate
it, otherwise we assume that something else owns it and move on.  I
think that we have to do this on a block-by-block basis since we have
no idea what else could be crosslinked with the old btree.

for_each_xfs_repair_extent_safe(rex, n, exlist) {
	agno = ...;
	agbno = ...;
	for (i = 0; i < rex->len, i++) {
		/* don't touch ag headers or post-eofs blocks */
		if (!xfs_verify_agbno(sc->mp, agno, agbno))
			continue;
		bp = xfs_buf_get_anything_caching_this_block(...);
		if (!bp)
			continue;
		if (bp does not belong to this tree) {
			xfs_trans_brelse(sc->tp, bp);
			continue;
		}
		xfs_trans_binval(sc->tp, bp);
	}
}

> That's why I get worried when I see assumptions that we can process
> contiguous metadata ranges in single block buffers....

:/

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2018-05-16 19:34 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-15 22:33 [PATCH v15.1 00/22] xfs-4.18: online repair support Darrick J. Wong
2018-05-15 22:33 ` [PATCH 01/22] xfs: add helpers to deal with transaction allocation and rolling Darrick J. Wong
2018-05-16  6:51   ` Dave Chinner
2018-05-16 16:46     ` Darrick J. Wong
2018-05-16 21:19       ` Dave Chinner
2018-05-16 16:48   ` Allison Henderson
2018-05-18  3:49   ` [PATCH v2 " Darrick J. Wong
2018-05-15 22:33 ` [PATCH 02/22] xfs: add helpers to allocate and initialize fresh btree roots Darrick J. Wong
2018-05-16  7:07   ` Dave Chinner
2018-05-16 17:15     ` Darrick J. Wong
2018-05-16 17:00   ` Allison Henderson
2018-05-15 22:33 ` [PATCH 03/22] xfs: add helpers to collect and sift btree block pointers during repair Darrick J. Wong
2018-05-16  7:56   ` Dave Chinner
2018-05-16 17:34     ` Allison Henderson
2018-05-16 18:06       ` Darrick J. Wong
2018-05-16 21:23         ` Dave Chinner
2018-05-16 21:33           ` Allison Henderson
2018-05-16 18:01     ` Darrick J. Wong
2018-05-16 21:32       ` Dave Chinner
2018-05-16 22:05         ` Darrick J. Wong
2018-05-17  0:41           ` Dave Chinner
2018-05-17  5:05             ` Darrick J. Wong
2018-05-18  3:51   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:10     ` Dave Chinner
2018-05-29 15:28       ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 04/22] xfs: add helpers to dispose of old btree blocks after a repair Darrick J. Wong
2018-05-16  8:32   ` Dave Chinner
2018-05-16 18:02     ` Allison Henderson
2018-05-16 19:34     ` Darrick J. Wong [this message]
2018-05-16 22:32       ` Dave Chinner
2018-05-16 23:18         ` Darrick J. Wong
2018-05-17  5:58           ` Darrick J. Wong
2018-05-18  3:53   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:14     ` Dave Chinner
2018-05-29 18:01       ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 05/22] xfs: recover AG btree roots from rmap data Darrick J. Wong
2018-05-16  8:51   ` Dave Chinner
2018-05-16 18:37     ` Darrick J. Wong
2018-05-16 19:18       ` Allison Henderson
2018-05-16 22:36       ` Dave Chinner
2018-05-17  5:53         ` Darrick J. Wong
2018-05-18  3:54   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:16     ` Dave Chinner
2018-05-15 22:34 ` [PATCH 06/22] xfs: add a repair helper to reset superblock counters Darrick J. Wong
2018-05-16 21:29   ` Allison Henderson
2018-05-18  3:56     ` Darrick J. Wong
2018-05-18  3:56   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:28     ` Dave Chinner
2018-05-29 22:07       ` Darrick J. Wong
2018-05-29 22:24         ` Dave Chinner
2018-05-29 22:43           ` Darrick J. Wong
2018-05-30  1:23             ` Dave Chinner
2018-05-30  3:22               ` Darrick J. Wong
2018-05-15 22:34 ` [PATCH 07/22] xfs: add helpers to attach quotas to inodes Darrick J. Wong
2018-05-16 22:21   ` Allison Henderson
2018-05-18  3:58   ` [PATCH v2 " Darrick J. Wong
2018-05-29  3:29     ` Dave Chinner
2018-05-15 22:34 ` [PATCH 08/22] xfs: repair superblocks Darrick J. Wong
2018-05-16 22:55   ` Allison Henderson
2018-05-29  3:42   ` Dave Chinner
2018-05-15 22:34 ` [PATCH 09/22] xfs: repair the AGF and AGFL Darrick J. Wong
2018-05-15 22:34 ` [PATCH 10/22] xfs: repair the AGI Darrick J. Wong
2018-05-15 22:34 ` [PATCH 11/22] xfs: repair free space btrees Darrick J. Wong
2018-05-15 22:34 ` [PATCH 12/22] xfs: repair inode btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 13/22] xfs: repair the rmapbt Darrick J. Wong
2018-05-15 22:35 ` [PATCH 14/22] xfs: repair refcount btrees Darrick J. Wong
2018-05-15 22:35 ` [PATCH 15/22] xfs: repair inode records Darrick J. Wong
2018-05-15 22:35 ` [PATCH 16/22] xfs: zap broken inode forks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 17/22] xfs: repair inode block maps Darrick J. Wong
2018-05-15 22:35 ` [PATCH 18/22] xfs: repair damaged symlinks Darrick J. Wong
2018-05-15 22:35 ` [PATCH 19/22] xfs: repair extended attributes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 20/22] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong
2018-05-15 22:35 ` [PATCH 21/22] xfs: repair quotas Darrick J. Wong
2018-05-15 22:36 ` [PATCH 22/22] xfs: implement live quotacheck as part of quota repair Darrick J. Wong
2018-05-18  3:47 ` [PATCH 0.5/22] xfs: grab the per-ag structure whenever relevant Darrick J. Wong
2018-05-30  6:44   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180516193425.GM23858@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.