All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 10/21] xfs: add helper routines for the repair code
Date: Mon, 9 Apr 2018 10:19:43 -0700	[thread overview]
Message-ID: <20180409171943.GO7500@magnolia> (raw)
In-Reply-To: <20180409002342.GS23861@dastard>

On Mon, Apr 09, 2018 at 10:23:42AM +1000, Dave Chinner wrote:
> On Sat, Apr 07, 2018 at 10:55:42AM -0700, Darrick J. Wong wrote:
> > On Fri, Apr 06, 2018 at 10:52:51AM +1000, Dave Chinner wrote:
> > > On Mon, Apr 02, 2018 at 12:57:18PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Add some helper functions for repair functions that will help us to
> > > > allocate and initialize new metadata blocks for btrees that we're
> > > > rebuilding.
> > > 
> > > This is more than "helper functions" - my replay is almost 700 lines
> > > long!
> > > 
> > > i.e. This is a stack of extent invalidation, freeing and rmap/free
> > > space rebuild code. Needs a lot more description and context than
> > > "helper functions".
> > 
> > I agree now; this patch has become overly long and incohesive.  It could
> > be broken into the following pieces:
> > 
> > - modifying transaction allocations to take resblks
> > 	or "ensuring sufficient free space to rebuild"
> > 
> > - rolling transactions
> > 
> > - allocation and reinit functions
> > 
> > - fixing/putting on the freelist
> > 
> > - dealing with leftover blocks after we rebuild a tree
> > 
> > - finding btree roots
> > 
> > - resetting superblock counters
> > 
> > So I'll break this up into seven smaller pieces.
> 
> Thanks!
> 
> > 
> > > .....
> > > > @@ -574,7 +575,10 @@ xfs_scrub_setup_fs(
> > > >  	struct xfs_scrub_context	*sc,
> > > >  	struct xfs_inode		*ip)
> > > >  {
> > > > -	return xfs_scrub_trans_alloc(sc->sm, sc->mp, &sc->tp);
> > > > +	uint				resblks;
> > > > +
> > > > +	resblks = xfs_repair_calc_ag_resblks(sc);
> > > > +	return xfs_scrub_trans_alloc(sc->sm, sc->mp, resblks, &sc->tp);
> > > >  }
> > > 
> > > What's the block reservation for?
> > 
> > We rebuild a metadata structure by constructing a completely new btree
> > with blocks we got from the free space and switching the root when we're
> > done.  This isn't treated any differently from any other btree shape
> > change, so we need to reserve blocks in the transaction.
> 
> Can you put this in a comment?

Added to the comment above xfs_scrub_trans_alloc.

> > > > +	args.tp = sc->tp;
> > > > +	args.mp = sc->mp;
> > > > +	args.oinfo = *oinfo;
> > > > +	args.fsbno = XFS_AGB_TO_FSB(args.mp, sc->sa.agno, 0);
> > > 
> > > So our target BNO is the start of the AG.
> > > 
> > > > +	args.minlen = 1;
> > > > +	args.maxlen = 1;
> > > > +	args.prod = 1;
> > > > +	args.type = XFS_ALLOCTYPE_NEAR_BNO;
> > > 
> > > And we do a "near" search" for a single block. i.e. we are looking
> > > for a single block near to the start of the AG. Basically, we are
> > > looking for the extent at the left edge of the by-bno tree, which
> > > may not be a single block.
> > > 
> > > Would it be better to do a XFS_ALLOCTYPE_THIS_AG allocation here so
> > > we look up the by-size btree for a single block extent? The left
> > > edge of that tree will always be the smallest free extent closest to
> > > the start of the AG, and so using THIS_AG will tend to fill
> > > small freespace holes rather than tear up larger free spaces for
> > > single block allocations.....
> > 
> > Makes sense.  Originally I was thinking that we try to put the rebuilt
> > btree as close to the start of the AG as possible, but there's little
> > point in doing that so long as regular btree block allocation doesn't
> > bother.
> > 
> > TBH I've been wondering on the side if it makes more sense for us to
> > detect things like dm-{thinp,cache} which have large(ish) underlying
> > blocks and try to consolidate metadata into a smaller number of those
> > underlying blocks at the start (or the end) of the AG?  It could be
> > useful to pack the metadata into a smaller number of underlying blocks
> > so that if one piece of metadata gets hot enough the rest will end up in
> > the fast cache as well.
> 
> Separate issue, probably can be done with an in-memory threshold
> for metadata allocation (e.g. search for BNO < 5% from start of AG).
> I've thought about doing somthing like that (reserved area for
> metadata) for some time, but I've never really seen signficant
> advantages to doing it...

<nod> I suppose on a spinny disk it might cut down scrub/repair time,
though maybe for pmem it'll prove advantageous to try to keep free space
as unfragmented as possible to increase the likelihood of large page
use.  Oh well, that's a research question for another thread. :)

> > > > +/* Put a block back on the AGFL. */
> > > > +int
> > > > +xfs_repair_put_freelist(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	xfs_agblock_t			agbno)
> > > > +{
> > > > +	struct xfs_owner_info		oinfo;
> > > > +	int				error;
> > > > +
> > > > +	/*
> > > > +	 * Since we're "freeing" a lost block onto the AGFL, we have to
> > > > +	 * create an rmap for the block prior to merging it or else other
> > > > +	 * parts will break.
> > > > +	 */
> > > > +	xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > > +	error = xfs_rmap_alloc(sc->tp, sc->sa.agf_bp, sc->sa.agno, agbno, 1,
> > > > +			&oinfo);
> > > > +	if (error)
> > > > +		return error;
> > > > +
> > > > +	/* Put the block on the AGFL. */
> > > > +	error = xfs_alloc_put_freelist(sc->tp, sc->sa.agf_bp, sc->sa.agfl_bp,
> > > > +			agbno, 0);
> > > 
> > > At what point do we check there's actually room in the AGFL for this
> > > block?
> > > 
> > > > +	if (error)
> > > > +		return error;
> > > > +	xfs_extent_busy_insert(sc->tp, sc->sa.agno, agbno, 1,
> > > > +			XFS_EXTENT_BUSY_SKIP_DISCARD);
> > > > +
> > > > +	/* Make sure the AGFL doesn't overfill. */
> > > > +	return xfs_repair_fix_freelist(sc, true);
> > > 
> > > i.e. shouldn't this be done first?
> > 
> > 
> >
> 
> ???

That was supposed to be "Yes".

> 
> > > > +		else if (resv == XFS_AG_RESV_AGFL)
> > > > +			error = xfs_repair_put_freelist(sc, agbno);
> > > > +		else
> > > > +			error = xfs_free_extent(sc->tp, fsbno, 1, oinfo, resv);
> > > > +		if (error)
> > > > +			break;
> > > > +
> > > > +		if (sc->ip)
> > > > +			error = xfs_trans_roll_inode(&sc->tp, sc->ip);
> > > > +		else
> > > > +			error = xfs_repair_roll_ag_trans(sc);
> > > 
> > > why don't we break on error here?
> > 
> > We do, but it's up in the for loop.
> 
> I missed that completely. I'd much prefer we don't hide "break on
> error" conditions inside the for loop machinery because it is so
> easy to miss...

I refactored the loop body into a separate helper, which decreased the
indentation and made the loop control clearer.

> 
> > > > +
> > > > +	trace_xfs_repair_collect_btree_extent(sc->mp,
> > > > +			XFS_FSB_TO_AGNO(sc->mp, fsbno),
> > > > +			XFS_FSB_TO_AGBNO(sc->mp, fsbno), len);
> > > > +
> > > > +	rae = kmem_alloc(sizeof(struct xfs_repair_extent),
> > > > +			KM_MAYFAIL | KM_NOFS);
> > > 
> > > Why KM_NOFS?
> > 
> > The caller holds a transaction and (potentially) has locked an entire
> > AG, so we want to avoid recursing into the filesystem to free up memory.
> 
> If we're in transaction context, we are already under a KM_NOFS
> allocation contexts. SO it shouldn't be needed here - if we are
> called outside transaction context, then we need a comment
> explaining it (as opposed to using KM_NOFS to shut up lockdep).

Ok, I'll get rid of the KM_NOFS in this and all the other patches.

(I admit I probably /have/ been stuffing in KM_NOFS to shut up
lockdep...)

> > > > +	for_each_xfs_repair_extent_safe(rae, n, exlist) {
> > > > +		agno = XFS_FSB_TO_AGNO(sc->mp, rae->fsbno);
> > > > +		agbno = XFS_FSB_TO_AGBNO(sc->mp, rae->fsbno);
> > > > +		for (i = 0; i < rae->len; i++) {
> > > > +			bp = xfs_btree_get_bufs(sc->mp, sc->tp, agno,
> > > > +					agbno + i, 0);
> > > > +			xfs_trans_binval(sc->tp, bp);
> > > > +		}
> > > > +	}
> > > 
> > > What if they are data blocks we are dumping? xfs_trans_binval is
> > > fine for metadata blocks, but creating a stale metadata buffer and
> > > logging a buffer cancel for user data blocks doesn't make any sense
> > > to me....
> > 
> > exlist is only supposed to contain references to metadata blocks that we
> > collected from the rmapbt.  The current iteration of the repair code
> > shouldn't ever dump file data blocks.
> 
> That needs explanation, then. I had no idea that exlist was only for
> metadata blocks...

Ok, comment changed to:

/*
 * Invalidate buffers for per-AG btree blocks we're dumping.  We assume
 * that exlist points only to metadata blocks.
 */

> > > > +/* Remove all the blocks in sublist from exlist. */
> > > > +#define LEFT_CONTIG	(1 << 0)
> > > > +#define RIGHT_CONTIG	(1 << 1)
> > > 
> > > Namespace, please.
> > > 
> > > > +int
> > > > +xfs_repair_subtract_extents(
> > > > +	struct xfs_scrub_context	*sc,
> > > > +	struct xfs_repair_extent_list	*exlist,
> > > > +	struct xfs_repair_extent_list	*sublist)
> > > 
> > > What is this function supposed to do? I can't really review the code
> > > (looks complex) because I don't know what it's function is supposed
> > > to be.
> > 
> > The intent is that we iterate the rmapbt for all of its records for a
> > given owner to generate exlist.  Then we iterate all blocks of metadata
> > structures that we are not rebuilding but have the same rmapbt owner to
> > generate sublist.  This routine subtracts all the blocks mentioned in
> > sublist from all the extents linked in exlist, which leaves exlist as
> > the list of all blocks owned by the old version of btree that we're
> > rebuilding.  The list is fed into _reap_btree_extents to free the old
> > btree blocks (or merely remove the rmap if the block is crosslinked).
> 
> Comment, please :)

Added.

> > > > +
> > > > +	if (delta_icount) {
> > > > +		error = xfs_mod_icount(mp, delta_icount);
> > > > +		if (error)
> > > > +			goto out;
> > > > +	}
> > > > +	if (delta_ifree) {
> > > > +		error = xfs_mod_ifree(mp, delta_ifree);
> > > > +		if (error)
> > > > +			goto out;
> > > > +	}
> > > > +	if (delta_fdblocks) {
> > > > +		error = xfs_mod_fdblocks(mp, delta_fdblocks, false);
> > > > +		if (error)
> > > > +			goto out;
> > > > +	}
> > > > +
> > > > +out:
> > > > +	xfs_trans_cancel(tp);
> > > > +	return error;
> > > 
> > > Ummmm - why do we do all this superblock modification work in a
> > > transaction context, and then just cancel it?
> > 
> > It's an empty transaction, so nothing gets dirty and nothing needs
> > committing.  TBH I don't even think this is necessary, since we only use
> > it for reading the agi/agf, and for that we can certainly _buf_relse
> > instead of letting the _trans_cancel do it.
> 
> Yeah, it'd be better to get rid of the transaction context if we
> can. It should also explain how it avoids races with ongoing
> superblock counter updates...

<nod>  AFAICT it's a simple matter of taking m_sb_lock to update the
m_sb counters, after which we unlock and do the percpu counters.

> > > > +		inobt_sz = xfs_iallocbt_calc_size(mp, icount /
> > > > +				XFS_INODES_PER_CHUNK);
> > > > +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> > > > +		inobt_sz *= 2;
> > > > +	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
> > > > +		rmapbt_sz = xfs_rmapbt_calc_size(mp, aglen);
> > > > +		refcbt_sz = xfs_refcountbt_calc_size(mp, usedlen);
> > > > +	} else {
> > > > +		rmapbt_sz = xfs_rmapbt_calc_size(mp, usedlen);
> > > > +		refcbt_sz = 0;
> > > > +	}
> > > > +	if (!xfs_sb_version_hasrmapbt(&mp->m_sb))
> > > > +		rmapbt_sz = 0;
> > > > +
> > > > +	trace_xfs_repair_calc_ag_resblks_btsize(mp, sm->sm_agno, bnobt_sz,
> > > > +			inobt_sz, rmapbt_sz, refcbt_sz);
> > > > +
> > > > +	return max(max(bnobt_sz, inobt_sz), max(rmapbt_sz, refcbt_sz));
> > > 
> > > Why are we only returning the resblks for a single tree rebuild?
> > > Shouldn't it return the total blocks require to rebuild all all the
> > > trees?
> > 
> > A repair call only ever rebuilds one btree in one AG, so I don't see why
> > we'd need to reserve space to rebuild all the btrees in the AG (or the
> > same btree in all AGs)... though it's possible that I don't understand
> > the question?
> 
> It's not clear that we are only doing a reservation for a single
> tree in an AG. needs a comment to explain the context the
> reservation is being made for.

/*
 * Figure out how many blocks to reserve for an AG repair.  We calculate
 * the worst case estimate for the number of blocks we'd need to rebuild
 * one of any type of per-AG btree.
 */

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2018-04-09 17:19 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-02 19:56 [PATCH v14.1 00/21] xfs: online repair support Darrick J. Wong
2018-04-02 19:56 ` [PATCH 01/21] xfs: add helpers to calculate btree size Darrick J. Wong
2018-04-02 19:56 ` [PATCH 02/21] xfs: expose various functions to repair code Darrick J. Wong
2018-04-02 19:56 ` [PATCH 03/21] xfs: add repair helpers for the reverse mapping btree Darrick J. Wong
2018-04-05 23:02   ` Dave Chinner
2018-04-06 16:31     ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 04/21] xfs: add repair helpers for the reference count btree Darrick J. Wong
2018-04-02 19:56 ` [PATCH 05/21] xfs: add BMAPI_NORMAP flag to perform block remapping without updating rmapbt Darrick J. Wong
2018-04-05 23:07   ` Dave Chinner
2018-04-06 16:41     ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 06/21] xfs: make xfs_bmapi_remapi work with attribute forks Darrick J. Wong
2018-04-05 23:12   ` Dave Chinner
2018-04-06 17:31     ` Darrick J. Wong
2018-04-02 19:56 ` [PATCH 07/21] xfs: halt auto-reclamation activities while rebuilding rmap Darrick J. Wong
2018-04-05 23:14   ` Dave Chinner
2018-04-02 19:57 ` [PATCH 08/21] xfs: create tracepoints for online repair Darrick J. Wong
2018-04-05 23:15   ` Dave Chinner
2018-04-02 19:57 ` [PATCH 09/21] xfs: implement the metadata repair ioctl flag Darrick J. Wong
2018-04-05 23:24   ` Dave Chinner
2018-04-02 19:57 ` [PATCH 10/21] xfs: add helper routines for the repair code Darrick J. Wong
2018-04-06  0:52   ` Dave Chinner
2018-04-07 17:55     ` Darrick J. Wong
2018-04-09  0:23       ` Dave Chinner
2018-04-09 17:19         ` Darrick J. Wong [this message]
2018-04-02 19:57 ` [PATCH 11/21] xfs: repair superblocks Darrick J. Wong
2018-04-06  1:05   ` Dave Chinner
2018-04-07 18:04     ` Darrick J. Wong
2018-04-09  0:26       ` Dave Chinner
2018-04-09 17:36         ` Darrick J. Wong
2018-04-02 19:57 ` [PATCH 12/21] xfs: repair the AGF and AGFL Darrick J. Wong
2018-04-02 19:57 ` [PATCH 13/21] xfs: repair the AGI Darrick J. Wong
2018-04-02 19:57 ` [PATCH 14/21] xfs: repair free space btrees Darrick J. Wong
2018-04-02 19:57 ` [PATCH 15/21] xfs: repair inode btrees Darrick J. Wong
2018-04-02 19:57 ` [PATCH 16/21] xfs: repair the rmapbt Darrick J. Wong
2018-04-02 19:58 ` [PATCH 17/21] xfs: repair refcount btrees Darrick J. Wong
2018-04-02 19:58 ` [PATCH 18/21] xfs: repair inode records Darrick J. Wong
2018-04-02 19:58 ` [PATCH 19/21] xfs: zap broken inode forks Darrick J. Wong
2018-04-02 19:58 ` [PATCH 20/21] xfs: repair inode block maps Darrick J. Wong
2018-04-02 19:58 ` [PATCH 21/21] xfs: repair damaged symlinks Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180409171943.GO7500@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.