Re: [PATCH 09/10] xfs: scrub/repair should update filesystem metadata health

From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 09/10] xfs: scrub/repair should update filesystem metadata health
Date: Fri, 5 Apr 2019 13:54:47 -0700	[thread overview]
Message-ID: <20190405205447.GB5147@magnolia> (raw)
In-Reply-To: <20190405130739.GB43111@bfoster>

On Fri, Apr 05, 2019 at 09:07:39AM -0400, Brian Foster wrote:
> On Thu, Apr 04, 2019 at 11:01:33AM -0700, Darrick J. Wong wrote:
> > On Thu, Apr 04, 2019 at 07:50:11AM -0400, Brian Foster wrote:
> > > On Mon, Apr 01, 2019 at 10:11:12AM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > > 
> > > > Now that we have the ability to track sick metadata in-core, make scrub
> > > > and repair update those health assessments after doing work.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/xfs/Makefile       |    1 
> > > >  fs/xfs/scrub/health.c |  180 +++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/xfs/scrub/health.h |   12 +++
> > > >  fs/xfs/scrub/scrub.c  |    8 ++
> > > >  fs/xfs/scrub/scrub.h  |   11 +++
> > > >  5 files changed, 212 insertions(+)
> > > >  create mode 100644 fs/xfs/scrub/health.c
> > > >  create mode 100644 fs/xfs/scrub/health.h
> > > > 
> > > > 
> > > ...
> > > > diff --git a/fs/xfs/scrub/health.c b/fs/xfs/scrub/health.c
> > > > new file mode 100644
> > > > index 000000000000..dd9986500801
> > > > --- /dev/null
> > > > +++ b/fs/xfs/scrub/health.c
> > > > @@ -0,0 +1,180 @@
> > > ...
> > > > +/* Update filesystem health assessments based on what we found and did. */
> > > > +void
> > > > +xchk_update_health(
> > > > +	struct xfs_scrub	*sc,
> > > > +	bool			already_fixed)
> > > > +{
> > > > +	/*
> > > > +	 * If the scrubber finds errors, we mark sick whatever's mentioned in
> > > > +	 * sick_mask, no matter whether this is a first scan or an evaluation
> > > > +	 * of repair effectiveness.
> > > > +	 *
> > > > +	 * If there is no direct corruption and we're called after a repair,
> > > > +	 * clear whatever's in heal_mask because that's what we fixed.
> > > > +	 *
> > > > +	 * Otherwise, there's no direct corruption and we didn't repair
> > > > +	 * anything, so mark whatever's in sick_mask as healthy.
> > > > +	 */
> > > > +	if (sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)
> > > > +		xchk_mark_sick(sc, sc->sick_mask);
> > > > +	else if (already_fixed)
> > > > +		xchk_mark_healthy(sc, sc->heal_mask);
> > > > +	else
> > > > +		xchk_mark_healthy(sc, sc->sick_mask);
> > > > +}
> > > 
> > > Hmm, I think I follow what we're doing here but it's a bit confusing
> > > without the additional context of where these bits will be set/cleared
> > > at the lower scrub layers (or at least without an example). Some
> > > questions on that below...
> > > 
> > > ...
> > > > diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c
> > > > index 1b2344d00525..b1519dfc5811 100644
> > > > --- a/fs/xfs/scrub/scrub.c
> > > > +++ b/fs/xfs/scrub/scrub.c
> > > > @@ -40,6 +40,7 @@
> > > >  #include "scrub/trace.h"
> > > >  #include "scrub/btree.h"
> > > >  #include "scrub/repair.h"
> > > > +#include "scrub/health.h"
> > > >  
> > > >  /*
> > > >   * Online Scrub and Repair
> > > > @@ -468,6 +469,7 @@ xfs_scrub_metadata(
> > > >  {
> > > >  	struct xfs_scrub		sc;
> > > >  	struct xfs_mount		*mp = ip->i_mount;
> > > > +	unsigned int			heal_mask;
> > > >  	bool				try_harder = false;
> > > >  	bool				already_fixed = false;
> > > >  	int				error = 0;
> > > > @@ -488,6 +490,7 @@ xfs_scrub_metadata(
> > > >  	error = xchk_validate_inputs(mp, sm);
> > > >  	if (error)
> > > >  		goto out;
> > > > +	heal_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
> > > >  
> > > >  	xchk_experimental_warning(mp);
> > > >  
> > > > @@ -499,6 +502,8 @@ xfs_scrub_metadata(
> > > >  	sc.ops = &meta_scrub_ops[sm->sm_type];
> > > >  	sc.try_harder = try_harder;
> > > >  	sc.sa.agno = NULLAGNUMBER;
> > > > +	sc.heal_mask = heal_mask;
> > > > +	sc.sick_mask = xchk_health_mask_for_scrub_type(sm->sm_type);
> > > 
> > > Ok, so we initialize the heal/sick masks based on the scrub type that
> > > was requested on the first pass through...
> > > 
> > > >  	error = sc.ops->setup(&sc, ip);
> > > >  	if (error)
> > > >  		goto out_teardown;
> > > > @@ -519,6 +524,8 @@ xfs_scrub_metadata(
> > > >  	} else if (error)
> > > >  		goto out_teardown;
> > > >  
> > > > +	xchk_update_health(&sc, already_fixed);
> > > > +
> > > 
> > > ... then update the in-core fs health state based on the sick mask. Is
> > > it possible for the scrub operation to set more sick mask bits based on
> > > what it finds?
> > 
> > Theoretically, yes, but in practice none of the current scrubbers need
> > to touch sick_mask.
> > 
> > heal_mask, OTOH, will be adjusted by the free space / inode repair
> > functions since they rebuild multiple structures.
> > 
> 
> Ok..
> 
> > > More specifically, I'm wondering why the masks wouldn't start as zero
> > > and toggle based on finding/fixing corruption(s).
> > 
> > sick_mask is also the mask we feed to xfs_*_mark_healthy if the scan
> > returns clean, which is why we set the default value before dispatching
> > the scrub.
> > 
> > > Or if the sick mask value is essentially fixed, whether we need to
> > > store it in the xfs_scrub context...
> > 
> > We could probably get away with generating it in xchk_update_health at
> > the end, but it feels weird to have heal_mask in the scrub context but
> > sick_mask gets auto-generated.
> > 
> 
> Ok.. hmm. Both feel a little weird to me, but this is really just an
> aesthetic/factoring thing so I'll think about it a bit more and come
> back to it.
> 
> > > 
> > > >  	if ((sc.sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR) && !already_fixed) {
> > > >  		bool needs_fix;
> > > >  
> > > > @@ -551,6 +558,7 @@ xfs_scrub_metadata(
> > > >  				xrep_failure(mp);
> > > >  				goto out;
> > > >  			}
> > > > +			heal_mask = sc.heal_mask;
> > > 
> > > And if we end up doing a repair, we presumably can repair multiple
> > > things and so we track that separately and persist the heal mask across
> > > a potential retry.
> > 
> > Right.
> > 
> > > What about the case where we don't retry, but scrub finds something
> > > and then immediately repairs it?
> > 
> > The repair jumps back to retry_op if either (a) we couldn't get all the
> > resources we needed and therefore sc.try_harder = true and we need to
> > start over; or (b) repair thinks it fixed a thing, so we need to scrub
> > the thing again to see if it's really fixed...
> > 
> > > Should we update the fs state after both detecting and clearing the
> > > problem, or does that happen elsewhere?
> > 
> > ...so if scrub immediately repairs a thing, we preserve heal_mask, jump
> > back to the scrub, and if the scrub says clean we'll mark heal mask
> > healthy.
> > 
> > If the repair has to retry then the we'll call the repair function
> > again, which (presumably) will set (again) the heal_mask appropriately,
> > and then we have the same post-repair state updating as above.
> > 
> > Does that make sense? :)
> > 
> 
> Ah, Ok. I didn't realize that a successful repair looped back to the
> scrub code (and thus the health update). Yes, that makes more sense.
> 
> > > Also, if repair can potentially clear multiple bits, what's the
> > > possibility of a repair clearing one failure and then failing on
> > > another, causing the broader repair op to return an error or jump into
> > > this retry?
> > 
> > Scrub doesn't touch the fs health state at all until after the ->scrub
> > or ->repair function succeeds.  If the scrub or the repair functions
> > fail for any non-retry reason, we back out to userspace without updating
> > anything.  It's as if we'd never called the failed function.
> > 
> 
> Right.. what I was getting at above is seeing whether we'd actually
> update partial repair state in-core. E.g., suppose things A and B are
> faulted in-core and it's one of these cases where repair can fix A and B
> at the same time. If it fixes thing A and fails on thing B, it sounds
> like we'd not clear the in-core fault state on A even though it's
> technically repaired.

Hmm.  If the repair function returns a runtime error (having fixed A but
not B) then yes, we won't clear the incore fault state on A (or B) even
though we fixed A.  Something weird happened, so we shouldn't be too
hasty to clear things.  A subsequent re-scrub of A will clear the fault
on A, though.

OTOH... if the A/B repair function returns 0 having fixed A but left B
corrupt, the rescan will see that A is fine and (incorrectly) clear both
A and B.  I would say that's a bug, so maybe I should rethink the need
for sick_mask and heal_mask.

That said, a normal xfs_scrub run will check (or have already checked) B
and noticed that it was corrupt, so it will circle back and try to fix B
separately, so in a sense we don't really need heal_mask at all.

> > Maybe some worked examples will help?
> > 
> > Let's say both inode btrees are corrupt.  We run xfs_scrub -n,
> > xchk_inobt will record the corruption, and (assuming it hits no runtime
> > errors) once we return to xfs_scrub_metadata, it'll set
> > XFS_SICK_AG_INOBT.  Presumably xfs_scrub will also call the finobt scrub
> > and SICK_AG_FINOBT will also get set.
> > 
> > If we run xfs_scrub without the -n, xchk_inobt will record the
> > corruption and set SICK_AG_INOBT per above.  Then it'll run xrep_inobt,
> > which will set heal_mask to SICK_AG_INOBT | SICK_AG_FINOBT.  If the
> > repair fails with a non-retry runtime error, we exit to userspace and
> > ignore heal_mask.
> > 
> 
> Ok, this sounds like the case I'm theorizing about above (where suppose
> repair fixed the inobt and then failed on the finobt, but hasn't cleared
> faults for either..).
> 
> > If instead the repair succeeds, we scan the inobt again.  If that comes
> > up clear then we use heal_mask to clear SICK_AG_INOBT | SICK_AG_FINOBT.
> > xfs_scrub will call again later to repair the finobt, but the initial
> > finobt scan will see no errors in the finobt, clear SICK_AG_FINOBT
> > (which isn't set) and exit.
> > 
> 
> So it sounds like the state would have to be cleared by a subsequent
> scrub request. The scan would find thing A healthy and mark it so
> regardless, to clear any potential previous faults that might have
> already been repaired. Right?

Right.

> > If the inobt repair function is buggy and says it repaired the inode
> > btrees but leaves corruptions, then the rescan of the inobt will notice
> > and set SICK_AG_INOBT (which is already set) and exit.  Similarly, when
> > xfs_scrub calls back about the finobt, it will notice the corrupt
> > finobt, try to set SICK_AG_FINOBT (also already set), try to fix it, and
> > the rescan of the finobt will notice that the finobt is still corrupt
> > and try to set SICK_AG_FINOBT (which is still set).
> > 
> > The end result (I think) is that we always set the sick bits if a scan
> > shows problems, and we only clear the sick bits for things if we can
> > prove that the things are no longer sick.  Does that help?
> > 
> 
> Yes, thanks for the explanation. I think the confusion is mostly due to
> not being able to fully see how these scrub states are managed,
> particularly the bits that warranted the creation of separate masks in
> the first place.

You've convinced me that this patch is too convoluted to understand, so
I think I want to simplify it some more.  First, I'd rename the field
to "sick_mask_update" and change the behavior so that we:

 1. Set sick_mask_update to the default XFS_SICK flag for this scrub
    type (call it A).  (We already do this)

 2. If the scrubber returns an error code, we exit making no changes to
    the incore sick state.

 3. If the scrubber finds that A is clean, clear the incore sick flags
    that are set in s_m_u and exit.

 4. If the scrubber finds that A is corrupt, set the incore sick flags
    that are set in s_m_u.

    a. If the user doesn't want to repair, then we exit, having
       previously set incore sick flags.

 5. Now we know that A is corrupt and the user wants to repair.
    If repair returns an error code, we exit with that error code, having
    made no further changes to the incore sick state.

 6. If repair rebuilds both A & B correctly and the re-scrub of A is
    clean, we'll clear the incore sick flags using s_m_u.  This should
    clear A.

 7. If repair rebuilds both A & B and screws up A, the re-scrub will find
    it corrupt and leave the sick flags as they are, which is to say that
    A is marked sick.

 8. If repair rebuilds A correctly but leaves B corrupt, the re-scrub of
    A will be clean and we'll clear the incore sick flags using s_m_u.
    This should clear A, even though B is corrupt.

 9. No matter whether we encountered scenarios 6, 7, or 8, if xfs_scrub
    previously scrubbed B and found it corrupt, it will call again to
    repair B, which will set the incore sick state appropriately.  If
    xfs_scrub has not yet scrubbed B then it will call later to scrub B,
    which will set the incore sick state appropriately.

I hope that's easier to understand...

> This does still have me wondering if separate masks are necessary, if we
> perhaps had more selective health update logic, for example. I think it
> might be better to either bundle this patch with whatever other changes
> actually make use of the separate masks, or alternatively to simplify
> the current logic and just defer the separate mask thing until those
> more complex repair algorithms come along..

--D

> Brian
> 
> > > ISTM that it might be possible to skip clearing one fail state bit so
> > > long as the original thing remained corrupted, but I feel like I'm
> > > still missing some context on the bigger picture scrub tracking...
> > 
> > Yeah, the state machine is pretty squirrely. :/
> > 
> > --D
> > 
> > > Brian
> > > 
> > > >  			goto retry_op;
> > > >  		}
> > > >  	}
> > > > diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h
> > > > index 22f754fba8e5..05f1ad242a35 100644
> > > > --- a/fs/xfs/scrub/scrub.h
> > > > +++ b/fs/xfs/scrub/scrub.h
> > > > @@ -62,6 +62,17 @@ struct xfs_scrub {
> > > >  	struct xfs_inode		*ip;
> > > >  	void				*buf;
> > > >  	uint				ilock_flags;
> > > > +
> > > > +	/* Metadata to be marked sick if scrub finds errors. */
> > > > +	unsigned int			sick_mask;
> > > > +
> > > > +	/*
> > > > +	 * Metadata to be marked healthy if repair fixes errors.  Some repair
> > > > +	 * functions can fix multiple data structures at once, so we have to
> > > > +	 * treat sick and heal masks separately.
> > > > +	 */
> > > > +	unsigned int			heal_mask;
> > > > +
> > > >  	bool				try_harder;
> > > >  	bool				has_quotaofflock;
> > > >  
> > > >