From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 04/22] xfs: generic functions to scrub metadata and btrees
Date: Mon, 24 Jul 2017 17:39:44 -0700 [thread overview]
Message-ID: <20170725003944.GM4352@magnolia> (raw)
In-Reply-To: <20170724231542.GF17762@dastard>
On Tue, Jul 25, 2017 at 09:15:42AM +1000, Dave Chinner wrote:
> On Mon, Jul 24, 2017 at 02:58:10PM -0700, Darrick J. Wong wrote:
> > On Mon, Jul 24, 2017 at 11:05:49AM +1000, Dave Chinner wrote:
> > > > + struct xfs_btree_cur *cur,
> > > > + int level,
> > > > + bool fs_ok,
> > > > + const char *check,
> > > > + const char *func,
> > > > + int line)
> > > > +{
> > > > + char bt_ptr[24];
> > > > + char bt_type[48];
> > > > + xfs_fsblock_t fsbno;
> > > > +
> > > > + if (fs_ok)
> > > > + return fs_ok;
> > > > +
> > > > + sc->sm->sm_flags |= XFS_SCRUB_FLAG_CORRUPT;
> > > > + xfs_scrub_btree_format(cur, level, bt_type, 48, bt_ptr, 24, &fsbno);
> > >
> > > Ok, magic numbers for buffer lengths. Please use #defines for these
> > > with an explanation of why the chosen lengths are sufficient for the
> > > information they'll be used to hold.
> > >
> > > > + trace_xfs_scrub_btree_error(cur->bc_mp, bt_type, bt_ptr,
> > > > + XFS_FSB_TO_AGNO(cur->bc_mp, fsbno),
> > > > + XFS_FSB_TO_AGBNO(cur->bc_mp, fsbno),
> > > > + check, func, line);
> > >
> > > hmmmm - tracepoints are conditional, but the formatting call isn't.
> > > Can this formatting be called/run from inside the tracepoint code
> > > itself?
> >
> > I don't know of a way to find out if a given set of tracepoint(s) are
> > enabled. Fortunately the formatting call only happens if corruption is
> > detected.
>
> I was thinking more that the function is called from within the
> tracepoint code itself. Tracepoints can have code embedded in
> them...
Oh. I'll look into that.
> >
> > > > +
> > > > +/*
> > > > + * Make sure this record is in order and doesn't stray outside of the parent
> > > > + * keys.
> > > > + */
> > > > +STATIC int
> > > > +xfs_scrub_btree_rec(
> > > > + struct xfs_scrub_btree *bs)
> > > > +{
> > > > + struct xfs_btree_cur *cur = bs->cur;
> > > > + union xfs_btree_rec *rec;
> > > > + union xfs_btree_key key;
> > > > + union xfs_btree_key hkey;
> > > > + union xfs_btree_key *keyp;
> > > > + struct xfs_btree_block *block;
> > > > + struct xfs_btree_block *keyblock;
> > > > + struct xfs_buf *bp;
> > > > +
> > > > + block = xfs_btree_get_block(cur, 0, &bp);
> > > > + rec = xfs_btree_rec_addr(cur, cur->bc_ptrs[0], block);
> > > > +
> > > > + if (bp)
> > > > + trace_xfs_scrub_btree_rec(cur->bc_mp,
> > > > + XFS_FSB_TO_AGNO(cur->bc_mp,
> > > > + XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn)),
> > > > + XFS_FSB_TO_AGBNO(cur->bc_mp,
> > > > + XFS_DADDR_TO_FSB(cur->bc_mp, bp->b_bn)),
> > > > + cur->bc_btnum, 0, cur->bc_nlevels,
> > > > + cur->bc_ptrs[0]);
> > > > + else if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
> > > > + trace_xfs_scrub_btree_rec(cur->bc_mp,
> > > > + XFS_INO_TO_AGNO(cur->bc_mp,
> > > > + cur->bc_private.b.ip->i_ino),
> > > > + XFS_INO_TO_AGBNO(cur->bc_mp,
> > > > + cur->bc_private.b.ip->i_ino),
> > > > + cur->bc_btnum, 0, cur->bc_nlevels,
> > > > + cur->bc_ptrs[0]);
> > > > + else
> > > > + trace_xfs_scrub_btree_rec(cur->bc_mp,
> > > > + NULLAGNUMBER, NULLAGBLOCK,
> > > > + cur->bc_btnum, 0, cur->bc_nlevels,
> > > > + cur->bc_ptrs[0]);
> > >
> > > Hmmm - there's more code in the trace calls than there is in the
> > > scrubbing code. Is this really all necessary? I can see code
> > > getting changed in future but not the tracepoints, similar to how
> > > comment updates get missed...
> >
> > I've found it useful when analyzing the scrub-fuzz xfstests to be able
> > to pinpoint exactly what record in a btree hit some bug or other.
>
> Sure, I'm not questioning where it has some use, more just pondering
> the complexity required to emit them and whether there's a better
> way. I mean, the onl difference is the agno/agbno pair being traced,
> so wouldn't it make more sense to trace an opaque 64 bit number here
> and do:
>
> if (bp)
> num = bp->b_bn;
> else if (cur->bc_flags & XFS_BTREE_ROOT_IN_INODE)
> num = ip->i_ino;
> else
> num = NULLFSBLOCK;
> trace_xfs_scrub_btree_rec(cur->bc_mp, num, cur->bc_btnum, 0,
> cur->bc_nlevels, cur->bc_ptrs[0]);
>
> That's much simpler and easier to maintain, but provides the same
> info. It's also only one trace event callout, so the code size
> should go down as well...
<nod>
> > > > + leftsib = be64_to_cpu(block->bb_u.l.bb_leftsib);
> > > > + rightsib = be64_to_cpu(block->bb_u.l.bb_rightsib);
> > > > + level = xfs_btree_get_level(block);
> > > > +
> > > > + /* Root block should never have siblings. */
> > > > + if (level == bs->cur->bc_nlevels - 1) {
> > > > + XFS_SCRUB_BTKEY_CHECK(bs, level, leftsib == NULLFSBLOCK);
> > > > + XFS_SCRUB_BTKEY_CHECK(bs, level, rightsib == NULLFSBLOCK);
> > > > + return error;
> > > > + }
> > >
> > > This is where the macros force us into silly patterns and blow out
> > > the code size.
> > >
> > > if (level == bs->cur->bc_nlevels - 1 &&
> > > (leftsib != NULLFSBLOCK || rightsib != NULLFSBLOCK) {
> > > /* error trace call */
> > > return error;
> > > }
> >
> > We're also losing information here. With the previous code you can tell
> > between leftsib and rightsib which one is corrupt, whereas with the
> > suggested replacement you can only tell that at least one of them is
> > broken.
>
> Sure, but that's mostly irrelevant detail because...
>
> > I want to avoid the situation where scrub flags a corruption, the user
> > runs ftrace to give us __return_address, and we go looking for that only
> > to find that it points to a line of code that tests two different
> > fields.
>
> ... the fact is you have to go look at the root block header that is
> corrupt to analyse the extent of the corruption and eventually
> repair it. When it comes to analysing a corruption, you don't just
> look at the one field that has been flagged - you look at all the
> metadata in the block to determine the extent of the corruption.
>
> If you know a sibling pointer in a root block is corrupt, then the
> moment you look at the block header it's *obvious* which sibling
> pointer is corrupt. i.e. the one that is not NULLFSBLOCK. It really
> doesn't matter what is reported as the error from scrubbing because
> the corrupted items are trivially observable.
>
> It's all well and good to scan every piece of metadata for validity,
> but that doesn't mean we shouldn't think about what makes sense to
> report/log. It's going to be easy to drown the logs in corruption
> reports and make it impossible to find the needle that first caused
> it.
>
> All I'm saying is blind verbosity like this is an indication that we
> haven't thought about how to group or classify corruptions
> sufficiently. Yes, we need to be able to detect all corrupted
> fields, but we need to recognise that many corruptions are
> essentially the same problem distributed across multiple fields and
> they'll all be repaired by a single repair action.
>
> In this case, it doesn't matter if it is left or right sibling corruption
> because the analysis/repair work we need to do to correct either the
> left or right pointer is the same. i.e. we need to walk and validate
> the entire chain from both ends to repair a single bad pointer.
> Hence it doesn't matter if the left or right sibling is bad, the
> action we need to take to repair it is the same because we don't
> know if we're sitting on a wholly disconnected part of the chain
> (i.e. nasty level of tree corruption) or whether just that link got
> stomped on. i.e. bad sibling is an indication that both siblings may
> be invalid, not just the one we detected as bad....
>
> SO, yeah, "sibling corruption" means we need to check the entire
> sibling chain across all blocks in the btree level, not just in the
> direction of the bad pointer.
On some level it doesn't matter at all, since we return a single bit to
userspace, and repair just nukes the whole data structure and rebuilds
it from scratch...
...so I can go combine adjacent checks when I demacro the code; should
we decide for some reason to hurl floods of trace data back to userspace
we can always re-separate them.
> > > > + /* Does the left sibling match the parent level left block? */
> > > > + if (leftsib != NULLFSBLOCK) {
> > > > + error = xfs_btree_dup_cursor(bs->cur, &ncur);
> > > > + if (error)
> > > > + return error;
> > > > + error = xfs_btree_decrement(ncur, level + 1, &success);
> > > > + XFS_SCRUB_BTKEY_OP_ERROR_GOTO(bs, level + 1, &error, out_cur);
> > >
> > > Hmmm - if I read that right, there's a goto out_cur on error hidden
> > > in this macro....
> >
> > No more hidden than XFS_WANT_CORRUPTED_GOTO. :)
>
> Right, but I've been wanting to get rid of those XFS_WANT_CORRUPTED
> macros for a long, long time... :/
>
> > > > +
> > > > + pblock = xfs_btree_get_block(ncur, level + 1, &pbp);
> > > > + pp = xfs_btree_ptr_addr(ncur, ncur->bc_ptrs[level + 1], pblock);
> > > > + if (!xfs_scrub_btree_ptr(bs, level + 1, pp)) {
> > > > + agbno = be32_to_cpu(pp->s);
> > > > + XFS_SCRUB_BTKEY_CHECK(bs, level, agbno == leftsib);
FWIW I forgot to mention in my reply that the key pointer scrubber
doesn't actually have to check if the pointer is non-null but valid,
because at some point during scrub we'll try to use the pointer and if
it points somewhere bad we'll notice when we either an io error or a
block that doesn't match the contents we want.
> > > > + }
> > > > +
> > > > + xfs_btree_del_cursor(ncur, XFS_BTREE_ERROR);
> > > > + ncur = NULL;
> > > > + }
> > > > +
> > > > +verify_rightsib:
> > > > + if (ncur) {
> > > > + xfs_btree_del_cursor(ncur, XFS_BTREE_ERROR);
> > > > + ncur = NULL;
> > > > + }
> > > > +
> > > > + /* Does the right sibling match the parent level right block? */
> > > > + if (rightsib != NULLAGBLOCK) {
> > >
> > > No "if (!error ...) check here - I'm thinking there's some factoring
> > > needed here to reduce the code duplication going on here...
> >
> > Ok, will do. These two functions can be rewritten as a single function
> > that uses cur->bc_ops->diff_two_keys. Then we discard
> > bs->check_siblings_fn.
>
> Excellent!
>
> > > > +/* AG scrubbing */
> > > > +
> > >
> > > All this from here down doesn't seem related to scrubbing a btree?
> > > It's infrastructure for scanning AGs, but I don't see where it is
> > > called from - it looks unused at this point. I think it should be
> > > separated from the btree validation into it's own patchset and put
> > > before the individual btree verification code...
> > >
> > > I haven't really looked at this AG scrubbing code in depth because I
> > > can't tell how it fits into the code that is supposed to call it
> > > yet. I've read it, but without knowing how it's called I can't tell
> > > if the abstractions are just convoluted or whether they are
> > > necessary due to the infrastructure eventually ending up with
> > > multiple different call sites for some of the functions...
> >
> > Lock AGI/AGF/AGFL, initialize cursors for all AG btree types, pick the
> > cursor we want and scan the AG, and use the other btree cursors to
> > cross-reference each record we see during the scan.
> >
> > I regret now deferring the posting of the cross-referencing patches,
> > because some of the overdone bits in this patchset are done to avoid
> > code churn when that part gets added. That'll make it (I hope) a little
> > more obvious why some things are the way they are.
>
> Yeah, it's not easy to split large chunks of work up and maintain a
> split on a moving codebase. I don't want you to split these patches
> up into fine grained patches because that's just unmanagable, but I
> think it's worthwhile to do split out the bits that don't obviously
> appear to be related.
Yeah, I might have overreacted some; looking at the tracepoint, helper,
and btree scrub patches I think I only have to add a handful of patches.
> For this case, I don't think the follow on patch series would make
> any difference to my comments here, because going from btree
> verfication code to AG setup infrastructure in a single patch is
> quite a shift in context. Patch splits on context switch boundaries
> really do help reviewing large chunks of new code :P
Yeah, sorry about that. I got confused by my own code and thought we
were still talking about the helpers patch, not the btree scrub
framework patch.
--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2017-07-25 0:39 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-07-21 4:38 [PATCH v8 00/22] xfs: online scrub support Darrick J. Wong
2017-07-21 4:38 ` [PATCH 01/22] xfs: query the per-AG reservation counters Darrick J. Wong
2017-07-23 16:16 ` Allison Henderson
2017-07-23 22:25 ` Dave Chinner
2017-07-24 19:07 ` Darrick J. Wong
2017-07-21 4:38 ` [PATCH 02/22] xfs: add scrub tracepoints Darrick J. Wong
2017-07-23 16:23 ` Allison Henderson
2017-07-21 4:38 ` [PATCH 03/22] xfs: create an ioctl to scrub AG metadata Darrick J. Wong
2017-07-23 16:37 ` Allison Henderson
2017-07-23 23:45 ` Dave Chinner
2017-07-24 21:14 ` Darrick J. Wong
2017-07-21 4:38 ` [PATCH 04/22] xfs: generic functions to scrub metadata and btrees Darrick J. Wong
2017-07-23 16:40 ` Allison Henderson
2017-07-24 1:05 ` Dave Chinner
2017-07-24 21:58 ` Darrick J. Wong
2017-07-24 23:15 ` Dave Chinner
2017-07-25 0:39 ` Darrick J. Wong [this message]
2017-07-21 4:39 ` [PATCH 05/22] xfs: scrub in-memory metadata buffers Darrick J. Wong
2017-07-23 16:48 ` Allison Henderson
2017-07-24 1:43 ` Dave Chinner
2017-07-24 22:36 ` Darrick J. Wong
2017-07-24 23:38 ` Dave Chinner
2017-07-25 0:14 ` Darrick J. Wong
2017-07-25 3:32 ` Dave Chinner
2017-07-25 5:27 ` Darrick J. Wong
2017-07-21 4:39 ` [PATCH 06/22] xfs: scrub the backup superblocks Darrick J. Wong
2017-07-23 16:50 ` Allison Henderson
2017-07-25 4:05 ` Dave Chinner
2017-07-25 5:42 ` Darrick J. Wong
2017-07-21 4:39 ` [PATCH 07/22] xfs: scrub AGF and AGFL Darrick J. Wong
2017-07-23 16:59 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 08/22] xfs: scrub the AGI Darrick J. Wong
2017-07-23 17:02 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 09/22] xfs: scrub free space btrees Darrick J. Wong
2017-07-23 17:09 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 10/22] xfs: scrub inode btrees Darrick J. Wong
2017-07-23 17:15 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 11/22] xfs: scrub rmap btrees Darrick J. Wong
2017-07-23 17:21 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 12/22] xfs: scrub refcount btrees Darrick J. Wong
2017-07-23 17:25 ` Allison Henderson
2017-07-21 4:39 ` [PATCH 13/22] xfs: scrub inodes Darrick J. Wong
2017-07-23 17:38 ` Allison Henderson
2017-07-24 20:02 ` Darrick J. Wong
2017-07-21 4:40 ` [PATCH 14/22] xfs: scrub inode block mappings Darrick J. Wong
2017-07-23 17:41 ` Allison Henderson
2017-07-24 20:05 ` Darrick J. Wong
2017-07-21 4:40 ` [PATCH 15/22] xfs: scrub directory/attribute btrees Darrick J. Wong
2017-07-23 17:45 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 16/22] xfs: scrub directory metadata Darrick J. Wong
2017-07-23 17:51 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 17/22] xfs: scrub directory freespace Darrick J. Wong
2017-07-23 17:55 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 18/22] xfs: scrub extended attributes Darrick J. Wong
2017-07-23 17:57 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 19/22] xfs: scrub symbolic links Darrick J. Wong
2017-07-23 17:59 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 20/22] xfs: scrub parent pointers Darrick J. Wong
2017-07-23 18:03 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 21/22] xfs: scrub realtime bitmap/summary Darrick J. Wong
2017-07-23 18:05 ` Allison Henderson
2017-07-21 4:40 ` [PATCH 22/22] xfs: scrub quota information Darrick J. Wong
2017-07-23 18:07 ` Allison Henderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170725003944.GM4352@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).