Re: [BUG] generic/475 recovery failure(s)

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [BUG] generic/475 recovery failure(s)
Date: Wed, 16 Jun 2021 16:33:26 -0400	[thread overview]
Message-ID: <YMpgFmEzjpWnmZ66@bfoster> (raw)
In-Reply-To: <20210616070542.GY664593@dread.disaster.area>

On Wed, Jun 16, 2021 at 05:05:42PM +1000, Dave Chinner wrote:
> On Sat, Jun 12, 2021 at 08:33:32AM +1000, Dave Chinner wrote:
> > On Fri, Jun 11, 2021 at 03:02:39PM -0400, Brian Foster wrote:
> > > On Thu, Jun 10, 2021 at 11:14:32AM -0400, Brian Foster wrote:
> > > > Hi all,
> > > > 
> > > > I'm seeing what looks like at least one new generic/475 failure on
> > > > current for-next. (I've seen one related to an attr buffer that seems to
> > > > be older and harder to reproduce.). The test devices are a couple ~15GB
> > > > lvm devices formatted with mkfs defaults. I'm still trying to establish
> > > > reproducibility, but so far a failure seems fairly reliable within ~30
> > > > iterations.
> > > > 
> > > > The first [1] looks like log recovery failure processing an EFI. The
> > > > second variant [2] looks like it passes log recovery, but then fails the
> > > > mount in the COW extent cleanup stage due to a refcountbt problem. I've
> > > > also seen one that looks like the same free space corruption error as
> > > > [1], but triggered via the COW recovery codepath in [2], so these could
> > > > very well be related. A snippet of the dmesg output for each failed
> > > > mount is appended below.
> > > > 
> > > ...
> > > 
> > > A couple updates..
> > > 
> > > First (as noted on irc), the generic/475 failure is not new as I was
> > > able to produce it on vanilla 5.13.0-rc4. I'm not quite sure how far
> > > back that one goes, but Dave noted he's seen it on occasion for some
> > > time.
> > > 
> > > The generic/019 failure I'm seeing does appear to be new as I cannot
> > > reproduce on 5.13.0-rc4. This failure looks more like silent fs
> > > corruption. I.e., the test or log recovery doesn't explicitly fail, but
> > > the post-test xfs_repair check detects corruption. Example xfs_repair
> > > output is appended below (note that 'xfs_repair -n' actually crashes,
> > > while destructive repair seems to work). Since this reproduces fairly
> > > reliably on for-next, I bisected it (while also navigating an unmount
> > > hang that I don't otherwise have data on) down to facd77e4e38b ("xfs:
> > > CIL work is serialised, not pipelined"). From a quick glance at that I'm
> > > not quite sure what the problem is there, just that it doesn't occur
> > > prior to that particular commit.
> > 
> > I suspect that there's an underlying bug in the overlapping CIL
> > commit record sequencing. This commit will be the first time we are
> > actually getting overlapping checkpoints that need ordering via the
> > commit record writes. Hence I suspect what is being seen here is a
> > subtle ordering bug that has been in that code since it was first
> > introduced but never exercised until now..
> > 
> > I haven't had any success in reproducing this yet, I'll keep trying
> > to see if I can get it to trigger so I can look at it in more
> > detail...
> 
> I think I have reproduced this enough to have some idea about what
> is happening here. I has RUI recovery fail converting an unwritten
> extent which was added in a RUI+RUD+rmapbt buffer modification a few
> operations prior to the RUI that failed. Turns out that the on-disk
> buffer had an LSN stamped in it more recent than the LSN of the
> checkpoint being recovered and it all goes downhill from there.
> 
> The problematic checkpoints at the end of the log overlap
> like this:
> 
> Oper (0): tid: 53552074  len: 0  clientid: TRANS  flags: START 
> cycle: 52       version: 2              lsn: 52,13938   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,14002   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,14066   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,14130   tail_lsn: 52,10816
> Oper (27): tid: c960e383  len: 0  clientid: TRANS  flags: START 
> cycle: 52       version: 2              lsn: 52,14194   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,14258   tail_lsn: 52,10816
> .....
> cycle: 52       version: 2              lsn: 52,15410   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,15474   tail_lsn: 52,10816
> Oper (29): tid: c960e383  len: 0  clientid: TRANS  flags: COMMIT 
> Oper (30): tid: 53552074  len: 0  clientid: TRANS  flags: COMMIT 
> cycle: 52       version: 2              lsn: 52,15513   tail_lsn: 52,10816
> Oper (0): tid: 22790ee5  len: 0  clientid: TRANS  flags: START 
> cycle: 52       version: 2              lsn: 52,15577   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,15641   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,15705   tail_lsn: 52,10816
> cycle: 52       version: 2              lsn: 52,15769   tail_lsn: 52,10816
> 
> You can see the commit records for transaction IDs (tid: xxx)
> 53552074 and c960e383 are ordered differently to the start records.
> The checkpoint "tid: 22790ee5" is the last checkpoint in the log
> which is incomplete and so isn't recovered.
> 
> Now, from this information I don't know whether the commit records
> are correctly ordered as the CIL code is suppposed to do, but what
> has happened is that log recovery tags checkpoints with the start
> lsn of the checkpoint and uses that for recovery. That's the lsn it
> stamps into metadata recovered by that commit.
> 
> So, we recover "tid: c960e383" first with a lsn of 0x340003732,
> then recover "tid: 53552074" with a lsn of 0x3400003632. Hence even
> if the commit records are correctly ordered, log recovery screws up
> the LSN used to recover the checkpoints and hence incorrectly
> recovers buffers and stamps incorrect LSNs into buffers.
> 

Oof. :/

> So, once this is all recovered, we go to handle outstanding intents.
> We start with a RUI that is an rmap conversion, and it fails to find
> the rmap record that it is converting and assert fails. It didn't
> find the record because the rmapbt block modification was not
> replayed because the on-disk rmapbt block was updated in "tid:
> c960e383" and so has a lsn of 0x340003732 stamped in it. And so the
> modification that added the record in "tid: 53552074" was skipped
> because it ran with an LSN less than what was already on disk.
> 
> IOWs, we got a RUI conversion failure because the above two nested
> checkpoints were either replayed out of order or were replayed with
> the incorrect LSNs.
> 
> From here, I'm working as I write this because, well, complex... :/
> 
> Hypothesis: If the two nested checkpoints had their commit records
> written out of order, then what wasn't complete in 53552074 should
> be in c960e383 but was ignored. There were only two RUIs left
> incomplete in 53552074 (i.e. had no RUD+rmapbt buffer mods) and they
> weren't in c960e383. Hence c960e383 does not follow 53552074.
> Indeed, I found both RUDs at the start of checkpoint 22790ee5 (which
> wasn't complete or replayed), hence log recovery was correct to be
> replaying those RUIs.
> 
> So, lets use this same intent -> intent done split across
> checkpoints to verify that 53552074 follows c960e383. So:
> 
> ----------------------------------------------------------------------------
> Oper (26): tid: d3047f93  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff888034f3edb0
> (s: 0x25198c, l: 1, own: 518, off: 2714540, f: 0x20000001) 
> ....
> ----------------------------------------------------------------------------
> Oper (2): tid: 53552074  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff8880422f1d90
> .....
> ----------------------------------------------------------------------------
> Oper (24): tid: 53552074  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff8880422f35c0
> (s: 0x107de7, l: 1, own: 521, off: 2854494, f: 0x5) 
> ....
> ----------------------------------------------------------------------------
> Oper (39): tid: c960e383  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff888034f3edb0
> ....
> ----------------------------------------------------------------------------
> Oper (28): tid: c960e383  len: 48  clientid: TRANS  flags: none
> RUI:  #regs: 1  num_extents: 1  id: 0xffff8880422f1d90
> (s: 0x31285f, l: 1, own: 520, off: 3195131, f: 0x5) 
> ----------------------------------------------------------------------------
> Oper (29): tid: c960e383  len: 0  clientid: TRANS  flags: COMMIT 
> ----------------------------------------------------------------------------
> Oper (30): tid: 53552074  len: 0  clientid: TRANS  flags: COMMIT 
> ....
> ----------------------------------------------------------------------------
> Oper (0): tid: 22790ee5  len: 0  clientid: TRANS  flags: START 
> ----------------------------------------------------------------------------
> Oper (1): tid: 22790ee5  len: 16  clientid: TRANS  flags: none
> TRAN:     tid: e50e7922  num_items: 876
> ----------------------------------------------------------------------------
> Oper (2): tid: 22790ee5  len: 16  clientid: TRANS  flags: none
> RUD:  #regs: 1                   id: 0xffff8880422f35c0
> ----------------------------------------------------------------------------
> 
> Yup, there's the ordering evidence. The checkpoint sequence is:
> 
> d3047f93 (RUI 0xffff888034f3edb0)
> c960e383 (RUD 0xffff888034f3edb0)
> 
> c960e383 (RUI 0xffff8880422f1d90)
> 53552074 (RUD 0xffff8880422f1d90)
> 
> 53552074 (RUI 0xffff8880422f35c0)
> 22790ee5 (RUD 0xffff8880422f35c0)
> 
> So what we have here is log recovery failing to handle checkpoints
> that -start- out of order in the log because it uses the /start lsn/
> for recovery LSN sequencing, not the commit record LSN. However, it
> uses the commit record ordering for sequencing the recovery of
> checkpoints. The code that uses the start lsn for recvery of commit
> records appears to be:
> 
> STATIC int
> xlog_recover_items_pass2(
>         struct xlog                     *log,
>         struct xlog_recover             *trans,
>         struct list_head                *buffer_list,
>         struct list_head                *item_list)
> {
>         struct xlog_recover_item        *item;
>         int                             error = 0;
> 
>         list_for_each_entry(item, item_list, ri_list) {
>                 trace_xfs_log_recover_item_recover(log, trans, item,
>                                 XLOG_RECOVER_PASS2);
> 
>                 if (item->ri_ops->commit_pass2)
>                         error = item->ri_ops->commit_pass2(log, buffer_list,
> >>>>>>>>>>                               item, trans->r_lsn);
>                 if (error)
>                         return error;
>         }
> 
>         return error;
> }
> 
> trans->r_lsn is the LSN where the start record for the
> commit is found, not the LSN the commit record was found. At run
> time, we do all our ordering off start record LSN because new
> journal writes cannot be allowed to overwrite the start of the
> checkpoint before every item in the checkpoint has been written back
> to disk. If we were to use the commit record LSN, the AIL would
> allow the tail of the log to move over the start of the checkpoint
> before every thing was written.
> 
> However, what is the implication of the AIL having -start lsns- for
> checkpoint sequences out of order? And for log recovery finding the
> head of the log? I think it means we have the potential for
> checkpoint N to be complete in the log an needing recovery, but
> checkpoint N+1 is not complete because we've allowed the tail of the
> log to move past the start of that checkpoint. That will also cause
> corruption log recovery corruption. And I have a sneaking suspiciion
> that this may cause fsync/log force issues as well....
> 
> ----
> 
> Ok, time to step back and think about this for a bit. The good news
> is that this can only happen with pipelined CIL commits, while means
> allowing more than one CIL push to be in progress at once. We can
> avoid this whole problem simply by setting the CIL workqueue back to
> running just a single ordered work at a time. Call that Plan B.
> 
> The bad news is that this is zero day bug, so all kernels out there
> will fail to recovery with out of order start records. Back before
> delayed logging, we could have mulitple transactions commit to the
> journal in just about any order and nesting, and they used exactly
> the same start/commit ordering in recovery that is causing us
> problems now. This wouldn't have been noticed, however, because
> transactions were tiny back then, not huge checkpoints like we run
> now. And if is was, it likely would have been blamed on broken
> storage because it's so ephemeral and there was little test
> infrastructure that exercised these paths... :/
> 
> What this means is that we can't just make a fix to log recovery
> because taking a log from a current kernel and replaying it on an
> older kernel might still go very wrong, and do so silently. So I
> think we have to fix what we write to the log.
> 
> Hence I think the only way forward here is to recognise that the
> AIL, log forces and log recovery all require strictly ordered
> checkpoint start records (and hence LSNs) as well as commit records.
> We already strictly order the commit records (and this analysis
> proves it is working correctly), so we should be able to leverage
> the existing functionality to do this.
> 

Ok. I still need to poke at the code and think about this, but what you
describe here all makes sense. I wonder a bit if we have the option to
fix recovery via some kind of backwards incompatibility, but that
requires more thought (didn't we implement or at least consider
something like a transient feature bit state some time reasonably
recently? I thought we did but I can't quite put my finger on what it
was for or if it landed upstream.).

That train of thought aside, ordering start records a la traditional
commit record ordering seems like a sane option. I am starting to think
that your plan B might actually be a wise plan A, though. I already
wished we weren't pushing so much log rewrite/rework through all in one
cycle. Now it sounds like we require even more significant (in
complexity, if not code churn, as I don't yet have a picture of what
"start record ordering" looks like) functional changes to make an
otherwise isolated change safe. Just my .02.

Brian

> So I think I see a simple way out of this. I'll sleep on it and see
> if if I still think that in the morning...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
>