From: Dave Chinner <david@fromorbit.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Qu Wenruo <quwenruo.btrfs@gmx.com>,
Brian Foster <bfoster@redhat.com>,
fstests <fstests@vger.kernel.org>,
linux-xfs <linux-xfs@vger.kernel.org>, Qu Wenruo <wqu@suse.com>,
Josef Bacik <josef@toxicpanda.com>
Subject: Re: [PATCH] generic: skip dm-log-writes tests on XFS v5 superblock filesystems
Date: Wed, 27 Feb 2019 17:15:29 +1100 [thread overview]
Message-ID: <20190227061529.GF16436@dastard> (raw)
In-Reply-To: <CAOQ4uxhNdzbSiPREmMtv5_81=7bxCRbTaa_KUN00g7De_j6a4Q@mail.gmail.com>
On Wed, Feb 27, 2019 at 06:49:56AM +0200, Amir Goldstein wrote:
> On Wed, Feb 27, 2019 at 6:19 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2019/2/27 下午12:06, Amir Goldstein wrote:
> > > On Wed, Feb 27, 2019 at 1:22 AM Dave Chinner <david@fromorbit.com> wrote:
> > >>
> > >> On Tue, Feb 26, 2019 at 11:10:02PM +0200, Amir Goldstein wrote:
> > >>> On Tue, Feb 26, 2019 at 8:14 PM Brian Foster <bfoster@redhat.com> wrote:
> > >>>>
> > >>>> The dm-log-writes mechanism runs a workload against a filesystem,
> > >>>> tracks underlying FUAs and restores the filesystem to various points
> > >>>> in time based on FUA marks. This allows fstests to check fs
> > >>>> consistency at various points and verify log recovery works as
> > >>>> expected.
> > >>>>
> > >>>
> > >>> Inaccurate. generic/482 restores to FUA points.
> > >>> generic/45[57] restore to user defined points in time (marks).
> > >>> dm-log-writes mechanism is capable of restoring either.
> > >>>
> > >>>> This mechanism does not play well with LSN based log recovery
> > >>>> ordering behavior on XFS v5 superblocks, however. For example,
> > >>>> generic/482 can reproduce false positive corruptions based on extent
> > >>>> to btree conversion of an inode if the inode and associated btree
> > >>>> block are written back after different checkpoints. Even though both
> > >>>> items are logged correctly in the extent-to-btree transaction, the
> > >>>> btree block can be relogged (multiple times) and only written back
> > >>>> once when the filesystem unmounts. If the inode was written back
> > >>>> after the initial conversion, recovery points between that mark and
> > >>>> when the btree block is ultimately written back will show corruption
> > >>>> because log recovery sees that the destination buffer is newer than
> > >>>> the recovered buffer and intentionally skips the buffer. This is a
> > >>>> false positive because the destination buffer was resiliently
> > >>>> written back after being physically relogged one or more times.
> > >>>>
> > >>>
> > >>> This story doesn't add up.
> > >>> Either dm-log-writes emulated power failure correctly or it doesn't.
> > >>> My understanding is that the issue you are seeing is a result of
> > >>> XFS seeing "data from the future" after a restore of a power failure
> > >>> snapshot, because the scratch device is not a clean slate.
> > >>> If I am right, then the correct solution is to wipe the journal before
> > >>> starting to replay restore points.
> > >>
> > >> If that is the problem, then I think we should be wiping the entire
> > >> block device before replaying the recorded logwrite.
> > >>
> > >
> > > Indeed.
> >
> > May I ask a stupid question?
> >
> > How does it matter whether the device is clean or not?
> > Shouldn't the journal/metadata or whatever be self-contained?
> >
>
> Yes and no.
>
> The most simple example (not limited to xfs and not sure it is like that in xfs)
> is how you find the last valid journal commit entry. It should have correct CRC
> and the largest LSN. But it you replay IO on top of existing journal without
> wiping it first, then journal recovery will continue past the point to meant to
> replay or worse.
No, that's not the problem we have with XFS. THe problem is that XFS
will not recover the changes in the log if the object on disk it
would recover into is more recent than the information found in the
log. i.e. it's already been written back and so the journal entry
does not need to be replayed.
IOWs, if the block device is not wiped, the first read of a piece
of a newly allocated and modified metadata object in the log will
see the future state of the object, not whatever was there when the
block was first allocated.
i.e.
in memory in journal on disk
initial contents 0000 n/a 0000
allocate, 0001 n/a 0000
modify, checkpoint 1 0002 0002 0000
modify, checkpoint 2 0003 0003 0000
modify, checkpoint 3 0004 0004 0000
write back 0004 0004 0004
checkpoint 4 0004 n/a 0004
Now when we replay up to checkpoint 1, log recovery will read the
object from disk. If the disk has been zeroed before we replay, the
read in log recovery will see 0000 and replay 0002 over the top,
and all will be good. However, if the device hasn;t been zeroed,
recovery will read "0004", which is more recent than 0002, and it
will not replay 0002 because it knows there are future changes to
that object in the journal that will be replayed.
IOWs, stale metadata (from the future) prevents log recovery from
replaying the objects it should be replaying.
> The problem that Brian describes is more complicated than that and not
> limited to the data in the journal IIUC, but I think what I described above
> may plague also ext4 and xfs v4.
It will affect xfs v4, but we can't detect it at all because we
don't have sequence numbers in the metadata. ext4 is in the same
boat as xfs v4, while btrfs is like XFS v5 with transaction
identifiers in the metadata to indicate when it was written...
> > This "discard everything" assumption doesn't look right to me.
> > Although most mkfs would discard at least part of the device, even
> > without discarding the newly created fs should be self-contained, no
> > wild pointer points to some garbage.
If the block device is in an uninitialised state when we start,
then all bets are off - it is not a "self contained" test because
the initial state is completely unknown. We need to zero so that the
initial state for both creation and each replay that occurs start
from the same initial conditions.
> > Am I missing something? Or do I get too poisoned by btrfs CoW?
>
> I'd be very surprised if btrfs cannot be flipped by seeing stale data "from
> the future" in the block device. Seems to me like the entire concept of
> CoW and metadata checksums is completely subverted by the existence
> of correct checksums on "stale metadata from the future".
No, this is a journal recovery issue - recovery is a
read-modify-write operation and so if the contents that are read are
stale in a specific way we can be exposed to problems like this.
btrfs is not a journalling filesystem, so it shouldn't be doing RMW
cycles on metadata to bring it into a consistent state during
recovery - it should be doing atomic updates of the tree root to
switch from one consistent state on disk to the next....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2019-02-27 6:15 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-26 18:14 [PATCH] generic: skip dm-log-writes tests on XFS v5 superblock filesystems Brian Foster
2019-02-26 21:10 ` Amir Goldstein
2019-02-26 23:22 ` Dave Chinner
2019-02-27 4:06 ` Amir Goldstein
2019-02-27 4:19 ` Qu Wenruo
2019-02-27 4:49 ` Amir Goldstein
2019-02-27 5:01 ` Qu Wenruo
2019-02-27 5:19 ` Amir Goldstein
2019-02-27 5:32 ` Qu Wenruo
2019-02-27 5:58 ` Amir Goldstein
2019-02-27 6:15 ` Dave Chinner [this message]
2019-02-27 13:23 ` Brian Foster
2019-02-27 13:18 ` Brian Foster
2019-02-27 14:17 ` Brian Foster
2019-02-27 15:54 ` Josef Bacik
2019-02-27 17:11 ` Amir Goldstein
2019-02-27 17:13 ` Brian Foster
2019-02-27 18:46 ` Amir Goldstein
2019-02-27 20:45 ` Brian Foster
2019-02-27 19:27 ` Josef Bacik
2019-02-27 20:47 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190227061529.GF16436@dastard \
--to=david@fromorbit.com \
--cc=amir73il@gmail.com \
--cc=bfoster@redhat.com \
--cc=fstests@vger.kernel.org \
--cc=josef@toxicpanda.com \
--cc=linux-xfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
--cc=wqu@suse.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).