From: Duncan <1i5t5.duncan@cox.net>
To: linux-btrfs@vger.kernel.org
Subject: Re: filesystem corruption
Date: Tue, 4 Nov 2014 08:25:21 +0000 (UTC) [thread overview]
Message-ID: <pan$e54ac$a490d88a$2b40e1fe$3abb4a79@cox.net> (raw)
In-Reply-To: 20141104043130.GN17395@hungrycats.org
Zygo Blaxell posted on Mon, 03 Nov 2014 23:31:45 -0500 as excerpted:
> On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote:
>>
>> On Nov 2, 2014, at 8:43 PM, Zygo Blaxell <zblaxell@furryterror.org>
>> wrote:
>> > btrfs seems to assume the data is correct on both disks (the
>> > generation numbers and checksums are OK) but gets confused by equally
>> > plausible but different metadata on each disk. It doesn't take long
>> > before the filesystem becomes data soup or crashes the kernel.
>>
>> This is a pretty significant problem to still be present, honestly. I
>> can understand the "catchup" mechanism is probably not built yet,
>> but clearly the two devices don't have the same generation. The lower
>> generation device should probably be booted/ignored or declared missing
>> in the meantime to prevent trashing the file system.
>
> The problem with generation numbers is when both devices get divergent
> generation numbers but we can't tell them apart
[snip very reasonable scenario]
> Now we have two disks with equal generation numbers.
> Generations 6..9 on sda are not the same as generations 6..9 on sdb, so
> if we mix the two disks' metadata we get bad confusion.
>
> It needs to be more than a sequential number. If one of the disks
> disappears we need to record this fact on the surviving disks, and also
> cope with _both_ disks claiming to be the "surviving" one.
Zygo's absolutely correct. There is an existing catchup mechanism, but
the tracking is /purely/ sequential generation number based, and if the
two generation sequences diverge, "Welcome to the (data) Twilight Zone!"
I noted this in my own early pre-deployment raid1 mode testing as well,
except that I didn't at that point know about sequence numbers and never
got as far as letting the filesystem make data soup of itself.
What I did was this:
1) Create a two-device raid1 data and metadata filesystem, mount it and
stick some data on it.
2) Unmount, pull a device, mount degraded the remaining device.
3) Change a file.
4) Unmount, switch devices, mount degraded the other device.
5) Change the same file in an different/incompatible way.
6) Unmount, plug both devices in again, mount (not degraded).
7) Wait for the sync I was used to from mdraid, which of course didn't
occur.
8) Check the file to see which version showed up. I don't recall which
version it was, but it wasn't the common pre-change version.
9) Unmount, pull each device one at a time, mounting the other one
degraded and checking the file again.
10) The file on each device remained different, without a warning or
indication of any problem at all when I mounted undegraded in 6/7.
Had I initiated a scrub, presumably it would have seen the difference and
if one was a newer generation, it would have taken it, overwriting the
other. I don't know what it would have done if both were the same
generation, tho the file being small (just a few line text file, big
enough to test the effect of differing edits), I guess it would take one
version or the other. If the file was large enough to be multiple
extents, however, I've no idea whether it'd take one or the other, or
possibly combine the two, picking extents where they differed more or
less randomly.
By that time the lack of warning and absolute resolution to one version
or the other even after mounting undegraded and accessing the file with
incompatible versions on each of the two devices was bothering me
sufficiently that I didn't test any further.
Being just me I have to worry about (unlike a multi-admin corporate
scenario where you can never be /sure/ what the other admins will do
regardless of agreed procedure), I simply set myself a set of rules very
similar to what Zygo proposed:
1) If for whatever reason I ever split a btrfs raid1 with the intent or
even the possibility of bringing the pieces back together again, if at
all possible, never mount the split pieces writable -- mount read-only.
2) If a writable mount is required, keep the writable mounts to one
device of the split. As long as the other device is never mounted
writable, it will have an older generation when they're reunited and a
scrub should take care of things, reliably resolving to the updated
written device, rewriting the older generation on the other device.
What I'd do here is physically put the removed side of the raid1 in
storage, far enough from the remaining side that I couldn't possibly get
them mixed up. I'd clearly label it as well, creating a "defense in
depth" of at least two, the labeling and the physical separation and
storage of the read-only device.
3) If for whatever reason the originally read-only side must be mounted
writable, very clearly mark the originally mounted-writable device
POISONED/TOXIC!! *NEVER* *EVER* let such a POISONED device anywhere near
its original raid1 mate, until it is wiped, such that there's no
possibility of btrfs getting confused and contaminated with the poisoned
data.
Given how unimpressed I was with btrfs' ability to do the right thing in
such cases, I'd be tempted to wipefs the device, then dd from
/dev/zero to it, then badblocks write-pattern test a couple patterns,
then (if it was a full physical device not just a partition) hardware
secure-erase it, then mkfs it to ext4 or vfat, then dd from /dev/zero it
again and again hardware secure-erase it, then FINALLY mkfs.btrfs it
again. Of course being ssd, a single mkfs.btrfs would issue a trim and
that should suffice, but I was really REALLY not impressed with btrfs'
ability to reliably do the right thing, and would effectively be tearing
up the schoolbooks (at least the workbooks, since they couldn't be bought
back) and feeding them to the furnace at the end of the year, as I used
to do when I was a kid, not because it made a difference, but because it
was so emotionally rewarding! =:^)
Or maybe I'd make that an excuse to try dban[1].
But I'd probably just dd from /dev/zero or secure-erase it, or badblocks-
write-test a couple patterns if I wanted to badblocks-test it anyway, or
mkfs.btrfs it to get the trim from that.
But I'd have fun doing it. =:^)
And then I'd plug it back in and btrfs replace the missing device.
Anyway, the point is, either don't reintroduce absent devices once split
out of a btrfs raid1, or ensure they don't get written and immediately do
a scrub to update them when reintroduced, or if they were written and the
other device was too, separately, be sure the one is wiped (Destroy them
with Lasers![2]) before using a full btrfs replace, to keep the remaining
device(s) and the data on them healthy. =:^)
---
[1] https://www.google.com/search?q=dban
[2] Destroy them with Lazers! by Knife Party
https://www.google.com/search?q=destroy+them+with+lazers
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-11-04 8:25 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-31 0:29 filesystem corruption Tobias Holst
2014-10-31 1:02 ` Tobias Holst
2014-10-31 2:41 ` Rich Freeman
2014-10-31 17:34 ` Tobias Holst
2014-11-02 4:49 ` Robert White
2014-11-02 21:57 ` Chris Murphy
2014-11-03 3:43 ` Zygo Blaxell
2014-11-03 17:11 ` Chris Murphy
2014-11-04 4:31 ` Zygo Blaxell
2014-11-04 8:25 ` Duncan [this message]
2014-11-04 18:28 ` Chris Murphy
2014-11-04 21:44 ` Duncan
2014-11-04 22:19 ` Robert White
2014-11-04 22:34 ` Zygo Blaxell
2014-11-03 2:55 ` Tobias Holst
2014-11-03 3:49 ` Robert White
2018-12-03 9:31 Filesystem Corruption Stefan Malte Schumacher
2018-12-03 11:34 ` Qu Wenruo
2018-12-03 16:29 ` remi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$e54ac$a490d88a$2b40e1fe$3abb4a79@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).