* What is the vision for btrfs fs repair?
@ 2014-10-08 19:11 Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
` (3 more replies)
0 siblings, 4 replies; 33+ messages in thread
From: Eric Sandeen @ 2014-10-08 19:11 UTC (permalink / raw)
To: linux-btrfs
I was looking at Marc's post:
http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.
In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?
Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:
* btrfs scrub
"Errors are corrected along if possible" (what *is* possible?)
* mount -o recovery
"Enable autorecovery attempts if a bad tree root is found at mount time."
* mount -o degraded
"Allow mounts to continue with missing devices."
(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
"remove the log tree if log tree is corrupt"
* btrfs rescue
"Recover a damaged btrfs filesystem"
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
"repair a btrfs filesystem"
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?
* btrfs restore
"try to salvage files from a damaged filesystem"
(not really repair, it's disk-scraping)
What's the vision for, say, scrub vs. check vs. rescue? Should they repair the
same errors, only online vs. offline? If not, what class of errors does one fix vs.
the other? How would an admin know? Can btrfs check recover a bad tree root
in the same way that mount -o recovery does? How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?
It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?
Thanks,
-Eric
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
@ 2014-10-09 11:29 ` Austin S Hemmelgarn
2014-10-09 11:53 ` Duncan
2014-10-10 1:58 ` Chris Murphy
` (2 subsequent siblings)
3 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 11:29 UTC (permalink / raw)
To: Eric Sandeen, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4907 bytes --]
On 2014-10-08 15:11, Eric Sandeen wrote:
> I was looking at Marc's post:
>
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
>
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
>
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
>
> * btrfs scrub
> "Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> "Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> "Allow mounts to continue with missing devices."
> (This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> "remove the log tree if log tree is corrupt"
> * btrfs rescue
> "Recover a damaged btrfs filesystem"
> chunk-recover
> super-recover
> How does this relate to btrfs check?
> * btrfs check
> "repair a btrfs filesystem"
> --repair
> --init-csum-tree
> --init-extent-tree
> How does this relate to btrfs rescue?
> * btrfs restore
> "try to salvage files from a damaged filesystem"
> (not really repair, it's disk-scraping)
>
>
> What's the vision for, say, scrub vs. check vs. rescue? Should they repair the
> same errors, only online vs. offline? If not, what class of errors does one fix vs.
> the other? How would an admin know? Can btrfs check recover a bad tree root
> in the same way that mount -o recovery does? How would I know if I should use
> --init-*-tree, or chunk-recover, and what are the ramifications of using
> these options?
>
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?
Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a
RAID volume; that is, it fixes disparity between multiple copies of the
same block. IOW, it isn't really repair per se, but more preventative
maintnence. Currently, it only works for cases where you have multiple
copies of a block (dup, raid1, and raid10 profiles), but support is
planned for error correction of raid5 and raid6 profiles.
* mount -o recovery I don't know much about, but AFAICT, it s more for
dealing with metadata related FS corruption.
* mount -o degraded is used to mount a fs configured for a raid storage
profile with fewer devices than the profile minimum. It's primarily so
that you can get the fs into a state where you can run 'btrfs device
replace'
* btrfs-zero-log only deals with log tree corruption. This would be
roughly equivalent to zeroing out the journal on an XFS or ext4
filesystem, and should almost never be needed.
* btrfs rescue is intended for low level recovery corruption on an
offline fs.
* chunk-recover I'm not entirely sure about, but I believe it's
like scrub for a single chunk on an offline fs
* super-recover is for dealing with corrupted superblocks, and
tries to replace it with one of the other copies (which hopefully isn't
corrupted)
* btrfs check is intended to (eventually) be equivalent to the fsck
utility for most other filesystems. Currently, it's relatively good at
identifying corruption, but less so at actually fixing it. There are
however, some things that it won't catch, like a superblock pointing to
a corrupted root tree.
* btrfs restore is essentially disk scraping, but with built-in
knowledge of the filesystem's on-disk structure, which makes it more
reliable than more generic tools like scalpel for files that are too big
to fit in the metadata blocks, and it is pretty much essential for
dealing with transparently compressed files.
In general, my personal procedure for handling a misbehaving BTRFS
filesystem is:
* Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify
what's wrong
* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is
corrupt, try btrfs-zero-log
* If btrfs check indicated a corrupt superblock, try btrfs rescue
super-recover
* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and
force remapping of blocks with read errors. While BTRFS technically
handles both transparently on reads, it only corrects thing on disk when
you do a scrub.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 11:29 ` Austin S Hemmelgarn
@ 2014-10-09 11:53 ` Duncan
2014-10-09 11:55 ` Hugo Mills
` (3 more replies)
0 siblings, 4 replies; 33+ messages in thread
From: Duncan @ 2014-10-09 11:53 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:
> Also, you should be running btrfs scrub regularly to correct bit-rot
> and force remapping of blocks with read errors. While BTRFS
> technically handles both transparently on reads, it only corrects thing
> on disk when you do a scrub.
AFAIK that isn't quite correct. Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.
If btrfs reads the good copy, it simply uses it. If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.
But here's the rub. The chances of detecting that bad block are
relatively low in most cases. First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.
Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors. It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).
At least that's the way it /should/ work. I guess it's possible that
btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
if so, that's the first /I/ remember reading of it.
Other than that detail, what you posted matches my knowledge and
experience, such as it may be as a non-dev list regular, as well.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 11:53 ` Duncan
@ 2014-10-09 11:55 ` Hugo Mills
2014-10-09 12:07 ` Austin S Hemmelgarn
` (2 subsequent siblings)
3 siblings, 0 replies; 33+ messages in thread
From: Hugo Mills @ 2014-10-09 11:55 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2277 bytes --]
On Thu, Oct 09, 2014 at 11:53:23AM +0000, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> excerpted:
>
> > Also, you should be running btrfs scrub regularly to correct bit-rot
> > and force remapping of blocks with read errors. While BTRFS
> > technically handles both transparently on reads, it only corrects thing
> > on disk when you do a scrub.
>
> AFAIK that isn't quite correct. Currently, the number of copies is
> limited to two, meaning if one of the two is bad, there's a 50% chance of
> btrfs reading the good one on first try.
Scrub checks both copies, though. It's ordinary reads that don't.
Hugo.
> If btrfs reads the good copy, it simply uses it. If btrfs reads the bad
> one, it checks the other one and assuming it's good, replaces the bad one
> with the good one both for the read (which otherwise errors out), and by
> overwriting the bad one.
>
> But here's the rub. The chances of detecting that bad block are
> relatively low in most cases. First, the system must try reading it for
> some reason, but even then, chances are 50% it'll pick the good one and
> won't even notice the bad one.
>
> Thus, while btrfs may randomly bump into a bad block and rewrite it with
> the good copy, scrub is the only way to systematically detect and (if
> there's a good copy) fix these checksum errors. It's not that btrfs
> doesn't do it if it finds them, it's that the chances of finding them are
> relatively low, unless you do a scrub, which systematically checks the
> entire filesystem (well, other than files marked nocsum, or nocow, which
> implies nocsum, or files written when mounted with nodatacow or
> nodatasum).
>
> At least that's the way it /should/ work. I guess it's possible that
> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
> if so, that's the first /I/ remember reading of it.
>
> Other than that detail, what you posted matches my knowledge and
> experience, such as it may be as a non-dev list regular, as well.
>
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Great oxymorons of the world, no. 7: The Simple Truth ---
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 11:53 ` Duncan
2014-10-09 11:55 ` Hugo Mills
@ 2014-10-09 12:07 ` Austin S Hemmelgarn
2014-10-09 12:12 ` Hugo Mills
[not found] ` <107Y1p00G0wm9Bl0107vjZ>
[not found] ` <0zvr1p0162Q6ekd01zvtN0>
3 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 12:07 UTC (permalink / raw)
To: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]
On 2014-10-09 07:53, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> excerpted:
>
>> Also, you should be running btrfs scrub regularly to correct bit-rot
>> and force remapping of blocks with read errors. While BTRFS
>> technically handles both transparently on reads, it only corrects thing
>> on disk when you do a scrub.
>
> AFAIK that isn't quite correct. Currently, the number of copies is
> limited to two, meaning if one of the two is bad, there's a 50% chance of
> btrfs reading the good one on first try.
>
> If btrfs reads the good copy, it simply uses it. If btrfs reads the bad
> one, it checks the other one and assuming it's good, replaces the bad one
> with the good one both for the read (which otherwise errors out), and by
> overwriting the bad one.
>
> But here's the rub. The chances of detecting that bad block are
> relatively low in most cases. First, the system must try reading it for
> some reason, but even then, chances are 50% it'll pick the good one and
> won't even notice the bad one.
>
> Thus, while btrfs may randomly bump into a bad block and rewrite it with
> the good copy, scrub is the only way to systematically detect and (if
> there's a good copy) fix these checksum errors. It's not that btrfs
> doesn't do it if it finds them, it's that the chances of finding them are
> relatively low, unless you do a scrub, which systematically checks the
> entire filesystem (well, other than files marked nocsum, or nocow, which
> implies nocsum, or files written when mounted with nodatacow or
> nodatasum).
>
> At least that's the way it /should/ work. I guess it's possible that
> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
> if so, that's the first /I/ remember reading of it.
I'm not 100% certain, but I believe it doesn't actually fix things on
disk when it detects an error during a read, I know it doesn't it the fs
is mounted ro (even if the media is writable), because I did some
testing to see how 'read-only' mounting a btrfs filesystem really is.
Also, that's a much better description of how multiple copies work than
I could probably have ever given.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 12:07 ` Austin S Hemmelgarn
@ 2014-10-09 12:12 ` Hugo Mills
2014-10-09 12:32 ` Austin S Hemmelgarn
0 siblings, 1 reply; 33+ messages in thread
From: Hugo Mills @ 2014-10-09 12:12 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2536 bytes --]
On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:
> On 2014-10-09 07:53, Duncan wrote:
> >Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> >excerpted:
> >
> >>Also, you should be running btrfs scrub regularly to correct bit-rot
> >>and force remapping of blocks with read errors. While BTRFS
> >>technically handles both transparently on reads, it only corrects thing
> >>on disk when you do a scrub.
> >
> >AFAIK that isn't quite correct. Currently, the number of copies is
> >limited to two, meaning if one of the two is bad, there's a 50% chance of
> >btrfs reading the good one on first try.
> >
> >If btrfs reads the good copy, it simply uses it. If btrfs reads the bad
> >one, it checks the other one and assuming it's good, replaces the bad one
> >with the good one both for the read (which otherwise errors out), and by
> >overwriting the bad one.
> >
> >But here's the rub. The chances of detecting that bad block are
> >relatively low in most cases. First, the system must try reading it for
> >some reason, but even then, chances are 50% it'll pick the good one and
> >won't even notice the bad one.
> >
> >Thus, while btrfs may randomly bump into a bad block and rewrite it with
> >the good copy, scrub is the only way to systematically detect and (if
> >there's a good copy) fix these checksum errors. It's not that btrfs
> >doesn't do it if it finds them, it's that the chances of finding them are
> >relatively low, unless you do a scrub, which systematically checks the
> >entire filesystem (well, other than files marked nocsum, or nocow, which
> >implies nocsum, or files written when mounted with nodatacow or
> >nodatasum).
> >
> >At least that's the way it /should/ work. I guess it's possible that
> >btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
> >if so, that's the first /I/ remember reading of it.
>
> I'm not 100% certain, but I believe it doesn't actually fix things on disk
> when it detects an error during a read,
I'm fairly sure it does, as I've had it happen to me. :)
> I know it doesn't it the fs is
> mounted ro (even if the media is writable), because I did some testing to
> see how 'read-only' mounting a btrfs filesystem really is.
If the FS is RO, then yes, it won't fix things.
Hugo.
--
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- Great films about cricket: Interview with the Umpire ---
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 12:12 ` Hugo Mills
@ 2014-10-09 12:32 ` Austin S Hemmelgarn
0 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 12:32 UTC (permalink / raw)
To: Hugo Mills, Duncan, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2583 bytes --]
On 2014-10-09 08:12, Hugo Mills wrote:
> On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:
>> On 2014-10-09 07:53, Duncan wrote:
>>> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
>>> excerpted:
>>>
>>>> Also, you should be running btrfs scrub regularly to correct bit-rot
>>>> and force remapping of blocks with read errors. While BTRFS
>>>> technically handles both transparently on reads, it only corrects thing
>>>> on disk when you do a scrub.
>>>
>>> AFAIK that isn't quite correct. Currently, the number of copies is
>>> limited to two, meaning if one of the two is bad, there's a 50% chance of
>>> btrfs reading the good one on first try.
>>>
>>> If btrfs reads the good copy, it simply uses it. If btrfs reads the bad
>>> one, it checks the other one and assuming it's good, replaces the bad one
>>> with the good one both for the read (which otherwise errors out), and by
>>> overwriting the bad one.
>>>
>>> But here's the rub. The chances of detecting that bad block are
>>> relatively low in most cases. First, the system must try reading it for
>>> some reason, but even then, chances are 50% it'll pick the good one and
>>> won't even notice the bad one.
>>>
>>> Thus, while btrfs may randomly bump into a bad block and rewrite it with
>>> the good copy, scrub is the only way to systematically detect and (if
>>> there's a good copy) fix these checksum errors. It's not that btrfs
>>> doesn't do it if it finds them, it's that the chances of finding them are
>>> relatively low, unless you do a scrub, which systematically checks the
>>> entire filesystem (well, other than files marked nocsum, or nocow, which
>>> implies nocsum, or files written when mounted with nodatacow or
>>> nodatasum).
>>>
>>> At least that's the way it /should/ work. I guess it's possible that
>>> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
>>> if so, that's the first /I/ remember reading of it.
>>
>> I'm not 100% certain, but I believe it doesn't actually fix things on disk
>> when it detects an error during a read,
>
> I'm fairly sure it does, as I've had it happen to me. :)
I probably just misinterpreted the source code, while I know enough C to
generally understand things, I'm by far no expert.
>
>> I know it doesn't it the fs is
>> mounted ro (even if the media is writable), because I did some testing to
>> see how 'read-only' mounting a btrfs filesystem really is.
>
> If the FS is RO, then yes, it won't fix things.
>
> Hugo.
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
[not found] ` <107Y1p00G0wm9Bl0107vjZ>
@ 2014-10-09 12:34 ` Duncan
2014-10-09 13:18 ` Austin S Hemmelgarn
0 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2014-10-09 12:34 UTC (permalink / raw)
To: Austin S Hemmelgarn; +Cc: linux-btrfs
On Thu, 09 Oct 2014 08:07:51 -0400
Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
> On 2014-10-09 07:53, Duncan wrote:
> > Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> > excerpted:
> >
> >> Also, you should be running btrfs scrub regularly to correct
> >> bit-rot and force remapping of blocks with read errors. While
> >> BTRFS technically handles both transparently on reads, it only
> >> corrects thing on disk when you do a scrub.
> >
> > AFAIK that isn't quite correct. Currently, the number of copies is
> > limited to two, meaning if one of the two is bad, there's a 50%
> > chance of btrfs reading the good one on first try.
> >
> > If btrfs reads the good copy, it simply uses it. If btrfs reads
> > the bad one, it checks the other one and assuming it's good,
> > replaces the bad one with the good one both for the read (which
> > otherwise errors out), and by overwriting the bad one.
> >
> > But here's the rub. The chances of detecting that bad block are
> > relatively low in most cases. First, the system must try reading
> > it for some reason, but even then, chances are 50% it'll pick the
> > good one and won't even notice the bad one.
> >
> > Thus, while btrfs may randomly bump into a bad block and rewrite it
> > with the good copy, scrub is the only way to systematically detect
> > and (if there's a good copy) fix these checksum errors. It's not
> > that btrfs doesn't do it if it finds them, it's that the chances of
> > finding them are relatively low, unless you do a scrub, which
> > systematically checks the entire filesystem (well, other than files
> > marked nocsum, or nocow, which implies nocsum, or files written
> > when mounted with nodatacow or nodatasum).
> >
> > At least that's the way it /should/ work. I guess it's possible
> > that btrfs isn't doing those routine "bump-into-it-and-fix-it"
> > fixes yet, but if so, that's the first /I/ remember reading of it.
>
> I'm not 100% certain, but I believe it doesn't actually fix things on
> disk when it detects an error during a read, I know it doesn't it the
> fs is mounted ro (even if the media is writable), because I did some
> testing to see how 'read-only' mounting a btrfs filesystem really is.
Definitely it won't with a read-only mount. But then scrub shouldn't
be able to write to a read-only mount either. The only way a read-only
mount should be writable is if it's mounted (bind-mounted or
btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
that mount, not the read-only mounted location.
There's even debate about replaying the journal or doing orphan-delete
on read-only mounts (at least on-media, the change could, and arguably
should, occur in RAM and be cached, marking the cache "dirty" at the
same time so it's appropriately flushed if/when the filesystem goes
writable), with some arguing read-only means just that, don't
write /anything/ to it until it's read-write mounted.
But writable-mounted, detected checksum errors (with a good copy
available) should be rewritten as far as I know. If not, I'd call it
a bug. The problem is in the detection, not in the rewriting. Scrub's
the only way to reliably detect these errors since it's the only thing
that systematically checks /everything/.
> Also, that's a much better description of how multiple copies work
> than I could probably have ever given.
Thanks. =:^)
--
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
[not found] ` <0zvr1p0162Q6ekd01zvtN0>
@ 2014-10-09 12:42 ` Duncan
0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-09 12:42 UTC (permalink / raw)
To: Hugo Mills; +Cc: linux-btrfs
On Thu, 9 Oct 2014 12:55:50 +0100
Hugo Mills <hugo@carfax.org.uk> wrote:
> On Thu, Oct 09, 2014 at 11:53:23AM +0000, Duncan wrote:
> > Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> > excerpted:
> >
> > > Also, you should be running btrfs scrub regularly to correct
> > > bit-rot and force remapping of blocks with read errors. While
> > > BTRFS technically handles both transparently on reads, it only
> > > corrects thing on disk when you do a scrub.
> >
> > AFAIK that isn't quite correct. Currently, the number of copies is
> > limited to two, meaning if one of the two is bad, there's a 50%
> > chance of btrfs reading the good one on first try.
>
> Scrub checks both copies, though. It's ordinary reads that don't.
While I believe I was clear in full context (see below), agreed. I was
talking about normal reads in the above, not scrub, as the full quote
should make clear. I guess I could have made it clearer in the
immediate context, however. Thanks.
> > Thus, while btrfs may randomly bump into a bad block and rewrite it
> > with the good copy, scrub is the only way to systematically detect
> > and (if there's a good copy) fix these checksum errors.
--
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 12:34 ` Duncan
@ 2014-10-09 13:18 ` Austin S Hemmelgarn
2014-10-09 13:49 ` Duncan
0 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 13:18 UTC (permalink / raw)
To: Duncan; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]
On 2014-10-09 08:34, Duncan wrote:
> On Thu, 09 Oct 2014 08:07:51 -0400
> Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
>
>> On 2014-10-09 07:53, Duncan wrote:
>>> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
>>> excerpted:
>>>
>>>> Also, you should be running btrfs scrub regularly to correct
>>>> bit-rot and force remapping of blocks with read errors. While
>>>> BTRFS technically handles both transparently on reads, it only
>>>> corrects thing on disk when you do a scrub.
>>>
>>> AFAIK that isn't quite correct. Currently, the number of copies is
>>> limited to two, meaning if one of the two is bad, there's a 50%
>>> chance of btrfs reading the good one on first try.
>>>
>>> If btrfs reads the good copy, it simply uses it. If btrfs reads
>>> the bad one, it checks the other one and assuming it's good,
>>> replaces the bad one with the good one both for the read (which
>>> otherwise errors out), and by overwriting the bad one.
>>>
>>> But here's the rub. The chances of detecting that bad block are
>>> relatively low in most cases. First, the system must try reading
>>> it for some reason, but even then, chances are 50% it'll pick the
>>> good one and won't even notice the bad one.
>>>
>>> Thus, while btrfs may randomly bump into a bad block and rewrite it
>>> with the good copy, scrub is the only way to systematically detect
>>> and (if there's a good copy) fix these checksum errors. It's not
>>> that btrfs doesn't do it if it finds them, it's that the chances of
>>> finding them are relatively low, unless you do a scrub, which
>>> systematically checks the entire filesystem (well, other than files
>>> marked nocsum, or nocow, which implies nocsum, or files written
>>> when mounted with nodatacow or nodatasum).
>>>
>>> At least that's the way it /should/ work. I guess it's possible
>>> that btrfs isn't doing those routine "bump-into-it-and-fix-it"
>>> fixes yet, but if so, that's the first /I/ remember reading of it.
>>
>> I'm not 100% certain, but I believe it doesn't actually fix things on
>> disk when it detects an error during a read, I know it doesn't it the
>> fs is mounted ro (even if the media is writable), because I did some
>> testing to see how 'read-only' mounting a btrfs filesystem really is.
>
> Definitely it won't with a read-only mount. But then scrub shouldn't
> be able to write to a read-only mount either. The only way a read-only
> mount should be writable is if it's mounted (bind-mounted or
> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
> that mount, not the read-only mounted location.
In theory yes, but there are caveats to this, namely:
* atime updates still happen unless you have mounted the fs with noatime
* The superblock gets updated if there are 'any' writes
* The free space cache 'might' be updated if there are any writes
All in all, a BTRFS filesystem mounted ro is much more read-only than
say ext4 (which at least updates the sb, and old versions replayed the
journal, in addition to the atime updates).
>
> There's even debate about replaying the journal or doing orphan-delete
> on read-only mounts (at least on-media, the change could, and arguably
> should, occur in RAM and be cached, marking the cache "dirty" at the
> same time so it's appropriately flushed if/when the filesystem goes
> writable), with some arguing read-only means just that, don't
> write /anything/ to it until it's read-write mounted.
>
> But writable-mounted, detected checksum errors (with a good copy
> available) should be rewritten as far as I know. If not, I'd call it
> a bug. The problem is in the detection, not in the rewriting. Scrub's
> the only way to reliably detect these errors since it's the only thing
> that systematically checks /everything/.
>
>> Also, that's a much better description of how multiple copies work
>> than I could probably have ever given.
>
> Thanks. =:^)
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 13:18 ` Austin S Hemmelgarn
@ 2014-10-09 13:49 ` Duncan
2014-10-09 15:44 ` Eric Sandeen
0 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2014-10-09 13:49 UTC (permalink / raw)
To: linux-btrfs
Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
excerpted:
> On 2014-10-09 08:34, Duncan wrote:
>> The only way a read-only
>> mount should be writable is if it's mounted (bind-mounted or
>> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
>> that mount, not the read-only mounted location.
> In theory yes, but there are caveats to this, namely:
> * atime updates still happen unless you have mounted the fs with noatime
I've been mounting noatime for well over a decade now, exactly due to
such problems. But I believe at least /some/ filesystems are truly read-
only when they're mounted as such, and atime updates don't happen on them.
These days I actually apply a patch that changes the default relatime to
noatime, so I don't even have to have it in my mount-options. =:^)
> * The superblock gets updated if there are 'any' writes
Yeah. At least in theory, there shouldn't be, however. As I said, in
theory, even journal replay and orphan delete shouldn't hit media, altho
handling it in memory and dirtying the cache, so if the filesystem is
ever remounted read-write they get written, is reasonable.
> * The free space cache 'might' be updated if there are any writes
Makes sense. But of course that's what I'm arguing, there shouldn't /be/
any writes. Read-only should mean exactly that, don't touch media,
period.
I remember at one point activating an mdraid1 degraded, read-only, just a
single device of the 4-way raid1 I was running at the time, to recover
data from it after the system it was running in died. The idea was don't
write to the device at all, because I was still testing the new system,
and in case I decided to try to reassemble the raid at some point. Read-
only really NEEDS to be read-only, under such conditions.
Similarly for forensic examination, of course. If there's a write, any
write, it's evidence tampering. Read-only needs to MEAN read-only!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-09 13:49 ` Duncan
@ 2014-10-09 15:44 ` Eric Sandeen
0 siblings, 0 replies; 33+ messages in thread
From: Eric Sandeen @ 2014-10-09 15:44 UTC (permalink / raw)
To: Duncan, linux-btrfs
On 10/9/14 8:49 AM, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
> excerpted:
>
>> On 2014-10-09 08:34, Duncan wrote:
>
>>> The only way a read-only
>>> mount should be writable is if it's mounted (bind-mounted or
>>> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
>>> that mount, not the read-only mounted location.
>
>> In theory yes, but there are caveats to this, namely:
>> * atime updates still happen unless you have mounted the fs with noatime
Getting off the topic a bit, but that really shouldn't happen:
#define IS_NOATIME(inode) __IS_FLG(inode, MS_RDONLY|MS_NOATIME)
and in touch_atime():
if (IS_NOATIME(inode))
return;
-Eric
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
@ 2014-10-10 1:58 ` Chris Murphy
2014-10-10 3:20 ` Duncan
` (2 more replies)
2014-10-12 10:17 ` Martin Steigerwald
2014-10-13 21:09 ` Josef Bacik
3 siblings, 3 replies; 33+ messages in thread
From: Chris Murphy @ 2014-10-10 1:58 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-btrfs
On Oct 8, 2014, at 3:11 PM, Eric Sandeen <sandeen@redhat.com> wrote:
> I was looking at Marc's post:
>
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
It's definitely confusing compared to any other filesystem I've used on four different platforms. And that's when excluding scraping and the functions unique to any multiple device volume: scrubs, degraded mount.
To be fair, mdadm doesn't even have a scrub command, it's done via 'echo check > /sys/block/mdX/md/sync_action'. And meanwhile LVM has pvck, vgck, and for scrubs it's lvchange --syncaction {check|repair}. These are also completely non-obvious.
> * mount -o recovery
> "Enable autorecovery attempts if a bad tree root is found at mount time."
I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
> * btrfs-zero-log
> "remove the log tree if log tree is corrupt"
> * btrfs rescue
> "Recover a damaged btrfs filesystem"
> chunk-recover
> super-recover
> How does this relate to btrfs check?
> * btrfs check
> "repair a btrfs filesystem"
> --repair
> --init-csum-tree
> --init-extent-tree
> How does this relate to btrfs rescue?
These three translate into eight combinations of repairs, adding -o recovery there are 9 combinations. I think this is the main source of confusion, there are just too many options, but also it's completely non-obvious which one to use in which situation.
My expectation is that eventually these get consolidated into just check and check --repair. As the repair code matures, it'd go into kernel autorecovery code. That's a guess on my part, but it's consistent with design goals.
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?
I suspect it's unintended splintering, and is an artifact that will go away. I'd rather the convoluted, fractured nature of repair go away before the scary experimental warnings do.
Chris Murphy
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 1:58 ` Chris Murphy
@ 2014-10-10 3:20 ` Duncan
2014-10-10 10:53 ` Bob Marley
2014-10-12 10:06 ` Martin Steigerwald
2 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-10 3:20 UTC (permalink / raw)
To: linux-btrfs
Chris Murphy posted on Thu, 09 Oct 2014 21:58:53 -0400 as excerpted:
> I suspect it's unintended splintering, and is an artifact that will go
> away. I'd rather the convoluted, fractured nature of repair go away
> before the scary experimental warnings do.
Heh, agreed with everything[1], but too late for this, the experimental
warnings are peeled off, the experimental or at least horribly immature
/behavior/ remains. =:^(
---
[1] ... and a much more logically cohesive and well structured reply than
I could have managed as my own thoughts simply weren't that well
organized.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 1:58 ` Chris Murphy
2014-10-10 3:20 ` Duncan
@ 2014-10-10 10:53 ` Bob Marley
2014-10-10 10:59 ` Roman Mamedov
` (2 more replies)
2014-10-12 10:06 ` Martin Steigerwald
2 siblings, 3 replies; 33+ messages in thread
From: Bob Marley @ 2014-10-10 10:53 UTC (permalink / raw)
To: linux-btrfs
On 10/10/2014 03:58, Chris Murphy wrote:
>
>> * mount -o recovery
>> "Enable autorecovery attempts if a bad tree root is found at mount time."
> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>
> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
No way!
I wouldn't want a default like that.
If you think at distributed transactions: suppose a sync was issued on
both sides of a distributed transaction, then power was lost on one
side, than btrfs had corruption. When I remount it, definitely the worst
thing that can happen is that it auto-rolls-back to a previous
known-good state.
Now if I can express wishes:
I would like an option that spits out all the usable tree roots (or
what's the name, superblocks?) and not just the newest one which is
corrupt. And then another option that lets me mount *readonly* starting
from the tree root I specify. So I can check how much of the data is
still there. After I decide that such tree root is good, I need another
option that lets me mount with such tree root in readwrite mode, and
obviously eliminating all tree roots newer than that.
Some time ago I read that mounting the filesystem with an earlier tree
root was possible, but only by manually erasing the disk regions in
which the newer superblocks are. This is crazy, it's too risky on too
many levels, and also as I wrote I want to check what data is available
on a certain tree root before mounting readwrite from that one.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 10:53 ` Bob Marley
@ 2014-10-10 10:59 ` Roman Mamedov
2014-10-10 11:12 ` Bob Marley
2014-10-10 14:37 ` Chris Murphy
2014-10-11 7:29 ` Goffredo Baroncelli
2 siblings, 1 reply; 33+ messages in thread
From: Roman Mamedov @ 2014-10-10 10:59 UTC (permalink / raw)
To: Bob Marley; +Cc: linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]
On Fri, 10 Oct 2014 12:53:38 +0200
Bob Marley <bobmarley@shiftmail.org> wrote:
> On 10/10/2014 03:58, Chris Murphy wrote:
> >
> >> * mount -o recovery
> >> "Enable autorecovery attempts if a bad tree root is found at mount time."
> > I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
> >
> > If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
>
> No way!
> I wouldn't want a default like that.
>
> If you think at distributed transactions: suppose a sync was issued on
> both sides of a distributed transaction, then power was lost on one
> side
What distributed transactions? Btrfs is not a clustered filesystem[1], it does
not support and likely will never support being mounted from multiple hosts at
the same time.
[1]http://en.wikipedia.org/wiki/Clustered_file_system
--
With respect,
Roman
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 10:59 ` Roman Mamedov
@ 2014-10-10 11:12 ` Bob Marley
2014-10-10 15:18 ` cwillu
0 siblings, 1 reply; 33+ messages in thread
From: Bob Marley @ 2014-10-10 11:12 UTC (permalink / raw)
To: linux-btrfs
On 10/10/2014 12:59, Roman Mamedov wrote:
> On Fri, 10 Oct 2014 12:53:38 +0200
> Bob Marley <bobmarley@shiftmail.org> wrote:
>
>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>> * mount -o recovery
>>>> "Enable autorecovery attempts if a bad tree root is found at mount time."
>>> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>>>
>>> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
>> No way!
>> I wouldn't want a default like that.
>>
>> If you think at distributed transactions: suppose a sync was issued on
>> both sides of a distributed transaction, then power was lost on one
>> side
> What distributed transactions? Btrfs is not a clustered filesystem[1], it does
> not support and likely will never support being mounted from multiple hosts at
> the same time.
>
> [1]http://en.wikipedia.org/wiki/Clustered_file_system
>
This is not the only way to do a distributed transaction.
Databases can be hosted on the filesystem, and those can do distributed
transations.
Think of two bank accounts, one on btrfs fs1 here, and another bank
account on database on a whatever filesystem in another country. You
want to debit one account and credit the other one: the filesystems at
the two sides *must not rollback their state* !! (especially not
transparently without human intervention)
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 10:53 ` Bob Marley
2014-10-10 10:59 ` Roman Mamedov
@ 2014-10-10 14:37 ` Chris Murphy
2014-10-10 17:43 ` Bob Marley
2014-10-12 10:14 ` Martin Steigerwald
2014-10-11 7:29 ` Goffredo Baroncelli
2 siblings, 2 replies; 33+ messages in thread
From: Chris Murphy @ 2014-10-10 14:37 UTC (permalink / raw)
To: linux-btrfs
On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
> On 10/10/2014 03:58, Chris Murphy wrote:
>>
>>> * mount -o recovery
>>> "Enable autorecovery attempts if a bad tree root is found at mount time."
>> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>>
>> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
>
> No way!
> I wouldn't want a default like that.
>
> If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state.
For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case.
The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot.
Chris Murphy
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 11:12 ` Bob Marley
@ 2014-10-10 15:18 ` cwillu
0 siblings, 0 replies; 33+ messages in thread
From: cwillu @ 2014-10-10 15:18 UTC (permalink / raw)
To: Bob Marley; +Cc: linux-btrfs
If -o recovery is necessary, then you're either running into a btrfs
bug, or your hardware is lying about when it has actually written
things to disk.
The first case isn't unheard of, although far less common than it used
to be, and it should continue to improve with time.
In the second case, you're potentially screwed regardless of the
filesystem, without doing hacks like "wait a good long time before
returning from fsync in the hopes that the disk might actually have
gotten around to performing the write it said had already finished."
On Fri, Oct 10, 2014 at 5:12 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
> On 10/10/2014 12:59, Roman Mamedov wrote:
>>
>> On Fri, 10 Oct 2014 12:53:38 +0200
>> Bob Marley <bobmarley@shiftmail.org> wrote:
>>
>>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>>>
>>>>> * mount -o recovery
>>>>> "Enable autorecovery attempts if a bad tree root is found at
>>>>> mount time."
>>>>
>>>> I'm confused why it's not the default yet. Maybe it's continuing to
>>>> evolve at a pace that suggests something could sneak in that makes things
>>>> worse? It is almost an oxymoron in that I'm manually enabling an
>>>> autorecovery
>>>>
>>>> If true, maybe the closest indication we'd get of btrfs stablity is the
>>>> default enabling of autorecovery.
>>>
>>> No way!
>>> I wouldn't want a default like that.
>>>
>>> If you think at distributed transactions: suppose a sync was issued on
>>> both sides of a distributed transaction, then power was lost on one
>>> side
>>
>> What distributed transactions? Btrfs is not a clustered filesystem[1], it
>> does
>> not support and likely will never support being mounted from multiple
>> hosts at
>> the same time.
>>
>> [1]http://en.wikipedia.org/wiki/Clustered_file_system
>>
>
> This is not the only way to do a distributed transaction.
> Databases can be hosted on the filesystem, and those can do distributed
> transations.
> Think of two bank accounts, one on btrfs fs1 here, and another bank account
> on database on a whatever filesystem in another country. You want to debit
> one account and credit the other one: the filesystems at the two sides *must
> not rollback their state* !! (especially not transparently without human
> intervention)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 14:37 ` Chris Murphy
@ 2014-10-10 17:43 ` Bob Marley
2014-10-10 17:53 ` Bardur Arantsson
2014-10-10 19:35 ` Austin S Hemmelgarn
2014-10-12 10:14 ` Martin Steigerwald
1 sibling, 2 replies; 33+ messages in thread
From: Bob Marley @ 2014-10-10 17:43 UTC (permalink / raw)
To: linux-btrfs
On 10/10/2014 16:37, Chris Murphy wrote:
> The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot.
>
A filesystem which is suited for "general purpose" use is a filesystem
which honors fsync, and doesn't *ever* auto-roll-back without user
intervention.
Anything different is not suited for database transactions at all. Any
paid service which has the users database on btrfs is going to be at
risk of losing payments, and probably without the company even knowing.
If btrfs goes this way I hope a big warning is written on the wiki and
on the manpages telling that this filesystem is totally unsuitable for
hosting databases performing transactions.
At most I can suggest that a flag in the metadata be added to
allow/disallow auto-roll-back-on-error on such filesystem, so people can
decide the "tolerant" vs. "transaction-safe" mode at filesystem creation.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 17:43 ` Bob Marley
@ 2014-10-10 17:53 ` Bardur Arantsson
2014-10-10 19:35 ` Austin S Hemmelgarn
1 sibling, 0 replies; 33+ messages in thread
From: Bardur Arantsson @ 2014-10-10 17:53 UTC (permalink / raw)
To: linux-btrfs
On 2014-10-10 19:43, Bob Marley wrote:
> On 10/10/2014 16:37, Chris Murphy wrote:
>> The fail safe behavior is to treat the known good tree root as the
>> default tree root, and bypass the bad tree root if it cannot be
>> repaired, so that the volume can be mounted with default mount options
>> (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
>> suited for general purpose use as rootfs let alone for boot.
>>
>
> A filesystem which is suited for "general purpose" use is a filesystem
> which honors fsync, and doesn't *ever* auto-roll-back without user
> intervention.
>
A file system cannot do anything about the *DISKS* not honouring a sync
command. That's what the PP was talking about.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 17:43 ` Bob Marley
2014-10-10 17:53 ` Bardur Arantsson
@ 2014-10-10 19:35 ` Austin S Hemmelgarn
2014-10-10 22:05 ` Eric Sandeen
1 sibling, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-10 19:35 UTC (permalink / raw)
To: Bob Marley, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]
On 2014-10-10 13:43, Bob Marley wrote:
> On 10/10/2014 16:37, Chris Murphy wrote:
>> The fail safe behavior is to treat the known good tree root as the
>> default tree root, and bypass the bad tree root if it cannot be
>> repaired, so that the volume can be mounted with default mount options
>> (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
>> suited for general purpose use as rootfs let alone for boot.
>>
>
> A filesystem which is suited for "general purpose" use is a filesystem
> which honors fsync, and doesn't *ever* auto-roll-back without user
> intervention.
>
> Anything different is not suited for database transactions at all. Any
> paid service which has the users database on btrfs is going to be at
> risk of losing payments, and probably without the company even knowing.
> If btrfs goes this way I hope a big warning is written on the wiki and
> on the manpages telling that this filesystem is totally unsuitable for
> hosting databases performing transactions.
If they need reliability, they should have some form of redundancy
in-place and/or run the database directly on the block device; because
even ext4, XFS, and pretty much every other filesystem can lose data
sometimes, the difference being that those tend to give worse results
when hardware is misbehaving than BTRFS does, because BTRFS usually has
a old copy of whatever data structure gets corrupted to fall back on.
Also, you really shouldn't be running databases on a BTRFS filesystem at
the moment anyway, because of the significant performance implications.
>
> At most I can suggest that a flag in the metadata be added to
> allow/disallow auto-roll-back-on-error on such filesystem, so people can
> decide the "tolerant" vs. "transaction-safe" mode at filesystem creation.
>
The problem with this is that if the auto-recovery code did run (and
IMHO the kernel should spit out a warning to the system log whenever it
does), then chances are that you wouldn't have had a consistent view if
you had prevented it from running either; and, if the database is
properly distributed/replicated, then it should recover by itself.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 19:35 ` Austin S Hemmelgarn
@ 2014-10-10 22:05 ` Eric Sandeen
2014-10-13 11:26 ` Austin S Hemmelgarn
0 siblings, 1 reply; 33+ messages in thread
From: Eric Sandeen @ 2014-10-10 22:05 UTC (permalink / raw)
To: Austin S Hemmelgarn, Bob Marley, linux-btrfs
On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:
> On 2014-10-10 13:43, Bob Marley wrote:
>> On 10/10/2014 16:37, Chris Murphy wrote:
>>> The fail safe behavior is to treat the known good tree root as
>>> the default tree root, and bypass the bad tree root if it cannot
>>> be repaired, so that the volume can be mounted with default mount
>>> options (i.e. the ones in fstab). Otherwise it's a filesystem
>>> that isn't well suited for general purpose use as rootfs let
>>> alone for boot.
>>>
>>
>> A filesystem which is suited for "general purpose" use is a
>> filesystem which honors fsync, and doesn't *ever* auto-roll-back
>> without user intervention.
>>
>> Anything different is not suited for database transactions at all.
>> Any paid service which has the users database on btrfs is going to
>> be at risk of losing payments, and probably without the company
>> even knowing. If btrfs goes this way I hope a big warning is
>> written on the wiki and on the manpages telling that this
>> filesystem is totally unsuitable for hosting databases performing
>> transactions.
> If they need reliability, they should have some form of redundancy
> in-place and/or run the database directly on the block device;
> because even ext4, XFS, and pretty much every other filesystem can
> lose data sometimes,
Not if i.e. fsync returns. If the data is gone later, it's a hardware
problem, or occasionally a bug - bugs that are usually found & fixed
pretty quickly.
> the difference being that those tend to give
> worse results when hardware is misbehaving than BTRFS does, because
> BTRFS usually has a old copy of whatever data structure gets
> corrupted to fall back on.
I'm curious, is that based on conjecture or real-world testing?
-Eric
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 10:53 ` Bob Marley
2014-10-10 10:59 ` Roman Mamedov
2014-10-10 14:37 ` Chris Murphy
@ 2014-10-11 7:29 ` Goffredo Baroncelli
2014-11-17 20:55 ` Phillip Susi
2 siblings, 1 reply; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-10-11 7:29 UTC (permalink / raw)
To: Bob Marley, linux-btrfs
On 10/10/2014 12:53 PM, Bob Marley wrote:
>>
>> If true, maybe the closest indication we'd get of btrfs stablity is
>> the default enabling of autorecovery.
>
> No way! I wouldn't want a default like that.
>
> If you think at distributed transactions: suppose a sync was issued
> on both sides of a distributed transaction, then power was lost on
> one side, than btrfs had corruption. When I remount it, definitely
> the worst thing that can happen is that it auto-rolls-back to a
> previous known-good state.
I cannot agree. I consider a sane default to have a consistent state with
"the recently data written lost", instead of "require the user
intervention to not lost anything".
To address your requirement, we need a "super sync" command which
ensure that the data are in the filesystem and not only
in the log (as sync should ensure).
BR
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 1:58 ` Chris Murphy
2014-10-10 3:20 ` Duncan
2014-10-10 10:53 ` Bob Marley
@ 2014-10-12 10:06 ` Martin Steigerwald
2 siblings, 0 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:06 UTC (permalink / raw)
To: Chris Murphy, linux-btrfs; +Cc: Eric Sandeen
Am Donnerstag, 9. Oktober 2014, 21:58:53 schrieben Sie:
> > * btrfs-zero-log
> > "remove the log tree if log tree is corrupt"
> > * btrfs rescue
> > "Recover a damaged btrfs filesystem"
> > chunk-recover
> > super-recover
> > How does this relate to btrfs check?
> > * btrfs check
> > "repair a btrfs filesystem"
> > --repair
> > --init-csum-tree
> > --init-extent-tree
> > How does this relate to btrfs rescue?
>
> These three translate into eight combinations of repairs, adding -o recovery
> there are 9 combinations. I think this is the main source of confusion,
> there are just too many options, but also it's completely non-obvious which
> one to use in which situation.
>
> My expectation is that eventually these get consolidated into just check and
> check --repair. As the repair code matures, it'd go into kernel
> autorecovery code. That's a guess on my part, but it's consistent with
> design goals.
Also I think these should at least all be unter the btrfs command.
So include btrfs-zero-log in btrfs command.
And well how about "btrfs repair" or "btrfs check" as upper category and at
least add the various options as commands below it? So there is at least one
command and one place in manpage to learn about the various options.
But maybe some can be made automatic as well. Or folded into btrfs check --
repair. Ideally it would auto-detect which path to take on filesystem
recovery.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 14:37 ` Chris Murphy
2014-10-10 17:43 ` Bob Marley
@ 2014-10-12 10:14 ` Martin Steigerwald
2014-10-12 23:59 ` Duncan
` (2 more replies)
1 sibling, 3 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:14 UTC (permalink / raw)
To: Chris Murphy; +Cc: linux-btrfs
Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
> > On 10/10/2014 03:58, Chris Murphy wrote:
> >>> * mount -o recovery
> >>>
> >>> "Enable autorecovery attempts if a bad tree root is found at mount
> >>> time."
> >>
> >> I'm confused why it's not the default yet. Maybe it's continuing to
> >> evolve at a pace that suggests something could sneak in that makes
> >> things worse? It is almost an oxymoron in that I'm manually enabling an
> >> autorecovery
> >>
> >> If true, maybe the closest indication we'd get of btrfs stablity is the
> >> default enabling of autorecovery.>
> > No way!
> > I wouldn't want a default like that.
> >
> > If you think at distributed transactions: suppose a sync was issued on
> > both sides of a distributed transaction, then power was lost on one side,
> > than btrfs had corruption. When I remount it, definitely the worst thing
> > that can happen is that it auto-rolls-back to a previous known-good
> > state.
> For a general purpose file system, losing 30 seconds (or less) of
> questionably committed data, likely corrupt, is a file system that won't
> mount without user intervention, which requires a secret decoder ring to
> get it to mount at all. And may require the use of specialized tools to
> retrieve that data in any case.
>
> The fail safe behavior is to treat the known good tree root as the default
> tree root, and bypass the bad tree root if it cannot be repaired, so that
> the volume can be mounted with default mount options (i.e. the ones in
> fstab). Otherwise it's a filesystem that isn't well suited for general
> purpose use as rootfs let alone for boot.
To understand this a bit better:
What can be the reasons a recent tree gets corrupted?
I always thought with a controller and device and driver combination that
honors fsync with BTRFS it would either be the new state of the last known
good state *anyway*. So where does the need to rollback arise from?
That said all journalling filesystems have some sort of rollback as far as I
understand: If the last journal entry is incomplete they discard it on journal
replay. So even there you use the last seconds of write activity.
But in case fsync() returns the data needs to be safe on disk. I always
thought BTRFS honors this under *any* circumstance. If some proposed
autorollback breaks this guarentee, I think something is broke elsewhere.
And fsync is an fsync is an fsync. Its semantics are clear as crystal. There
is nothing, absolutely nothing to discuss about it.
An fsync completes if the device itself reported "Yeah, I have the data on
disk, all safe and cool to go". Anything else is a bug IMO.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
2014-10-10 1:58 ` Chris Murphy
@ 2014-10-12 10:17 ` Martin Steigerwald
2014-10-13 21:09 ` Josef Bacik
3 siblings, 0 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:17 UTC (permalink / raw)
To: Eric Sandeen; +Cc: linux-btrfs
Am Mittwoch, 8. Oktober 2014, 14:11:51 schrieb Eric Sandeen:
> I was looking at Marc's post:
>
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-> and-Btrfs-Filesystem-Repair.html
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
>
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
>
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
>
> * btrfs scrub
> "Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> "Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> "Allow mounts to continue with missing devices."
> (This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> "remove the log tree if log tree is corrupt"
> * btrfs rescue
> "Recover a damaged btrfs filesystem"
> chunk-recover
> super-recover
> How does this relate to btrfs check?
> * btrfs check
> "repair a btrfs filesystem"
> --repair
> --init-csum-tree
> --init-extent-tree
> How does this relate to btrfs rescue?
> * btrfs restore
> "try to salvage files from a damaged filesystem"
> (not really repair, it's disk-scraping)
>
>
> What's the vision for, say, scrub vs. check vs. rescue? Should they repair
> the same errors, only online vs. offline? If not, what class of errors
> does one fix vs. the other? How would an admin know? Can btrfs check
> recover a bad tree root in the same way that mount -o recovery does? How
> would I know if I should use --init-*-tree, or chunk-recover, and what are
> the ramifications of using these options?
>
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?
How about taking one step back:
What are the possible corruption cases these tools are meant to address?
*Where* can BTRFS break and *why*?
What of it can be folded into one command? Where can BTRFS be improved to
either prevent a corruption from happening ot automatically correcting it?
What actions can be determined automatically by the repair tool? What needs to
be options for the user to choose from? And what guidance would the user need
to decide?
I.e. really going to back what diagnosing and repair of BTRFS actually
includes and then well… go about a vision how this all can fit together as you
suggested.
As a minimum I suggest to have all possible options as a main category in
btrfs command, no external commands whatsoever, so if btrfs-zero-log is still
needed, at it into btrfs command.
--
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-12 10:14 ` Martin Steigerwald
@ 2014-10-12 23:59 ` Duncan
2014-10-13 11:37 ` Austin S Hemmelgarn
2014-10-13 11:48 ` Rich Freeman
2 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-12 23:59 UTC (permalink / raw)
To: linux-btrfs
Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted:
> I always thought with a controller and device and driver combination
> that honors fsync with BTRFS it would either be the new state of the
> last known good state *anyway*. So where does the need to rollback arise
> from?
My understanding here is...
With btrfs a full-tree commit is atomic. You should get either the old
tree or the new tree. However, due to the cascading nature of updates on
cow-based structures, these full-tree commits are done by default
(there's a mount-option to adjust it) every 30 seconds. Between these
atomic commits partial updates may have occurred. The btrfs log (the one
that btrfs-zero-log kills) is limited to between-commit updates, and thus
to the upto 30 seconds (default) worth of changes since the last full-
tree atomic commit.
In addition to that, there's a history of tree-root commits kept (with
the superblocks pointing to the last one). Btrfs-find-tree-root can be
used to list this history. The recovery mount option simply allows btrfs
to fall back to this history, should the current root be corrupted.
Btrfs restore can be used to list tree roots as well, and can be pointed
at an appropriate one if necessary.
Fsync forces the file and its corresponding metadata update to the log
and barring hardware or software bugs should not return until it's safely
in the log, but I'm not sure whether it forces a full-tree commit.
Either way the guarantees should be the same. If the log can be replayed
or a full-tree commit has occurred since the fsync, the new copy should
appear. If it can't, the rollback to the last atomic tree commit should
return an intact copy of the file from that point. If the recovery mount
option is used and a further rollback to an earlier full-tree commit is
forced, provided it existed at the point of that full-tree commit, the
intact file at that point should appear.
So if the current tree root is a good one, the log will replay the last
upto 30 seconds of activity on top of that last atomic tree root. If the
current root tree itself is corrupt, the recovery mount option will let
an earlier one be used. Obviously in that case the log will be discarded
since it applies to a later root tree that itself has been discarded.
The debate is whether recovery should be automated so the admin doesn't
have to care about it, or whether having to manually add that option
serves as a necessary notifier to the admin that something /did/ go
wrong, and that an earlier root is being used instead, so more than a few
seconds worth of data may have disappeared.
As someone else has already suggested, I'd argue that as long as btrfs
continues to be under the sort of development it's in now, keeping
recovery as a non-default option is desired. Once it's optimized and
considered stable, arguably recovery should be made the default, perhaps
with a no-recovery option for those who prefer that in-the-face
notification in the form of a mount error, if btrfs would otherwise fall
back to an earlier tree root commit.
What worries me, however, is that IMO the recent warning stripping was
premature. Btrfs is certainly NOT fully stable or optimized for normal
use at this point. We're still using the even/odd PID balancing scheme
for raid1 reads, for instance, and multi-device writes are still
serialized when they could be parallelized to a much larger degree (tho
keeping some serialization is arguably good for data safety). Arguably
optimizing that now would be premature optimization since the code itself
is still subject to change, so I'm not complaining, but by that very same
token, it *IS* still subject to change, which by definition means it's
*NOT* stable, so why are we removing all the warnings and giving the
impression that it IS stable?
The decision wasn't mine to make and I don't know, but while a nice
suggestion, making recovery-by-default a measure of when btrfs goes
stable simply won't work, because surely, the same folks behind the
warning stripping would then ensure this indicator too, said btrfs was
stable, while the state of the code itself continues to say otherwise.
Meanwhile, if your distributed transactions scenario doesn't account for
crash and loss of data on one side with real-time backup/redundancy, such
that loss of a few seconds worth of transactions on a single local
filesystem is going to kill the entire scenario, I don't think too much
of that scenario in the first place, and regardless, btrfs, certainly in
its current state, is definitely NOT an appropriate base for it. Use
appropriate tools for the task. Btrfs at least at this point is simply
not an appropriate tool for that task.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-10 22:05 ` Eric Sandeen
@ 2014-10-13 11:26 ` Austin S Hemmelgarn
0 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-13 11:26 UTC (permalink / raw)
To: Eric Sandeen, Bob Marley, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]
On 2014-10-10 18:05, Eric Sandeen wrote:
> On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:
>> On 2014-10-10 13:43, Bob Marley wrote:
>>> On 10/10/2014 16:37, Chris Murphy wrote:
>>>> The fail safe behavior is to treat the known good tree root as
>>>> the default tree root, and bypass the bad tree root if it cannot
>>>> be repaired, so that the volume can be mounted with default mount
>>>> options (i.e. the ones in fstab). Otherwise it's a filesystem
>>>> that isn't well suited for general purpose use as rootfs let
>>>> alone for boot.
>>>>
>>>
>>> A filesystem which is suited for "general purpose" use is a
>>> filesystem which honors fsync, and doesn't *ever* auto-roll-back
>>> without user intervention.
>>>
>>> Anything different is not suited for database transactions at all.
>>> Any paid service which has the users database on btrfs is going to
>>> be at risk of losing payments, and probably without the company
>>> even knowing. If btrfs goes this way I hope a big warning is
>>> written on the wiki and on the manpages telling that this
>>> filesystem is totally unsuitable for hosting databases performing
>>> transactions.
>> If they need reliability, they should have some form of redundancy
>> in-place and/or run the database directly on the block device;
>> because even ext4, XFS, and pretty much every other filesystem can
>> lose data sometimes,
>
> Not if i.e. fsync returns. If the data is gone later, it's a hardware
> problem, or occasionally a bug - bugs that are usually found & fixed
> pretty quickly.
Yes, barring bugs and hardware problems they won't lose data.
>
>> the difference being that those tend to give
>> worse results when hardware is misbehaving than BTRFS does, because
>> BTRFS usually has a old copy of whatever data structure gets
>> corrupted to fall back on.
>
> I'm curious, is that based on conjecture or real-world testing?
>
I wouldn't really call it testing, but based on personal experience I
know that ext4 can lose whole directory sub-trees if it gets a single
corrupt sector in the wrong place. I've also had that happen on FAT32
and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and
I've actually had individual files disappear on HFS+ without any
discernible hardware issues. I don't have as much experience with XFS,
but would assume based on what I do know of it that it could have
similar issues. As for BTRFS, I've only ever had any issues with it 3
times, one was due to the kernel panicking during resume from S1, and
the other two were due to hardware problems that would have caused
issues on most other filesystems as well. In both cases of hardware
issues, while the filesystem was initially unmountable, it was
relatively simple to fix once I knew how. I tried to fix an ext4 fs
that had become unmountable due to dropped writes once, and that was
anything but simple, even with the much greater amount of documentation.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-12 10:14 ` Martin Steigerwald
2014-10-12 23:59 ` Duncan
@ 2014-10-13 11:37 ` Austin S Hemmelgarn
2014-10-13 11:48 ` Rich Freeman
2 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-13 11:37 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Chris Murphy, linux-btrfs
[-- Attachment #1: Type: text/plain, Size: 4031 bytes --]
On 2014-10-12 06:14, Martin Steigerwald wrote:
> Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
>> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
>>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>>> * mount -o recovery
>>>>>
>>>>> "Enable autorecovery attempts if a bad tree root is found at mount
>>>>> time."
>>>>
>>>> I'm confused why it's not the default yet. Maybe it's continuing to
>>>> evolve at a pace that suggests something could sneak in that makes
>>>> things worse? It is almost an oxymoron in that I'm manually enabling an
>>>> autorecovery
>>>>
>>>> If true, maybe the closest indication we'd get of btrfs stablity is the
>>>> default enabling of autorecovery.>
>>> No way!
>>> I wouldn't want a default like that.
>>>
>>> If you think at distributed transactions: suppose a sync was issued on
>>> both sides of a distributed transaction, then power was lost on one side,
>>> than btrfs had corruption. When I remount it, definitely the worst thing
>>> that can happen is that it auto-rolls-back to a previous known-good
>>> state.
>> For a general purpose file system, losing 30 seconds (or less) of
>> questionably committed data, likely corrupt, is a file system that won't
>> mount without user intervention, which requires a secret decoder ring to
>> get it to mount at all. And may require the use of specialized tools to
>> retrieve that data in any case.
>>
>> The fail safe behavior is to treat the known good tree root as the default
>> tree root, and bypass the bad tree root if it cannot be repaired, so that
>> the volume can be mounted with default mount options (i.e. the ones in
>> fstab). Otherwise it's a filesystem that isn't well suited for general
>> purpose use as rootfs let alone for boot.
>
> To understand this a bit better:
>
> What can be the reasons a recent tree gets corrupted?
>
Well, so far I have had the following cause corrupted trees:
1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just
happened to be in the middle of a tree commit.
2. Generic power loss during a tree commit.
3. A device not properly honoring write-barriers (the operations
immediately adjacent to the write barrier weren't being ordered
correctly all the time).
Based on what I know about BTRFS, the following could also cause problems:
1. A single-event-upset somewhere in the write path.
2. The kernel issuing a write to the wrong device (I haven't had this
happen to me, but know people who have).
In general, any of these will cause problems for pretty much any
filesystem, not just BTRFS.
> I always thought with a controller and device and driver combination that
> honors fsync with BTRFS it would either be the new state of the last known
> good state *anyway*. So where does the need to rollback arise from?
>
I think that in this case the term rollback is a bit ambiguous, here it
means from the point of view of userspace, which sees the FS as having
'rolled-back' from the most recent state to the last known good state.
> That said all journalling filesystems have some sort of rollback as far as I
> understand: If the last journal entry is incomplete they discard it on journal
> replay. So even there you use the last seconds of write activity.
>
> But in case fsync() returns the data needs to be safe on disk. I always
> thought BTRFS honors this under *any* circumstance. If some proposed
> autorollback breaks this guarentee, I think something is broke elsewhere.
>
> And fsync is an fsync is an fsync. Its semantics are clear as crystal. There
> is nothing, absolutely nothing to discuss about it.
>
> An fsync completes if the device itself reported "Yeah, I have the data on
> disk, all safe and cool to go". Anything else is a bug IMO.
>
Or a hardware issue, most filesystems need disks to properly honor write
barriers to provide guaranteed semantics on an fsync, and many consumer
disk drives still don't honor them consistently.
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-12 10:14 ` Martin Steigerwald
2014-10-12 23:59 ` Duncan
2014-10-13 11:37 ` Austin S Hemmelgarn
@ 2014-10-13 11:48 ` Rich Freeman
2 siblings, 0 replies; 33+ messages in thread
From: Rich Freeman @ 2014-10-13 11:48 UTC (permalink / raw)
To: Martin Steigerwald; +Cc: Chris Murphy, linux-btrfs
On Sun, Oct 12, 2014 at 6:14 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
>> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
>> > On 10/10/2014 03:58, Chris Murphy wrote:
>> >>> * mount -o recovery
>> >>>
>> >>> "Enable autorecovery attempts if a bad tree root is found at mount
>> >>> time."
>> >>
>> >> I'm confused why it's not the default yet. Maybe it's continuing to
>> >> evolve at a pace that suggests something could sneak in that makes
>> >> things worse? It is almost an oxymoron in that I'm manually enabling an
>> >> autorecovery
>> >>
>> >> If true, maybe the closest indication we'd get of btrfs stablity is the
>> >> default enabling of autorecovery.>
>> > No way!
>> > I wouldn't want a default like that.
>> >
>> > If you think at distributed transactions: suppose a sync was issued on
>> > both sides of a distributed transaction, then power was lost on one side,
>> > than btrfs had corruption. When I remount it, definitely the worst thing
>> > that can happen is that it auto-rolls-back to a previous known-good
>> > state.
>> For a general purpose file system, losing 30 seconds (or less) of
>> questionably committed data, likely corrupt, is a file system that won't
>> mount without user intervention, which requires a secret decoder ring to
>> get it to mount at all. And may require the use of specialized tools to
>> retrieve that data in any case.
>>
>> The fail safe behavior is to treat the known good tree root as the default
>> tree root, and bypass the bad tree root if it cannot be repaired, so that
>> the volume can be mounted with default mount options (i.e. the ones in
>> fstab). Otherwise it's a filesystem that isn't well suited for general
>> purpose use as rootfs let alone for boot.
>
> To understand this a bit better:
>
> What can be the reasons a recent tree gets corrupted?
>
> I always thought with a controller and device and driver combination that
> honors fsync with BTRFS it would either be the new state of the last known
> good state *anyway*. So where does the need to rollback arise from?
>
In theory the recover option should never be necessary. Btrfs makes
all the guarantees everybody wants it to - when the data is fsynced
then it will never be lost.
The question is what should happen when a corrupted tree root, which
should never happen, happens anyway. The options are to refuse to
mount the filesystem by default, or mount it by default discarding
about 30-60s worth of writes. And yes, when this situation happens
(whether it mounts by default or not) btrfs has broken its promise of
data being written after a successful fsync return.
As has been pointed out, braindead drive firmware is the most likely
cause of this sort of issue. However, there are a number of other
hardware and software errors that could cause it, including errors in
linux outside of btrfs, and of course bugs in btrfs as well.
In an ideal world no filesystem would need any kind of recovery/repair
tools. They can often mean that the fsync promise was broken. The
real question is, once that has happened, how do you move on?
I think the best default is to auto-recover, but to have better
facilities for reporting errors to the user. Right now btrfs is very
quiet about failures - maybe a cryptic message in dmesg, and nobody
reads all of that unless they're looking for something. If btrfs
could report significant issues that might mitigate the impact of an
auto-recovery.
Also, another thing to consider during recovery is whether the damaged
data could be optionally stored in a snapshot of some kind - maybe in
the way that ext3/4 rollback data after conversion gets stored in a
snapshot. My knowledge of the underlying structures is weak, but I'd
think that a corrupted tree root practically is a snapshot already,
and turning it into one might even be easier than cleaning it up. Of
course, we would need to ensure the snapshot could be deleted without
further error. Doing anything with the snapshot might require special
tools, but if people want to do disk scraping they could.
--
Rich
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: What is the vision for btrfs fs repair?
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
` (2 preceding siblings ...)
2014-10-12 10:17 ` Martin Steigerwald
@ 2014-10-13 21:09 ` Josef Bacik
3 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2014-10-13 21:09 UTC (permalink / raw)
To: Eric Sandeen, linux-btrfs
On 10/08/2014 03:11 PM, Eric Sandeen wrote:
> I was looking at Marc's post:
>
> https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=XJPoqgf9jjvuE1IqCerEXXuwF4w3hbDS3%2F63x5KI4R4%3D%0A&s=b1f817d758eacf914bd60f20ada715384e13c1f8e040100794b0cb21261ec884
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
>
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
>
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
>
> * btrfs scrub
> "Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> "Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> "Allow mounts to continue with missing devices."
> (This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> "remove the log tree if log tree is corrupt"
> * btrfs rescue
> "Recover a damaged btrfs filesystem"
> chunk-recover
> super-recover
> How does this relate to btrfs check?
> * btrfs check
> "repair a btrfs filesystem"
> --repair
> --init-csum-tree
> --init-extent-tree
> How does this relate to btrfs rescue?
> * btrfs restore
> "try to salvage files from a damaged filesystem"
> (not really repair, it's disk-scraping)
>
>
> What's the vision for, say, scrub vs. check vs. rescue? Should they repair the
> same errors, only online vs. offline? If not, what class of errors does one fix vs.
> the other? How would an admin know? Can btrfs check recover a bad tree root
> in the same way that mount -o recovery does? How would I know if I should use
> --init-*-tree, or chunk-recover, and what are the ramifications of using
> these options?
>
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?
>
We probably should just consolidate under 3 commands, one for online
checking, one for offline repair and one for pulling stuff off of the
disk when things go to hell. A lot of these tools were born out of the
fact that we didn't have a fsck tool for a long time so there were these
stop gaps put into place, so now its time to go back and clean it up.
I'll try and do this after I finish my cleanup/sync between kernel and
progs work and fill out the documentation a little better so its clear
when to use what. Thanks,
Josef
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: Re: What is the vision for btrfs fs repair?
2014-10-11 7:29 ` Goffredo Baroncelli
@ 2014-11-17 20:55 ` Phillip Susi
0 siblings, 0 replies; 33+ messages in thread
From: Phillip Susi @ 2014-11-17 20:55 UTC (permalink / raw)
To: kreijack, Bob Marley, linux-btrfs
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 10/11/2014 3:29 AM, Goffredo Baroncelli wrote:
> On 10/10/2014 12:53 PM, Bob Marley wrote:
>>>
>>> If true, maybe the closest indication we'd get of btrfs
>>> stablity is the default enabling of autorecovery.
>>
>> No way! I wouldn't want a default like that.
>>
>> If you think at distributed transactions: suppose a sync was
>> issued on both sides of a distributed transaction, then power was
>> lost on one side, than btrfs had corruption. When I remount it,
>> definitely the worst thing that can happen is that it
>> auto-rolls-back to a previous known-good state.
>
> I cannot agree. I consider a sane default to have a consistent
> state with "the recently data written lost", instead of "require
> the user intervention to not lost anything".
>
> To address your requirement, we need a "super sync" command which
> ensure that the data are in the filesystem and not only in the log
> (as sync should ensure).
I have to agree. There is a reason we have fsck -p and why that is what
is run at boot time. Some repairs involve a tradeoff that will result
in permanent data loss that maybe could be avoided by going the other
way, or performing manual recovery. Such repairs should never be done
automatically by default.
For that matter I'm not even sure this sort of thing should be there as
a mount option at all. It really should require a manual fsck run with
a big warning that *THIS WILL THROW OUT SOME DATA*.
Now if the data is saved to a snapshot or something so you can manually
try to recover it later rather than being thrown out wholesale, I can
see that being done automatically at boot time. Of course, if btrfs is
that damaged then wouldn't grub be unable to load your kernel in the
first place?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
iQEcBAEBAgAGBQJUamDQAAoJEI5FoCIzSKrwaYAIAKXgkGBbBZj6yUuLC1+euim6
6Xqer1DiGywEiO4UPaxmq3rHDOlZlyIamDpUi7nIvbfK+TgBWfEVtLvdd6shjfqA
FvFv7t+X2mlAyk+iGffSK1w9/qgEhE55M35exba95Cdsn0ezos4LpvTsL1128nkx
uGzYQcoYj1irkmDp133JuHYAxhrAp0Q6PB+5gIgWfRsVbGezcxg5FvqzotEq1J/d
7MT1FvdoUo5qt2j/KzTUfD5AlFhsXE5beykakMdFmoHlTCQAxEeUU21z6APclkxF
/b/ppLt603Vpb6rpKvNUyBy1TuPr6FJEx5O2qWUWlhRxkOUB98M86KHyWVBHtMM=
=uG+h
-----END PGP SIGNATURE-----
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2014-11-17 20:56 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
2014-10-09 11:53 ` Duncan
2014-10-09 11:55 ` Hugo Mills
2014-10-09 12:07 ` Austin S Hemmelgarn
2014-10-09 12:12 ` Hugo Mills
2014-10-09 12:32 ` Austin S Hemmelgarn
[not found] ` <107Y1p00G0wm9Bl0107vjZ>
2014-10-09 12:34 ` Duncan
2014-10-09 13:18 ` Austin S Hemmelgarn
2014-10-09 13:49 ` Duncan
2014-10-09 15:44 ` Eric Sandeen
[not found] ` <0zvr1p0162Q6ekd01zvtN0>
2014-10-09 12:42 ` Duncan
2014-10-10 1:58 ` Chris Murphy
2014-10-10 3:20 ` Duncan
2014-10-10 10:53 ` Bob Marley
2014-10-10 10:59 ` Roman Mamedov
2014-10-10 11:12 ` Bob Marley
2014-10-10 15:18 ` cwillu
2014-10-10 14:37 ` Chris Murphy
2014-10-10 17:43 ` Bob Marley
2014-10-10 17:53 ` Bardur Arantsson
2014-10-10 19:35 ` Austin S Hemmelgarn
2014-10-10 22:05 ` Eric Sandeen
2014-10-13 11:26 ` Austin S Hemmelgarn
2014-10-12 10:14 ` Martin Steigerwald
2014-10-12 23:59 ` Duncan
2014-10-13 11:37 ` Austin S Hemmelgarn
2014-10-13 11:48 ` Rich Freeman
2014-10-11 7:29 ` Goffredo Baroncelli
2014-11-17 20:55 ` Phillip Susi
2014-10-12 10:06 ` Martin Steigerwald
2014-10-12 10:17 ` Martin Steigerwald
2014-10-13 21:09 ` Josef Bacik
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.