All of lore.kernel.org
 help / color / mirror / Atom feed
* What is the vision for btrfs fs repair?
@ 2014-10-08 19:11 Eric Sandeen
  2014-10-09 11:29 ` Austin S Hemmelgarn
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Eric Sandeen @ 2014-10-08 19:11 UTC (permalink / raw)
  To: linux-btrfs

I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
	"Errors are corrected along if possible" (what *is* possible?)
* mount -o recovery
	"Enable autorecovery attempts if a bad tree root is found at mount time."
* mount -o degraded
	"Allow mounts to continue with missing devices."
	(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
	"remove the log tree if log tree is corrupt"
* btrfs rescue
	"Recover a damaged btrfs filesystem"
	chunk-recover
	super-recover
	How does this relate to btrfs check?
* btrfs check
	"repair a btrfs filesystem"
	--repair
	--init-csum-tree
	--init-extent-tree
	How does this relate to btrfs rescue?
* btrfs restore
	"try to salvage files from a damaged filesystem"
	(not really repair, it's disk-scraping)


What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
same errors, only online vs. offline?  If not, what class of errors does one fix vs.
the other?  How would an admin know?  Can btrfs check recover a bad tree root
in the same way that mount -o recovery does?  How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
@ 2014-10-09 11:29 ` Austin S Hemmelgarn
  2014-10-09 11:53   ` Duncan
  2014-10-10  1:58 ` Chris Murphy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 11:29 UTC (permalink / raw)
  To: Eric Sandeen, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4907 bytes --]

On 2014-10-08 15:11, Eric Sandeen wrote:
> I was looking at Marc's post:
>
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
>
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
>
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
>
> * btrfs scrub
> 	"Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> 	"Allow mounts to continue with missing devices."
> 	(This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> 	"remove the log tree if log tree is corrupt"
> * btrfs rescue
> 	"Recover a damaged btrfs filesystem"
> 	chunk-recover
> 	super-recover
> 	How does this relate to btrfs check?
> * btrfs check
> 	"repair a btrfs filesystem"
> 	--repair
> 	--init-csum-tree
> 	--init-extent-tree
> 	How does this relate to btrfs rescue?
> * btrfs restore
> 	"try to salvage files from a damaged filesystem"
> 	(not really repair, it's disk-scraping)
>
>
> What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
> same errors, only online vs. offline?  If not, what class of errors does one fix vs.
> the other?  How would an admin know?  Can btrfs check recover a bad tree root
> in the same way that mount -o recovery does?  How would I know if I should use
> --init-*-tree, or chunk-recover, and what are the ramifications of using
> these options?
>
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?

Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a 
RAID volume; that is, it fixes disparity between multiple copies of the 
same block.  IOW, it isn't really repair per se, but more preventative 
maintnence.  Currently, it only works for cases where you have multiple 
copies of a block (dup, raid1, and raid10 profiles), but support is 
planned for error correction of raid5 and raid6 profiles.
* mount -o recovery I don't know much about, but AFAICT, it s more for 
dealing with metadata related FS corruption.
* mount -o degraded is used to mount a fs configured for a raid storage 
profile with fewer devices than the profile minimum.  It's primarily so 
that you can get the fs into a state where you can run 'btrfs device 
replace'
* btrfs-zero-log only deals with log tree corruption.  This would be 
roughly equivalent to zeroing out the journal on an XFS or ext4 
filesystem, and should almost never be needed.
* btrfs rescue is intended for low level recovery corruption on an 
offline fs.
     * chunk-recover I'm not entirely sure about, but I believe it's 
like scrub for a single chunk on an offline fs
     * super-recover is for dealing with corrupted superblocks, and 
tries to replace it with one of the other copies (which hopefully isn't 
corrupted)
* btrfs check is intended to (eventually) be equivalent to the fsck 
utility for most other filesystems.  Currently, it's relatively good at 
identifying corruption, but less so at actually fixing it.  There are 
however, some things that it won't catch, like a superblock pointing to 
a corrupted root tree.
* btrfs restore is essentially disk scraping, but with built-in 
knowledge of the filesystem's on-disk structure, which makes it more 
reliable than more generic tools like scalpel for files that are too big 
to fit in the metadata blocks, and it is pretty much essential for 
dealing with transparently compressed files.

In general, my personal procedure for handling a misbehaving BTRFS 
filesystem is:
* Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify 
what's wrong
* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is 
corrupt, try btrfs-zero-log
* If btrfs check indicated a corrupt superblock, try btrfs rescue 
super-recover
* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and 
force remapping of blocks with read errors.  While BTRFS technically 
handles both transparently on reads, it only corrects thing on disk when 
you do a scrub.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 11:29 ` Austin S Hemmelgarn
@ 2014-10-09 11:53   ` Duncan
  2014-10-09 11:55     ` Hugo Mills
                       ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Duncan @ 2014-10-09 11:53 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:

> Also, you should be running btrfs scrub regularly to correct bit-rot
> and force remapping of blocks with read errors.  While BTRFS
> technically handles both transparently on reads, it only corrects thing
> on disk when you do a scrub.

AFAIK that isn't quite correct.  Currently, the number of copies is 
limited to two, meaning if one of the two is bad, there's a 50% chance of 
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad 
one, it checks the other one and assuming it's good, replaces the bad one 
with the good one both for the read (which otherwise errors out), and by 
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are 
relatively low in most cases.  First, the system must try reading it for 
some reason, but even then, chances are 50% it'll pick the good one and 
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with 
the good copy, scrub is the only way to systematically detect and (if 
there's a good copy) fix these checksum errors.  It's not that btrfs 
doesn't do it if it finds them, it's that the chances of finding them are 
relatively low, unless you do a scrub, which systematically checks the 
entire filesystem (well, other than files marked nocsum, or nocow, which 
implies nocsum, or files written when mounted with nodatacow or 
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that 
btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but 
if so, that's the first /I/ remember reading of it.

Other than that detail, what you posted matches my knowledge and 
experience, such as it may be as a non-dev list regular, as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 11:53   ` Duncan
@ 2014-10-09 11:55     ` Hugo Mills
  2014-10-09 12:07     ` Austin S Hemmelgarn
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: Hugo Mills @ 2014-10-09 11:55 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2277 bytes --]

On Thu, Oct 09, 2014 at 11:53:23AM +0000, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> excerpted:
> 
> > Also, you should be running btrfs scrub regularly to correct bit-rot
> > and force remapping of blocks with read errors.  While BTRFS
> > technically handles both transparently on reads, it only corrects thing
> > on disk when you do a scrub.
> 
> AFAIK that isn't quite correct.  Currently, the number of copies is 
> limited to two, meaning if one of the two is bad, there's a 50% chance of 
> btrfs reading the good one on first try.

   Scrub checks both copies, though. It's ordinary reads that don't.

   Hugo.

> If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad 
> one, it checks the other one and assuming it's good, replaces the bad one 
> with the good one both for the read (which otherwise errors out), and by 
> overwriting the bad one.
> 
> But here's the rub.  The chances of detecting that bad block are 
> relatively low in most cases.  First, the system must try reading it for 
> some reason, but even then, chances are 50% it'll pick the good one and 
> won't even notice the bad one.
> 
> Thus, while btrfs may randomly bump into a bad block and rewrite it with 
> the good copy, scrub is the only way to systematically detect and (if 
> there's a good copy) fix these checksum errors.  It's not that btrfs 
> doesn't do it if it finds them, it's that the chances of finding them are 
> relatively low, unless you do a scrub, which systematically checks the 
> entire filesystem (well, other than files marked nocsum, or nocow, which 
> implies nocsum, or files written when mounted with nodatacow or 
> nodatasum).
> 
> At least that's the way it /should/ work.  I guess it's possible that 
> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but 
> if so, that's the first /I/ remember reading of it.
> 
> Other than that detail, what you posted matches my knowledge and 
> experience, such as it may be as a non-dev list regular, as well.
> 

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
      --- Great oxymorons of the world, no. 7: The Simple Truth ---      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 11:53   ` Duncan
  2014-10-09 11:55     ` Hugo Mills
@ 2014-10-09 12:07     ` Austin S Hemmelgarn
  2014-10-09 12:12       ` Hugo Mills
       [not found]     ` <107Y1p00G0wm9Bl0107vjZ>
       [not found]     ` <0zvr1p0162Q6ekd01zvtN0>
  3 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 12:07 UTC (permalink / raw)
  To: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2222 bytes --]

On 2014-10-09 07:53, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> excerpted:
>
>> Also, you should be running btrfs scrub regularly to correct bit-rot
>> and force remapping of blocks with read errors.  While BTRFS
>> technically handles both transparently on reads, it only corrects thing
>> on disk when you do a scrub.
>
> AFAIK that isn't quite correct.  Currently, the number of copies is
> limited to two, meaning if one of the two is bad, there's a 50% chance of
> btrfs reading the good one on first try.
>
> If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
> one, it checks the other one and assuming it's good, replaces the bad one
> with the good one both for the read (which otherwise errors out), and by
> overwriting the bad one.
>
> But here's the rub.  The chances of detecting that bad block are
> relatively low in most cases.  First, the system must try reading it for
> some reason, but even then, chances are 50% it'll pick the good one and
> won't even notice the bad one.
>
> Thus, while btrfs may randomly bump into a bad block and rewrite it with
> the good copy, scrub is the only way to systematically detect and (if
> there's a good copy) fix these checksum errors.  It's not that btrfs
> doesn't do it if it finds them, it's that the chances of finding them are
> relatively low, unless you do a scrub, which systematically checks the
> entire filesystem (well, other than files marked nocsum, or nocow, which
> implies nocsum, or files written when mounted with nodatacow or
> nodatasum).
>
> At least that's the way it /should/ work.  I guess it's possible that
> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
> if so, that's the first /I/ remember reading of it.

I'm not 100% certain, but I believe it doesn't actually fix things on 
disk when it detects an error during a read, I know it doesn't it the fs 
is mounted ro (even if the media is writable), because I did some 
testing to see how 'read-only' mounting a btrfs filesystem really is.

Also, that's a much better description of how multiple copies work than 
I could probably have ever given.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 12:07     ` Austin S Hemmelgarn
@ 2014-10-09 12:12       ` Hugo Mills
  2014-10-09 12:32         ` Austin S Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Hugo Mills @ 2014-10-09 12:12 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2536 bytes --]

On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:
> On 2014-10-09 07:53, Duncan wrote:
> >Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> >excerpted:
> >
> >>Also, you should be running btrfs scrub regularly to correct bit-rot
> >>and force remapping of blocks with read errors.  While BTRFS
> >>technically handles both transparently on reads, it only corrects thing
> >>on disk when you do a scrub.
> >
> >AFAIK that isn't quite correct.  Currently, the number of copies is
> >limited to two, meaning if one of the two is bad, there's a 50% chance of
> >btrfs reading the good one on first try.
> >
> >If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
> >one, it checks the other one and assuming it's good, replaces the bad one
> >with the good one both for the read (which otherwise errors out), and by
> >overwriting the bad one.
> >
> >But here's the rub.  The chances of detecting that bad block are
> >relatively low in most cases.  First, the system must try reading it for
> >some reason, but even then, chances are 50% it'll pick the good one and
> >won't even notice the bad one.
> >
> >Thus, while btrfs may randomly bump into a bad block and rewrite it with
> >the good copy, scrub is the only way to systematically detect and (if
> >there's a good copy) fix these checksum errors.  It's not that btrfs
> >doesn't do it if it finds them, it's that the chances of finding them are
> >relatively low, unless you do a scrub, which systematically checks the
> >entire filesystem (well, other than files marked nocsum, or nocow, which
> >implies nocsum, or files written when mounted with nodatacow or
> >nodatasum).
> >
> >At least that's the way it /should/ work.  I guess it's possible that
> >btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
> >if so, that's the first /I/ remember reading of it.
> 
> I'm not 100% certain, but I believe it doesn't actually fix things on disk
> when it detects an error during a read,

   I'm fairly sure it does, as I've had it happen to me. :)

> I know it doesn't it the fs is
> mounted ro (even if the media is writable), because I did some testing to
> see how 'read-only' mounting a btrfs filesystem really is.

   If the FS is RO, then yes, it won't fix things.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
      --- Great films about cricket:  Interview with the Umpire ---      

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 12:12       ` Hugo Mills
@ 2014-10-09 12:32         ` Austin S Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 12:32 UTC (permalink / raw)
  To: Hugo Mills, Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2583 bytes --]

On 2014-10-09 08:12, Hugo Mills wrote:
> On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:
>> On 2014-10-09 07:53, Duncan wrote:
>>> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
>>> excerpted:
>>>
>>>> Also, you should be running btrfs scrub regularly to correct bit-rot
>>>> and force remapping of blocks with read errors.  While BTRFS
>>>> technically handles both transparently on reads, it only corrects thing
>>>> on disk when you do a scrub.
>>>
>>> AFAIK that isn't quite correct.  Currently, the number of copies is
>>> limited to two, meaning if one of the two is bad, there's a 50% chance of
>>> btrfs reading the good one on first try.
>>>
>>> If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
>>> one, it checks the other one and assuming it's good, replaces the bad one
>>> with the good one both for the read (which otherwise errors out), and by
>>> overwriting the bad one.
>>>
>>> But here's the rub.  The chances of detecting that bad block are
>>> relatively low in most cases.  First, the system must try reading it for
>>> some reason, but even then, chances are 50% it'll pick the good one and
>>> won't even notice the bad one.
>>>
>>> Thus, while btrfs may randomly bump into a bad block and rewrite it with
>>> the good copy, scrub is the only way to systematically detect and (if
>>> there's a good copy) fix these checksum errors.  It's not that btrfs
>>> doesn't do it if it finds them, it's that the chances of finding them are
>>> relatively low, unless you do a scrub, which systematically checks the
>>> entire filesystem (well, other than files marked nocsum, or nocow, which
>>> implies nocsum, or files written when mounted with nodatacow or
>>> nodatasum).
>>>
>>> At least that's the way it /should/ work.  I guess it's possible that
>>> btrfs isn't doing those routine "bump-into-it-and-fix-it" fixes yet, but
>>> if so, that's the first /I/ remember reading of it.
>>
>> I'm not 100% certain, but I believe it doesn't actually fix things on disk
>> when it detects an error during a read,
>
>     I'm fairly sure it does, as I've had it happen to me. :)
I probably just misinterpreted the source code, while I know enough C to 
generally understand things, I'm by far no expert.
>
>> I know it doesn't it the fs is
>> mounted ro (even if the media is writable), because I did some testing to
>> see how 'read-only' mounting a btrfs filesystem really is.
>
>     If the FS is RO, then yes, it won't fix things.
>
>     Hugo.
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
       [not found]     ` <107Y1p00G0wm9Bl0107vjZ>
@ 2014-10-09 12:34       ` Duncan
  2014-10-09 13:18         ` Austin S Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2014-10-09 12:34 UTC (permalink / raw)
  To: Austin S Hemmelgarn; +Cc: linux-btrfs

On Thu, 09 Oct 2014 08:07:51 -0400
Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:

> On 2014-10-09 07:53, Duncan wrote:
> > Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> > excerpted:
> >
> >> Also, you should be running btrfs scrub regularly to correct
> >> bit-rot and force remapping of blocks with read errors.  While
> >> BTRFS technically handles both transparently on reads, it only
> >> corrects thing on disk when you do a scrub.
> >
> > AFAIK that isn't quite correct.  Currently, the number of copies is
> > limited to two, meaning if one of the two is bad, there's a 50%
> > chance of btrfs reading the good one on first try.
> >
> > If btrfs reads the good copy, it simply uses it.  If btrfs reads
> > the bad one, it checks the other one and assuming it's good,
> > replaces the bad one with the good one both for the read (which
> > otherwise errors out), and by overwriting the bad one.
> >
> > But here's the rub.  The chances of detecting that bad block are
> > relatively low in most cases.  First, the system must try reading
> > it for some reason, but even then, chances are 50% it'll pick the
> > good one and won't even notice the bad one.
> >
> > Thus, while btrfs may randomly bump into a bad block and rewrite it
> > with the good copy, scrub is the only way to systematically detect
> > and (if there's a good copy) fix these checksum errors.  It's not
> > that btrfs doesn't do it if it finds them, it's that the chances of
> > finding them are relatively low, unless you do a scrub, which
> > systematically checks the entire filesystem (well, other than files
> > marked nocsum, or nocow, which implies nocsum, or files written
> > when mounted with nodatacow or nodatasum).
> >
> > At least that's the way it /should/ work.  I guess it's possible
> > that btrfs isn't doing those routine "bump-into-it-and-fix-it"
> > fixes yet, but if so, that's the first /I/ remember reading of it.
> 
> I'm not 100% certain, but I believe it doesn't actually fix things on 
> disk when it detects an error during a read, I know it doesn't it the
> fs is mounted ro (even if the media is writable), because I did some 
> testing to see how 'read-only' mounting a btrfs filesystem really is.

Definitely it won't with a read-only mount.  But then scrub shouldn't
be able to write to a read-only mount either.  The only way a read-only
mount should be writable is if it's mounted (bind-mounted or
btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
that mount, not the read-only mounted location.

There's even debate about replaying the journal or doing orphan-delete
on read-only mounts (at least on-media, the change could, and arguably
should, occur in RAM and be cached, marking the cache "dirty" at the
same time so it's appropriately flushed if/when the filesystem goes
writable), with some arguing read-only means just that, don't
write /anything/ to it until it's read-write mounted.

But writable-mounted, detected checksum errors (with a good copy
available) should be rewritten as far as I know.  If not, I'd call it
a bug.  The problem is in the detection, not in the rewriting.  Scrub's
the only way to reliably detect these errors since it's the only thing
that systematically checks /everything/.

> Also, that's a much better description of how multiple copies work
> than I could probably have ever given.

Thanks.  =:^)

-- 
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
       [not found]     ` <0zvr1p0162Q6ekd01zvtN0>
@ 2014-10-09 12:42       ` Duncan
  0 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-09 12:42 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

On Thu, 9 Oct 2014 12:55:50 +0100
Hugo Mills <hugo@carfax.org.uk> wrote:

> On Thu, Oct 09, 2014 at 11:53:23AM +0000, Duncan wrote:
> > Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
> > excerpted:
> > 
> > > Also, you should be running btrfs scrub regularly to correct
> > > bit-rot and force remapping of blocks with read errors.  While
> > > BTRFS technically handles both transparently on reads, it only
> > > corrects thing on disk when you do a scrub.
> > 
> > AFAIK that isn't quite correct.  Currently, the number of copies is 
> > limited to two, meaning if one of the two is bad, there's a 50%
> > chance of btrfs reading the good one on first try.
> 
>    Scrub checks both copies, though. It's ordinary reads that don't.

While I believe I was clear in full context (see below), agreed.  I was
talking about normal reads in the above, not scrub, as the full quote
should make clear.  I guess I could have made it clearer in the
immediate context, however.  Thanks.

> > Thus, while btrfs may randomly bump into a bad block and rewrite it
> > with the good copy, scrub is the only way to systematically detect
> > and (if there's a good copy) fix these checksum errors.



-- 
Duncan - No HTML messages please, as they are filtered as spam.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 12:34       ` Duncan
@ 2014-10-09 13:18         ` Austin S Hemmelgarn
  2014-10-09 13:49           ` Duncan
  0 siblings, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-09 13:18 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]

On 2014-10-09 08:34, Duncan wrote:
> On Thu, 09 Oct 2014 08:07:51 -0400
> Austin S Hemmelgarn <ahferroin7@gmail.com> wrote:
>
>> On 2014-10-09 07:53, Duncan wrote:
>>> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
>>> excerpted:
>>>
>>>> Also, you should be running btrfs scrub regularly to correct
>>>> bit-rot and force remapping of blocks with read errors.  While
>>>> BTRFS technically handles both transparently on reads, it only
>>>> corrects thing on disk when you do a scrub.
>>>
>>> AFAIK that isn't quite correct.  Currently, the number of copies is
>>> limited to two, meaning if one of the two is bad, there's a 50%
>>> chance of btrfs reading the good one on first try.
>>>
>>> If btrfs reads the good copy, it simply uses it.  If btrfs reads
>>> the bad one, it checks the other one and assuming it's good,
>>> replaces the bad one with the good one both for the read (which
>>> otherwise errors out), and by overwriting the bad one.
>>>
>>> But here's the rub.  The chances of detecting that bad block are
>>> relatively low in most cases.  First, the system must try reading
>>> it for some reason, but even then, chances are 50% it'll pick the
>>> good one and won't even notice the bad one.
>>>
>>> Thus, while btrfs may randomly bump into a bad block and rewrite it
>>> with the good copy, scrub is the only way to systematically detect
>>> and (if there's a good copy) fix these checksum errors.  It's not
>>> that btrfs doesn't do it if it finds them, it's that the chances of
>>> finding them are relatively low, unless you do a scrub, which
>>> systematically checks the entire filesystem (well, other than files
>>> marked nocsum, or nocow, which implies nocsum, or files written
>>> when mounted with nodatacow or nodatasum).
>>>
>>> At least that's the way it /should/ work.  I guess it's possible
>>> that btrfs isn't doing those routine "bump-into-it-and-fix-it"
>>> fixes yet, but if so, that's the first /I/ remember reading of it.
>>
>> I'm not 100% certain, but I believe it doesn't actually fix things on
>> disk when it detects an error during a read, I know it doesn't it the
>> fs is mounted ro (even if the media is writable), because I did some
>> testing to see how 'read-only' mounting a btrfs filesystem really is.
>
> Definitely it won't with a read-only mount.  But then scrub shouldn't
> be able to write to a read-only mount either.  The only way a read-only
> mount should be writable is if it's mounted (bind-mounted or
> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
> that mount, not the read-only mounted location.
In theory yes, but there are caveats to this, namely:
* atime updates still happen unless you have mounted the fs with noatime
* The superblock gets updated if there are 'any' writes
* The free space cache 'might' be updated if there are any writes

All in all, a BTRFS filesystem mounted ro is much more read-only than 
say ext4 (which at least updates the sb, and old versions replayed the 
journal, in addition to the atime updates).
>
> There's even debate about replaying the journal or doing orphan-delete
> on read-only mounts (at least on-media, the change could, and arguably
> should, occur in RAM and be cached, marking the cache "dirty" at the
> same time so it's appropriately flushed if/when the filesystem goes
> writable), with some arguing read-only means just that, don't
> write /anything/ to it until it's read-write mounted.
>
> But writable-mounted, detected checksum errors (with a good copy
> available) should be rewritten as far as I know.  If not, I'd call it
> a bug.  The problem is in the detection, not in the rewriting.  Scrub's
> the only way to reliably detect these errors since it's the only thing
> that systematically checks /everything/.
>
>> Also, that's a much better description of how multiple copies work
>> than I could probably have ever given.
>
> Thanks.  =:^)
>



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 13:18         ` Austin S Hemmelgarn
@ 2014-10-09 13:49           ` Duncan
  2014-10-09 15:44             ` Eric Sandeen
  0 siblings, 1 reply; 33+ messages in thread
From: Duncan @ 2014-10-09 13:49 UTC (permalink / raw)
  To: linux-btrfs

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
excerpted:

> On 2014-10-09 08:34, Duncan wrote:

>> The only way a read-only
>> mount should be writable is if it's mounted (bind-mounted or
>> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
>> that mount, not the read-only mounted location.

> In theory yes, but there are caveats to this, namely:
> * atime updates still happen unless you have mounted the fs with noatime

I've been mounting noatime for well over a decade now, exactly due to 
such problems.  But I believe at least /some/ filesystems are truly read-
only when they're mounted as such, and atime updates don't happen on them.

These days I actually apply a patch that changes the default relatime to 
noatime, so I don't even have to have it in my mount-options. =:^)

> * The superblock gets updated if there are 'any' writes

Yeah.  At least in theory, there shouldn't be, however.  As I said, in 
theory, even journal replay and orphan delete shouldn't hit media, altho 
handling it in memory and dirtying the cache, so if the filesystem is 
ever remounted read-write they get written, is reasonable.

> * The free space cache 'might' be updated if there are any writes

Makes sense.  But of course that's what I'm arguing, there shouldn't /be/ 
any writes.  Read-only should mean exactly that, don't touch media, 
period.

I remember at one point activating an mdraid1 degraded, read-only, just a 
single device of the 4-way raid1 I was running at the time, to recover 
data from it after the system it was running in died.  The idea was don't 
write to the device at all, because I was still testing the new system, 
and in case I decided to try to reassemble the raid at some point.  Read-
only really NEEDS to be read-only, under such conditions.

Similarly for forensic examination, of course.  If there's a write, any 
write, it's evidence tampering.  Read-only needs to MEAN read-only!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-09 13:49           ` Duncan
@ 2014-10-09 15:44             ` Eric Sandeen
  0 siblings, 0 replies; 33+ messages in thread
From: Eric Sandeen @ 2014-10-09 15:44 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 10/9/14 8:49 AM, Duncan wrote:
> Austin S Hemmelgarn posted on Thu, 09 Oct 2014 09:18:22 -0400 as
> excerpted:
> 
>> On 2014-10-09 08:34, Duncan wrote:
> 
>>> The only way a read-only
>>> mount should be writable is if it's mounted (bind-mounted or
>>> btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
>>> that mount, not the read-only mounted location.
> 
>> In theory yes, but there are caveats to this, namely:
>> * atime updates still happen unless you have mounted the fs with noatime

Getting off the topic a bit, but that really shouldn't happen:

#define IS_NOATIME(inode)       __IS_FLG(inode, MS_RDONLY|MS_NOATIME)

and in touch_atime():

        if (IS_NOATIME(inode))
                return;

-Eric

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
  2014-10-09 11:29 ` Austin S Hemmelgarn
@ 2014-10-10  1:58 ` Chris Murphy
  2014-10-10  3:20   ` Duncan
                     ` (2 more replies)
  2014-10-12 10:17 ` Martin Steigerwald
  2014-10-13 21:09 ` Josef Bacik
  3 siblings, 3 replies; 33+ messages in thread
From: Chris Murphy @ 2014-10-10  1:58 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-btrfs


On Oct 8, 2014, at 3:11 PM, Eric Sandeen <sandeen@redhat.com> wrote:

> I was looking at Marc's post:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> 
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.

It's definitely confusing compared to any other filesystem I've used on four different platforms. And that's when excluding scraping and the functions unique to any multiple device volume: scrubs, degraded mount.

To be fair, mdadm doesn't even have a scrub command, it's done via 'echo check > /sys/block/mdX/md/sync_action'. And meanwhile LVM has pvck, vgck, and for scrubs it's lvchange --syncaction {check|repair}. These are also completely non-obvious.

> * mount -o recovery
> 	"Enable autorecovery attempts if a bad tree root is found at mount time."

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery

 If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.

> * btrfs-zero-log
> 	"remove the log tree if log tree is corrupt"
> * btrfs rescue
> 	"Recover a damaged btrfs filesystem"
> 	chunk-recover
> 	super-recover
> 	How does this relate to btrfs check?
> * btrfs check
> 	"repair a btrfs filesystem"
> 	--repair
> 	--init-csum-tree
> 	--init-extent-tree
> 	How does this relate to btrfs rescue?

These three translate into eight combinations of repairs, adding -o recovery there are 9 combinations. I think this is the main source of confusion, there are just too many options, but also it's completely non-obvious which one to use in which situation.

My expectation is that eventually these get consolidated into just check and check --repair. As the repair code matures, it'd go into kernel autorecovery code. That's a guess on my part, but it's consistent with design goals.


> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?

I suspect it's unintended splintering, and is an artifact that will go away. I'd rather the convoluted, fractured nature of repair go away before the scary experimental warnings do.


Chris Murphy

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10  1:58 ` Chris Murphy
@ 2014-10-10  3:20   ` Duncan
  2014-10-10 10:53   ` Bob Marley
  2014-10-12 10:06   ` Martin Steigerwald
  2 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-10  3:20 UTC (permalink / raw)
  To: linux-btrfs

Chris Murphy posted on Thu, 09 Oct 2014 21:58:53 -0400 as excerpted:

> I suspect it's unintended splintering, and is an artifact that will go
> away. I'd rather the convoluted, fractured nature of repair go away
> before the scary experimental warnings do.

Heh, agreed with everything[1], but too late for this, the experimental 
warnings are peeled off, the experimental or at least horribly immature
/behavior/ remains. =:^(

---
[1] ... and a much more logically cohesive and well structured reply than 
I could have managed as my own thoughts simply weren't that well 
organized.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10  1:58 ` Chris Murphy
  2014-10-10  3:20   ` Duncan
@ 2014-10-10 10:53   ` Bob Marley
  2014-10-10 10:59     ` Roman Mamedov
                       ` (2 more replies)
  2014-10-12 10:06   ` Martin Steigerwald
  2 siblings, 3 replies; 33+ messages in thread
From: Bob Marley @ 2014-10-10 10:53 UTC (permalink / raw)
  To: linux-btrfs

On 10/10/2014 03:58, Chris Murphy wrote:
>
>> * mount -o recovery
>> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>
> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.

No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on 
both sides of a distributed transaction, then power was lost on one 
side, than btrfs had corruption. When I remount it, definitely the worst 
thing that can happen is that it auto-rolls-back to a previous 
known-good state.

Now if I can express wishes:

I would like an option that spits out all the usable tree roots (or 
what's the name, superblocks?) and not just the newest one which is 
corrupt. And then another option that lets me mount *readonly* starting 
from the tree root I specify. So I can check how much of the data is 
still there. After I decide that such tree root is good, I need another 
option that lets me mount with such tree root in readwrite mode, and 
obviously eliminating all tree roots newer than that.
Some time ago I read that mounting the filesystem with an earlier tree 
root was possible, but only by manually erasing the disk regions in 
which the newer superblocks are. This is crazy, it's too risky on too 
many levels, and also as I wrote I want to check what data is available 
on a certain tree root before mounting readwrite from that one.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 10:53   ` Bob Marley
@ 2014-10-10 10:59     ` Roman Mamedov
  2014-10-10 11:12       ` Bob Marley
  2014-10-10 14:37     ` Chris Murphy
  2014-10-11  7:29     ` Goffredo Baroncelli
  2 siblings, 1 reply; 33+ messages in thread
From: Roman Mamedov @ 2014-10-10 10:59 UTC (permalink / raw)
  To: Bob Marley; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1058 bytes --]

On Fri, 10 Oct 2014 12:53:38 +0200
Bob Marley <bobmarley@shiftmail.org> wrote:

> On 10/10/2014 03:58, Chris Murphy wrote:
> >
> >> * mount -o recovery
> >> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
> > I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
> >
> > If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
> 
> No way!
> I wouldn't want a default like that.
> 
> If you think at distributed transactions: suppose a sync was issued on 
> both sides of a distributed transaction, then power was lost on one 
> side

What distributed transactions? Btrfs is not a clustered filesystem[1], it does
not support and likely will never support being mounted from multiple hosts at
the same time.

[1]http://en.wikipedia.org/wiki/Clustered_file_system

-- 
With respect,
Roman

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 10:59     ` Roman Mamedov
@ 2014-10-10 11:12       ` Bob Marley
  2014-10-10 15:18         ` cwillu
  0 siblings, 1 reply; 33+ messages in thread
From: Bob Marley @ 2014-10-10 11:12 UTC (permalink / raw)
  To: linux-btrfs

On 10/10/2014 12:59, Roman Mamedov wrote:
> On Fri, 10 Oct 2014 12:53:38 +0200
> Bob Marley <bobmarley@shiftmail.org> wrote:
>
>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>> * mount -o recovery
>>>> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
>>> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>>>
>>> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
>> No way!
>> I wouldn't want a default like that.
>>
>> If you think at distributed transactions: suppose a sync was issued on
>> both sides of a distributed transaction, then power was lost on one
>> side
> What distributed transactions? Btrfs is not a clustered filesystem[1], it does
> not support and likely will never support being mounted from multiple hosts at
> the same time.
>
> [1]http://en.wikipedia.org/wiki/Clustered_file_system
>

This is not the only way to do a distributed transaction.
Databases can be hosted on the filesystem, and those can do distributed 
transations.
Think of two bank accounts, one on btrfs fs1 here, and another bank 
account on database on a whatever filesystem in another country. You 
want to debit one account and credit the other one: the filesystems at 
the two sides *must not rollback their state* !! (especially not 
transparently without human intervention)


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 10:53   ` Bob Marley
  2014-10-10 10:59     ` Roman Mamedov
@ 2014-10-10 14:37     ` Chris Murphy
  2014-10-10 17:43       ` Bob Marley
  2014-10-12 10:14       ` Martin Steigerwald
  2014-10-11  7:29     ` Goffredo Baroncelli
  2 siblings, 2 replies; 33+ messages in thread
From: Chris Murphy @ 2014-10-10 14:37 UTC (permalink / raw)
  To: linux-btrfs


On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:

> On 10/10/2014 03:58, Chris Murphy wrote:
>> 
>>> * mount -o recovery
>>> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
>> I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery
>> 
>> If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery.
> 
> No way!
> I wouldn't want a default like that.
> 
> If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state.

For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case.

The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot.

Chris Murphy


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 11:12       ` Bob Marley
@ 2014-10-10 15:18         ` cwillu
  0 siblings, 0 replies; 33+ messages in thread
From: cwillu @ 2014-10-10 15:18 UTC (permalink / raw)
  To: Bob Marley; +Cc: linux-btrfs

If -o recovery is necessary, then you're either running into a btrfs
bug, or your hardware is lying about when it has actually written
things to disk.

The first case isn't unheard of, although far less common than it used
to be, and it should continue to improve with time.

In the second case, you're potentially screwed regardless of the
filesystem, without doing hacks like "wait a good long time before
returning from fsync in the hopes that the disk might actually have
gotten around to performing the write it said had already finished."

On Fri, Oct 10, 2014 at 5:12 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
> On 10/10/2014 12:59, Roman Mamedov wrote:
>>
>> On Fri, 10 Oct 2014 12:53:38 +0200
>> Bob Marley <bobmarley@shiftmail.org> wrote:
>>
>>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>>>
>>>>> * mount -o recovery
>>>>>         "Enable autorecovery attempts if a bad tree root is found at
>>>>> mount time."
>>>>
>>>> I'm confused why it's not the default yet. Maybe it's continuing to
>>>> evolve at a pace that suggests something could sneak in that makes things
>>>> worse? It is almost an oxymoron in that I'm manually enabling an
>>>> autorecovery
>>>>
>>>> If true, maybe the closest indication we'd get of btrfs stablity is the
>>>> default enabling of autorecovery.
>>>
>>> No way!
>>> I wouldn't want a default like that.
>>>
>>> If you think at distributed transactions: suppose a sync was issued on
>>> both sides of a distributed transaction, then power was lost on one
>>> side
>>
>> What distributed transactions? Btrfs is not a clustered filesystem[1], it
>> does
>> not support and likely will never support being mounted from multiple
>> hosts at
>> the same time.
>>
>> [1]http://en.wikipedia.org/wiki/Clustered_file_system
>>
>
> This is not the only way to do a distributed transaction.
> Databases can be hosted on the filesystem, and those can do distributed
> transations.
> Think of two bank accounts, one on btrfs fs1 here, and another bank account
> on database on a whatever filesystem in another country. You want to debit
> one account and credit the other one: the filesystems at the two sides *must
> not rollback their state* !! (especially not transparently without human
> intervention)
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 14:37     ` Chris Murphy
@ 2014-10-10 17:43       ` Bob Marley
  2014-10-10 17:53         ` Bardur Arantsson
  2014-10-10 19:35         ` Austin S Hemmelgarn
  2014-10-12 10:14       ` Martin Steigerwald
  1 sibling, 2 replies; 33+ messages in thread
From: Bob Marley @ 2014-10-10 17:43 UTC (permalink / raw)
  To: linux-btrfs

On 10/10/2014 16:37, Chris Murphy wrote:
> The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot.
>

A filesystem which is suited for "general purpose" use is a filesystem 
which honors fsync, and doesn't *ever* auto-roll-back without user 
intervention.

Anything different is not suited for database transactions at all. Any 
paid service which has the users database on btrfs is going to be at 
risk of losing payments, and probably without the company even knowing. 
If btrfs goes this way I hope a big warning is written on the wiki and 
on the manpages telling that this filesystem is totally unsuitable for 
hosting databases performing transactions.

At most I can suggest that a flag in the metadata be added to 
allow/disallow auto-roll-back-on-error on such filesystem, so people can 
decide the "tolerant" vs. "transaction-safe" mode at filesystem creation.


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 17:43       ` Bob Marley
@ 2014-10-10 17:53         ` Bardur Arantsson
  2014-10-10 19:35         ` Austin S Hemmelgarn
  1 sibling, 0 replies; 33+ messages in thread
From: Bardur Arantsson @ 2014-10-10 17:53 UTC (permalink / raw)
  To: linux-btrfs

On 2014-10-10 19:43, Bob Marley wrote:
> On 10/10/2014 16:37, Chris Murphy wrote:
>> The fail safe behavior is to treat the known good tree root as the
>> default tree root, and bypass the bad tree root if it cannot be
>> repaired, so that the volume can be mounted with default mount options
>> (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
>> suited for general purpose use as rootfs let alone for boot.
>>
> 
> A filesystem which is suited for "general purpose" use is a filesystem
> which honors fsync, and doesn't *ever* auto-roll-back without user
> intervention.
> 

A file system cannot do anything about the *DISKS* not honouring a sync
command. That's what the PP was talking about.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 17:43       ` Bob Marley
  2014-10-10 17:53         ` Bardur Arantsson
@ 2014-10-10 19:35         ` Austin S Hemmelgarn
  2014-10-10 22:05           ` Eric Sandeen
  1 sibling, 1 reply; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-10 19:35 UTC (permalink / raw)
  To: Bob Marley, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2188 bytes --]

On 2014-10-10 13:43, Bob Marley wrote:
> On 10/10/2014 16:37, Chris Murphy wrote:
>> The fail safe behavior is to treat the known good tree root as the
>> default tree root, and bypass the bad tree root if it cannot be
>> repaired, so that the volume can be mounted with default mount options
>> (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
>> suited for general purpose use as rootfs let alone for boot.
>>
>
> A filesystem which is suited for "general purpose" use is a filesystem
> which honors fsync, and doesn't *ever* auto-roll-back without user
> intervention.
>
> Anything different is not suited for database transactions at all. Any
> paid service which has the users database on btrfs is going to be at
> risk of losing payments, and probably without the company even knowing.
> If btrfs goes this way I hope a big warning is written on the wiki and
> on the manpages telling that this filesystem is totally unsuitable for
> hosting databases performing transactions.
If they need reliability, they should have some form of redundancy 
in-place and/or run the database directly on the block device; because 
even ext4, XFS, and pretty much every other filesystem can lose data 
sometimes, the difference being that those tend to give worse results 
when hardware is misbehaving than BTRFS does, because BTRFS usually has 
a old copy of whatever data structure gets corrupted to fall back on.

Also, you really shouldn't be running databases on a BTRFS filesystem at 
the moment anyway, because of the significant performance implications.
>
> At most I can suggest that a flag in the metadata be added to
> allow/disallow auto-roll-back-on-error on such filesystem, so people can
> decide the "tolerant" vs. "transaction-safe" mode at filesystem creation.
>

The problem with this is that if the auto-recovery code did run (and 
IMHO the kernel should spit out a warning to the system log whenever it 
does), then chances are that you wouldn't have had a consistent view if 
you had prevented it from running either; and, if the database is 
properly distributed/replicated, then it should recover by itself.



[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 19:35         ` Austin S Hemmelgarn
@ 2014-10-10 22:05           ` Eric Sandeen
  2014-10-13 11:26             ` Austin S Hemmelgarn
  0 siblings, 1 reply; 33+ messages in thread
From: Eric Sandeen @ 2014-10-10 22:05 UTC (permalink / raw)
  To: Austin S Hemmelgarn, Bob Marley, linux-btrfs

On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:
> On 2014-10-10 13:43, Bob Marley wrote:
>> On 10/10/2014 16:37, Chris Murphy wrote:
>>> The fail safe behavior is to treat the known good tree root as
>>> the default tree root, and bypass the bad tree root if it cannot
>>> be repaired, so that the volume can be mounted with default mount
>>> options (i.e. the ones in fstab). Otherwise it's a filesystem
>>> that isn't well suited for general purpose use as rootfs let
>>> alone for boot.
>>> 
>> 
>> A filesystem which is suited for "general purpose" use is a
>> filesystem which honors fsync, and doesn't *ever* auto-roll-back
>> without user intervention.
>> 
>> Anything different is not suited for database transactions at all.
>> Any paid service which has the users database on btrfs is going to
>> be at risk of losing payments, and probably without the company
>> even knowing. If btrfs goes this way I hope a big warning is
>> written on the wiki and on the manpages telling that this
>> filesystem is totally unsuitable for hosting databases performing
>> transactions.
> If they need reliability, they should have some form of redundancy
> in-place and/or run the database directly on the block device;
> because even ext4, XFS, and pretty much every other filesystem can
> lose data sometimes,

Not if i.e. fsync returns.  If the data is gone later, it's a hardware
problem, or occasionally a bug - bugs that are usually found & fixed
pretty quickly.

> the difference being that those tend to give
> worse results when hardware is misbehaving than BTRFS does, because
> BTRFS usually has a old copy of whatever data structure gets
> corrupted to fall back on.

I'm curious, is that based on conjecture or real-world testing?

-Eric


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 10:53   ` Bob Marley
  2014-10-10 10:59     ` Roman Mamedov
  2014-10-10 14:37     ` Chris Murphy
@ 2014-10-11  7:29     ` Goffredo Baroncelli
  2014-11-17 20:55       ` Phillip Susi
  2 siblings, 1 reply; 33+ messages in thread
From: Goffredo Baroncelli @ 2014-10-11  7:29 UTC (permalink / raw)
  To: Bob Marley, linux-btrfs

On 10/10/2014 12:53 PM, Bob Marley wrote:
>> 
>> If true, maybe the closest indication we'd get of btrfs stablity is
>> the default enabling of autorecovery.
> 
> No way! I wouldn't want a default like that.
> 
> If you think at distributed transactions: suppose a sync was issued
> on both sides of a distributed transaction, then power was lost on
> one side, than btrfs had corruption. When I remount it, definitely
> the worst thing that can happen is that it auto-rolls-back to a
> previous known-good state.

I cannot agree. I consider a sane default to have a consistent state with 
"the recently data written lost", instead of "require the user 
intervention to not lost anything".

To address your requirement, we need a "super sync" command which
ensure that the data are in the filesystem and not only
in the log (as sync should ensure).

BR

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10  1:58 ` Chris Murphy
  2014-10-10  3:20   ` Duncan
  2014-10-10 10:53   ` Bob Marley
@ 2014-10-12 10:06   ` Martin Steigerwald
  2 siblings, 0 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:06 UTC (permalink / raw)
  To: Chris Murphy, linux-btrfs; +Cc: Eric Sandeen

Am Donnerstag, 9. Oktober 2014, 21:58:53 schrieben Sie:
> > * btrfs-zero-log
> >       "remove the log tree if log tree is corrupt"
> > * btrfs rescue
> >       "Recover a damaged btrfs filesystem"
> >       chunk-recover
> >       super-recover
> >       How does this relate to btrfs check?
> > * btrfs check
> >       "repair a btrfs filesystem"
> >       --repair
> >       --init-csum-tree
> >       --init-extent-tree
> >       How does this relate to btrfs rescue?
> 
> These three translate into eight combinations of repairs, adding -o recovery
> there are 9 combinations. I think this is the main source of confusion,
> there are just too many options, but also it's completely non-obvious which
> one to use in which situation.
> 
> My expectation is that eventually these get consolidated into just check and
> check --repair. As the repair code matures, it'd go into kernel
> autorecovery code. That's a guess on my part, but it's consistent with
> design goals.

Also I think these should at least all be unter the btrfs command.

So include btrfs-zero-log in btrfs command.

And well how about "btrfs repair" or "btrfs check" as upper category and at 
least add the various options as commands below it? So there is at least one
command and one place in manpage to learn about the various options.

But maybe some can be made automatic as well. Or folded into btrfs check --
repair. Ideally it would auto-detect which path to take on filesystem 
recovery.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 14:37     ` Chris Murphy
  2014-10-10 17:43       ` Bob Marley
@ 2014-10-12 10:14       ` Martin Steigerwald
  2014-10-12 23:59         ` Duncan
                           ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-btrfs

Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
> > On 10/10/2014 03:58, Chris Murphy wrote:
> >>> * mount -o recovery
> >>> 
> >>> 	"Enable autorecovery attempts if a bad tree root is found at mount
> >>> 	time."
> >> 
> >> I'm confused why it's not the default yet. Maybe it's continuing to
> >> evolve at a pace that suggests something could sneak in that makes
> >> things worse? It is almost an oxymoron in that I'm manually enabling an
> >> autorecovery
> >> 
> >> If true, maybe the closest indication we'd get of btrfs stablity is the
> >> default enabling of autorecovery.> 
> > No way!
> > I wouldn't want a default like that.
> > 
> > If you think at distributed transactions: suppose a sync was issued on
> > both sides of a distributed transaction, then power was lost on one side,
> > than btrfs had corruption. When I remount it, definitely the worst thing
> > that can happen is that it auto-rolls-back to a previous known-good
> > state.
> For a general purpose file system, losing 30 seconds (or less) of
> questionably committed data, likely corrupt, is a file system that won't
> mount without user intervention, which requires a secret decoder ring to
> get it to mount at all. And may require the use of specialized tools to
> retrieve that data in any case.
> 
> The fail safe behavior is to treat the known good tree root as the default
> tree root, and bypass the bad tree root if it cannot be repaired, so that
> the volume can be mounted with default mount options (i.e. the ones in
> fstab). Otherwise it's a filesystem that isn't well suited for general
> purpose use as rootfs let alone for boot.

To understand this a bit better:

What can be the reasons a recent tree gets corrupted?

I always thought with a controller and device and driver combination that 
honors fsync with BTRFS it would either be the new state of the last known 
good state *anyway*. So where does the need to rollback arise from?

That said all journalling filesystems have some sort of rollback as far as I 
understand: If the last journal entry is incomplete they discard it on journal 
replay. So even there you use the last seconds of write activity.

But in case fsync() returns the data needs to be safe on disk. I always 
thought BTRFS honors this under *any* circumstance. If some proposed 
autorollback breaks this guarentee, I think something is broke elsewhere.

And fsync is an fsync is an fsync. Its semantics are clear as crystal. There 
is nothing, absolutely nothing to discuss about it.

An fsync completes if the device itself reported "Yeah, I have the data on 
disk, all safe and cool to go". Anything else is a bug IMO.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
  2014-10-09 11:29 ` Austin S Hemmelgarn
  2014-10-10  1:58 ` Chris Murphy
@ 2014-10-12 10:17 ` Martin Steigerwald
  2014-10-13 21:09 ` Josef Bacik
  3 siblings, 0 replies; 33+ messages in thread
From: Martin Steigerwald @ 2014-10-12 10:17 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-btrfs

Am Mittwoch, 8. Oktober 2014, 14:11:51 schrieb Eric Sandeen:
> I was looking at Marc's post:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-> and-Btrfs-Filesystem-Repair.html
> 
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
> 
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
> 
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
> 
> * btrfs scrub
> 	"Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> 	"Allow mounts to continue with missing devices."
> 	(This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> 	"remove the log tree if log tree is corrupt"
> * btrfs rescue
> 	"Recover a damaged btrfs filesystem"
> 	chunk-recover
> 	super-recover
> 	How does this relate to btrfs check?
> * btrfs check
> 	"repair a btrfs filesystem"
> 	--repair
> 	--init-csum-tree
> 	--init-extent-tree
> 	How does this relate to btrfs rescue?
> * btrfs restore
> 	"try to salvage files from a damaged filesystem"
> 	(not really repair, it's disk-scraping)
> 
> 
> What's the vision for, say, scrub vs. check vs. rescue?  Should they repair
> the same errors, only online vs. offline?  If not, what class of errors
> does one fix vs. the other?  How would an admin know?  Can btrfs check
> recover a bad tree root in the same way that mount -o recovery does?  How
> would I know if I should use --init-*-tree, or chunk-recover, and what are
> the ramifications of using these options?
> 
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?

How about taking one step back:

What are the possible corruption cases these tools are meant to address? 
*Where* can BTRFS break and *why*?

What of it can be folded into one command? Where can BTRFS be improved to 
either prevent a corruption from happening ot automatically correcting it? 
What actions can be determined automatically by the repair tool? What needs to 
be options for the user to choose from? And what guidance would the user need 
to decide?

I.e. really going to back what diagnosing and repair of BTRFS actually 
includes and then well… go about a vision how this all can fit together as you 
suggested.

As a minimum I suggest to have all possible options as a main category in 
btrfs command, no external commands whatsoever, so if btrfs-zero-log is still 
needed, at it into btrfs command.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-12 10:14       ` Martin Steigerwald
@ 2014-10-12 23:59         ` Duncan
  2014-10-13 11:37         ` Austin S Hemmelgarn
  2014-10-13 11:48         ` Rich Freeman
  2 siblings, 0 replies; 33+ messages in thread
From: Duncan @ 2014-10-12 23:59 UTC (permalink / raw)
  To: linux-btrfs

Martin Steigerwald posted on Sun, 12 Oct 2014 12:14:01 +0200 as excerpted:

> I always thought with a controller and device and driver combination
> that honors fsync with BTRFS it would either be the new state of the
> last known good state *anyway*. So where does the need to rollback arise
> from?

My understanding here is...

With btrfs a full-tree commit is atomic.  You should get either the old 
tree or the new tree.  However, due to the cascading nature of updates on 
cow-based structures, these full-tree commits are done by default 
(there's a mount-option to adjust it) every 30 seconds.  Between these 
atomic commits partial updates may have occurred.  The btrfs log (the one 
that btrfs-zero-log kills) is limited to between-commit updates, and thus 
to the upto 30 seconds (default) worth of changes since the last full-
tree atomic commit.

In addition to that, there's a history of tree-root commits kept (with 
the superblocks pointing to the last one).  Btrfs-find-tree-root can be 
used to list this history.  The recovery mount option simply allows btrfs 
to fall back to this history, should the current root be corrupted.  
Btrfs restore can be used to list tree roots as well, and can be pointed 
at an appropriate one if necessary.

Fsync forces the file and its corresponding metadata update to the log 
and barring hardware or software bugs should not return until it's safely 
in the log, but I'm not sure whether it forces a full-tree commit.  
Either way the guarantees should be the same.  If the log can be replayed 
or a full-tree commit has occurred since the fsync, the new copy should 
appear.  If it can't, the rollback to the last atomic tree commit should 
return an intact copy of the file from that point.  If the recovery mount 
option is used and a further rollback to an earlier full-tree commit is 
forced, provided it existed at the point of that full-tree commit, the 
intact file at that point should appear.

So if the current tree root is a good one, the log will replay the last 
upto 30 seconds of activity on top of that last atomic tree root.  If the 
current root tree itself is corrupt, the recovery mount option will let 
an earlier one be used.  Obviously in that case the log will be discarded 
since it applies to a later root tree that itself has been discarded.

The debate is whether recovery should be automated so the admin doesn't 
have to care about it, or whether having to manually add that option 
serves as a necessary notifier to the admin that something /did/ go 
wrong, and that an earlier root is being used instead, so more than a few 
seconds worth of data may have disappeared.


As someone else has already suggested, I'd argue that as long as btrfs 
continues to be under the sort of development it's in now, keeping 
recovery as a non-default option is desired.  Once it's optimized and 
considered stable, arguably recovery should be made the default, perhaps 
with a no-recovery option for those who prefer that in-the-face 
notification in the form of a mount error, if btrfs would otherwise fall 
back to an earlier tree root commit.

What worries me, however, is that IMO the recent warning stripping was 
premature.  Btrfs is certainly NOT fully stable or optimized for normal 
use at this point.  We're still using the even/odd PID balancing scheme 
for raid1 reads, for instance, and multi-device writes are still 
serialized when they could be parallelized to a much larger degree (tho 
keeping some serialization is arguably good for data safety).  Arguably 
optimizing that now would be premature optimization since the code itself 
is still subject to change, so I'm not complaining, but by that very same 
token, it *IS* still subject to change, which by definition means it's 
*NOT* stable, so why are we removing all the warnings and giving the 
impression that it IS stable?

The decision wasn't mine to make and I don't know, but while a nice 
suggestion, making recovery-by-default a measure of when btrfs goes 
stable simply won't work, because surely, the same folks behind the 
warning stripping would then ensure this indicator too, said btrfs was 
stable, while the state of the code itself continues to say otherwise. 

Meanwhile, if your distributed transactions scenario doesn't account for 
crash and loss of data on one side with real-time backup/redundancy, such 
that loss of a few seconds worth of transactions on a single local 
filesystem is going to kill the entire scenario, I don't think too much 
of that scenario in the first place, and regardless, btrfs, certainly in 
its current state, is definitely NOT an appropriate base for it.  Use 
appropriate tools for the task.  Btrfs at least at this point is simply 
not an appropriate tool for that task.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-10 22:05           ` Eric Sandeen
@ 2014-10-13 11:26             ` Austin S Hemmelgarn
  0 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-13 11:26 UTC (permalink / raw)
  To: Eric Sandeen, Bob Marley, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

On 2014-10-10 18:05, Eric Sandeen wrote:
> On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:
>> On 2014-10-10 13:43, Bob Marley wrote:
>>> On 10/10/2014 16:37, Chris Murphy wrote:
>>>> The fail safe behavior is to treat the known good tree root as
>>>> the default tree root, and bypass the bad tree root if it cannot
>>>> be repaired, so that the volume can be mounted with default mount
>>>> options (i.e. the ones in fstab). Otherwise it's a filesystem
>>>> that isn't well suited for general purpose use as rootfs let
>>>> alone for boot.
>>>>
>>>
>>> A filesystem which is suited for "general purpose" use is a
>>> filesystem which honors fsync, and doesn't *ever* auto-roll-back
>>> without user intervention.
>>>
>>> Anything different is not suited for database transactions at all.
>>> Any paid service which has the users database on btrfs is going to
>>> be at risk of losing payments, and probably without the company
>>> even knowing. If btrfs goes this way I hope a big warning is
>>> written on the wiki and on the manpages telling that this
>>> filesystem is totally unsuitable for hosting databases performing
>>> transactions.
>> If they need reliability, they should have some form of redundancy
>> in-place and/or run the database directly on the block device;
>> because even ext4, XFS, and pretty much every other filesystem can
>> lose data sometimes,
>
> Not if i.e. fsync returns.  If the data is gone later, it's a hardware
> problem, or occasionally a bug - bugs that are usually found & fixed
> pretty quickly.
Yes, barring bugs and hardware problems they won't lose data.
>
>> the difference being that those tend to give
>> worse results when hardware is misbehaving than BTRFS does, because
>> BTRFS usually has a old copy of whatever data structure gets
>> corrupted to fall back on.
>
> I'm curious, is that based on conjecture or real-world testing?
>
I wouldn't really call it testing, but based on personal experience I 
know that ext4 can lose whole directory sub-trees if it gets a single 
corrupt sector in the wrong place.  I've also had that happen on FAT32 
and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and 
I've actually had individual files disappear on HFS+ without any 
discernible hardware issues.  I don't have as much experience with XFS, 
but would assume based on what I do know of it that it could have 
similar issues.  As for BTRFS, I've only ever had any issues with it 3 
times, one was due to the kernel panicking during resume from S1, and 
the other two were due to hardware problems that would have caused 
issues on most other filesystems as well.  In both cases of hardware 
issues, while the filesystem was initially unmountable, it was 
relatively simple to fix once I knew how.  I tried to fix an ext4 fs 
that had become unmountable due to dropped writes once, and that was 
anything but simple, even with the much greater amount of documentation.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-12 10:14       ` Martin Steigerwald
  2014-10-12 23:59         ` Duncan
@ 2014-10-13 11:37         ` Austin S Hemmelgarn
  2014-10-13 11:48         ` Rich Freeman
  2 siblings, 0 replies; 33+ messages in thread
From: Austin S Hemmelgarn @ 2014-10-13 11:37 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Chris Murphy, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4031 bytes --]

On 2014-10-12 06:14, Martin Steigerwald wrote:
> Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
>> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
>>> On 10/10/2014 03:58, Chris Murphy wrote:
>>>>> * mount -o recovery
>>>>>
>>>>> 	"Enable autorecovery attempts if a bad tree root is found at mount
>>>>> 	time."
>>>>
>>>> I'm confused why it's not the default yet. Maybe it's continuing to
>>>> evolve at a pace that suggests something could sneak in that makes
>>>> things worse? It is almost an oxymoron in that I'm manually enabling an
>>>> autorecovery
>>>>
>>>> If true, maybe the closest indication we'd get of btrfs stablity is the
>>>> default enabling of autorecovery.>
>>> No way!
>>> I wouldn't want a default like that.
>>>
>>> If you think at distributed transactions: suppose a sync was issued on
>>> both sides of a distributed transaction, then power was lost on one side,
>>> than btrfs had corruption. When I remount it, definitely the worst thing
>>> that can happen is that it auto-rolls-back to a previous known-good
>>> state.
>> For a general purpose file system, losing 30 seconds (or less) of
>> questionably committed data, likely corrupt, is a file system that won't
>> mount without user intervention, which requires a secret decoder ring to
>> get it to mount at all. And may require the use of specialized tools to
>> retrieve that data in any case.
>>
>> The fail safe behavior is to treat the known good tree root as the default
>> tree root, and bypass the bad tree root if it cannot be repaired, so that
>> the volume can be mounted with default mount options (i.e. the ones in
>> fstab). Otherwise it's a filesystem that isn't well suited for general
>> purpose use as rootfs let alone for boot.
>
> To understand this a bit better:
>
> What can be the reasons a recent tree gets corrupted?
>
Well, so far I have had the following cause corrupted trees:
1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just 
happened to be in the middle of a tree commit.
2. Generic power loss during a tree commit.
3. A device not properly honoring write-barriers (the operations 
immediately adjacent to the write barrier weren't being ordered 
correctly all the time).

Based on what I know about BTRFS, the following could also cause problems:
1. A single-event-upset somewhere in the write path.
2. The kernel issuing a write to the wrong device (I haven't had this 
happen to me, but know people who have).

In general, any of these will cause problems for pretty much any 
filesystem, not just BTRFS.
> I always thought with a controller and device and driver combination that
> honors fsync with BTRFS it would either be the new state of the last known
> good state *anyway*. So where does the need to rollback arise from?
>
I think that in this case the term rollback is a bit ambiguous, here it 
means from the point of view of userspace, which sees the FS as having 
'rolled-back' from the most recent state to the last known good state.
> That said all journalling filesystems have some sort of rollback as far as I
> understand: If the last journal entry is incomplete they discard it on journal
> replay. So even there you use the last seconds of write activity.
>
> But in case fsync() returns the data needs to be safe on disk. I always
> thought BTRFS honors this under *any* circumstance. If some proposed
> autorollback breaks this guarentee, I think something is broke elsewhere.
>
> And fsync is an fsync is an fsync. Its semantics are clear as crystal. There
> is nothing, absolutely nothing to discuss about it.
>
> An fsync completes if the device itself reported "Yeah, I have the data on
> disk, all safe and cool to go". Anything else is a bug IMO.
>
Or a hardware issue, most filesystems need disks to properly honor write 
barriers to provide guaranteed semantics on an fsync, and many consumer 
disk drives still don't honor them consistently.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 2455 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-12 10:14       ` Martin Steigerwald
  2014-10-12 23:59         ` Duncan
  2014-10-13 11:37         ` Austin S Hemmelgarn
@ 2014-10-13 11:48         ` Rich Freeman
  2 siblings, 0 replies; 33+ messages in thread
From: Rich Freeman @ 2014-10-13 11:48 UTC (permalink / raw)
  To: Martin Steigerwald; +Cc: Chris Murphy, linux-btrfs

On Sun, Oct 12, 2014 at 6:14 AM, Martin Steigerwald <Martin@lichtvoll.de> wrote:
> Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:
>> On Oct 10, 2014, at 6:53 AM, Bob Marley <bobmarley@shiftmail.org> wrote:
>> > On 10/10/2014 03:58, Chris Murphy wrote:
>> >>> * mount -o recovery
>> >>>
>> >>>   "Enable autorecovery attempts if a bad tree root is found at mount
>> >>>   time."
>> >>
>> >> I'm confused why it's not the default yet. Maybe it's continuing to
>> >> evolve at a pace that suggests something could sneak in that makes
>> >> things worse? It is almost an oxymoron in that I'm manually enabling an
>> >> autorecovery
>> >>
>> >> If true, maybe the closest indication we'd get of btrfs stablity is the
>> >> default enabling of autorecovery.>
>> > No way!
>> > I wouldn't want a default like that.
>> >
>> > If you think at distributed transactions: suppose a sync was issued on
>> > both sides of a distributed transaction, then power was lost on one side,
>> > than btrfs had corruption. When I remount it, definitely the worst thing
>> > that can happen is that it auto-rolls-back to a previous known-good
>> > state.
>> For a general purpose file system, losing 30 seconds (or less) of
>> questionably committed data, likely corrupt, is a file system that won't
>> mount without user intervention, which requires a secret decoder ring to
>> get it to mount at all. And may require the use of specialized tools to
>> retrieve that data in any case.
>>
>> The fail safe behavior is to treat the known good tree root as the default
>> tree root, and bypass the bad tree root if it cannot be repaired, so that
>> the volume can be mounted with default mount options (i.e. the ones in
>> fstab). Otherwise it's a filesystem that isn't well suited for general
>> purpose use as rootfs let alone for boot.
>
> To understand this a bit better:
>
> What can be the reasons a recent tree gets corrupted?
>
> I always thought with a controller and device and driver combination that
> honors fsync with BTRFS it would either be the new state of the last known
> good state *anyway*. So where does the need to rollback arise from?
>

In theory the recover option should never be necessary.  Btrfs makes
all the guarantees everybody wants it to - when the data is fsynced
then it will never be lost.

The question is what should happen when a corrupted tree root, which
should never happen, happens anyway.  The options are to refuse to
mount the filesystem by default, or mount it by default discarding
about 30-60s worth of writes.  And yes, when this situation happens
(whether it mounts by default or not) btrfs has broken its promise of
data being written after a successful fsync return.

As has been pointed out, braindead drive firmware is the most likely
cause of this sort of issue.  However, there are a number of other
hardware and software errors that could cause it, including errors in
linux outside of btrfs, and of course bugs in btrfs as well.

In an ideal world no filesystem would need any kind of recovery/repair
tools.  They can often mean that the fsync promise was broken.  The
real question is, once that has happened, how do you move on?

I think the best default is to auto-recover, but to have better
facilities for reporting errors to the user.  Right now btrfs is very
quiet about failures - maybe a cryptic message in dmesg, and nobody
reads all of that unless they're looking for something.  If btrfs
could report significant issues that might mitigate the impact of an
auto-recovery.

Also, another thing to consider during recovery is whether the damaged
data could be optionally stored in a snapshot of some kind - maybe in
the way that ext3/4 rollback data after conversion gets stored in a
snapshot.  My knowledge of the underlying structures is weak, but I'd
think that a corrupted tree root practically is a snapshot already,
and turning it into one might even be easier than cleaning it up.  Of
course, we would need to ensure the snapshot could be deleted without
further error.  Doing anything with the snapshot might require special
tools, but if people want to do disk scraping they could.

--
Rich

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: What is the vision for btrfs fs repair?
  2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
                   ` (2 preceding siblings ...)
  2014-10-12 10:17 ` Martin Steigerwald
@ 2014-10-13 21:09 ` Josef Bacik
  3 siblings, 0 replies; 33+ messages in thread
From: Josef Bacik @ 2014-10-13 21:09 UTC (permalink / raw)
  To: Eric Sandeen, linux-btrfs

On 10/08/2014 03:11 PM, Eric Sandeen wrote:
> I was looking at Marc's post:
>
> https://urldefense.proofpoint.com/v1/url?u=http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=XJPoqgf9jjvuE1IqCerEXXuwF4w3hbDS3%2F63x5KI4R4%3D%0A&s=b1f817d758eacf914bd60f20ada715384e13c1f8e040100794b0cb21261ec884
>
> and it feels like there isn't exactly a cohesive, overarching vision for
> repair of a corrupted btrfs filesystem.
>
> In other words - I'm an admin cruising along, when the kernel throws some
> fs corruption error, or for whatever reason btrfs fails to mount.
> What should I do?
>
> Marc lays out several steps, but to me this highlights that there seem to
> be a lot of disjoint mechanisms out there to deal with these problems;
> mostly from Marc's blog, with some bits of my own:
>
> * btrfs scrub
> 	"Errors are corrected along if possible" (what *is* possible?)
> * mount -o recovery
> 	"Enable autorecovery attempts if a bad tree root is found at mount time."
> * mount -o degraded
> 	"Allow mounts to continue with missing devices."
> 	(This isn't really a way to recover from corruption, right?)
> * btrfs-zero-log
> 	"remove the log tree if log tree is corrupt"
> * btrfs rescue
> 	"Recover a damaged btrfs filesystem"
> 	chunk-recover
> 	super-recover
> 	How does this relate to btrfs check?
> * btrfs check
> 	"repair a btrfs filesystem"
> 	--repair
> 	--init-csum-tree
> 	--init-extent-tree
> 	How does this relate to btrfs rescue?
> * btrfs restore
> 	"try to salvage files from a damaged filesystem"
> 	(not really repair, it's disk-scraping)
>
>
> What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
> same errors, only online vs. offline?  If not, what class of errors does one fix vs.
> the other?  How would an admin know?  Can btrfs check recover a bad tree root
> in the same way that mount -o recovery does?  How would I know if I should use
> --init-*-tree, or chunk-recover, and what are the ramifications of using
> these options?
>
> It feels like recovery tools have been badly splintered, and if there's an
> overarching design or vision for btrfs fs repair, I can't tell what it is.
> Can anyone help me?
>

We probably should just consolidate under 3 commands, one for online 
checking, one for offline repair and one for pulling stuff off of the 
disk when things go to hell.  A lot of these tools were born out of the 
fact that we didn't have a fsck tool for a long time so there were these 
stop gaps put into place, so now its time to go back and clean it up.

I'll try and do this after I finish my cleanup/sync between kernel and 
progs work and fill out the documentation a little better so its clear 
when to use what.  Thanks,

Josef


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: Re: What is the vision for btrfs fs repair?
  2014-10-11  7:29     ` Goffredo Baroncelli
@ 2014-11-17 20:55       ` Phillip Susi
  0 siblings, 0 replies; 33+ messages in thread
From: Phillip Susi @ 2014-11-17 20:55 UTC (permalink / raw)
  To: kreijack, Bob Marley, linux-btrfs

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 10/11/2014 3:29 AM, Goffredo Baroncelli wrote:
> On 10/10/2014 12:53 PM, Bob Marley wrote:
>>> 
>>> If true, maybe the closest indication we'd get of btrfs
>>> stablity is the default enabling of autorecovery.
>> 
>> No way! I wouldn't want a default like that.
>> 
>> If you think at distributed transactions: suppose a sync was
>> issued on both sides of a distributed transaction, then power was
>> lost on one side, than btrfs had corruption. When I remount it,
>> definitely the worst thing that can happen is that it
>> auto-rolls-back to a previous known-good state.
> 
> I cannot agree. I consider a sane default to have a consistent
> state with "the recently data written lost", instead of "require
> the user intervention to not lost anything".
> 
> To address your requirement, we need a "super sync" command which 
> ensure that the data are in the filesystem and not only in the log
> (as sync should ensure).

I have to agree.  There is a reason we have fsck -p and why that is what
is run at boot time.  Some repairs involve a tradeoff that will result
in permanent data loss that maybe could be avoided by going the other
way, or performing manual recovery.  Such repairs should never be done
automatically by default.

For that matter I'm not even sure this sort of thing should be there as
a mount option at all.  It really should require a manual fsck run with
a big warning that *THIS WILL THROW OUT SOME DATA*.

Now if the data is saved to a snapshot or something so you can manually
try to recover it later rather than being thrown out wholesale, I can
see that being done automatically at boot time.  Of course, if btrfs is
that damaged then wouldn't grub be unable to load your kernel in the
first place?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUamDQAAoJEI5FoCIzSKrwaYAIAKXgkGBbBZj6yUuLC1+euim6
6Xqer1DiGywEiO4UPaxmq3rHDOlZlyIamDpUi7nIvbfK+TgBWfEVtLvdd6shjfqA
FvFv7t+X2mlAyk+iGffSK1w9/qgEhE55M35exba95Cdsn0ezos4LpvTsL1128nkx
uGzYQcoYj1irkmDp133JuHYAxhrAp0Q6PB+5gIgWfRsVbGezcxg5FvqzotEq1J/d
7MT1FvdoUo5qt2j/KzTUfD5AlFhsXE5beykakMdFmoHlTCQAxEeUU21z6APclkxF
/b/ppLt603Vpb6rpKvNUyBy1TuPr6FJEx5O2qWUWlhRxkOUB98M86KHyWVBHtMM=
=uG+h
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-11-17 20:56 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-08 19:11 What is the vision for btrfs fs repair? Eric Sandeen
2014-10-09 11:29 ` Austin S Hemmelgarn
2014-10-09 11:53   ` Duncan
2014-10-09 11:55     ` Hugo Mills
2014-10-09 12:07     ` Austin S Hemmelgarn
2014-10-09 12:12       ` Hugo Mills
2014-10-09 12:32         ` Austin S Hemmelgarn
     [not found]     ` <107Y1p00G0wm9Bl0107vjZ>
2014-10-09 12:34       ` Duncan
2014-10-09 13:18         ` Austin S Hemmelgarn
2014-10-09 13:49           ` Duncan
2014-10-09 15:44             ` Eric Sandeen
     [not found]     ` <0zvr1p0162Q6ekd01zvtN0>
2014-10-09 12:42       ` Duncan
2014-10-10  1:58 ` Chris Murphy
2014-10-10  3:20   ` Duncan
2014-10-10 10:53   ` Bob Marley
2014-10-10 10:59     ` Roman Mamedov
2014-10-10 11:12       ` Bob Marley
2014-10-10 15:18         ` cwillu
2014-10-10 14:37     ` Chris Murphy
2014-10-10 17:43       ` Bob Marley
2014-10-10 17:53         ` Bardur Arantsson
2014-10-10 19:35         ` Austin S Hemmelgarn
2014-10-10 22:05           ` Eric Sandeen
2014-10-13 11:26             ` Austin S Hemmelgarn
2014-10-12 10:14       ` Martin Steigerwald
2014-10-12 23:59         ` Duncan
2014-10-13 11:37         ` Austin S Hemmelgarn
2014-10-13 11:48         ` Rich Freeman
2014-10-11  7:29     ` Goffredo Baroncelli
2014-11-17 20:55       ` Phillip Susi
2014-10-12 10:06   ` Martin Steigerwald
2014-10-12 10:17 ` Martin Steigerwald
2014-10-13 21:09 ` Josef Bacik

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.