All of lore.kernel.org
 help / color / mirror / Atom feed
* unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
@ 2016-04-06 15:34 Ank Ular
  2016-04-06 21:02 ` Duncan
  2016-04-06 23:08 ` Chris Murphy
  0 siblings, 2 replies; 23+ messages in thread
From: Ank Ular @ 2016-04-06 15:34 UTC (permalink / raw)
  To: linux-btrfs

I am currently unable to mount nor recover data from my btrfs storage pool.

To the best of my knowledge, the situation did not arise from hard
disk failure. I believe the sequence of events is:

One or possibly more of my external devices had the USB 3.0
communications link fail. I recall seeing the message which is
generated when a USB based storage device is newly connected.

I was near the end of a 'btrfs balance' run which included adding
devices and converting the pool from RAID5 to RAID6. There were
approximately 1000 chunks {out of 22K+ chunks} left to go.
I was also participating in several torrents {this means my btrfs pool
was active}

>From the ouput of 'dmesg', the section:
[   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm
[   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn
[   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds
[   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu

bothers me because the transid value of these four devices doesn't
match the other 16 devices in the pool {should be 625065}. In theory,
I believe these should all have the same transid value. These four
devices are all on a single USB 3.0 port and this is the link I
believe went down and came back up. This is an external, four drive
bay case with 4 6T drives in it.

I can no longer mount the storage pool
pyrogyro ~ # mount -t btrfs -o ro,recovery,degraded /dev/sdb /PublicA
mount: wrong fs type, bad option, bad superblock on /dev/sdb,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

nor can I restore data from the storage pool
pyrogyro ~ # btrfs restore -D -i -v /dev/sdb /dev/null
checksum verify failed on 120890386268160 found D7319043 wanted 33D22DF5
checksum verify failed on 120890386268160 found 50ECAB17 wanted 2D8EEBCA
checksum verify failed on 120890386268160 found D7319043 wanted 33D22DF5
bytenr mismatch, want=120890386268160, have=65536
This is a dry-run, no files are going to be restored
parent transid verify failed on 120874721263616 wanted 625047 found 625039
parent transid verify failed on 120874721263616 wanted 625047 found 625039
checksum verify failed on 120874721263616 found 6FE4916B wanted 824E1F4D
checksum verify failed on 120874721263616 found 6FE4916B wanted 824E1F4D
bytenr mismatch, want=120874721263616, have=45608042283264
Error searching -5

blkid has no problem accessing all the devices in the storage pool.
smartcl tells me that every device in the pool has a 'Passed' health status.
The oldest drive is about 14 months old.

I understand I'll be losing some data. Of course, I'd like to recover
as much as possible. I can think two possible approaches though I
don't have any idea how to go about them:

Somehow fix things so I can mount the pool 'in place'. I don't mind
rolling back {if possible} the other 16 devices so that all devices
are at the same transid. I can recreate any corrupt/missing files up
to several weeks back. This might include fixing the chunk-tree,
re-creating any other trees or other repairs.

Somehow fix things so that I can perform 'btrfs restore' which will
copy all recoverable files to a new storage location.

I have not run any 'btrfs' commands such at 'check', 'rescue',
'replace', 'scrub' etc. The idea was to not make things worse. I have
rebooted 3 times before I understood that I had real issues with the
btrfs pool. These reboots all failed to mount the btrfs pool.

Any other information I can provide will be happily provided. All help
will be appreciated.

pyrogyro ~ # uname -a
Linux pyrogyro 4.4.6-gentoo #1 SMP PREEMPT Wed Apr 6 07:45:45 EDT 2016
x86_64 AMD A10-7850K Radeon R7, 12 Compute Cores 4C+8G AuthenticAMD
GNU/Linux
pyrogyro ~ # btrfs --version
btrfs-progs v4.5.1
pyrogyro ~ # btrfs fi show
Label: 'PhoenixRootSSD'  uuid: ed1790a7-87e6-466c-a68c-e375303fd99f
        Total devices 1 FS bytes used 85.50GiB
        devid    1 size 200.04GiB used 110.01GiB path /dev/sda5

Label: 'PhoenixRoot'  uuid: 7ba4f981-c2ff-4a70-96a6-4c4b25f96e96
        Total devices 1 FS bytes used 2.00TiB
        devid    1 size 2.71TiB used 2.38TiB path /dev/sdg5

Label: none  uuid: a7e2e4f6-e324-4cf4-8b76-33bb7dedf5d1
        Total devices 1 FS bytes used 384.00KiB
        devid    1 size 2.73TiB used 2.02GiB path /dev/sdab1

checksum verify failed on 120890386268160 found D7319043 wanted 33D22DF5
checksum verify failed on 120890386268160 found 50ECAB17 wanted 2D8EEBCA
checksum verify failed on 120890386268160 found D7319043 wanted 33D22DF5
bytenr mismatch, want=120890386268160, have=65536
Label: 'FSgyroA'  uuid: 4dae41b0-a459-4c20-a09d-0aca9563b9ad
        Total devices 20 FS bytes used 53.15TiB
        devid    1 size 3.64TiB used 3.46TiB path /dev/sdb
        devid    2 size 3.64TiB used 3.46TiB path /dev/sdd
        devid    3 size 3.64TiB used 3.46TiB path /dev/sdc
        devid    4 size 2.73TiB used 2.73TiB path /dev/sdh
        devid    5 size 4.55TiB used 4.50TiB path /dev/sde
        devid    6 size 4.55TiB used 4.50TiB path /dev/sdf
        devid    7 size 4.55TiB used 4.49TiB path /dev/sdi
        devid    8 size 4.55TiB used 4.50TiB path /dev/sdj
        devid    9 size 5.46TiB used 5.19TiB path /dev/sdm
        devid   10 size 5.46TiB used 5.19TiB path /dev/sdn
        devid   11 size 5.46TiB used 5.19TiB path /dev/sds
        devid   12 size 5.46TiB used 5.19TiB path /dev/sdu
        devid   14 size 2.73TiB used 2.73TiB path /dev/sdag
        devid   15 size 2.73TiB used 2.73TiB path /dev/sdz
        devid   16 size 2.73TiB used 2.73TiB path /dev/sdy
        devid   17 size 2.73TiB used 2.73TiB path /dev/sdac
        devid   18 size 2.73TiB used 2.73TiB path /dev/sdaf
        devid   19 size 2.73TiB used 2.73TiB path /dev/sdx
        devid   20 size 2.73TiB used 2.73TiB path /dev/sdad
        *** Some devices missing

pyrogyro ~ # btrfs fi df /PublicA
Data, single: total=107.00GiB, used=84.23GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=3.01GiB, used=1.27GiB
GlobalReserve, single: total=448.00MiB, used=0.00B
pyrogyro ~ # dmesg | grep BTRFS
[   20.295632] BTRFS: device label PhoenixRootSSD devid 1 transid
300544 /dev/sda5
[   20.300144] BTRFS info (device sda5): disk space caching is enabled
[   20.300148] BTRFS: has skinny extents
[   20.321855] BTRFS: detected SSD devices, enabling SSD mode
[   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm
[   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn
[   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds
[   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu
[   21.109647] BTRFS: device label FSgyroA devid 6 transid 625065 /dev/sdf
[   21.130846] BTRFS: device label FSgyroA devid 5 transid 625065 /dev/sde
[   21.131920] BTRFS: device label FSgyroA devid 3 transid 625065 /dev/sdc
[   21.133196] BTRFS: device label FSgyroA devid 17 transid 625065 /dev/sdac
[   21.152346] BTRFS: device label FSgyroA devid 19 transid 625065 /dev/sdx
[   21.158732] BTRFS: device label FSgyroA devid 15 transid 625065 /dev/sdz
[   21.168634] BTRFS: device label FSgyroA devid 20 transid 625065 /dev/sdad
[   21.172592] BTRFS: device label FSgyroA devid 1 transid 625065 /dev/sdb
[   21.173639] BTRFS: device label FSgyroA devid 18 transid 625065 /dev/sdaf
[   21.178384] BTRFS: device label FSgyroA devid 2 transid 625065 /dev/sdd
[   21.212464] BTRFS: device label FSgyroA devid 16 transid 625065 /dev/sdy
[   21.290614] BTRFS: device label FSgyroA devid 7 transid 625065 /dev/sdi
[   21.309370] BTRFS: device label FSgyroA devid 8 transid 625065 /dev/sdj
[   21.372684] BTRFS: device label FSgyroA devid 4 transid 625065 /dev/sdh
[   21.443467] BTRFS: device label FSgyroA devid 14 transid 625065 /dev/sdag
[   21.495110] BTRFS: device fsid a7e2e4f6-e324-4cf4-8b76-33bb7dedf5d1
devid 1 transid 14 /dev/sdab1
[   21.652071] BTRFS: device label PhoenixRoot devid 1 transid 593561 /dev/sdg5
[   29.881428] BTRFS info (device sda5): enabling auto defrag
[   29.881436] BTRFS info (device sda5): disk space caching is enabled
[   30.063829] BTRFS info (device sdg5): enabling auto defrag
[   30.063837] BTRFS info (device sdg5): disk space caching is enabled
[   30.063838] BTRFS: has skinny extents
[  340.714491] BTRFS info (device sdag): disk space caching is enabled
[  340.714496] BTRFS: has skinny extents
[  341.010175] BTRFS: failed to read chunk tree on sdag
[  341.030490] BTRFS: open_ctree failed
[  341.056664] BTRFS info (device sdag): disk space caching is enabled
[  341.056668] BTRFS: has skinny extents
[  341.070958] BTRFS: failed to read chunk tree on sdag
[  341.090538] BTRFS: open_ctree failed
[  341.176337] BTRFS info (device sdag): disk space caching is enabled
[  341.176340] BTRFS: has skinny extents
[  341.181257] BTRFS: failed to read chunk tree on sdag
[  341.193838] BTRFS: open_ctree failed
[  341.301907] BTRFS info (device sdag): disk space caching is enabled
[  341.301911] BTRFS: has skinny extents
[  341.302754] BTRFS: failed to read chunk tree on sdag
[  341.313773] BTRFS: open_ctree failed
[  341.681433] BTRFS info (device sdag): disk space caching is enabled
[  341.681437] BTRFS: has skinny extents
[  341.682436] BTRFS: failed to read chunk tree on sdag
[  341.700410] BTRFS: open_ctree failed
[  342.535884] BTRFS info (device sdag): disk space caching is enabled
[  342.535887] BTRFS: has skinny extents
[  342.536531] BTRFS: failed to read chunk tree on sdag
[  342.550450] BTRFS: open_ctree failed
[  342.562704] BTRFS info (device sdag): disk space caching is enabled
[  342.562708] BTRFS: has skinny extents
[  342.564068] BTRFS: failed to read chunk tree on sdag
[  342.594017] BTRFS: open_ctree failed
[  343.059777] BTRFS info (device sdag): disk space caching is enabled
[  343.059782] BTRFS: has skinny extents
[  343.061271] BTRFS: failed to read chunk tree on sdag
[  343.083753] BTRFS: open_ctree failed
[  343.501960] BTRFS info (device sdag): disk space caching is enabled
[  343.501963] BTRFS: has skinny extents
[  343.506562] BTRFS: failed to read chunk tree on sdag
[  343.520391] BTRFS: open_ctree failed
[  344.010038] BTRFS info (device sdag): disk space caching is enabled
[  344.010042] BTRFS: has skinny extents
[  344.014591] BTRFS: failed to read chunk tree on sdag
[  344.037124] BTRFS: open_ctree failed
[  344.249147] BTRFS info (device sdag): disk space caching is enabled
[  344.249152] BTRFS: has skinny extents
[  344.270668] BTRFS: failed to read chunk tree on sdag
[  344.283740] BTRFS: open_ctree failed
[  344.312789] BTRFS info (device sdab1): disk space caching is enabled
[  344.312793] BTRFS: has skinny extents
[  570.894920] BTRFS info (device sdag): enabling auto recovery
[  570.894926] BTRFS info (device sdag): disabling disk space caching
[  570.894929] BTRFS info (device sdag): force clearing of disk cache
[  570.894931] BTRFS: has skinny extents
[  570.896272] BTRFS: failed to read chunk tree on sdag
[  570.907534] BTRFS: open_ctree failed
[ 6328.239627] BTRFS info (device sdag): enabling auto recovery
[ 6328.239631] BTRFS info (device sdag): allowing degraded mounts
[ 6328.239634] BTRFS info (device sdag): disk space caching is enabled
[ 6328.239635] BTRFS: has skinny extents
[ 6328.271138] BTRFS warning (device sdag): devid 13 uuid
34774574-5c91-4366-a58a-d2c6799fc162 missing
[ 6328.735082] BTRFS info (device sdag): bdev /dev/sdu errs: wr 75, rd
48, flush 0, corrupt 0, gen 0
[ 6328.735089] BTRFS info (device sdag): bdev /dev/sds errs: wr 75, rd
36, flush 0, corrupt 0, gen 0
[ 6328.735094] BTRFS info (device sdag): bdev /dev/sdn errs: wr 75, rd
32, flush 0, corrupt 0, gen 0
[ 6328.735098] BTRFS info (device sdag): bdev /dev/sdm errs: wr 75, rd
22, flush 0, corrupt 0, gen 0
[ 6329.447352] BTRFS error (device sdag): parent transid verify failed
on 120878845526016 wanted 625047 found 624312
[ 6329.516712] BTRFS error (device sdag): bad tree block start
15703682036217976097 120878845526016
[ 6329.516838] BTRFS: Failed to read block groups: -5
[ 6329.559946] BTRFS: open_ctree failed

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 15:34 unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Ank Ular
@ 2016-04-06 21:02 ` Duncan
  2016-04-06 22:08   ` Ank Ular
  2016-04-06 23:08 ` Chris Murphy
  1 sibling, 1 reply; 23+ messages in thread
From: Duncan @ 2016-04-06 21:02 UTC (permalink / raw)
  To: linux-btrfs

Ank Ular posted on Wed, 06 Apr 2016 11:34:39 -0400 as excerpted:

> I am currently unable to mount nor recover data from my btrfs storage
> pool.
> 
> To the best of my knowledge, the situation did not arise from hard disk
> failure. I believe the sequence of events is:
> 
> One or possibly more of my external devices had the USB 3.0
> communications link fail. I recall seeing the message which is generated
> when a USB based storage device is newly connected.
> 
> I was near the end of a 'btrfs balance' run which included adding
> devices and converting the pool from RAID5 to RAID6. There were
> approximately 1000 chunks {out of 22K+ chunks} left to go.
> I was also participating in several torrents {this means my btrfs pool
> was active}
> 
> From the ouput of 'dmesg', the section:
> [   20.998071] BTRFS: device label FSgyroA
> devid 9 transid 625039 /dev/sdm
> [   20.999984] BTRFS: device label FSgyroA
> devid 10 transid 625039 /dev/sdn
> [   21.004127] BTRFS: device label FSgyroA
> devid 11 transid 625039 /dev/sds
> [   21.011808] BTRFS: device label FSgyroA
> devid 12 transid 625039 /dev/sdu
> 
> bothers me because the transid value of these four devices doesn't match
> the other 16 devices in the pool {should be 625065}. In theory,
> I believe these should all have the same transid value. These four
> devices are all on a single USB 3.0 port and this is the link I believe
> went down and came back up. This is an external, four drive bay case
> with 4 6T drives in it.

Unfortunately it's somewhat common to have problems with USB attached 
devices.  On a single-device btrfs it's not so much of a problem because 
it all dies at the same time and should be conventionally rolled back to 
the previous transaction commit, with fsyncs beyond that replayed by the 
log.  A pair-device raid1 mode should be easily recovered as well, as 
while the two may be out of sync, there's only the two copies and one 
will consistently be ahead, so the btrfs should mount and a scrub can 
easily be used to update the device that's behind.  Any other raid than 
raid0 should work similarly when only a single device is behind.

But with multiple devices behind, like your four, things get far more 
complex.

Of course the first thing to note is that with btrfs still considered 
stabilizing, but not fully stable and mature, the sysadmin's rule of 
backups, in simple form that anything without at least one backup is 
defined by that lack of backup as not worth the trouble, applies even 
stronger than it would in the mature filesystem case.  Similarly, btrfs' 
parity-raid is fairly new and not yet at the stability of other btrfs 
raid types (raid1 and raid10, plus of course raid0 which implies you 
don't care about recovery after failure anyway), strengthening the 
application of the backups rule even further.

So you definitely (triple-power!) either had backups you can restore 
from, or were defining that data as not worth the hassle.  That means 
worst-case, you can either restore from your backups, or you considered 
the time and resources saved in not doing them more valuable than the 
data you were risking.  Since either way you saved what was most 
important to you, you can be happy. =:^)

But even if you had backups, there's then a tradeoff between the hassle 
of updating them and the risk of having to revert to them, and like me, 
while you do have backups they may not be particularly current as the 
limited risk didn't really justify updating the backups at a higher 
frequency, so some effort to get more current versions is justified.  
(I've actually been in that situation a couple times with some of my 
btrfs.  Fortunately, in both cases I was able to btrfs restore and thus 
/was/ able to recover basically current versions of everything that 
mattered on the filesystem.)

Anyway, that does sound like where you're at, you have backups but 
they're several weeks old and you'd prefer to recover newer versions if 
possible.  That I can definitely understand as I've been there. =:^)


With four devices behind by (fortunately only) 26 transactions, and 
luckily all at the same transaction/generation number, you're likely 
beyond what the recovery mount option can deal with (I believe upto three 
transactions, tho it might be a few more in newer kernels), and obviously 
from your results, beyond what btrfs restore can deal with automatically 
as well.

There is still hope via btrfs restore, but you have to feed it more 
information than it can get on its own, and while it's reasonably likely 
that you can get that information and as a result a successful restore, 
the process of finding the information and manually verifying that it's 
appropriately complete is definitely rather more technical than the 
automated process.  If you're sufficiently technically inclined (not at a 
dev level, but at an admin level, able to understand technical concepts 
and make use of them on the command line, etc), your chances at recovery 
are still rather good.  If you aren't... better be getting out those 
backups.

There's a page on the wiki that describes the general process, but keep 
in mind that the tools continue to evolve and the wiki page may not be 
absolutely current, so what it describes might not be exactly what you 
get, and you may have to do some translation between the current tools 
and what's on the wiki.  (Actually, it looks like it is much more current 
than it used to be, but I'm not sure whether all parts of the page have 
been updated/rewritten or not.)

https://btrfs.wiki.kernel.org/index.php/Restore

You're at the "advanced usage" point as the automated method didn't work.

The general idea is to use the btrfs-find-root command to get a list of 
previous roots, their generation number (aka transaction ID, aka transid), 
and their corresponding byte number (bytenr).  The bytenr is the value 
you feed to btrfs restore, via the -t option.

I'd start with the 625039 generation/transid that is the latest on the 
four "behind" devices, hoping that the other devices still have it intact 
as well.  Find the corresponding bytenr via btrfs-find-root, and feed it 
to btrfs restore via -t.  But not yet in a live run!!

First, use -t and -l together, to get a list of the tree-roots available 
at that bytenr.  You want to pick a bytenr/generation that still has its 
tree roots intact as much as possible.  Down near the bottom of the page 
there's a bit of an explanation of what the object-IDs mean.  The low 
number ones are filesystem-global and are quite critical.  256 up are 
subvolumes and snapshots.  If a few snapshots are missing no big deal, 
tho if something critical is in a subvolume, you'll want either it or a 
snapshot of it available to try to restore from.

Once you have a -t bytenr candidate with ideally all of the objects 
intact, or as many as possible if not all of them, do a dry-run using the 
-D option.  The output here will be the list of files it's trying to 
recover and thus may be (hopefully is, with a reasonably populated 
filesystem) quite long.  But if it looks reasonable, you can use the same 
-t bytenr without the -D/dry-run option to do the actual restore.  Be 
sure to use the various options restoring metadata, symlinks, extended 
attributes, snapshots, etc, if appropriate.

Of course you'll need enough space to restore to as well.  If that's an 
issue, you can use the --path-regex option to restore the most important 
stuff only.  There's an example on the page.


If that's beyond your technical abilities or otherwise doesn't work, you 
may be able to use some of the advanced options of btrfs check and btrfs 
rescue to help, but I've never tried that myself and you'll be better off 
with help from someone else, because unlike restore, which doesn't write 
to the damaged filesystem the files are being restored from and thus 
can't damage it further, these tools and options can destroy any 
reasonable hope of recovery if they aren't used with appropriate 
knowledge and care.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 21:02 ` Duncan
@ 2016-04-06 22:08   ` Ank Ular
  2016-04-07  2:36     ` Duncan
  0 siblings, 1 reply; 23+ messages in thread
From: Ank Ular @ 2016-04-06 22:08 UTC (permalink / raw)
  To: linux-btrfs

On Wed, Apr 6, 2016 at 5:02 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Ank Ular posted on Wed, 06 Apr 2016 11:34:39 -0400 as excerpted:
>
>> I am currently unable to mount nor recover data from my btrfs storage
>> pool.
>>

> With four devices behind by (fortunately only) 26 transactions, and
> luckily all at the same transaction/generation number, you're likely
> beyond what the recovery mount option can deal with (I believe upto three
> transactions, tho it might be a few more in newer kernels), and obviously
> from your results, beyond what btrfs restore can deal with automatically
> as well.
>
> There is still hope via btrfs restore, but you have to feed it more
> information than it can get on its own, and while it's reasonably likely
> that you can get that information and as a result a successful restore,
> the process of finding the information and manually verifying that it's
> appropriately complete is definitely rather more technical than the
> automated process.  If you're sufficiently technically inclined (not at a
> dev level, but at an admin level, able to understand technical concepts
> and make use of them on the command line, etc), your chances at recovery
> are still rather good.  If you aren't... better be getting out those
> backups.
>
> There's a page on the wiki that describes the general process, but keep
> in mind that the tools continue to evolve and the wiki page may not be
> absolutely current, so what it describes might not be exactly what you
> get, and you may have to do some translation between the current tools
> and what's on the wiki.  (Actually, it looks like it is much more current
> than it used to be, but I'm not sure whether all parts of the page have
> been updated/rewritten or not.)
>
> https://btrfs.wiki.kernel.org/index.php/Restore
>
> You're at the "advanced usage" point as the automated method didn't work.
>
> The general idea is to use the btrfs-find-root command to get a list of
> previous roots, their generation number (aka transaction ID, aka transid),
> and their corresponding byte number (bytenr).  The bytenr is the value
> you feed to btrfs restore, via the -t option.
>
> I'd start with the 625039 generation/transid that is the latest on the
> four "behind" devices, hoping that the other devices still have it intact
> as well.  Find the corresponding bytenr via btrfs-find-root, and feed it
> to btrfs restore via -t.  But not yet in a live run!!
>
> First, use -t and -l together, to get a list of the tree-roots available
> at that bytenr.  You want to pick a bytenr/generation that still has its
> tree roots intact as much as possible.  Down near the bottom of the page
> there's a bit of an explanation of what the object-IDs mean.  The low
> number ones are filesystem-global and are quite critical.  256 up are
> subvolumes and snapshots.  If a few snapshots are missing no big deal,
> tho if something critical is in a subvolume, you'll want either it or a
> snapshot of it available to try to restore from.
>
> Once you have a -t bytenr candidate with ideally all of the objects
> intact, or as many as possible if not all of them, do a dry-run using the
> -D option.  The output here will be the list of files it's trying to
> recover and thus may be (hopefully is, with a reasonably populated
> filesystem) quite long.  But if it looks reasonable, you can use the same
> -t bytenr without the -D/dry-run option to do the actual restore.  Be
> sure to use the various options restoring metadata, symlinks, extended
> attributes, snapshots, etc, if appropriate.
>
> Of course you'll need enough space to restore to as well.  If that's an
> issue, you can use the --path-regex option to restore the most important
> stuff only.  There's an example on the page.
>
>
> If that's beyond your technical abilities or otherwise doesn't work, you
> may be able to use some of the advanced options of btrfs check and btrfs
> rescue to help, but I've never tried that myself and you'll be better off
> with help from someone else, because unlike restore, which doesn't write
> to the damaged filesystem the files are being restored from and thus
> can't damage it further, these tools and options can destroy any
> reasonable hope of recovery if they aren't used with appropriate
> knowledge and care.
>
> --
> Duncan - List replies preferred.   No HTML msgs.

I did read this page: https://btrfs.wiki.kernel.org/index.php/Restore

But, not understanding the meaning of much of the terminology, I
didn't "get it".

Your explanation makes the page much clearer. I do need one
clarification. I'm assuming
that when I issue the command:

   btrfs-find-root /dev/sdb

it doesn't actually matter which device I use and that, in theory, any
of the 20 devices should yield the same listing.

By the same token, when I issue the command:

   btrfs restore -t n /dev/sdb /mnt/restore

any of the 20 devices would work equally well.

I want to be clear on this because this will be the first time I
attempt using 'btrfs restore'. While I think I understand what
is supposed to happen now, there is nothing like experience to make
that 'understanding' more solid. I just want to be
sure I haven't confused myself before I do something more or less irrevocable.

Fortunately, I neither use sub-volumes nor snapshots since nearly all
of the files are static in nature.

As far as backups go, we're talking about a home server/workstation.
While I used to go through an excruciating budget
battle every year on a professional level in my usually futile fight
for disaster recovery planning funding, my personal
budget is much, nuch more limited.

Of the 53T currently in limbo, about ~6-8T are on several hundreds of
DVDs. About 10T are on the hard drives of the
my prior system which needs a replacement motherboard. {I had rsynced
the data to a new build system just before imminent
 failure}. Most of the rest can be re-collected from a variety of
still existing sources as I still have the file names and link
locations
on a separate file system. My 'disaster recovery plans' assume
patience, a limited budget and knowing where everything
came from originally. Backups are a completely different issue. My
backup planning won't be complete for another 12
or so since it essentially means building a duplicate system. Since my
budget funding is limited, my duplicate system
is happening piecemeal every other month or so.

I do understand both backups {having implemented real time transaction
journal-ling to tape combined with weekly 'dd'
copies to tape, monthly full backups with 6 month retention
yada-yada-yada} and disaster recovery planning. Been there.
Done that. Save my ___ multiple times.

The crux is always funding.

Naturally, using 'btrfs restore' successfully will go a long ways
towards shortening the recovery process.

It will be about a week before I can begin since I need to acquire the
restore destination storage first.

Thank you for explaining the process.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 15:34 unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Ank Ular
  2016-04-06 21:02 ` Duncan
@ 2016-04-06 23:08 ` Chris Murphy
  2016-04-07 11:19   ` Austin S. Hemmelgarn
  2016-04-07 11:29   ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 23+ messages in thread
From: Chris Murphy @ 2016-04-06 23:08 UTC (permalink / raw)
  To: Ank Ular; +Cc: Btrfs BTRFS

On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular <ankular.anime@gmail.com> wrote:

>
> From the ouput of 'dmesg', the section:
> [   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm
> [   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn
> [   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds
> [   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu
>
> bothers me because the transid value of these four devices doesn't
> match the other 16 devices in the pool {should be 625065}. In theory,
> I believe these should all have the same transid value. These four
> devices are all on a single USB 3.0 port and this is the link I
> believe went down and came back up.

This is effectively a 4 disk failure and raid6 only allows for 2.

Now, a valid complaint is that as soon as Btrfs is seeing write
failures for 3 devices, it needs to go read-only. Specifically, it
would go read only upon 3 or more write errors affecting a single full
raid stripe (data and parity strips combined); and that's because such
a write is fully failed.

Now, maybe there's a way to just retry that stripe? During heavy
writing, there are probably multiple stripes in flight. But in real
short order the file system I think needs to face plant (read only or
even a graceful crash) is better than continuing to write to n-4
drives which is a bunch of bogus data, in effect.

I'm gonna guess the superblock on all the surviving drives is wrong,
because it sounds like the file system didn't immediately go read only
when the four drives vanished?

However, there is probably really valuable information in the
superblocks of the failed devices. The file system should be
consistent as of the generation on those missing devices. If there's a
way to roll back the file system to those supers, including using
their trees, then it should be possible to get the file system back -
while accepting 100% data loss between generation 625039 and 625065.
That's already 100% data loss anyway, if it was still doing n-4 device
writes - those are bogus generations.

Since this is entirely COW, nothing should be lost. All the data
necessary to go back to generation 625039 is on all drives. And none
of the data after that is usable anyway. Possibly even 625038 is the
last good one on every single drive.

So what you should try to do is get supers on every drive. There are
three super blocks per drive. And there are four backups per super. So
that's potentially 12 slots per drive times 20 drives. That's a lot of
data for you to look through but that's what you have to do. The top
task would be to see if the three supers are the same on each device,
if so, then that cuts the comparison down by 1/3. And then compare the
supers across devices. You can get this with btrfs-show-super -fa.

You might look in another thread about how to setup an overlay for 16
of the 20 drives; making certain you obfuscate the volume UUID of the
original, only allowing that UUID to appear via the overlay (of the
same volume+device UUID appear to the kernel, e.g. using LVM snapshots
of either thick or thin variety and making both visible and then
trying to mount one of them). Others have done this I think remotely
to make sure the local system only sees the overlay devices. Anyway,
this allows you to make destructive changes non-destructively. What I
can't tell you off hand is if any of the tools will let you
specifically accept the superblocks from the four "good" devices that
went offline abruptly, and adapt them to to the other 16, i.e. rolling
back the 16 that went too far forward without the other 4. Make sense?

Note. You can't exactly copy the super block from one device to
another because it contains a dev UUID. So first you need to look at a
superblock for any two of the four "good" devices, and compare them.
Exactly how  do they differ? They should only differ witih
dev_item.devid, dev_item.uuid, and maybe dev_item.total_bytes and
hopefully not but maybe dev_item.bytes_used. And then somehow adapt
this for the other 16 drives. I'd love it if there's a tool that does
this, maybe 'btrfs rescue super-recover' but there are no meaningful
options with that command so I'm skeptical how it knows what's bad and
what's good.

You literally might have to splice superblocks and write them to 16
drives in exactly 3 locations per drive (well, maybe just one of them,
and then delete the magic from the other two, and then 'btrfs rescue
super-recover' should then use the one good copy to fix the two bad
copies).

Sigh.... maybe?

In theory it's possible, I just don't know the state of the tools. But
I'm fairly sure the best chance of recovery is going to be on the 4
drives that abruptly vanished.  Their supers will be mostly correct or
close to it: and that's what has all the roots in it: tree, fs, chunk,
extent and csum. And all of those states are better farther in the
past, rather than the 16 drives that have much newer writes.

Of course it is possible there's corruption problems with those four
drives having vanished while writes were incomplete. But if you're
lucky, data write happen first, then metadata writes second, and only
then is the super updated. So the super should point to valid metadata
and that should point to valid data. If that order is wrong, then it's
bad news and you have to look at backup roots. But *if* you get all
the supers correct and on the same page, you can access the backup
roots by using -o recovery if corruption is found with a normal mount.





-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 22:08   ` Ank Ular
@ 2016-04-07  2:36     ` Duncan
  0 siblings, 0 replies; 23+ messages in thread
From: Duncan @ 2016-04-07  2:36 UTC (permalink / raw)
  To: linux-btrfs

Ank Ular posted on Wed, 06 Apr 2016 18:08:53 -0400 as excerpted:

> I did read this page: https://btrfs.wiki.kernel.org/index.php/Restore
> 
> But, not understanding the meaning of much of the terminology, I didn't
> "get it".
> 
> Your explanation makes the page much clearer.

Yeah.  It took me awhile, some help from the list, and actually going 
thru the process for real, once, to understand that page as well.  As I 
said, once you get to the point of the automatic btrfs restore not 
working and needing the advanced stuff, the process gets /far/ more 
technical, and even for admin types used to dealing with technical stuff 
(I've been a gentooer for over a decade, as I actually enjoy its mix of 
customizability, including the ability to override distro defaults where 
found necessary, and automation where I don't particularly care), it's 
not exactly easy reading.

But it sure beats not having that page as a resource! =:^)

> I do need one
> clarification. I'm assuming that when I issue the command:
> 
>    btrfs-find-root /dev/sdb
> 
> it doesn't actually matter which device I use and that, in theory, any
> of the 20 devices should yield the same listing.
> 
> By the same token, when I issue the command:
> 
>    btrfs restore -t n /dev/sdb /mnt/restore
> 
> any of the 20 devices would work equally well.
> 
> I want to be clear on this because this will be the first time I attempt
> using 'btrfs restore'. While I think I understand what is supposed to
> happen now, there is nothing like experience to make that
> 'understanding' more solid. I just want to be sure I haven't confused
> myself before I do something more or less irrevocable.

Keep in mind that one of the advantages of btrfs restore is that it does 
NOT write to the filesystem it's trying to recover files from.  As such, 
it can't damage it further.  So the only way you could be doing something 
irrevocable would be trying to write restore's output back to the device 
in question, instead of to a different filesystem intended to restore the 
files to, or something equally crazy.

As to the question at hand, whether pointing it at one component device 
or another makes a difference, to the best of my knowledge, no.  However, 
it should be stated that my own experience with restore was with a two-
device btrfs raid1, so even if it actually only used the one device it 
was pointed at, it could be expected to work, since the two devices were 
actually raid1 mirrors of the same content.

Between that and the fact that the wiki page in reference, again,
https://btrfs.wiki.kernel.org/index.php/Restore , was clearly written 
from the single-device viewpoint, and the manpage doesn't say either, I 
can't actually say for sure how it deals with multiple-device filesystems 
when a single device doesn't contain effectively a copy of the entire 
filesystem, as was the case with my personal experience, on a two-device 
btrfs raid1 for both data and metadata.

But as I said, restore doesn't write to the devices or filesystem it's 
trying to restore files from, so feel free to experiment with it.  The 
only thing you might lose is a bit of time... unless you start trying to 
redirect its output to the devices it's trying to restore from, or 
something equally strange, and that's pretty difficult to do by accident!

But to my knowledge, pointing restore at any of the device components of 
the filesystem should be fine.  The one thing I'd be sure I had done 
previously would be btrfs device scan.  That's normally used (and 
normally triggered automatically by udev) to update the kernel on what 
component devices are available to form the filesystem before mounting, 
etc, while btrfs restore is otherwise mostly userspace code, so again I'm 
not /sure/ it applies in this case or not, but while btrfs check and 
btrfs restore are normally mostly userspace, I assume they /do/ make use 
of kernel services to at least figure out what device components belong 
to the filesystem in question, so a btrfs device scan would update that 
information for them.  Even if they don't use that mechanism, figuring 
that out entirely in userspace, it won't hurt.

Which BTW, any devs reading this care to clarify for me?  How /do/ 
otherwise userspace only commands such as btrfs check and btrfs restore, 
discover which device components make up a multi-device filesystem?  Do 
they still call kernel services for that, such that btrfs device scan 
matters as it triggers an update of the kernel's btrfs component devices 
list, or do they do it all in userspace?

> Fortunately, I neither use sub-volumes nor snapshots since nearly all of
> the files are static in nature.

FWIW, my use-case doesn't use either subvolumes or snapshots, either.  I 
prefer independent filesystems as they protect against filesystem 
meltdown while snapshots/subvolumes don't, and with an already functional 
configuration of partitions and filesystems from well before btrfs, 
throwing in the additional complexity of snapshots and subvolumes would 
simply complexify administration for me, and keeping my setup simple 
enough that I can actually understand it and manage it in a disaster 
recovery situation is for me an important component of my disaster 
management strategy.  In my time I've rejected more than one 
technological "solution" as too complex to master it sufficiently that I 
can be confident of my ability to recover to a working state under the 
pressures of disaster recovery, and the additional complexity of 
snapshots and subvolumes, when they clearly weren't needed in my 
situation, were simply one more thing to add to that pile of unnecessary 
complexity, rejected as an impediment to confident and reliable disaster 
recovery. =:^)

> As far as backups go, we're talking about a home server/workstation.
> While I used to go through an excruciating budget battle every year on a
> professional level in my usually futile fight for disaster recovery
> planning funding, my personal budget is much, nuch more limited.

FWIW, here as well on the budget crunch angle, but with the difference 
being that unless it really was throw-away data, there's no way I'd trust 
btrfs raid56 mode without a backup, kept much more current than I tend to 
keep mine, BTW, as it's simply too immature still for that use case, at 
least from my point of view.

And for much the same reason I'd hesitate to recommend btrfs for use in a 
no-at-hand-backups situation as well.  Tho you've made it plain that in 
general, you do have backups... in the form of original source DVDs, etc, 
for most of your content.  It's simply not conveniently at hand and will 
take you quite some work to re-dump/re-convert, etc.  While I'm obviously 
bleeding edge enough to try btrfs here, it /is/ with backups, and I'm 
just conservative enough that I'd really not use btrfs personally for 
your use-case, nor recommend it, because in my judgement your backups, 
original sources in some cases, are simply not conveniently enough at 
hand to be something I'd consider worth the risk.

> Of the 53T currently in limbo, about ~6-8T are on several hundreds of
> DVDs. About 10T are on the hard drives of the my prior system which
> needs a replacement motherboard. {I had rsynced the data to a new build
> system just before imminent
>  failure}. Most of the rest can be re-collected from a variety of
> still existing sources as I still have the file names and link locations
> on a separate file system. My 'disaster recovery plans' assume patience,
> a limited budget and knowing where everything came from originally.
> Backups are a completely different issue. My backup planning won't be
> complete for another 12 or so since it essentially means building a
> duplicate system. Since my budget funding is limited, my duplicate
> system is happening piecemeal every other month or so.

I do hope you understand the implications of btrfs restore, in that 
case.  You aren't restoring to the damaged filesystem, you're grabbing 
files off that filesystem and copying them elsewhere, which means you 
must have free space elsewhere to copy them to.

Which means if it's 53T in limbo, you better either prioritize that 
restore and limit it to only what you actually have space elsewhere to 
restore to (using the regex pattern option), or have 53T of space 
available to restore to.

Or did you mean "in limbo" in more global terms, encompassing multiple 
btrfs and perhaps other non-btrfs filesystems as well, with this one 
being a much smaller fraction of that 53T, say 5-10T.  Even that's 
sizable, but it's a lot easier to come up with space to recover 5-10T to, 
than to recover 53T to, for sure.

> I do understand both backups {having implemented real time transaction
> journal-ling to tape combined with weekly 'dd'
> copies to tape, monthly full backups with 6 month retention
> yada-yada-yada} and disaster recovery planning. Been there.
> Done that. Save my ___ multiple times.
> 
> The crux is always funding.
> 
> Naturally, using 'btrfs restore' successfully will go a long ways
> towards shortening the recovery process.
> 
> It will be about a week before I can begin since I need to acquire the
> restore destination storage first.

OK, it looks like you /do/ understand that you'll need additional space 
to write that data to, and are already working on getting it.

Not to be a killjoy, but perhaps this can be a lesson.  Had you had that 
53T or whatever already backed up, you wouldn't need to be looking for 
that space now, as you'd already have it covered.  And you'd need the 
same space either way.  Tho to be fair, that's 53T of space you did get 
to put off purchase of... tho at the cost of risking losing it and having 
to resort to restoring from original sources.

Personally, flying without a backup net like that, I'd choose for myself 
some other more mature filesystem.  I use reiserfs for my own media 
partitions (on spinning rust, btrfs is all on ssd, here).  I do have it 
backed up, tho not to what I'd like, but given the hardware faults I've 
had with reiserfs and recovered, including head-crashes when the AC went 
out and the drive massively overheated (in Phoenix, it was 50C+ inside 
when I got to it, likely 60C in the computer, and very easily 70C head 
and platter temp, but the unmounted backup partitions on the drive worked 
just fine when everything cooled down), despite the head-crash and 
subsequent damage on some of the mounted partitions), bad memory, caps 
going bad on an old mobo... and the fact that I do have old and very 
stale copies of the data on now out of regular service devices... I 
expect I could recover most of it, and what I couldn't I'd simply do 
without.

> Thank you for explaining the process.

=:^)

FWIW... My dad was a teacher before he retired.  As he always said, the 
best way to learn something, even better than simply doing it yourself, 
is to try to teach it to others.  Others will come from different 
viewpoints and will ask questions and require explanations for areas you 
otherwise allow yourself to gloss over and to never really understand, so 
in explaining it to others, you learn it far better yourself.

For sure my dad knew whereof he spoke, on that one! It's not only the one 
asking that benefits from the answer!  =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 23:08 ` Chris Murphy
@ 2016-04-07 11:19   ` Austin S. Hemmelgarn
  2016-04-07 11:31     ` Austin S. Hemmelgarn
  2016-04-07 19:32     ` Chris Murphy
  2016-04-07 11:29   ` Austin S. Hemmelgarn
  1 sibling, 2 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-07 11:19 UTC (permalink / raw)
  To: Chris Murphy, Ank Ular; +Cc: Btrfs BTRFS

On 2016-04-06 19:08, Chris Murphy wrote:
> On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular <ankular.anime@gmail.com> wrote:
>
>>
>>  From the ouput of 'dmesg', the section:
>> [   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm
>> [   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn
>> [   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds
>> [   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu
>>
>> bothers me because the transid value of these four devices doesn't
>> match the other 16 devices in the pool {should be 625065}. In theory,
>> I believe these should all have the same transid value. These four
>> devices are all on a single USB 3.0 port and this is the link I
>> believe went down and came back up.
>
> This is effectively a 4 disk failure and raid6 only allows for 2.
>
> Now, a valid complaint is that as soon as Btrfs is seeing write
> failures for 3 devices, it needs to go read-only. Specifically, it
> would go read only upon 3 or more write errors affecting a single full
> raid stripe (data and parity strips combined); and that's because such
> a write is fully failed.
AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_ 
after that, it will start writing out narrower stripes across the 
remaining disks if there are enough for it to maintain the data 
consistency (so if there's at least 3 for raid6 (I think, I don't 
remember if our lower limit is 3 (which is degenerate), or 4 (which 
isn't, but most other software won't let you use it for some stupid 
reason))).  Based on this, if the FS does get recovered, make sure to 
run a balance on it too, otherwise you might have some sub-optimal 
striping for some data.
>
> Now, maybe there's a way to just retry that stripe? During heavy
> writing, there are probably multiple stripes in flight. But in real
> short order the file system I think needs to face plant (read only or
> even a graceful crash) is better than continuing to write to n-4
> drives which is a bunch of bogus data, in effect.
Actually, because of how things get serialized, there probably aren't a 
huge number of stripes in flight (IIRC, there can be at most 8 in flight 
assuming you don't set a custom thread-pool size, but even that is 
extremely unlikely unless you're writing huge amounts of data).  That 
said, we need to at least be very noisy about this happening, and not 
just log something and go on with life.  Ideally, we should have a way 
to retry the failed stripe after narrowing it to the number of drives.
>
> I'm gonna guess the superblock on all the surviving drives is wrong,
> because it sounds like the file system didn't immediately go read only
> when the four drives vanished?
>
> However, there is probably really valuable information in the
> superblocks of the failed devices. The file system should be
> consistent as of the generation on those missing devices. If there's a
> way to roll back the file system to those supers, including using
> their trees, then it should be possible to get the file system back -
> while accepting 100% data loss between generation 625039 and 625065.
> That's already 100% data loss anyway, if it was still doing n-4 device
> writes - those are bogus generations.
>
> Since this is entirely COW, nothing should be lost. All the data
> necessary to go back to generation 625039 is on all drives. And none
> of the data after that is usable anyway. Possibly even 625038 is the
> last good one on every single drive.
>
> So what you should try to do is get supers on every drive. There are
> three super blocks per drive. And there are four backups per super. So
> that's potentially 12 slots per drive times 20 drives. That's a lot of
> data for you to look through but that's what you have to do. The top
> task would be to see if the three supers are the same on each device,
> if so, then that cuts the comparison down by 1/3. And then compare the
> supers across devices. You can get this with btrfs-show-super -fa.
>
> You might look in another thread about how to setup an overlay for 16
> of the 20 drives; making certain you obfuscate the volume UUID of the
> original, only allowing that UUID to appear via the overlay (of the
> same volume+device UUID appear to the kernel, e.g. using LVM snapshots
> of either thick or thin variety and making both visible and then
> trying to mount one of them). Others have done this I think remotely
> to make sure the local system only sees the overlay devices. Anyway,
> this allows you to make destructive changes non-destructively. What I
> can't tell you off hand is if any of the tools will let you
> specifically accept the superblocks from the four "good" devices that
> went offline abruptly, and adapt them to to the other 16, i.e. rolling
> back the 16 that went too far forward without the other 4. Make sense?
>
> Note. You can't exactly copy the super block from one device to
> another because it contains a dev UUID. So first you need to look at a
> superblock for any two of the four "good" devices, and compare them.
> Exactly how  do they differ? They should only differ witih
> dev_item.devid, dev_item.uuid, and maybe dev_item.total_bytes and
> hopefully not but maybe dev_item.bytes_used. And then somehow adapt
> this for the other 16 drives. I'd love it if there's a tool that does
> this, maybe 'btrfs rescue super-recover' but there are no meaningful
> options with that command so I'm skeptical how it knows what's bad and
> what's goo.


>
> You literally might have to splice superblocks and write them to 16
> drives in exactly 3 locations per drive (well, maybe just one of them,
> and then delete the magic from the other two, and then 'btrfs rescue
> super-recover' should then use the one good copy to fix the two bad
> copies).
>
> Sigh.... maybe?
>
> In theory it's possible, I just don't know the state of the tools. But
> I'm fairly sure the best chance of recovery is going to be on the 4
> drives that abruptly vanished.  Their supers will be mostly correct or
> close to it: and that's what has all the roots in it: tree, fs, chunk,
> extent and csum. And all of those states are better farther in the
> past, rather than the 16 drives that have much newer writes.
FWIW, it is actually possible to do this, I've done it before myself on 
much smaller raid1 filesystems with single drives disappearing, and once 
with a raid6 filesystem with a double drive failure.  It is by no means 
easy, and there's not much in the tools that helps with it, but it is 
possible (although I sincerely hope I never have to do it again myself).
>
> Of course it is possible there's corruption problems with those four
> drives having vanished while writes were incomplete. But if you're
> lucky, data write happen first, then metadata writes second, and only
> then is the super updated. So the super should point to valid metadata
> and that should point to valid data. If that order is wrong, then it's
> bad news and you have to look at backup roots. But *if* you get all
> the supers correct and on the same page, you can access the backup
> roots by using -o recovery if corruption is found with a normal mount.
This though is where the potential issue is.  -o recovery will only go 
back so many generations before refusing to mount, and I think that may 
be why it's not working now..


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-06 23:08 ` Chris Murphy
  2016-04-07 11:19   ` Austin S. Hemmelgarn
@ 2016-04-07 11:29   ` Austin S. Hemmelgarn
  1 sibling, 0 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-07 11:29 UTC (permalink / raw)
  To: Chris Murphy, Ank Ular; +Cc: Btrfs BTRFS

On 2016-04-06 19:08, Chris Murphy wrote:
> On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular <ankular.anime@gmail.com> wrote:
>
>>
>>  From the ouput of 'dmesg', the section:
>> [   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039 /dev/sdm
>> [   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039 /dev/sdn
>> [   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039 /dev/sds
>> [   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039 /dev/sdu
>>
>> bothers me because the transid value of these four devices doesn't
>> match the other 16 devices in the pool {should be 625065}. In theory,
>> I believe these should all have the same transid value. These four
>> devices are all on a single USB 3.0 port and this is the link I
>> believe went down and came back up.
>
> This is effectively a 4 disk failure and raid6 only allows for 2.
>
> Now, a valid complaint is that as soon as Btrfs is seeing write
> failures for 3 devices, it needs to go read-only. Specifically, it
> would go read only upon 3 or more write errors affecting a single full
> raid stripe (data and parity strips combined); and that's because such
> a write is fully failed.
AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_ 
after that, it will start writing out narrower stripes across the 
remaining disks if there are enough for it to maintain the data 
consistency (so if there's at least 3 for raid6 (I think, I don't 
remember if our lower limit is 3 (which is degenerate), or 4 (which 
isn't, but most other software won't let you use it for some stupid 
reason))).  Based on this, if the FS does get recovered, make sure to 
run a balance on it too, otherwise you might have some sub-optimal 
striping for some data.
>
> Now, maybe there's a way to just retry that stripe? During heavy
> writing, there are probably multiple stripes in flight. But in real
> short order the file system I think needs to face plant (read only or
> even a graceful crash) is better than continuing to write to n-4
> drives which is a bunch of bogus data, in effect.
Actually, because of how things get serialized, there probably aren't a 
huge number of stripes in flight (IIRC, there can be at most 8 in flight 
assuming you don't set a custom thread-pool size, but even that is 
extremely unlikely unless you're writing huge amounts of data).  That 
said, we need to at least be very noisy about this happening, and not 
just log something and go on with life.  Ideally, we should have a way 
to retry the failed stripe after narrowing it to the number of drives.
>
> I'm gonna guess the superblock on all the surviving drives is wrong,
> because it sounds like the file system didn't immediately go read only
> when the four drives vanished?
>
> However, there is probably really valuable information in the
> superblocks of the failed devices. The file system should be
> consistent as of the generation on those missing devices. If there's a
> way to roll back the file system to those supers, including using
> their trees, then it should be possible to get the file system back -
> while accepting 100% data loss between generation 625039 and 625065.
> That's already 100% data loss anyway, if it was still doing n-4 device
> writes - those are bogus generations.
>
> Since this is entirely COW, nothing should be lost. All the data
> necessary to go back to generation 625039 is on all drives. And none
> of the data after that is usable anyway. Possibly even 625038 is the
> last good one on every single drive.
>
> So what you should try to do is get supers on every drive. There are
> three super blocks per drive. And there are four backups per super. So
> that's potentially 12 slots per drive times 20 drives. That's a lot of
> data for you to look through but that's what you have to do. The top
> task would be to see if the three supers are the same on each device,
> if so, then that cuts the comparison down by 1/3. And then compare the
> supers across devices. You can get this with btrfs-show-super -fa.
>
> You might look in another thread about how to setup an overlay for 16
> of the 20 drives; making certain you obfuscate the volume UUID of the
> original, only allowing that UUID to appear via the overlay (of the
> same volume+device UUID appear to the kernel, e.g. using LVM snapshots
> of either thick or thin variety and making both visible and then
> trying to mount one of them). Others have done this I think remotely
> to make sure the local system only sees the overlay devices. Anyway,
> this allows you to make destructive changes non-destructively. What I
> can't tell you off hand is if any of the tools will let you
> specifically accept the superblocks from the four "good" devices that
> went offline abruptly, and adapt them to to the other 16, i.e. rolling
> back the 16 that went too far forward without the other 4. Make sense?
>
> Note. You can't exactly copy the super block from one device to
> another because it contains a dev UUID. So first you need to look at a
> superblock for any two of the four "good" devices, and compare them.
> Exactly how  do they differ? They should only differ witih
> dev_item.devid, dev_item.uuid, and maybe dev_item.total_bytes and
> hopefully not but maybe dev_item.bytes_used. And then somehow adapt
> this for the other 16 drives. I'd love it if there's a tool that does
> this, maybe 'btrfs rescue super-recover' but there are no meaningful
> options with that command so I'm skeptical how it knows what's bad and
> what's good.
While I don't know what exactly it does currently, a roughly ideal 
method would be:
1. Check each SB, if it has both a valid checksum and magic number and 
points to a valid root, mark it valid.
2. If only one SB is valid, copy that over the other two and exit.
3. If more than one SB is valid and two of them point to the same root, 
copy that info to the third and exit (on all the occasions I've needed 
super-recover, this was state of the super blocks on the filesystem in 
question).
4. If more than one SB is valid and none of them point to the same root, 
or none of them are valid, pick one based on user input (command line 
switches or a prompt).
>
> You literally might have to splice superblocks and write them to 16
> drives in exactly 3 locations per drive (well, maybe just one of them,
> and then delete the magic from the other two, and then 'btrfs rescue
> super-recover' should then use the one good copy to fix the two bad
> copies).
>
> Sigh.... maybe?
>
> In theory it's possible, I just don't know the state of the tools. But
> I'm fairly sure the best chance of recovery is going to be on the 4
> drives that abruptly vanished.  Their supers will be mostly correct or
> close to it: and that's what has all the roots in it: tree, fs, chunk,
> extent and csum. And all of those states are better farther in the
> past, rather than the 16 drives that have much newer writes.
FWIW, it is actually possible to do this, I've done it before myself on 
much smaller raid1 filesystems with single drives disappearing, and once 
with a raid6 filesystem with a double drive failure.  It is by no means 
easy, and there's not much in the tools that helps with it, but it is 
possible (although I sincerely hope I never have to do it again myself).
>
> Of course it is possible there's corruption problems with those four
> drives having vanished while writes were incomplete. But if you're
> lucky, data write happen first, then metadata writes second, and only
> then is the super updated. So the super should point to valid metadata
> and that should point to valid data. If that order is wrong, then it's
> bad news and you have to look at backup roots. But *if* you get all
> the supers correct and on the same page, you can access the backup
> roots by using -o recovery if corruption is found with a normal mount.
This though is where the potential issue is.  -o recovery will only go 
back so many generations before refusing to mount, and I think that may 
be why it's not working now..


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-07 11:19   ` Austin S. Hemmelgarn
@ 2016-04-07 11:31     ` Austin S. Hemmelgarn
  2016-04-07 19:32     ` Chris Murphy
  1 sibling, 0 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-07 11:31 UTC (permalink / raw)
  To: Chris Murphy, Ank Ular; +Cc: Btrfs BTRFS

Sorry about the almost duplicate mail, Thunderbird's 'Send' button 
happens to be right below 'Undo' when you open the edit menu...

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-07 11:19   ` Austin S. Hemmelgarn
  2016-04-07 11:31     ` Austin S. Hemmelgarn
@ 2016-04-07 19:32     ` Chris Murphy
  2016-04-08 11:29       ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-07 19:32 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Ank Ular, Btrfs BTRFS

On Thu, Apr 7, 2016 at 5:19 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-06 19:08, Chris Murphy wrote:
>>
>> On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular <ankular.anime@gmail.com> wrote:
>>
>>>
>>>  From the ouput of 'dmesg', the section:
>>> [   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039
>>> /dev/sdm
>>> [   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039
>>> /dev/sdn
>>> [   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039
>>> /dev/sds
>>> [   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039
>>> /dev/sdu
>>>
>>> bothers me because the transid value of these four devices doesn't
>>> match the other 16 devices in the pool {should be 625065}. In theory,
>>> I believe these should all have the same transid value. These four
>>> devices are all on a single USB 3.0 port and this is the link I
>>> believe went down and came back up.
>>
>>
>> This is effectively a 4 disk failure and raid6 only allows for 2.
>>
>> Now, a valid complaint is that as soon as Btrfs is seeing write
>> failures for 3 devices, it needs to go read-only. Specifically, it
>> would go read only upon 3 or more write errors affecting a single full
>> raid stripe (data and parity strips combined); and that's because such
>> a write is fully failed.
>
> AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_
> after that, it will start writing out narrower stripes across the remaining
> disks if there are enough for it to maintain the data consistency (so if
> there's at least 3 for raid6 (I think, I don't remember if our lower limit
> is 3 (which is degenerate), or 4 (which isn't, but most other software won't
> let you use it for some stupid reason))).  Based on this, if the FS does get
> recovered, make sure to run a balance on it too, otherwise you might have
> some sub-optimal striping for some data.

I can see this being happening automatically with up to 2 device
failures, so that all subsequent writes are fully intact stripe
writes. But the instant there's a 3rd device failure, there's a rather
large hole in the file system that can't be reconstructed. It's an
invalid file system. I'm not sure what can be gained by allowing
writes to continue, other than tying off loose ends (so to speak) with
full stripe metadata writes for the purpose of making recovery
possible and easier, but after that metadata is written - poof, go
read only.




>
>
>
>>
>> You literally might have to splice superblocks and write them to 16
>> drives in exactly 3 locations per drive (well, maybe just one of them,
>> and then delete the magic from the other two, and then 'btrfs rescue
>> super-recover' should then use the one good copy to fix the two bad
>> copies).
>>
>> Sigh.... maybe?
>>
>> In theory it's possible, I just don't know the state of the tools. But
>> I'm fairly sure the best chance of recovery is going to be on the 4
>> drives that abruptly vanished.  Their supers will be mostly correct or
>> close to it: and that's what has all the roots in it: tree, fs, chunk,
>> extent and csum. And all of those states are better farther in the
>> past, rather than the 16 drives that have much newer writes.
>
> FWIW, it is actually possible to do this, I've done it before myself on much
> smaller raid1 filesystems with single drives disappearing, and once with a
> raid6 filesystem with a double drive failure.  It is by no means easy, and
> there's not much in the tools that helps with it, but it is possible
> (although I sincerely hope I never have to do it again myself).

I think considering the idea of Btrfs is to be more scalable than past
storage and filesystems have been, it needs to be able to deal with
transient failures like this. In theory all available information is
written on all the disks. This was a temporary failure. Once all
devices are made available again, the fs should be able to figure out
what to do, even so far as salvaging the writes that happened after
the 4 devices went missing if those were successful full stripe
writes.



>>
>>
>> Of course it is possible there's corruption problems with those four
>> drives having vanished while writes were incomplete. But if you're
>> lucky, data write happen first, then metadata writes second, and only
>> then is the super updated. So the super should point to valid metadata
>> and that should point to valid data. If that order is wrong, then it's
>> bad news and you have to look at backup roots. But *if* you get all
>> the supers correct and on the same page, you can access the backup
>> roots by using -o recovery if corruption is found with a normal mount.
>
> This though is where the potential issue is.  -o recovery will only go back
> so many generations before refusing to mount, and I think that may be why
> it's not working now..

It also looks like none of the tools are considering the stale supers
on the formerly missing 4 devices. I still think those are the best
chance to recover because even if their most current data is wrong due
to reordered writes not making it to stable storage, one of the
available backups in those supers should be good.




-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-07 19:32     ` Chris Murphy
@ 2016-04-08 11:29       ` Austin S. Hemmelgarn
  2016-04-08 16:17         ` Chris Murphy
  2016-04-08 18:05         ` unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Chris Murphy
  0 siblings, 2 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-08 11:29 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Ank Ular, Btrfs BTRFS

On 2016-04-07 15:32, Chris Murphy wrote:
> On Thu, Apr 7, 2016 at 5:19 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-06 19:08, Chris Murphy wrote:
>>>
>>> On Wed, Apr 6, 2016 at 9:34 AM, Ank Ular <ankular.anime@gmail.com> wrote:
>>>
>>>>
>>>>   From the ouput of 'dmesg', the section:
>>>> [   20.998071] BTRFS: device label FSgyroA devid 9 transid 625039
>>>> /dev/sdm
>>>> [   20.999984] BTRFS: device label FSgyroA devid 10 transid 625039
>>>> /dev/sdn
>>>> [   21.004127] BTRFS: device label FSgyroA devid 11 transid 625039
>>>> /dev/sds
>>>> [   21.011808] BTRFS: device label FSgyroA devid 12 transid 625039
>>>> /dev/sdu
>>>>
>>>> bothers me because the transid value of these four devices doesn't
>>>> match the other 16 devices in the pool {should be 625065}. In theory,
>>>> I believe these should all have the same transid value. These four
>>>> devices are all on a single USB 3.0 port and this is the link I
>>>> believe went down and came back up.
>>>
>>>
>>> This is effectively a 4 disk failure and raid6 only allows for 2.
>>>
>>> Now, a valid complaint is that as soon as Btrfs is seeing write
>>> failures for 3 devices, it needs to go read-only. Specifically, it
>>> would go read only upon 3 or more write errors affecting a single full
>>> raid stripe (data and parity strips combined); and that's because such
>>> a write is fully failed.
>>
>> AFAIUI, currently, BTRFS will fail that stripe, but not retry it, _but_
>> after that, it will start writing out narrower stripes across the remaining
>> disks if there are enough for it to maintain the data consistency (so if
>> there's at least 3 for raid6 (I think, I don't remember if our lower limit
>> is 3 (which is degenerate), or 4 (which isn't, but most other software won't
>> let you use it for some stupid reason))).  Based on this, if the FS does get
>> recovered, make sure to run a balance on it too, otherwise you might have
>> some sub-optimal striping for some data.
>
> I can see this being happening automatically with up to 2 device
> failures, so that all subsequent writes are fully intact stripe
> writes. But the instant there's a 3rd device failure, there's a rather
> large hole in the file system that can't be reconstructed. It's an
> invalid file system. I'm not sure what can be gained by allowing
> writes to continue, other than tying off loose ends (so to speak) with
> full stripe metadata writes for the purpose of making recovery
> possible and easier, but after that metadata is written - poof, go
> read only.
I don't mean writing partial stripes, I mean writing full stripes with a 
reduced width (so in an 8 device filesystem, if 3 devices fail, we can 
still technically write a complete stripe across 5 devices, but it will 
result in less total space we can use).  Whether or not this behavior is 
correct is another argument, but that appears to be what we do 
currently.  Ideally, this should be a mount option, as strictly 
speaking, it's policy, which therefore shouldn't be in the kernel.
>
>>
>>>
>>> You literally might have to splice superblocks and write them to 16
>>> drives in exactly 3 locations per drive (well, maybe just one of them,
>>> and then delete the magic from the other two, and then 'btrfs rescue
>>> super-recover' should then use the one good copy to fix the two bad
>>> copies).
>>>
>>> Sigh.... maybe?
>>>
>>> In theory it's possible, I just don't know the state of the tools. But
>>> I'm fairly sure the best chance of recovery is going to be on the 4
>>> drives that abruptly vanished.  Their supers will be mostly correct or
>>> close to it: and that's what has all the roots in it: tree, fs, chunk,
>>> extent and csum. And all of those states are better farther in the
>>> past, rather than the 16 drives that have much newer writes.
>>
>> FWIW, it is actually possible to do this, I've done it before myself on much
>> smaller raid1 filesystems with single drives disappearing, and once with a
>> raid6 filesystem with a double drive failure.  It is by no means easy, and
>> there's not much in the tools that helps with it, but it is possible
>> (although I sincerely hope I never have to do it again myself).
>
> I think considering the idea of Btrfs is to be more scalable than past
> storage and filesystems have been, it needs to be able to deal with
> transient failures like this. In theory all available information is
> written on all the disks. This was a temporary failure. Once all
> devices are made available again, the fs should be able to figure out
> what to do, even so far as salvaging the writes that happened after
> the 4 devices went missing if those were successful full stripe
> writes.
I entirely agree.  If the fix doesn't require any kind of decision to be 
made other than whether to fix it or not, it should be trivially fixable 
with the tools.  TBH though, this particular issue with devices 
disappearing and reappearing could be fixed easier in the block layer 
(at least, there are things that need to be fixed WRT it in the block 
layer).
>
>>>
>>> Of course it is possible there's corruption problems with those four
>>> drives having vanished while writes were incomplete. But if you're
>>> lucky, data write happen first, then metadata writes second, and only
>>> then is the super updated. So the super should point to valid metadata
>>> and that should point to valid data. If that order is wrong, then it's
>>> bad news and you have to look at backup roots. But *if* you get all
>>> the supers correct and on the same page, you can access the backup
>>> roots by using -o recovery if corruption is found with a normal mount.
>>
>> This though is where the potential issue is.  -o recovery will only go back
>> so many generations before refusing to mount, and I think that may be why
>> it's not working now..
>
> It also looks like none of the tools are considering the stale supers
> on the formerly missing 4 devices. I still think those are the best
> chance to recover because even if their most current data is wrong due
> to reordered writes not making it to stable storage, one of the
> available backups in those supers should be good.
>
Depending on utilization on the other devices though, they may not point 
to complete roots either.  In this case, they probably will because of 
the low write frequency.  In other cases, they may not though, because 
we try to reuse space in chunks before allocating new chunks.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 11:29       ` Austin S. Hemmelgarn
@ 2016-04-08 16:17         ` Chris Murphy
  2016-04-08 19:23           ` Missing device handling (was: 'unable to mount btrfs pool...') Austin S. Hemmelgarn
  2016-04-08 18:05         ` unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Chris Murphy
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-08 16:17 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Ank Ular, Btrfs BTRFS

On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

>> I can see this being happening automatically with up to 2 device
>> failures, so that all subsequent writes are fully intact stripe
>> writes. But the instant there's a 3rd device failure, there's a rather
>> large hole in the file system that can't be reconstructed. It's an
>> invalid file system. I'm not sure what can be gained by allowing
>> writes to continue, other than tying off loose ends (so to speak) with
>> full stripe metadata writes for the purpose of making recovery
>> possible and easier, but after that metadata is written - poof, go
>> read only.
>
> I don't mean writing partial stripes, I mean writing full stripes with a
> reduced width (so in an 8 device filesystem, if 3 devices fail, we can still
> technically write a complete stripe across 5 devices, but it will result in
> less total space we can use).

I understand what you mean, it was clear before. The problem is that
once its below the critical number of drives, the previously existing
file system is busted. So it should go read only. But it can't because
it doesn't yet have the concept of faulty devices, *and* also an
understanding of how many faulty devices can be tolerated before
there's a totally untenable hole in the file system.




>Whether or not this behavior is correct is
> another argument, but that appears to be what we do currently.  Ideally,
> this should be a mount option, as strictly speaking, it's policy, which
> therefore shouldn't be in the kernel.

I think we can definitely agree the current behavior is suboptimal
because in fact whatever it wrote to 16 drives was sufficiently
confusing that mounting all 20 drives again isn't possible no matter
what option is used.




>> I think considering the idea of Btrfs is to be more scalable than past
>> storage and filesystems have been, it needs to be able to deal with
>> transient failures like this. In theory all available information is
>> written on all the disks. This was a temporary failure. Once all
>> devices are made available again, the fs should be able to figure out
>> what to do, even so far as salvaging the writes that happened after
>> the 4 devices went missing if those were successful full stripe
>> writes.
>
> I entirely agree.  If the fix doesn't require any kind of decision to be
> made other than whether to fix it or not, it should be trivially fixable
> with the tools.  TBH though, this particular issue with devices disappearing
> and reappearing could be fixed easier in the block layer (at least, there
> are things that need to be fixed WRT it in the block layer).

Right. The block layer needs a way to communicate device missing to
Btrfs and Btrfs needs to have some tolerance for transience.

>>
>>
>>>>
>>>> Of course it is possible there's corruption problems with those four
>>>> drives having vanished while writes were incomplete. But if you're
>>>> lucky, data write happen first, then metadata writes second, and only
>>>> then is the super updated. So the super should point to valid metadata
>>>> and that should point to valid data. If that order is wrong, then it's
>>>> bad news and you have to look at backup roots. But *if* you get all
>>>> the supers correct and on the same page, you can access the backup
>>>> roots by using -o recovery if corruption is found with a normal mount.
>>>
>>>
>>> This though is where the potential issue is.  -o recovery will only go
>>> back
>>> so many generations before refusing to mount, and I think that may be why
>>> it's not working now..
>>
>>
>> It also looks like none of the tools are considering the stale supers
>> on the formerly missing 4 devices. I still think those are the best
>> chance to recover because even if their most current data is wrong due
>> to reordered writes not making it to stable storage, one of the
>> available backups in those supers should be good.
>>
> Depending on utilization on the other devices though, they may not point to
> complete roots either.  In this case, they probably will because of the low
> write frequency.  In other cases, they may not though, because we try to
> reuse space in chunks before allocating new chunks.

Based on the superblock posted, I think the *38 generation tree might
be incomplete, but there's a *37 and *36 generation that should be
intact. Chunk generation is the same.

What complicates the rollback is any deletions were happening at the
time. If it's just file additions, I think a rollback has a good
chance of working. It's just tedious.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 11:29       ` Austin S. Hemmelgarn
  2016-04-08 16:17         ` Chris Murphy
@ 2016-04-08 18:05         ` Chris Murphy
  2016-04-08 18:18           ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-08 18:05 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Ank Ular, Btrfs BTRFS

On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:

> I entirely agree.  If the fix doesn't require any kind of decision to be
> made other than whether to fix it or not, it should be trivially fixable
> with the tools.  TBH though, this particular issue with devices disappearing
> and reappearing could be fixed easier in the block layer (at least, there
> are things that need to be fixed WRT it in the block layer).

Another feature needed for transient failures with large storage, is
some kind of partial scrub, along the lines of md partial resync when
there's a bitmap write intent log.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 18:05         ` unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Chris Murphy
@ 2016-04-08 18:18           ` Austin S. Hemmelgarn
  2016-04-08 18:30             ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-08 18:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Ank Ular, Btrfs BTRFS

On 2016-04-08 14:05, Chris Murphy wrote:
> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>
>> I entirely agree.  If the fix doesn't require any kind of decision to be
>> made other than whether to fix it or not, it should be trivially fixable
>> with the tools.  TBH though, this particular issue with devices disappearing
>> and reappearing could be fixed easier in the block layer (at least, there
>> are things that need to be fixed WRT it in the block layer).
>
> Another feature needed for transient failures with large storage, is
> some kind of partial scrub, along the lines of md partial resync when
> there's a bitmap write intent log.
>
In this case, I would think the simplest way to do this would be to have 
scrub check if generation matches and not further verify anything that 
does (I think we might be able to prune anything below objects whose 
generation matches, but I'm not 100% certain about how writes cascade up 
the trees).  I hadn't really thought about this before, but now that I 
do, it kind of surprises me that we don't have something to do this.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 18:18           ` Austin S. Hemmelgarn
@ 2016-04-08 18:30             ` Chris Murphy
  2016-04-08 19:27               ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-08 18:30 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Ank Ular, Btrfs BTRFS

On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-08 14:05, Chris Murphy wrote:
>>
>> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>
>>> I entirely agree.  If the fix doesn't require any kind of decision to be
>>> made other than whether to fix it or not, it should be trivially fixable
>>> with the tools.  TBH though, this particular issue with devices
>>> disappearing
>>> and reappearing could be fixed easier in the block layer (at least, there
>>> are things that need to be fixed WRT it in the block layer).
>>
>>
>> Another feature needed for transient failures with large storage, is
>> some kind of partial scrub, along the lines of md partial resync when
>> there's a bitmap write intent log.
>>
> In this case, I would think the simplest way to do this would be to have
> scrub check if generation matches and not further verify anything that does
> (I think we might be able to prune anything below objects whose generation
> matches, but I'm not 100% certain about how writes cascade up the trees).  I
> hadn't really thought about this before, but now that I do, it kind of
> surprises me that we don't have something to do this.
>


And I need to better qualify this: this scrub (or balance) needs to be
initiated automatically, perhaps have some reasonable delay after the
block layer informs Btrfs the missing device as reappeared. Both the
requirement of a full scrub as well as it being a manual scrub, are
pretty big gotchas.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Missing device handling (was: 'unable to mount btrfs pool...')
  2016-04-08 16:17         ` Chris Murphy
@ 2016-04-08 19:23           ` Austin S. Hemmelgarn
  2016-04-08 19:53             ` Yauhen Kharuzhy
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-08 19:23 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2016-04-08 12:17, Chris Murphy wrote:
> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>>
>> I entirely agree.  If the fix doesn't require any kind of decision to be
>> made other than whether to fix it or not, it should be trivially fixable
>> with the tools.  TBH though, this particular issue with devices disappearing
>> and reappearing could be fixed easier in the block layer (at least, there
>> are things that need to be fixed WRT it in the block layer).
>
> Right. The block layer needs a way to communicate device missing to
> Btrfs and Btrfs needs to have some tolerance for transience.

Being notified when a device disappears _shouldn't_ be that hard. A 
uevent gets sent already, and we should be able to associate some kind 
of callback with that happening for devices we have mounted. The bigger 
issue is going to be handling the devices _reappearing_ (if we still 
hold a reference to the device, it appears under a different 
name/major/minor, and if it's more than one device and we have no 
references, they may appear in a different order than they were 
originally), and there is where we really need to fix things. A device 
disappearing forever is bad and all, but a device losing connection and 
reconnecting completely ruining the FS is exponentially worse.

Overall, to provide true reliability here, we need:
1. Some way for userspace to disable writeback caching per-device (this 
is needed for other reasons as well, but those are orthogonal to this 
discussion). This then needs to be used on all removable devices by 
default (Windows and OS X do this, it's part of why small transfers 
appear to complete faster on Linux, and then the disk takes _forever_ to 
unmount). This would reduce the possibility of data loss when a device 
disappears.
2. A way for userspace to be notified (instead of having to poll) of 
state changes in BTRFS. Currently, the only ways for userspace to know 
something is wrong are either parsing dmesg or polling the filesystem 
flags (and based both personal experience, and statements I've seen here 
and elsewhere, polling the FS flags is not reliable for this). Most 
normal installations are going to want to trigger handlers for specific 
state changes (be it e-mail to an admin, or some other notification 
method, or even doing some kind of maintenance on the FS automatically), 
and we need some kind of notification if we want to give userspace the 
ability to properly manage things.
3. A way to tell that a device is gone _when it happens_, not when we 
try to write to it next, not when a write fails, but the moment the 
block layer knows it's not there, we need to know as well. This is a 
prerequisite for the next two items. Sadly, we're probably the only 
thing that would directly benefit from this (LVM uses uevents and 
monitoring daemons to handle this, we don't exactly have that luxury), 
which means it may be hard to get something like this merged.
4. Transparent handling of short, transient loss of a device. This goes 
together to a certain extent with 1, if something disappears for long 
enough that the kernel notices, but it reappears before we have any I/O 
to do on it again, we shouldn't lose our lunch unless userspace tells us 
to (because we told userspace that it's gone due to item 2). In theory, 
we should be able to cache a small number of internal pending writes for 
when it reappears (so for example, if a transaction is being committed, 
and the USB disk disappears for a second, we should be able to pick up 
where we left off (after verifying the last write we sent)). We should 
also have an automatic re-sync if it's a short enough period it's gone 
for. The max timeout here should probably be configurable, but probably 
could just be one tunable for the whole system.
5. Give userspace the option to handle degraded states how it wants to, 
and keep our default of remount RO when degraded when userspace doesn't 
want to handle it itself. This needs to be configured at run-time (not 
stored on the media), and it needs to be per-filesystem, otherwise we 
open up all kinds of other issues. This is a core concept in LVM and 
many other storage management systems; namely, userspace can choose to 
handle a degraded RAID array however the hell it wants, and we'll 
provide a couple of sane default handlers for the common cases.

I would personally suggest adding a per-filesystem node in sysfs to 
handle both 2 and 5. Having it open tells BTRFS to not automatically 
attempt countermeasures when degraded, select/epoll on it will return 
when state changes, reads will return (at minimum): what devices 
comprise the FS, per disk state (is it working, failed, missing, a 
hot-spare, etc), and what effective redundancy we have (how many devices 
we can lose and still be mountable, so 1 for raid1, raid10, and raid5, 2 
for raid6, and 0 for raid0/single/dup, possibly higher for n-way 
replication (n-1), n-order parity (n), or erasure coding). This would 
make it trivial to write a daemon to monitor the filesystem, react when 
something happens, and handle all the policy decisions.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 18:30             ` Chris Murphy
@ 2016-04-08 19:27               ` Austin S. Hemmelgarn
  2016-04-08 20:16                 ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-08 19:27 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2016-04-08 14:30, Chris Murphy wrote:
> On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-08 14:05, Chris Murphy wrote:
>>>
>>> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
>>> <ahferroin7@gmail.com> wrote:
>>>
>>>> I entirely agree.  If the fix doesn't require any kind of decision to be
>>>> made other than whether to fix it or not, it should be trivially fixable
>>>> with the tools.  TBH though, this particular issue with devices
>>>> disappearing
>>>> and reappearing could be fixed easier in the block layer (at least, there
>>>> are things that need to be fixed WRT it in the block layer).
>>>
>>>
>>> Another feature needed for transient failures with large storage, is
>>> some kind of partial scrub, along the lines of md partial resync when
>>> there's a bitmap write intent log.
>>>
>> In this case, I would think the simplest way to do this would be to have
>> scrub check if generation matches and not further verify anything that does
>> (I think we might be able to prune anything below objects whose generation
>> matches, but I'm not 100% certain about how writes cascade up the trees).  I
>> hadn't really thought about this before, but now that I do, it kind of
>> surprises me that we don't have something to do this.
>>
>
> And I need to better qualify this: this scrub (or balance) needs to be
> initiated automatically, perhaps have some reasonable delay after the
> block layer informs Btrfs the missing device as reappeared. Both the
> requirement of a full scrub as well as it being a manual scrub, are
> pretty big gotchas.
>
We would still ideally want some way to initiate it manually because:
1. It would make it easier to test.
2. We should have a way to do it on filesystems that have been 
reassembled after a reboot, not just ones that got the device back in 
the same boot (or it was missing on boot and then appeared).


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Missing device handling (was: 'unable to mount btrfs pool...')
  2016-04-08 19:23           ` Missing device handling (was: 'unable to mount btrfs pool...') Austin S. Hemmelgarn
@ 2016-04-08 19:53             ` Yauhen Kharuzhy
  2016-04-09  7:24               ` Duncan
  0 siblings, 1 reply; 23+ messages in thread
From: Yauhen Kharuzhy @ 2016-04-08 19:53 UTC (permalink / raw)
  To: linux-btrfs

On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
> On 2016-04-08 12:17, Chris Murphy wrote:
> 
> I would personally suggest adding a per-filesystem node in sysfs to handle
> both 2 and 5. Having it open tells BTRFS to not automatically attempt
> countermeasures when degraded, select/epoll on it will return when state
> changes, reads will return (at minimum): what devices comprise the FS, per
> disk state (is it working, failed, missing, a hot-spare, etc), and what
> effective redundancy we have (how many devices we can lose and still be
> mountable, so 1 for raid1, raid10, and raid5, 2 for raid6, and 0 for
> raid0/single/dup, possibly higher for n-way replication (n-1), n-order
> parity (n), or erasure coding). This would make it trivial to write a daemon
> to monitor the filesystem, react when something happens, and handle all the
> policy decisions.

Hm, good proposal. Personally I tried to use uevents for this but they
cause locking troubles, and I didn't continue this attempt.

In any case we need have interface for btrfs-progs to passing FS state
information (presence and IDs of missing devices, for example,
degraded/good state of RAID etc.).

For testing as first attempt I implemented following interface. It still seems
not good for me but acceptable as a starting point. Additionally to this, I changed
missing device name reported in btrfs_ioctl_dev_info() to 'missing' for avoiding of
interferences with block devices inserted after closing of failed device (adding of
'missing' field to the struct btrfs_ioctl_dev_info_args may be more right way). So,
your opinion?

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index d9b147f..f9a2fa6 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2716,12 +2716,17 @@ static long btrfs_ioctl_fs_info(struct btrfs_root *root, void __user *arg)
 
        mutex_lock(&fs_devices->device_list_mutex);
        fi_args->num_devices = fs_devices->num_devices;
+       fi_args->missing_devices = fs_devices->missing_devices;
+       fi_args->open_devices = fs_devices->open_devices;
+       fi_args->rw_devices = fs_devices->rw_devices;
+       fi_args->total_devices = fs_devices->total_devices;
        memcpy(&fi_args->fsid, root->fs_info->fsid, sizeof(fi_args->fsid));
 
        list_for_each_entry(device, &fs_devices->devices, dev_list) {
                if (device->devid > fi_args->max_id)
                        fi_args->max_id = device->devid;
        }
+       fi_args->state = root->fs_info->fs_state;
        mutex_unlock(&fs_devices->device_list_mutex);
 
        fi_args->nodesize = root->fs_info->super_copy->nodesize;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index dea8931..6808bf2 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -186,8 +186,12 @@ struct btrfs_ioctl_fs_info_args {
        __u32 nodesize;                         /* out */
        __u32 sectorsize;                       /* out */
        __u32 clone_alignment;                  /* out */
-       __u32 reserved32;
-       __u64 reserved[122];                    /* pad to 1k */
+       __u32 state;                            /* out */
+       __u64 missing_devices;                  /* out */
+       __u64 open_devices;                     /* out */
+       __u64 rw_devices;                       /* out */
+       __u64 total_devices;                    /* out */
+       __u64 reserved[118];                    /* pad to 1k */
 };
 
 struct btrfs_ioctl_feature_flags {




-- 
Yauhen Kharuzhy

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 19:27               ` Austin S. Hemmelgarn
@ 2016-04-08 20:16                 ` Chris Murphy
  2016-04-08 23:01                   ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-08 20:16 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Apr 8, 2016 at 1:27 PM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-08 14:30, Chris Murphy wrote:
>>
>> On Fri, Apr 8, 2016 at 12:18 PM, Austin S. Hemmelgarn
>> <ahferroin7@gmail.com> wrote:
>>>
>>> On 2016-04-08 14:05, Chris Murphy wrote:
>>>>
>>>>
>>>> On Fri, Apr 8, 2016 at 5:29 AM, Austin S. Hemmelgarn
>>>> <ahferroin7@gmail.com> wrote:
>>>>
>>>>> I entirely agree.  If the fix doesn't require any kind of decision to
>>>>> be
>>>>> made other than whether to fix it or not, it should be trivially
>>>>> fixable
>>>>> with the tools.  TBH though, this particular issue with devices
>>>>> disappearing
>>>>> and reappearing could be fixed easier in the block layer (at least,
>>>>> there
>>>>> are things that need to be fixed WRT it in the block layer).
>>>>
>>>>
>>>>
>>>> Another feature needed for transient failures with large storage, is
>>>> some kind of partial scrub, along the lines of md partial resync when
>>>> there's a bitmap write intent log.
>>>>
>>> In this case, I would think the simplest way to do this would be to have
>>> scrub check if generation matches and not further verify anything that
>>> does
>>> (I think we might be able to prune anything below objects whose
>>> generation
>>> matches, but I'm not 100% certain about how writes cascade up the trees).
>>> I
>>> hadn't really thought about this before, but now that I do, it kind of
>>> surprises me that we don't have something to do this.
>>>
>>
>> And I need to better qualify this: this scrub (or balance) needs to be
>> initiated automatically, perhaps have some reasonable delay after the
>> block layer informs Btrfs the missing device as reappeared. Both the
>> requirement of a full scrub as well as it being a manual scrub, are
>> pretty big gotchas.
>>
> We would still ideally want some way to initiate it manually because:
> 1. It would make it easier to test.
> 2. We should have a way to do it on filesystems that have been reassembled
> after a reboot, not just ones that got the device back in the same boot (or
> it was missing on boot and then appeared).

I'm OK with a mount option, 'autoraidfixup' (not a proposed name!),
that permits the mechanism to happen, but which isn't yet the default.
However, one day I think it should be, because right now we already
allow mounts of devices with different generations and there is no
message indicating this at all, even though the superblocks clearly
show a discrepancy in generation.

mount with one device missing

[264466.609093] BTRFS: has skinny extents
[264912.547199] BTRFS info (device dm-6): disk space caching is enabled
[264912.547267] BTRFS: has skinny extents
[264912.606266] BTRFS: failed to read chunk tree on dm-6
[264912.621829] BTRFS: open_ctree failed

mount -o degraded

[264953.758518] BTRFS info (device dm-6): allowing degraded mounts
[264953.758794] BTRFS info (device dm-6): disk space caching is enabled
[264953.759055] BTRFS: has skinny extents

copy 800MB file
umount
lvchange -ay
mount

[265082.859201] BTRFS info (device dm-6): disk space caching is enabled
[265082.859474] BTRFS: has skinny extents

btrfs scrub start

[265260.024267] BTRFS error (device dm-6): bdev /dev/dm-7 errs: wr 0,
rd 0, flush 0, corrupt 0, gen 1

# btrfs scrub status /mnt/1
scrub status for b01b3922-4012-4de1-af42-63f5b2f68fc3
    scrub started at Fri Apr  8 14:01:41 2016 and finished after 00:00:18
    total bytes scrubbed: 1.70GiB with 1 errors
    error details: super=1
    corrected errors: 0, uncorrectable errors: 0, unverified errors: 0

After scrubbing and fixing everything and zeroing out the counters, if
I fail the device again, I can no longer mount degraded:

[265502.432444] BTRFS: missing devices(1) exceeds the limit(0),
writeable mount is not allowed

because of this nonsense:

[root@f23s ~]# btrfs fi df /mnt/1
Data, RAID1: total=1.00GiB, used=458.06MiB
Data, single: total=1.00GiB, used=824.00MiB
System, RAID1: total=64.00MiB, used=16.00KiB
System, single: total=32.00MiB, used=0.00B
Metadata, RAID1: total=2.00GiB, used=576.00KiB
Metadata, single: total=256.00MiB, used=912.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B

a.) the device I'm mounting degraded contains the single chunks, it's
not like the single chunks are actually missing
b.) the manual scrub only fixed the supers, it did not replicate the
newly copied data since it was placed in new single chunks rather than
existing raid1 chunks.
c.) this requires a manual balance convert,soft to actually get
everything back to raid1.

Very non-obvious.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore'
  2016-04-08 20:16                 ` Chris Murphy
@ 2016-04-08 23:01                   ` Chris Murphy
  0 siblings, 0 replies; 23+ messages in thread
From: Chris Murphy @ 2016-04-08 23:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Austin S. Hemmelgarn, Btrfs BTRFS

For raid5 it's different. No single chunks are created while copying
files to a degraded volume.

And the scrub produces very noisy kernel messages. Looks like there's
a message for each missing block (or stripe?), thousands per file. And
also many uncorrectable errors like this:

[267466.792060] f23s.localdomain kernel: BTRFS error (device dm-8):
unable to fixup (regular) error at logical 3760582656 on dev /dev/dm-7
[267467.508588] f23s.localdomain kernel: scrub_handle_errored_block:
401 callbacks suppressed

[root@f23s ~]# btrfs scrub start /mnt/1/
ERROR: there are uncorrectable errors

[root@f23s ~]# btrfs scrub status /mnt/1/
scrub status for 51e1efb0-7df3-44d5-8716-9ed4bdadc93e
    scrub started at Fri Apr  8 14:35:25 2016 and finished after 00:11:26
    total bytes scrubbed: 3.21GiB with 45186 errors
    error details: read=95 super=2 verify=8 csum=45081
    corrected errors: 44935, uncorrectable errors: 249, unverified errors: 0

Subsequent balance and scrub have no messages at all. So...
uncorrectable? Really? That's confusing.

FYI, for a scrub with no errors it's 4m24s, but with the same data and
1/2 of it needing to be rebuilt during the scrub took 16m4s, so about
4x longer to reconstruct. Seems excessive.

Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Missing device handling (was: 'unable to mount btrfs pool...')
  2016-04-08 19:53             ` Yauhen Kharuzhy
@ 2016-04-09  7:24               ` Duncan
  2016-04-11 11:32                 ` Missing device handling Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Duncan @ 2016-04-09  7:24 UTC (permalink / raw)
  To: linux-btrfs

Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:

> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>> On 2016-04-08 12:17, Chris Murphy wrote:
>> 
>> I would personally suggest adding a per-filesystem node in sysfs to
>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>> attempt countermeasures when degraded, select/epoll on it will return
>> when state changes, reads will return (at minimum): what devices
>> comprise the FS, per disk state (is it working, failed, missing, a
>> hot-spare, etc), and what effective redundancy we have (how many
>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>> would make it trivial to write a daemon to monitor the filesystem,
>> react when something happens, and handle all the policy decisions.
> 
> Hm, good proposal. Personally I tried to use uevents for this but they
> cause locking troubles, and I didn't continue this attempt.

Except that... in sysfs (unlike proc) there's a rather strictly enforced 
rule of one property per file.

So you could NOT hold a single sysfs file open, that upon read would 
return 1) what devices comprise the FS, 2) per device (um, disk in the 
original, except that it can be a non-disk device, so changed to device 
here) state, 3) effective number of can-be-lost devices.

The sysfs style interface would be a filesystem directory containing a 
devices subdir, with (read-only?) per-device state-files in that subdir.  
The listing of per-device state-files would thus provide #1, with the 
contents of each state-file being the status of that device, therefore 
providing #2.  Back in the main filesystem dir, there'd be a devices-
loseable file, which would provide #3.

There could also be a filesystem-level state file which could be read for 
the current state of the filesystem as a whole or selected/epolled for 
state-changes, and probably yet another file, we'll call it leave-be here 
simply because I don't have a better name, that would be read/write 
allowing reading or setting the no-countermeasures property.


Actually, after looking at the existing /sys/fs/btrfs layout, we already 
have filesystem directories, each with a devices subdir, tho the symlinks 
therein point to the /sys/devices tree device dirs.  The listing thereof 
already provides #1, at least for operational devices.

I'm not going to go testing what happens to the current sysfs devices 
listings when a device goes missing, but we already know btrfs doesn't 
dynamically use that information.  Presumably, once it does, the symlinks 
could be replaced with subdirs for missing devices, with the still known 
information in the subdir (which could then be named as either the btrfs 
device ID or as missing-N), and the status of the device being detectable 
by whether it's a symlink to a devices tree device (device online) or a 
subdir (device offline).

The per-filesystem devices-losable, fs-status, and leave-be files could 
be added to the existing syfs btrfs interface.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Missing device handling
  2016-04-09  7:24               ` Duncan
@ 2016-04-11 11:32                 ` Austin S. Hemmelgarn
  2016-04-18  0:55                   ` Chris Murphy
  0 siblings, 1 reply; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-11 11:32 UTC (permalink / raw)
  To: Btrfs BTRFS

On 2016-04-09 03:24, Duncan wrote:
> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>
>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>> On 2016-04-08 12:17, Chris Murphy wrote:
>>>
>>> I would personally suggest adding a per-filesystem node in sysfs to
>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>> attempt countermeasures when degraded, select/epoll on it will return
>>> when state changes, reads will return (at minimum): what devices
>>> comprise the FS, per disk state (is it working, failed, missing, a
>>> hot-spare, etc), and what effective redundancy we have (how many
>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>> would make it trivial to write a daemon to monitor the filesystem,
>>> react when something happens, and handle all the policy decisions.
>>
>> Hm, good proposal. Personally I tried to use uevents for this but they
>> cause locking troubles, and I didn't continue this attempt.
>
> Except that... in sysfs (unlike proc) there's a rather strictly enforced
> rule of one property per file.
Good point, I had forgotten about this.
>
> So you could NOT hold a single sysfs file open, that upon read would
> return 1) what devices comprise the FS, 2) per device (um, disk in the
> original, except that it can be a non-disk device, so changed to device
> here) state, 3) effective number of can-be-lost devices.
>
> The sysfs style interface would be a filesystem directory containing a
> devices subdir, with (read-only?) per-device state-files in that subdir.
> The listing of per-device state-files would thus provide #1, with the
> contents of each state-file being the status of that device, therefore
> providing #2.  Back in the main filesystem dir, there'd be a devices-
> loseable file, which would provide #3.
>
> There could also be a filesystem-level state file which could be read for
> the current state of the filesystem as a whole or selected/epolled for
> state-changes, and probably yet another file, we'll call it leave-be here
> simply because I don't have a better name, that would be read/write
> allowing reading or setting the no-countermeasures property.
I actually rather like this suggestion, with the caveat that we ideally 
should have multiple options for the auto-recovery mode:
1. Full auto-recovery, go read-only when an error is detected.
2. Go read-only when an error is detected but don't do auto-recovery 
(probably not very useful).
3. Do auto-recovery, but don't go read-only when an error is detected.
4. Don't do auto-recovery, and don't go read-only when an error is detected.
5-8. Same as the above, but require that the process that set the state 
keep the file open to maintain it (useful for cases when we need some 
kind of recovery if at all possible, but would prefer the monitoring 
tool to do it if possible).

In theory, we could do it as a bit-field to control what gets recovered 
and what doesn't.
>
>
> Actually, after looking at the existing /sys/fs/btrfs layout, we already
> have filesystem directories, each with a devices subdir, tho the symlinks
> therein point to the /sys/devices tree device dirs.  The listing thereof
> already provides #1, at least for operational devices.
>
> I'm not going to go testing what happens to the current sysfs devices
> listings when a device goes missing, but we already know btrfs doesn't
> dynamically use that information.  Presumably, once it does, the symlinks
> could be replaced with subdirs for missing devices, with the still known
> information in the subdir (which could then be named as either the btrfs
> device ID or as missing-N), and the status of the device being detectable
> by whether it's a symlink to a devices tree device (device online) or a
> subdir (device offline).
IIRC, under the current implementation, the symlink stays around as long 
as the device node in /dev stays around (so usually until the filesystem 
gets unmounted).

That said, there are issues inherent in trying to do something like 
replacing a symlink with a directory in sysfs, especially if the new 
directory contains a different layout than the one the symlink was 
pointing at:
1. You horribly break compatibility with existing tools.
2. You break the expectations of stability that are supposed to be 
guaranteed by sysfs for a given mount of it.
3. Sysfs isn't designed in a way that this could be done atomically, 
which severely limits usability (because the node might not be there, or 
it might be an empty directory).
This means we would need a separate directory to report device state.
>
> The per-filesystem devices-losable, fs-status, and leave-be files could
> be added to the existing syfs btrfs interface.
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Missing device handling
  2016-04-11 11:32                 ` Missing device handling Austin S. Hemmelgarn
@ 2016-04-18  0:55                   ` Chris Murphy
  2016-04-18 12:18                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 23+ messages in thread
From: Chris Murphy @ 2016-04-18  0:55 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Btrfs BTRFS

On Mon, Apr 11, 2016 at 5:32 AM, Austin S. Hemmelgarn
<ahferroin7@gmail.com> wrote:
> On 2016-04-09 03:24, Duncan wrote:
>>
>> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>>
>>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>>>
>>>>
>>>> I would personally suggest adding a per-filesystem node in sysfs to
>>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>>> attempt countermeasures when degraded, select/epoll on it will return
>>>> when state changes, reads will return (at minimum): what devices
>>>> comprise the FS, per disk state (is it working, failed, missing, a
>>>> hot-spare, etc), and what effective redundancy we have (how many
>>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>>> would make it trivial to write a daemon to monitor the filesystem,
>>>> react when something happens, and handle all the policy decisions.
>>>
>>>
>>> Hm, good proposal. Personally I tried to use uevents for this but they
>>> cause locking troubles, and I didn't continue this attempt.
>>
>>
>> Except that... in sysfs (unlike proc) there's a rather strictly enforced
>> rule of one property per file.
>
> Good point, I had forgotten about this.

I just ran across this:
https://www.kernel.org/doc/Documentation/block/stat.txt

Q. Why are there multiple statistics in a single file?  Doesn't sysfs
   normally contain a single value per file?
A. By having a single file, the kernel can guarantee that the statistics
   represent a consistent snapshot of the state of the device.

So there might be an exception. I'm using a zram device as a sprout
for a Btrfs seed. And this is what I'm seeing:

[root@f23m 0]# cat /sys/block/zram0/stat
   64258        0   514064       19    19949        0   159592
214        0      233      233

Anyway there might be a plausible exception, if there's a good reason,
for the one property per file rule.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Missing device handling
  2016-04-18  0:55                   ` Chris Murphy
@ 2016-04-18 12:18                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 23+ messages in thread
From: Austin S. Hemmelgarn @ 2016-04-18 12:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 2016-04-17 20:55, Chris Murphy wrote:
> On Mon, Apr 11, 2016 at 5:32 AM, Austin S. Hemmelgarn
> <ahferroin7@gmail.com> wrote:
>> On 2016-04-09 03:24, Duncan wrote:
>>>
>>> Yauhen Kharuzhy posted on Fri, 08 Apr 2016 22:53:00 +0300 as excerpted:
>>>
>>>> On Fri, Apr 08, 2016 at 03:23:28PM -0400, Austin S. Hemmelgarn wrote:
>>>>>
>>>>>
>>>>> I would personally suggest adding a per-filesystem node in sysfs to
>>>>> handle both 2 and 5. Having it open tells BTRFS to not automatically
>>>>> attempt countermeasures when degraded, select/epoll on it will return
>>>>> when state changes, reads will return (at minimum): what devices
>>>>> comprise the FS, per disk state (is it working, failed, missing, a
>>>>> hot-spare, etc), and what effective redundancy we have (how many
>>>>> devices we can lose and still be mountable, so 1 for raid1, raid10, and
>>>>> raid5, 2 for raid6, and 0 for raid0/single/dup, possibly higher for
>>>>> n-way replication (n-1), n-order parity (n), or erasure coding). This
>>>>> would make it trivial to write a daemon to monitor the filesystem,
>>>>> react when something happens, and handle all the policy decisions.
>>>>
>>>>
>>>> Hm, good proposal. Personally I tried to use uevents for this but they
>>>> cause locking troubles, and I didn't continue this attempt.
>>>
>>>
>>> Except that... in sysfs (unlike proc) there's a rather strictly enforced
>>> rule of one property per file.
>>
>> Good point, I had forgotten about this.
>
> I just ran across this:
> https://www.kernel.org/doc/Documentation/block/stat.txt
>
> Q. Why are there multiple statistics in a single file?  Doesn't sysfs
>     normally contain a single value per file?
> A. By having a single file, the kernel can guarantee that the statistics
>     represent a consistent snapshot of the state of the device.
>
> So there might be an exception. I'm using a zram device as a sprout
> for a Btrfs seed. And this is what I'm seeing:
>
> [root@f23m 0]# cat /sys/block/zram0/stat
>     64258        0   514064       19    19949        0   159592
> 214        0      233      233
>
> Anyway there might be a plausible exception, if there's a good reason,
> for the one property per file rule.
Part of the requirement for that though is that we have to provide a 
consistent set of info.  IOW, we would probably need to use something 
like RCU or locks to handle the data so we could get a consistent 
snapshot of the state.


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-04-18 12:19 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-06 15:34 unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Ank Ular
2016-04-06 21:02 ` Duncan
2016-04-06 22:08   ` Ank Ular
2016-04-07  2:36     ` Duncan
2016-04-06 23:08 ` Chris Murphy
2016-04-07 11:19   ` Austin S. Hemmelgarn
2016-04-07 11:31     ` Austin S. Hemmelgarn
2016-04-07 19:32     ` Chris Murphy
2016-04-08 11:29       ` Austin S. Hemmelgarn
2016-04-08 16:17         ` Chris Murphy
2016-04-08 19:23           ` Missing device handling (was: 'unable to mount btrfs pool...') Austin S. Hemmelgarn
2016-04-08 19:53             ` Yauhen Kharuzhy
2016-04-09  7:24               ` Duncan
2016-04-11 11:32                 ` Missing device handling Austin S. Hemmelgarn
2016-04-18  0:55                   ` Chris Murphy
2016-04-18 12:18                     ` Austin S. Hemmelgarn
2016-04-08 18:05         ` unable to mount btrfs pool even with -oro,recovery,degraded, unable to do 'btrfs restore' Chris Murphy
2016-04-08 18:18           ` Austin S. Hemmelgarn
2016-04-08 18:30             ` Chris Murphy
2016-04-08 19:27               ` Austin S. Hemmelgarn
2016-04-08 20:16                 ` Chris Murphy
2016-04-08 23:01                   ` Chris Murphy
2016-04-07 11:29   ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.