All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs recovery
@ 2017-01-26  9:18 Oliver Freyermuth
  2017-01-26  9:25 ` Hugo Mills
  0 siblings, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-26  9:18 UTC (permalink / raw)
  To: linux-btrfs

Hi, 

I have just encountered on mount of one of my filesystems (after a clean reboot...): 
[  495.303313] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=243
[  495.315642] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=243
[  495.315694] BTRFS error (device sdb1): failed to read block groups: -5
[  495.327865] BTRFS error (device sdb1): open_ctree failed

The system is using a 4.9.0 kernel, and I have btrfs-progs 4.9 installed. 

Since the last backup is a few weeks old (but the data is not so crucial), I'd like to attempt to recover at least some of the files. 

btrfs check tells me:
# btrfs check /dev/sdb1
Checking filesystem on /dev/sdb1
UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
checking extents
bad block 35028992
ERROR: errors found in extent allocation tree or chunk allocation

IIRC, the FS has DUP metadata (but single DATA). It's on a classic spinning disk. 
I use: "space_cache,noatime,compress=lzo,commit=120" as mount options. 

What is the best way to go? 

Should I:
- reinit extent tree
- or collect debug info
- or is there a better way to go?

Cheers and thanks for any suggestions, 
	Oliver

PS: Please put my mail in CC, I'm not subscribed to the list. Thanks! 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26  9:18 btrfs recovery Oliver Freyermuth
@ 2017-01-26  9:25 ` Hugo Mills
  2017-01-26  9:36   ` Oliver Freyermuth
  0 siblings, 1 reply; 43+ messages in thread
From: Hugo Mills @ 2017-01-26  9:25 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1993 bytes --]

On Thu, Jan 26, 2017 at 10:18:40AM +0100, Oliver Freyermuth wrote:
> Hi, 
> 
> I have just encountered on mount of one of my filesystems (after a clean reboot...): 
> [  495.303313] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=243
> [  495.315642] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=243
> [  495.315694] BTRFS error (device sdb1): failed to read block groups: -5
> [  495.327865] BTRFS error (device sdb1): open_ctree failed

   Can you post the output of "btrfs-debug-tree -b 35028992
/dev/sdb1", specifically the 5 or so entries around item 243. It is
quite likely that you have bad RAM, and the output will help confirm
that.

> The system is using a 4.9.0 kernel, and I have btrfs-progs 4.9 installed. 
> 
> Since the last backup is a few weeks old (but the data is not so crucial), I'd like to attempt to recover at least some of the files. 
> 
> btrfs check tells me:
> # btrfs check /dev/sdb1
> Checking filesystem on /dev/sdb1
> UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
> checking extents
> bad block 35028992
> ERROR: errors found in extent allocation tree or chunk allocation
> 
> IIRC, the FS has DUP metadata (but single DATA). It's on a classic spinning disk. 
> I use: "space_cache,noatime,compress=lzo,commit=120" as mount options. 
> 
> What is the best way to go? 
> 
> Should I:
> - reinit extent tree
> - or collect debug info
> - or is there a better way to go?

   Check and fix your hardware first. :)

   If it is bad RAM, then the error is likely to be a simple bitflip,
and there are patches for btrfs check which will fix those in most
cases.

   Hugo.

> Cheers and thanks for any suggestions, 
> 	Oliver
> 
> PS: Please put my mail in CC, I'm not subscribed to the list. Thanks! 

-- 
Hugo Mills             | This: Rock. You throw rock.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                          Graeme Swann on fast bowlers

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26  9:25 ` Hugo Mills
@ 2017-01-26  9:36   ` Oliver Freyermuth
  2017-01-26 10:00     ` Hugo Mills
  2017-01-26 11:01     ` Oliver Freyermuth
  0 siblings, 2 replies; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-26  9:36 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

Hi and thanks for the quick reply! 

Am 26.01.2017 um 10:25 schrieb Hugo Mills:
>    Can you post the output of "btrfs-debug-tree -b 35028992
> /dev/sdb1", specifically the 5 or so entries around item 243. It is
> quite likely that you have bad RAM, and the output will help confirm
> that.
> 

Since I did not find item 243 in the debug output at all, I uploaded the complete output of the debug-tree command here:
http://pastebin.com/xM8qUnSx

>    Check and fix your hardware first. :)
> 
>    If it is bad RAM, then the error is likely to be a simple bitflip,
> and there are patches for btrfs check which will fix those in most
> cases.

I'll schedule a memcheck as soon as I can turn off the machine for a while,
which sadly may be a week or so in the future from now... 

> 
>    Hugo.
> 
>> Cheers and thanks for any suggestions, 
>> 	Oliver
>>
>> PS: Please put my mail in CC, I'm not subscribed to the list. Thanks! 
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26  9:36   ` Oliver Freyermuth
@ 2017-01-26 10:00     ` Hugo Mills
  2017-01-26 11:01     ` Oliver Freyermuth
  1 sibling, 0 replies; 43+ messages in thread
From: Hugo Mills @ 2017-01-26 10:00 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2931 bytes --]

On Thu, Jan 26, 2017 at 10:36:55AM +0100, Oliver Freyermuth wrote:
> Hi and thanks for the quick reply! 
> 
> Am 26.01.2017 um 10:25 schrieb Hugo Mills:
> >    Can you post the output of "btrfs-debug-tree -b 35028992
> > /dev/sdb1", specifically the 5 or so entries around item 243. It is
> > quite likely that you have bad RAM, and the output will help confirm
> > that.
> > 
> 
> Since I did not find item 243 in the debug output at all, I uploaded the complete output of the debug-tree command here:
> http://pastebin.com/xM8qUnSx

   It's on line 248 of the paste:

246.   key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
247.   key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
248.   key (15606380089319694336 UNKNOWN.76 303104) block 596459520 (36405) gen 20441
249.   key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
250.   key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427

   I was wrong in my assumption: this isn't a simple bitflip. It looks
like a small random write of data over the item key. That's not to say
that bad hardware isn't the culprit -- it's worth checking anyway --
but it could also be a bug in... well, almost anything.

   It's not corruption on the disk, because that would be caught by
the checksum mechanism. This data was corrupted in RAM, before it was
checksummed and written to disk. That could have happened as a result
of some rogue piece of kernel code writing to an incorrect address, or
as a result of some _other_ memory corruption affecting an address
which is then used to write something to.

   Looking at the data, I think this should be manually fixable, with
sufficient effort (and a hex editor).

Looking at the item value:

>>> hex(15606380089319694336)
'0xd89500014da12000'

Compared to the preceding key's value:

>>> hex(5561905152)
'0x14b83f000'

It looks like it's just the top couple of bytes in this field that are
affected, so those (d8, 95) can be zeroed. The second field should
clearly be EXTENT_ITEM, which is 0xa8. The offset field (the third
one) looks OK to me -- the bottom byte is 0.

   We can probably talk you through fixing this by hand with a decent
hex editor. I've done it before...

> >    Check and fix your hardware first. :)
> > 
> >    If it is bad RAM, then the error is likely to be a simple bitflip,
> > and there are patches for btrfs check which will fix those in most
> > cases.
> 
> I'll schedule a memcheck as soon as I can turn off the machine for a while,
> which sadly may be a week or so in the future from now... 

   Bear in mind that if it is unreliable hardware, then continued use
of the FS in read-write operation is likely to cause additional
damage.

   Hugo.

-- 
Hugo Mills             | This: Rock. You throw rock.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                          Graeme Swann on fast bowlers

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26  9:36   ` Oliver Freyermuth
  2017-01-26 10:00     ` Hugo Mills
@ 2017-01-26 11:01     ` Oliver Freyermuth
  2017-01-27 11:01       ` Oliver Freyermuth
  2017-01-28 21:04       ` Oliver Freyermuth
  1 sibling, 2 replies; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-26 11:01 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

>    It's on line 248 of the paste:
> 
> 246.   key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
> 247.   key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
> 248.   key (15606380089319694336 UNKNOWN.76 303104) block 596459520 (36405) gen 20441
> 249.   key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
> 250.   key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427
> 
>    I was wrong in my assumption: this isn't a simple bitflip. It looks
> like a small random write of data over the item key. That's not to say
> that bad hardware isn't the culprit -- it's worth checking anyway --
> but it could also be a bug in... well, almost anything.
> 
>    It's not corruption on the disk, because that would be caught by
> the checksum mechanism. This data was corrupted in RAM, before it was
> checksummed and written to disk. That could have happened as a result
> of some rogue piece of kernel code writing to an incorrect address, or
> as a result of some _other_ memory corruption affecting an address
> which is then used to write something to.
In the past, I used the nvidia binary blob on that machine, which would of course be a potential culprit - but since a few months, the machine uses nouveau and the kernel is not tainted. 

In case somebody encounters something similar in the future, a few more details:
The only not-so-common kernel code running on that machine was zram, which I have unloaded just now. Apart from that, there's only very common hardware with in-tree modules (realtek ethernet card, intel chipset + CPU, VIA USB3 controller)
in the machine. Special kernel options are "iommu=soft zswap.enabled=1", I am running Gentoo, no kernel patches apart from those by Gentoo upstream (i.e. I use sys-kernel/gentoo-sources-4.9.0). 

I'm also running 'memtester 12G' right now, which at least tests 2/3 of the memory. I'll leave that running for a day or so, but of course it will not provide a clear answer... 

> 
>    Looking at the data, I think this should be manually fixable, with
> sufficient effort (and a hex editor).
> 
> Looking at the item value:
> 
>>>> hex(15606380089319694336)
> '0xd89500014da12000'
> 
> Compared to the preceding key's value:
> 
>>>> hex(5561905152)
> '0x14b83f000'
> 
> It looks like it's just the top couple of bytes in this field that are
> affected, so those (d8, 95) can be zeroed. The second field should
> clearly be EXTENT_ITEM, which is 0xa8. The offset field (the third
> one) looks OK to me -- the bottom byte is 0.
> 
>    We can probably talk you through fixing this by hand with a decent
> hex editor. I've done it before...
> 
That would be nice! Is it fine via the mailing list? 
Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 

Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 

The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
0x00a800014da12000
(if I understood correctly) and then probably adapt a checksum? 

>    Bear in mind that if it is unreliable hardware, then continued use
> of the FS in read-write operation is likely to cause additional
> damage.
Of course. 
I would then, in any case, after the filesystem is up again, clean up, do a fresh external backup, scratch the FS and recreate it. I think it is already over 2 years old, so it has survived several generations of kernels. 

Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26 11:01     ` Oliver Freyermuth
@ 2017-01-27 11:01       ` Oliver Freyermuth
  2017-01-27 12:58         ` Austin S. Hemmelgarn
  2017-01-28 21:04       ` Oliver Freyermuth
  1 sibling, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-27 11:01 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

> I'm also running 'memtester 12G' right now, which at least tests 2/3 of the memory. I'll leave that running for a day or so, but of course it will not provide a clear answer... 

A small update: while the online memtester is without any errors still, I checked old syslogs from the machine and found something intriguing. 
Jan 16 10:03:11 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00098d39
Jan 16 10:18:33 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00099795
Jan 16 17:35:48 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 000dd64e
This seems to be consistently happening from time to time (I have low memory corruption checking compiled in). 
The numbers always consistently increase, and after a reboot, start fresh from a small number again. 

I suppose this is a BIOS bug and it's storing some counter in low memory. I am unsure whether this could have triggered the BTRFS corruption, 
nor do I know what to do about it (are there kernel quirks for that?). 
The vendor does not provide any updates, as usual. 

If someone could confirm whether this might cause corruption for btrfs (and maybe direct me to the correct place to ask for a kernel quirk for this device - do I ask on MM, or somewhere else?), that would be much appreciated. 

>>    We can probably talk you through fixing this by hand with a decent
>> hex editor. I've done it before...
>>
> That would be nice! Is it fine via the mailing list? 
> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
> 
> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
> 
> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
> 0x00a800014da12000
> (if I understood correctly) and then probably adapt a checksum? 
> 
Additionally, I found that "btrfs restore" works on this broken FS. I will take an external backup of the content within the next 24 hours using that, then I am ready to try anything you suggeest. 

Cheers and thanks!
	Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-27 11:01       ` Oliver Freyermuth
@ 2017-01-27 12:58         ` Austin S. Hemmelgarn
  2017-01-28  5:00           ` Duncan
  0 siblings, 1 reply; 43+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-27 12:58 UTC (permalink / raw)
  To: Oliver Freyermuth, Hugo Mills; +Cc: linux-btrfs

On 2017-01-27 06:01, Oliver Freyermuth wrote:
>> I'm also running 'memtester 12G' right now, which at least tests 2/3 of the memory. I'll leave that running for a day or so, but of course it will not provide a clear answer...
>
> A small update: while the online memtester is without any errors still, I checked old syslogs from the machine and found something intriguing.
> Jan 16 10:03:11 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00098d39
> Jan 16 10:18:33 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00099795
> Jan 16 17:35:48 xxx kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 000dd64e
> This seems to be consistently happening from time to time (I have low memory corruption checking compiled in).
> The numbers always consistently increase, and after a reboot, start fresh from a small number again.
>
> I suppose this is a BIOS bug and it's storing some counter in low memory. I am unsure whether this could have triggered the BTRFS corruption,
> nor do I know what to do about it (are there kernel quirks for that?).
> The vendor does not provide any updates, as usual.
>
> If someone could confirm whether this might cause corruption for btrfs (and maybe direct me to the correct place to ask for a kernel quirk for this device - do I ask on MM, or somewhere else?), that would be much appreciated.
It is a firmware bug, Linux doesn't use stuff in that physical address 
range at all.  I don't think it's likely that this specific bug caused 
the corruption, but given that the firmware doesn't have it's 
allocations listed correctly in the e820 table (if they were listed 
correctly, you wouldn't be seeing this message), it would not surprise 
me if the firmware was involved somehow.
>
>>>    We can probably talk you through fixing this by hand with a decent
>>> hex editor. I've done it before...
>>>
>> That would be nice! Is it fine via the mailing list?
>> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location.
>>
>> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs,
>> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device.
>>
>> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read
>> 0x00a800014da12000
>> (if I understood correctly) and then probably adapt a checksum?
>>
> Additionally, I found that "btrfs restore" works on this broken FS. I will take an external backup of the content within the next 24 hours using that, then I am ready to try anything you suggeest.
FWIW< the fact that btrfs restore works is a good sign, it means that 
the filesystem is almost certainly repairable (even though the tools 
might not be able to repair it themselves).


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-27 12:58         ` Austin S. Hemmelgarn
@ 2017-01-28  5:00           ` Duncan
  2017-01-28 12:37             ` Janos Toth F.
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Duncan @ 2017-01-28  5:00 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
excerpted:

> On 2017-01-27 06:01, Oliver Freyermuth wrote:
>>> I'm also running 'memtester 12G' right now, which at least tests 2/3
>>> of the memory. I'll leave that running for a day or so, but of course
>>> it will not provide a clear answer...
>>
>> A small update: while the online memtester is without any errors still,
>> I checked old syslogs from the machine and found something intriguing.

>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00098d39
>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00099795
>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 000dd64e

0x9000 = 36K...

>> This seems to be consistently happening from time to time (I have low
>> memory corruption checking compiled in).
>> The numbers always consistently increase, and after a reboot, start
>> fresh from a small number again.
>>
>> I suppose this is a BIOS bug and it's storing some counter in low
>> memory. I am unsure whether this could have triggered the BTRFS
>> corruption, nor do I know what to do about it (are there kernel quirks
>> for that?). The vendor does not provide any updates, as usual.
>>
>> If someone could confirm whether this might cause corruption for btrfs
>> (and maybe direct me to the correct place to ask for a kernel quirk for
>> this device - do I ask on MM, or somewhere else?), that would be much
>> appreciated.

> It is a firmware bug, Linux doesn't use stuff in that physical address
> range at all.  I don't think it's likely that this specific bug caused
> the corruption, but given that the firmware doesn't have it's
> allocations listed correctly in the e820 table (if they were listed
> correctly, you wouldn't be seeing this message), it would not surprise
> me if the firmware was involved somehow.

Correct me if I'm wrong (I'm no kernel expert, but I've been building my 
own kernel for well over a decade now so having a working familiarity 
with the kernel options, of which the following is my possibly incorrect 
read), but I believe that's only "fact check: mostly correct" (mostly as 
in yes it's the default, but there's a mainline kernel option to change 
it).

I was just going over the related kernel options again a couple days ago, 
so they're fresh in my head, and AFAICT...

There are THREE semi-related kernel options (config UI option location is 
based on the mainline 4.10-rc5+ git kernel I'm presently running):

DEFAULT_MMAP_MIN_ADDR

Config location: Processor type and features:
Low address space to protect from user allocation

This one is virtual memory according to config help, so likely not 
directly related, but similar idea.

X86_CHECK_BIOS_CORRUPTION

Location: Same section, a few lines below the first one:
Check for low memory corruption

I guess this is the option you (OF) have enabled.  Note that according to 
help, in addition to enabling this in options, a runtime kernel 
commandline option must be given as well, to actually enable the checks.

X86_RESERVE_LOW

Location: Same section, immediately below the check option:
Amount of low memory, in kilobytes, to reserve for the BIOS

Help for this one suggests enabling the check bios corruption option 
above if there are any doubts, so the two are directly related.

All three options apparently default to 64K (as that's what I see here 
and I don't believe I've changed them), but can be changed.  See the 
kernel options help and where it points for more.

My read of the above is that yes, by default the kernel won't use 
physical 0x9000 (36K), as it's well within the 64K default reserve area, 
but a blanket "Linux doesn't use stuff in that physical address range at 
all" is incorrect, as if the defaults have been changed it /could/ use 
that space (#3's minimum is 1 page, 4K, leaving that 36K address 
uncovered) -- there's a mainline-official option to do so, so it doesn't 
even require patching.

Meanwhile, since the defaults cover it, no quirk should be necessary (tho 
I might increase the reserve and test coverage area to the maximum 640K 
and run for awhile to be sure it's not going above the 64K default), but 
were it outside the default 64K coverage area, I would probably file it 
as a bug (my usual method for confirmed bugs), and mark it initially as 
an arch-x86 bug, tho they may switch it to something else, later.  But 
the devs would probably suggest further debugging, possibly giving you 
debug patches to try, etc, to nail down the specific device, before 
setting up a quirk for it.  Because the problem could be an expansion 
card or something, not the mobo/factory-default-machine, too, and it'd be 
a shame to setup a quirk for the wrong hardware.

>> Additionally, I found that "btrfs restore" works on this broken FS. I
>> will take an external backup of the content within the next 24 hours
>> using that, then I am ready to try anything you suggeest.

> FWIW the fact that btrfs restore works is a good sign, it means that
> the filesystem is almost certainly repairable (even though the tools
> might not be able to repair it themselves).

Btrfs restore is a very useful tool.  It has gotten me out of a few 
"changes since the last backup weren't valuable enough to have updated 
the backup yet when the risk was theoretical, so nothing serious, but now 
that it's no longer theory only, it'd still be useful to be able to save 
the current version, if it's not /too/ much trouble" type situations, 
myself. =:^)

Just don't count on restore to save your *** and always treat what it can 
often bring to current as a pleasant surprise, and having it fail won't 
be a down side, while having it work, if it does, will always be up side. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28  5:00           ` Duncan
@ 2017-01-28 12:37             ` Janos Toth F.
  2017-01-28 16:51               ` Oliver Freyermuth
  2017-01-28 16:46             ` Oliver Freyermuth
  2017-01-30 12:41             ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 43+ messages in thread
From: Janos Toth F. @ 2017-01-28 12:37 UTC (permalink / raw)
  To: Btrfs BTRFS

I usually compile my kernels with CONFIG_X86_RESERVE_LOW=640 and
CONFIG_X86_CHECK_BIOS_CORRUPTION=N because 640 kilobyte seems like a
very cheap price to pay in order to avoid worrying about this (and
skip the associated checking + monitoring).

Out of curiosity (after reading this email) I set these to 4 and Y (so
1 page = 4k reserve and checking turned ON and activated by default)
on a useless laptop. Right after reboot, the kernel log was full of
the same kind of Btrfs errors reported in the first email of this
topic ("bad key order", etc). I could run a scrub with zero errors and
successfully reboot with a read-write mounted root filesystem with the
old kernel build (but the kernel log was still full of errors, as your
might imagine). I tried to run "btrfs check --repair" but it seems to
be useless in this situation, the filesystem needs to be recreated
(not too hard in my case when it's still fully readable). Although,
the kernel log was free of the "Corrupted low memory at" kind of
messages (even though I let it run for hours).

On Sat, Jan 28, 2017 at 6:00 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
> excerpted:
>
>> On 2017-01-27 06:01, Oliver Freyermuth wrote:
>>>> I'm also running 'memtester 12G' right now, which at least tests 2/3
>>>> of the memory. I'll leave that running for a day or so, but of course
>>>> it will not provide a clear answer...
>>>
>>> A small update: while the online memtester is without any errors still,
>>> I checked old syslogs from the machine and found something intriguing.
>
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00098d39
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00099795
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 000dd64e
>
> 0x9000 = 36K...
>
>>> This seems to be consistently happening from time to time (I have low
>>> memory corruption checking compiled in).
>>> The numbers always consistently increase, and after a reboot, start
>>> fresh from a small number again.
>>>
>>> I suppose this is a BIOS bug and it's storing some counter in low
>>> memory. I am unsure whether this could have triggered the BTRFS
>>> corruption, nor do I know what to do about it (are there kernel quirks
>>> for that?). The vendor does not provide any updates, as usual.
>>>
>>> If someone could confirm whether this might cause corruption for btrfs
>>> (and maybe direct me to the correct place to ask for a kernel quirk for
>>> this device - do I ask on MM, or somewhere else?), that would be much
>>> appreciated.
>
>> It is a firmware bug, Linux doesn't use stuff in that physical address
>> range at all.  I don't think it's likely that this specific bug caused
>> the corruption, but given that the firmware doesn't have it's
>> allocations listed correctly in the e820 table (if they were listed
>> correctly, you wouldn't be seeing this message), it would not surprise
>> me if the firmware was involved somehow.
>
> Correct me if I'm wrong (I'm no kernel expert, but I've been building my
> own kernel for well over a decade now so having a working familiarity
> with the kernel options, of which the following is my possibly incorrect
> read), but I believe that's only "fact check: mostly correct" (mostly as
> in yes it's the default, but there's a mainline kernel option to change
> it).
>
> I was just going over the related kernel options again a couple days ago,
> so they're fresh in my head, and AFAICT...
>
> There are THREE semi-related kernel options (config UI option location is
> based on the mainline 4.10-rc5+ git kernel I'm presently running):
>
> DEFAULT_MMAP_MIN_ADDR
>
> Config location: Processor type and features:
> Low address space to protect from user allocation
>
> This one is virtual memory according to config help, so likely not
> directly related, but similar idea.
>
> X86_CHECK_BIOS_CORRUPTION
>
> Location: Same section, a few lines below the first one:
> Check for low memory corruption
>
> I guess this is the option you (OF) have enabled.  Note that according to
> help, in addition to enabling this in options, a runtime kernel
> commandline option must be given as well, to actually enable the checks.
>
> X86_RESERVE_LOW
>
> Location: Same section, immediately below the check option:
> Amount of low memory, in kilobytes, to reserve for the BIOS
>
> Help for this one suggests enabling the check bios corruption option
> above if there are any doubts, so the two are directly related.
>
> All three options apparently default to 64K (as that's what I see here
> and I don't believe I've changed them), but can be changed.  See the
> kernel options help and where it points for more.
>
> My read of the above is that yes, by default the kernel won't use
> physical 0x9000 (36K), as it's well within the 64K default reserve area,
> but a blanket "Linux doesn't use stuff in that physical address range at
> all" is incorrect, as if the defaults have been changed it /could/ use
> that space (#3's minimum is 1 page, 4K, leaving that 36K address
> uncovered) -- there's a mainline-official option to do so, so it doesn't
> even require patching.
>
> Meanwhile, since the defaults cover it, no quirk should be necessary (tho
> I might increase the reserve and test coverage area to the maximum 640K
> and run for awhile to be sure it's not going above the 64K default), but
> were it outside the default 64K coverage area, I would probably file it
> as a bug (my usual method for confirmed bugs), and mark it initially as
> an arch-x86 bug, tho they may switch it to something else, later.  But
> the devs would probably suggest further debugging, possibly giving you
> debug patches to try, etc, to nail down the specific device, before
> setting up a quirk for it.  Because the problem could be an expansion
> card or something, not the mobo/factory-default-machine, too, and it'd be
> a shame to setup a quirk for the wrong hardware.
>
>>> Additionally, I found that "btrfs restore" works on this broken FS. I
>>> will take an external backup of the content within the next 24 hours
>>> using that, then I am ready to try anything you suggeest.
>
>> FWIW the fact that btrfs restore works is a good sign, it means that
>> the filesystem is almost certainly repairable (even though the tools
>> might not be able to repair it themselves).
>
> Btrfs restore is a very useful tool.  It has gotten me out of a few
> "changes since the last backup weren't valuable enough to have updated
> the backup yet when the risk was theoretical, so nothing serious, but now
> that it's no longer theory only, it'd still be useful to be able to save
> the current version, if it's not /too/ much trouble" type situations,
> myself. =:^)
>
> Just don't count on restore to save your *** and always treat what it can
> often bring to current as a pleasant surprise, and having it fail won't
> be a down side, while having it work, if it does, will always be up side.
> =:^)
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28  5:00           ` Duncan
  2017-01-28 12:37             ` Janos Toth F.
@ 2017-01-28 16:46             ` Oliver Freyermuth
  2017-01-31  4:58               ` Duncan
  2017-01-30 12:41             ` Austin S. Hemmelgarn
  2 siblings, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-28 16:46 UTC (permalink / raw)
  To: linux-btrfs

Hi Duncan, 

thanks for your extensive reply! 

Am 28.01.2017 um 06:00 schrieb Duncan:
> All three options apparently default to 64K (as that's what I see here 
> and I don't believe I've changed them), but can be changed.  See the 
> kernel options help and where it points for more.
> 
Indeed, I have here:
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
(still at default)
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
The last option sets the default value for the bootparam,
so my kernel boots with checks on even without the bootparam explicitly set. 

I have now increased 
CONFIG_X86_RESERVE_LOW=640
which was previously indeed set to 64, and rebooted successfully. 

I have also tried to set:
memory_corruption_check_size=640
as kernel parameter. Sadly, the system froze before it could produce any output on screen, or write anything to PMEM.  
So I am now running with the default of checking the first 64k, but hope I am more safe since I use CONFIG_X86_RESERVE_LOW=640 now. 

> Meanwhile, since the defaults cover it, no quirk should be necessary (tho 
> I might increase the reserve and test coverage area to the maximum 640K 
> and run for awhile to be sure it's not going above the 64K default), but 
> were it outside the default 64K coverage area, I would probably file it 
> as a bug (my usual method for confirmed bugs), and mark it initially as 
> an arch-x86 bug, tho they may switch it to something else, later.  But 
> the devs would probably suggest further debugging, possibly giving you 
> debug patches to try, etc, to nail down the specific device, before 
> setting up a quirk for it.  Because the problem could be an expansion 
> card or something, not the mobo/factory-default-machine, too, and it'd be 
> a shame to setup a quirk for the wrong hardware.
I have another funny addition. I have access to a machine with exact same hardware configuration, and I am pretty sure it has the same firmware version, of course maybe slightly differently configured. 
That one is running openSUSE 13.1 with standard kernel. 
They use:
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
That machine does *NOT* produce the corruption-messages in syslog. 

The obvious differences are the kernel version (I run 4.9, that machine has 3.12.57-44-default),
and the fact that my machine was booted via UEFI, while the openSUSE machine booted classically via BIOS (i.e. CSM of the UEFI). 
Even the GPU is the same. That machine uses nvidia binary, while I use nouveau by now, but indeed I find in old syslogs the same corruption messages from the time I still used the binary blob. 

So my hunch is that it is related to me booting via EFI, but of course it might be a firmware bug triggered by some kernel-firmware-interaction change between 3.12 and 4.9,
or something in the firmware configuration. 

> Btrfs restore is a very useful tool.  It has gotten me out of a few 
> "changes since the last backup weren't valuable enough to have updated 
> the backup yet when the risk was theoretical, so nothing serious, but now 
> that it's no longer theory only, it'd still be useful to be able to save 
> the current version, if it's not /too/ much trouble" type situations, 
> myself. =:^)
> 
> Just don't count on restore to save your *** and always treat what it can 
> often bring to current as a pleasant surprise, and having it fail won't 
> be a down side, while having it work, if it does, will always be up side. 
> =:^)
> 
I'll keep that in mind, and I think that in the future, before trying any "btrfs check" (or even repair)
I will always try restore first if my backup was not fresh enough :-). 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28 12:37             ` Janos Toth F.
@ 2017-01-28 16:51               ` Oliver Freyermuth
  0 siblings, 0 replies; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-28 16:51 UTC (permalink / raw)
  To: Janos Toth F., Btrfs BTRFS

Am 28.01.2017 um 13:37 schrieb Janos Toth F.:
> I usually compile my kernels with CONFIG_X86_RESERVE_LOW=640 and
> CONFIG_X86_CHECK_BIOS_CORRUPTION=N because 640 kilobyte seems like a
> very cheap price to pay in order to avoid worrying about this (and
> skip the associated checking + monitoring).
> 
> Out of curiosity (after reading this email) I set these to 4 and Y (so
> 1 page = 4k reserve and checking turned ON and activated by default)
> on a useless laptop. Right after reboot, the kernel log was full of
> the same kind of Btrfs errors reported in the first email of this
> topic ("bad key order", etc). I could run a scrub with zero errors and
> successfully reboot with a read-write mounted root filesystem with the
> old kernel build (but the kernel log was still full of errors, as your
> might imagine). I tried to run "btrfs check --repair" but it seems to
> be useless in this situation, the filesystem needs to be recreated
> (not too hard in my case when it's still fully readable). Although,
> the kernel log was free of the "Corrupted low memory at" kind of
> messages (even though I let it run for hours).
Thanks for this report about the (even destructive...) test! 
It's astonishing how fast this broke... 
As mentioned in my last mail a few minutes ago, I am now following your example and the machine is running with
CONFIG_X86_RESERVE_LOW=640
(but still with checking for the first 64k active). I consider using that for all machines I'm administrating,
but it's interesting to see that major distributions targetting desktop use stay with 64. 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-26 11:01     ` Oliver Freyermuth
  2017-01-27 11:01       ` Oliver Freyermuth
@ 2017-01-28 21:04       ` Oliver Freyermuth
  2017-01-28 22:27         ` Hans van Kranenburg
  1 sibling, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-28 21:04 UTC (permalink / raw)
  To: Hugo Mills; +Cc: linux-btrfs

Am 26.01.2017 um 12:01 schrieb Oliver Freyermuth:
>Am 26.01.2017 um 11:00 schrieb Hugo Mills:
>>    We can probably talk you through fixing this by hand with a decent
>> hex editor. I've done it before...
>>
> That would be nice! Is it fine via the mailing list? 
> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
> 
> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
> 
> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
> 0x00a800014da12000
> (if I understood correctly) and then probably adapt a checksum? 
>
My external backup via btrfs-restore is now done successfully, so I am ready for anything you throw at me. 
Since I was able to pull all data, though, it would mainly be something educational (for me, and likely other list readers). 
If you think that this manual procedure is not worth it, I can also just scratch and recreate the FS. 

Cheers, 
	Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28 21:04       ` Oliver Freyermuth
@ 2017-01-28 22:27         ` Hans van Kranenburg
  2017-01-29  2:02           ` Oliver Freyermuth
  0 siblings, 1 reply; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-28 22:27 UTC (permalink / raw)
  To: Oliver Freyermuth, Hugo Mills; +Cc: linux-btrfs

On 01/28/2017 10:04 PM, Oliver Freyermuth wrote:
> Am 26.01.2017 um 12:01 schrieb Oliver Freyermuth:
>> Am 26.01.2017 um 11:00 schrieb Hugo Mills:
>>>    We can probably talk you through fixing this by hand with a decent
>>> hex editor. I've done it before...
>>>
>> That would be nice! Is it fine via the mailing list? 
>> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
>>
>> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
>> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
>>
>> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
>> 0x00a800014da12000
>> (if I understood correctly) and then probably adapt a checksum? 
>>
> My external backup via btrfs-restore is now done successfully, so I am ready for anything you throw at me. 
> Since I was able to pull all data, though, it would mainly be something educational (for me, and likely other list readers). 
> If you think that this manual procedure is not worth it, I can also just scratch and recreate the FS. 

OK, let's do it. I also want to practice a bit with stuff like this, so
this is a nice example.

See if you can dump the chunk tree (tree 3) with btrfs inspect-internal
dump-tree -t 3 /dev/xxx

You should get a list of objects like this one:

item 88 key (FIRST_CHUNK_TREE CHUNK_ITEM 1200384638976) itemoff 9067
itemsize 80
  chunk length 1073741824 owner 2 stripe_len 65536
  type DATA num_stripes 1
    stripe 0 devid 1 offset 729108447232
    dev uuid: edae9198-4ea9-4553-9992-af8e27aa6578

Find the one that contains 35028992

So, where it says 1200384638976 and length 1073741824 in the example
above, which is the btrfs virtual address space from 1200384638976 to
1200384638976 + 1GiB, you need to find the one where 35028992 is between
the start and start+length.

Then, look at the stripe line. If you have DUP metadata, it will be a
type METADATA (instead of DATA in the example above) and it will list
two stripe lines, which point at the two physical locations in the
underlying block device.

The place where your 16kiB metadata block is stored is at physical start
of stripe + (35028992 - start of virtual address block).

Then, dump one of the two mirrored 16kiB from disk with something like
`dd if=/dev/sdb1 bs=1 skip=<physical location> count=16384 > foo`

File foo of 16kiB size now contains the data that you dumped in the
pastebin before.

Using hexedit on this can be a quite confusing experience because of the
reordering of bytes in the raw data. When you expect to find
0xd89500014da12000 somewhere, it probably doesn't show up as d8 95 00 01
4d a1 20 00, but in a different order.

If you end up here, and if you can find the values in the hexdump
already, please put the 16kiB file somewhere online (or pipe it through
base64 and pastebin it), so we can help a bit more efficiently.

After getting the bytelevel stuff right again, the block needs a new
checksum, and then you have to carefully dd it back in both of the
places which are listed in the stripe lines.

If everything goes right... bam! Mount again and happy btrfsing again.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28 22:27         ` Hans van Kranenburg
@ 2017-01-29  2:02           ` Oliver Freyermuth
  2017-01-29 16:44             ` Hans van Kranenburg
  0 siblings, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-29  2:02 UTC (permalink / raw)
  To: Hans van Kranenburg, Hugo Mills; +Cc: linux-btrfs

Am 28.01.2017 um 23:27 schrieb Hans van Kranenburg:
> On 01/28/2017 10:04 PM, Oliver Freyermuth wrote:
>> Am 26.01.2017 um 12:01 schrieb Oliver Freyermuth:
>>> Am 26.01.2017 um 11:00 schrieb Hugo Mills:
>>>>    We can probably talk you through fixing this by hand with a decent
>>>> hex editor. I've done it before...
>>>>
>>> That would be nice! Is it fine via the mailing list? 
>>> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
>>>
>>> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
>>> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
>>>
>>> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
>>> 0x00a800014da12000
>>> (if I understood correctly) and then probably adapt a checksum? 
>>>
>> My external backup via btrfs-restore is now done successfully, so I am ready for anything you throw at me. 
>> Since I was able to pull all data, though, it would mainly be something educational (for me, and likely other list readers). 
>> If you think that this manual procedure is not worth it, I can also just scratch and recreate the FS. 
> 
> OK, let's do it. I also want to practice a bit with stuff like this, so
> this is a nice example.
> 
> See if you can dump the chunk tree (tree 3) with btrfs inspect-internal
> dump-tree -t 3 /dev/xxx
> 
Yes, I can! :-)

> You should get a list of objects like this one:
> 
> item 88 key (FIRST_CHUNK_TREE CHUNK_ITEM 1200384638976) itemoff 9067
> itemsize 80
>   chunk length 1073741824 owner 2 stripe_len 65536
>   type DATA num_stripes 1
>     stripe 0 devid 1 offset 729108447232
>     dev uuid: edae9198-4ea9-4553-9992-af8e27aa6578
> 
> Find the one that contains 35028992
>
> So, where it says 1200384638976 and length 1073741824 in the example
> above, which is the btrfs virtual address space from 1200384638976 to
> 1200384638976 + 1GiB, you need to find the one where 35028992 is between
> the start and start+length.
> 
I found:
        item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 29360128) itemoff 15993 itemsize 112
                length 1073741824 owner 2 stripe_len 65536 type METADATA|DUP
                io_align 65536 io_width 65536 sector_size 4096
                num_stripes 2 sub_stripes 0
                        stripe 0 devid 1 offset 37748736
                        dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728
                        stripe 1 devid 1 offset 1111490560
                        dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728

So I have Metadata DUP (at least I remembered that correctly). 
Now, for the calculation:
37748736+(35028992-29360128)   =   43417600
1111490560+(35028992-29360128) = 1117159424

> Then, look at the stripe line. If you have DUP metadata, it will be a
> type METADATA (instead of DATA in the example above) and it will list
> two stripe lines, which point at the two physical locations in the
> underlying block device.
> 
> The place where your 16kiB metadata block is stored is at physical start
> of stripe + (35028992 - start of virtual address block).
> 
> Then, dump one of the two mirrored 16kiB from disk with something like
> `dd if=/dev/sdb1 bs=1 skip=<physical location> count=16384 > foo`
And the dd'ing:
dd if=/dev/sdb1 bs=1 skip=43417600 count=16384 > mblock_first
dd if=/dev/sdb1 bs=1 skip=1117159424 count=16384 > mblock_second
Just as a cross-check, as expected, the md5sum of both files is the same, so they are identical. 

> 
> File foo of 16kiB size now contains the data that you dumped in the
> pastebin before.
> 
> Using hexedit on this can be a quite confusing experience because of the
> reordering of bytes in the raw data. When you expect to find
> 0xd89500014da12000 somewhere, it probably doesn't show up as d8 95 00 01
> 4d a1 20 00, but in a different order.
> 
Indeed, that's confusing, luckily I'm used to this a bit since I did some close-to-hardware work. 
In the dump, starting at offset 0x1FB8, I get:
00 20 A1 4D  01 00 95 D8
so the expected bytes in reverse. 
So my next step would likely be to change that to:
00 20 A1 4D  01 00 A8 00
and then somehow redo the CRC - correct so far? 

And my very last step would be: 
dd if=mblock_first of=/dev/sdb1 bs=1 skip=43417600 count=16384
dd if=mblock_first of=/dev/sdb1 bs=1 skip=1117159424 count=16384
(of which the "count" is then not really needed, but better safe than sorry). 

> If you end up here, and if you can find the values in the hexdump
> already, please put the 16kiB file somewhere online (or pipe it through
> base64 and pastebin it), so we can help a bit more efficiently.
I've put it online here (ownCloud instance of our University):
https://uni-bonn.sciebo.de/index.php/s/3Vdr7nmmfqPtHot/download
and alternatively as base64 in pastebin:
http://pastebin.com/K1CzCxqi

> After getting the bytelevel stuff right again, the block needs a new
> checksum, and then you have to carefully dd it back in both of the
> places which are listed in the stripe lines.
> 
> If everything goes right... bam! Mount again and happy btrfsing again.
> 

Thanks for all up to here! 
	Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-29  2:02           ` Oliver Freyermuth
@ 2017-01-29 16:44             ` Hans van Kranenburg
  2017-01-29 19:09               ` Oliver Freyermuth
  0 siblings, 1 reply; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-29 16:44 UTC (permalink / raw)
  To: Oliver Freyermuth, Hugo Mills; +Cc: linux-btrfs

On 01/29/2017 03:02 AM, Oliver Freyermuth wrote:
> Am 28.01.2017 um 23:27 schrieb Hans van Kranenburg:
>> On 01/28/2017 10:04 PM, Oliver Freyermuth wrote:
>>> Am 26.01.2017 um 12:01 schrieb Oliver Freyermuth:
>>>> Am 26.01.2017 um 11:00 schrieb Hugo Mills:
>>>>>    We can probably talk you through fixing this by hand with a decent
>>>>> hex editor. I've done it before...
>>>>>
>>>> That would be nice! Is it fine via the mailing list? 
>>>> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
>>>>
>>>> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
>>>> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
>>>>
>>>> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
>>>> 0x00a800014da12000
>>>> (if I understood correctly) and then probably adapt a checksum? 
>>>>
>>> My external backup via btrfs-restore is now done successfully, so I am ready for anything you throw at me. 
>>> Since I was able to pull all data, though, it would mainly be something educational (for me, and likely other list readers). 
>>> If you think that this manual procedure is not worth it, I can also just scratch and recreate the FS. 
>>
>> OK, let's do it. I also want to practice a bit with stuff like this, so
>> this is a nice example.
>>
>> See if you can dump the chunk tree (tree 3) with btrfs inspect-internal
>> dump-tree -t 3 /dev/xxx
>>
> Yes, I can! :-)
> 
>> You should get a list of objects like this one:
>>
>> item 88 key (FIRST_CHUNK_TREE CHUNK_ITEM 1200384638976) itemoff 9067
>> itemsize 80
>>   chunk length 1073741824 owner 2 stripe_len 65536
>>   type DATA num_stripes 1
>>     stripe 0 devid 1 offset 729108447232
>>     dev uuid: edae9198-4ea9-4553-9992-af8e27aa6578
>>
>> Find the one that contains 35028992
>>
>> So, where it says 1200384638976 and length 1073741824 in the example
>> above, which is the btrfs virtual address space from 1200384638976 to
>> 1200384638976 + 1GiB, you need to find the one where 35028992 is between
>> the start and start+length.
>>
> I found:
>         item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 29360128) itemoff 15993 itemsize 112
>                 length 1073741824 owner 2 stripe_len 65536 type METADATA|DUP
>                 io_align 65536 io_width 65536 sector_size 4096
>                 num_stripes 2 sub_stripes 0
>                         stripe 0 devid 1 offset 37748736
>                         dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728
>                         stripe 1 devid 1 offset 1111490560
>                         dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728
> 
> So I have Metadata DUP (at least I remembered that correctly). 
> Now, for the calculation:
> 37748736+(35028992-29360128)   =   43417600
> 1111490560+(35028992-29360128) = 1117159424
> 
>> Then, look at the stripe line. If you have DUP metadata, it will be a
>> type METADATA (instead of DATA in the example above) and it will list
>> two stripe lines, which point at the two physical locations in the
>> underlying block device.
>>
>> The place where your 16kiB metadata block is stored is at physical start
>> of stripe + (35028992 - start of virtual address block).
>>
>> Then, dump one of the two mirrored 16kiB from disk with something like
>> `dd if=/dev/sdb1 bs=1 skip=<physical location> count=16384 > foo`
> And the dd'ing:
> dd if=/dev/sdb1 bs=1 skip=43417600 count=16384 > mblock_first
> dd if=/dev/sdb1 bs=1 skip=1117159424 count=16384 > mblock_second
> Just as a cross-check, as expected, the md5sum of both files is the same, so they are identical. 
> 
>>
>> File foo of 16kiB size now contains the data that you dumped in the
>> pastebin before.
>>
>> Using hexedit on this can be a quite confusing experience because of the
>> reordering of bytes in the raw data. When you expect to find
>> 0xd89500014da12000 somewhere, it probably doesn't show up as d8 95 00 01
>> 4d a1 20 00, but in a different order.
>>
> Indeed, that's confusing, luckily I'm used to this a bit since I did some close-to-hardware work. 
> In the dump, starting at offset 0x1FB8, I get:
> 00 20 A1 4D  01 00 95 D8
> so the expected bytes in reverse. 
> So my next step would likely be to change that to:
> 00 20 A1 4D  01 00 A8 00
> and then somehow redo the CRC - correct so far? 

Almost, the 95 d8 was garbage, which needs to be 00 00, and the a8 goes
in place of the 4c, which now causes it do be displayed as UNKNOWN.76
instead of EXTENT_ITEM.

I hope the 303104 value is correct, otherwise we have to also fix that.

> And my very last step would be: 
> dd if=mblock_first of=/dev/sdb1 bs=1 skip=43417600 count=16384
> dd if=mblock_first of=/dev/sdb1 bs=1 skip=1117159424 count=16384
> (of which the "count" is then not really needed, but better safe than sorry). 
> 
>> If you end up here, and if you can find the values in the hexdump
>> already, please put the 16kiB file somewhere online (or pipe it through
>> base64 and pastebin it), so we can help a bit more efficiently.
> I've put it online here (ownCloud instance of our University):
> https://uni-bonn.sciebo.de/index.php/s/3Vdr7nmmfqPtHot/download
> and alternatively as base64 in pastebin:
> http://pastebin.com/K1CzCxqi
> 
>> After getting the bytelevel stuff right again, the block needs a new
>> checksum, and then you have to carefully dd it back in both of the
>> places which are listed in the stripe lines.
>>
>> If everything goes right... bam! Mount again and happy btrfsing again.

Yes, or... do some btrfs-assisted 'hexedit'. I just added some missing
structures for a metadata Node into python-btrfs, in a branch where I'm
playing around a bit with the first steps of offline editing.

If you clone https://github.com/knorrie/python-btrfs/ and checkout the
branch 'bigmomma', you can do this:

~/src/git/python-btrfs (bigmomma) 4-$ ipython
Python 2.7.13 (default, Dec 18 2016, 20:19:42)
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import array

In [2]: import btrfs

In [3]: buf = array.array('B', open('mblock_first').read())

In [4]: node = btrfs.ctree.Node(buf)

In [5]: len(node.ptrs)
Out[5]: 376

In [6]: ptr = node.ptrs[243]

In [7]: print(ptr)
key (15606380089319694336 76 303104) block 596459520 gen 20441

In [8]: ptr.key.objectid &= 0xffffffff

In [9]: ptr.key.type = btrfs.ctree.EXTENT_ITEM_KEY

In [10]: print(ptr)
key (1302405120 EXTENT_ITEM 303104) block 596459520 gen 20441

In [11]: ptr.write()

In [12]: node.header.write()

In [13]: buf.tofile(open('mblock_first_fixed', 'wb'))

And voila:

-$ hexdump -C mblock_first > mblock_first.hexdump
-$ hexdump -C mblock_first_fixed > mblock_first_fixed.hexdump
-$ diff -u0 mblock_first.hexdump mblock_first_fixed.hexdump
--- mblock_first.hexdump	2017-01-29 17:31:57.324537433 +0100
+++ mblock_first_fixed.hexdump	2017-01-29 17:33:48.252683710 +0100
@@ -1 +1 @@
-00000000  00 22 16 2b 00 00 00 00  00 00 00 00 00 00 00 00
|.".+............|
+00000000  8f c0 96 b0 00 00 00 00  00 00 00 00 00 00 00 00
|................|
@@ -508,2 +508,2 @@
-00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8  |.O.......
.M....|
-00001fc0  4c 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
|L.........@.#...|
+00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00  |.O.......
.M....|
+00001fc0  a8 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
|..........@.#...|

:-)

Writing back the information to the byte buffer (the node header) also
recomputes the checksum.

If this is the same change that you ended up with while doing it
manually, then try to put it back on disk twice, and see what happens
when mounting.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-29 16:44             ` Hans van Kranenburg
@ 2017-01-29 19:09               ` Oliver Freyermuth
  2017-01-29 19:28                 ` Hans van Kranenburg
  0 siblings, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-29 19:09 UTC (permalink / raw)
  To: Hans van Kranenburg, Hugo Mills; +Cc: linux-btrfs

Am 29.01.2017 um 17:44 schrieb Hans van Kranenburg:
> On 01/29/2017 03:02 AM, Oliver Freyermuth wrote:
>> Am 28.01.2017 um 23:27 schrieb Hans van Kranenburg:
>>> On 01/28/2017 10:04 PM, Oliver Freyermuth wrote:
>>>> Am 26.01.2017 um 12:01 schrieb Oliver Freyermuth:
>>>>> Am 26.01.2017 um 11:00 schrieb Hugo Mills:
>>>>>>    We can probably talk you through fixing this by hand with a decent
>>>>>> hex editor. I've done it before...
>>>>>>
>>>>> That would be nice! Is it fine via the mailing list? 
>>>>> Potentially, the instructions could be helpful for future reference, and "real" IRC is not accessible from my current location. 
>>>>>
>>>>> Do you have suggestions for a decent hexeditor for this job? Until now, I have been mainly using emacs, 
>>>>> classic hexedit (http://rigaux.org/hexedit.html), or okteta (beware, it's graphical!), but of course these were made for a few MiB of files and are not so well suited for a block device. 
>>>>>
>>>>> The first thing to do would then probably just be to jump to the offset where 0xd89500014da12000 is written (can I get that via inspect-internal, or do I have to search for it?), fix that to read 
>>>>> 0x00a800014da12000
>>>>> (if I understood correctly) and then probably adapt a checksum? 
>>>>>
>>>> My external backup via btrfs-restore is now done successfully, so I am ready for anything you throw at me. 
>>>> Since I was able to pull all data, though, it would mainly be something educational (for me, and likely other list readers). 
>>>> If you think that this manual procedure is not worth it, I can also just scratch and recreate the FS. 
>>>
>>> OK, let's do it. I also want to practice a bit with stuff like this, so
>>> this is a nice example.
>>>
>>> See if you can dump the chunk tree (tree 3) with btrfs inspect-internal
>>> dump-tree -t 3 /dev/xxx
>>>
>> Yes, I can! :-)
>>
>>> You should get a list of objects like this one:
>>>
>>> item 88 key (FIRST_CHUNK_TREE CHUNK_ITEM 1200384638976) itemoff 9067
>>> itemsize 80
>>>   chunk length 1073741824 owner 2 stripe_len 65536
>>>   type DATA num_stripes 1
>>>     stripe 0 devid 1 offset 729108447232
>>>     dev uuid: edae9198-4ea9-4553-9992-af8e27aa6578
>>>
>>> Find the one that contains 35028992
>>>
>>> So, where it says 1200384638976 and length 1073741824 in the example
>>> above, which is the btrfs virtual address space from 1200384638976 to
>>> 1200384638976 + 1GiB, you need to find the one where 35028992 is between
>>> the start and start+length.
>>>
>> I found:
>>         item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 29360128) itemoff 15993 itemsize 112
>>                 length 1073741824 owner 2 stripe_len 65536 type METADATA|DUP
>>                 io_align 65536 io_width 65536 sector_size 4096
>>                 num_stripes 2 sub_stripes 0
>>                         stripe 0 devid 1 offset 37748736
>>                         dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728
>>                         stripe 1 devid 1 offset 1111490560
>>                         dev_uuid 76acfc80-aa73-4a21-890b-34d1d2259728
>>
>> So I have Metadata DUP (at least I remembered that correctly). 
>> Now, for the calculation:
>> 37748736+(35028992-29360128)   =   43417600
>> 1111490560+(35028992-29360128) = 1117159424
>>
>>> Then, look at the stripe line. If you have DUP metadata, it will be a
>>> type METADATA (instead of DATA in the example above) and it will list
>>> two stripe lines, which point at the two physical locations in the
>>> underlying block device.
>>>
>>> The place where your 16kiB metadata block is stored is at physical start
>>> of stripe + (35028992 - start of virtual address block).
>>>
>>> Then, dump one of the two mirrored 16kiB from disk with something like
>>> `dd if=/dev/sdb1 bs=1 skip=<physical location> count=16384 > foo`
>> And the dd'ing:
>> dd if=/dev/sdb1 bs=1 skip=43417600 count=16384 > mblock_first
>> dd if=/dev/sdb1 bs=1 skip=1117159424 count=16384 > mblock_second
>> Just as a cross-check, as expected, the md5sum of both files is the same, so they are identical. 
>>
>>>
>>> File foo of 16kiB size now contains the data that you dumped in the
>>> pastebin before.
>>>
>>> Using hexedit on this can be a quite confusing experience because of the
>>> reordering of bytes in the raw data. When you expect to find
>>> 0xd89500014da12000 somewhere, it probably doesn't show up as d8 95 00 01
>>> 4d a1 20 00, but in a different order.
>>>
>> Indeed, that's confusing, luckily I'm used to this a bit since I did some close-to-hardware work. 
>> In the dump, starting at offset 0x1FB8, I get:
>> 00 20 A1 4D  01 00 95 D8
>> so the expected bytes in reverse. 
>> So my next step would likely be to change that to:
>> 00 20 A1 4D  01 00 A8 00
>> and then somehow redo the CRC - correct so far? 
> 
> Almost, the 95 d8 was garbage, which needs to be 00 00, and the a8 goes
> in place of the 4c, which now causes it do be displayed as UNKNOWN.76
> instead of EXTENT_ITEM.
> 
> I hope the 303104 value is correct, otherwise we have to also fix that.
> 
>> And my very last step would be: 
>> dd if=mblock_first of=/dev/sdb1 bs=1 skip=43417600 count=16384
>> dd if=mblock_first of=/dev/sdb1 bs=1 skip=1117159424 count=16384
>> (of which the "count" is then not really needed, but better safe than sorry). 
>>
>>> If you end up here, and if you can find the values in the hexdump
>>> already, please put the 16kiB file somewhere online (or pipe it through
>>> base64 and pastebin it), so we can help a bit more efficiently.
>> I've put it online here (ownCloud instance of our University):
>> https://uni-bonn.sciebo.de/index.php/s/3Vdr7nmmfqPtHot/download
>> and alternatively as base64 in pastebin:
>> http://pastebin.com/K1CzCxqi
>>
>>> After getting the bytelevel stuff right again, the block needs a new
>>> checksum, and then you have to carefully dd it back in both of the
>>> places which are listed in the stripe lines.
>>>
>>> If everything goes right... bam! Mount again and happy btrfsing again.
> 
> Yes, or... do some btrfs-assisted 'hexedit'. I just added some missing
> structures for a metadata Node into python-btrfs, in a branch where I'm
> playing around a bit with the first steps of offline editing.
> 
> If you clone https://github.com/knorrie/python-btrfs/ and checkout the
> branch 'bigmomma', you can do this:
> 
> ~/src/git/python-btrfs (bigmomma) 4-$ ipython
> Python 2.7.13 (default, Dec 18 2016, 20:19:42)
> Type "copyright", "credits" or "license" for more information.
> 
> IPython 5.1.0 -- An enhanced Interactive Python.
> ?         -> Introduction and overview of IPython's features.
> %quickref -> Quick reference.
> help      -> Python's own help system.
> object?   -> Details about 'object', use 'object??' for extra details.
> 
> In [1]: import array
> 
> In [2]: import btrfs
> 
> In [3]: buf = array.array('B', open('mblock_first').read())
> 
> In [4]: node = btrfs.ctree.Node(buf)
> 
> In [5]: len(node.ptrs)
> Out[5]: 376
> 
> In [6]: ptr = node.ptrs[243]
> 
> In [7]: print(ptr)
> key (15606380089319694336 76 303104) block 596459520 gen 20441
> 
> In [8]: ptr.key.objectid &= 0xffffffff
> 
> In [9]: ptr.key.type = btrfs.ctree.EXTENT_ITEM_KEY
> 
> In [10]: print(ptr)
> key (1302405120 EXTENT_ITEM 303104) block 596459520 gen 20441
> 
> In [11]: ptr.write()
> 
> In [12]: node.header.write()
> 
> In [13]: buf.tofile(open('mblock_first_fixed', 'wb'))
> 
> And voila:
> 
> -$ hexdump -C mblock_first > mblock_first.hexdump
> -$ hexdump -C mblock_first_fixed > mblock_first_fixed.hexdump
> -$ diff -u0 mblock_first.hexdump mblock_first_fixed.hexdump
> --- mblock_first.hexdump	2017-01-29 17:31:57.324537433 +0100
> +++ mblock_first_fixed.hexdump	2017-01-29 17:33:48.252683710 +0100
> @@ -1 +1 @@
> -00000000  00 22 16 2b 00 00 00 00  00 00 00 00 00 00 00 00
> |.".+............|
> +00000000  8f c0 96 b0 00 00 00 00  00 00 00 00 00 00 00 00
> |................|
> @@ -508,2 +508,2 @@
> -00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8  |.O.......
> .M....|
> -00001fc0  4c 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
> |L.........@.#...|
> +00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00  |.O.......
> .M....|
> +00001fc0  a8 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
> |..........@.#...|
> 
> :-)
> 
> Writing back the information to the byte buffer (the node header) also
> recomputes the checksum.
> 
> If this is the same change that you ended up with while doing it
> manually, then try to put it back on disk twice, and see what happens
> when mounting.
> 
Wow - this nice python toolset really makes it easy, bigmomma holding your hands ;-) . 

Indeed, I get exactly the same output you did show in your example, which almost matches my manual change, apart from one bit here:
-00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
+00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00
I do not understand this change from 01 to 00, is this some parity information which python-btrfs fixed up automatically?

Trusting the output, I did:
dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=43417600 count=16384
dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=1117159424 count=16384
and re-ran "btrfs-debug-tree -b 35028992 /dev/sdb1" to confirm, item 243 is now:
...
        key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
        key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
=>      key (1302405120 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
        key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
        key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427
...
Sadly, trying to mount, I still get:
[190422.147717] BTRFS info (device sdb1): use lzo compression
[190422.147846] BTRFS info (device sdb1): disk space caching is enabled
[190422.229227] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
[190422.241635] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
[190422.241644] BTRFS error (device sdb1): failed to read block groups: -5
[190422.254824] BTRFS error (device sdb1): open_ctree failed
The notable difference is that previously, the message was:
corrupt node, bad key order: block=35028992, root=1, slot=243
So does this tell me that also item 242 was corrupted?

Cheers and thanks for everything up to now!
	Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-29 19:09               ` Oliver Freyermuth
@ 2017-01-29 19:28                 ` Hans van Kranenburg
  2017-01-29 19:52                   ` Oliver Freyermuth
  0 siblings, 1 reply; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-29 19:28 UTC (permalink / raw)
  To: Oliver Freyermuth, Hugo Mills; +Cc: linux-btrfs

On 01/29/2017 08:09 PM, Oliver Freyermuth wrote:
>> [..whaaa.. text.. see previous message..]
> Wow - this nice python toolset really makes it easy, bigmomma holding your hands ;-) . 
> 
> Indeed, I get exactly the same output you did show in your example, which almost matches my manual change, apart from one bit here:
> -00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
> +00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00
> I do not understand this change from 01 to 00, is this some parity information which python-btrfs fixed up automatically?
> 
> Trusting the output, I did:
> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=43417600 count=16384
> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=1117159424 count=16384
> and re-ran "btrfs-debug-tree -b 35028992 /dev/sdb1" to confirm, item 243 is now:
> ...
>         key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
>         key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
> =>      key (1302405120 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
>         key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
>         key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427

Ehm, oh yes, that was obviously a mistake in what I showed. The
0xffffffff cuts off too much..

>>> 0xd89500014da12000 & 0xffffffff
1302405120L

This is better...

>>> 0xd89500014da12000 & 0xffffffffff
5597372416L

...which is the value Hugo also mentioned to likely be the value that
has to be there, since it nicely fits in between the surrounding keys.

> ...
> Sadly, trying to mount, I still get:
> [190422.147717] BTRFS info (device sdb1): use lzo compression
> [190422.147846] BTRFS info (device sdb1): disk space caching is enabled
> [190422.229227] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
> [190422.241635] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
> [190422.241644] BTRFS error (device sdb1): failed to read block groups: -5
> [190422.254824] BTRFS error (device sdb1): open_ctree failed
> The notable difference is that previously, the message was:
> corrupt node, bad key order: block=35028992, root=1, slot=243
> So does this tell me that also item 242 was corrupted?

No, I was just going too fast.

A nice extra excercise is to look up the block at 596459520, which this
item points to, and then see which object is the first one in the part
of the tree stored in that page. It should be (5597372416 EXTENT_ITEM
303104) I guess.

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-29 19:28                 ` Hans van Kranenburg
@ 2017-01-29 19:52                   ` Oliver Freyermuth
  2017-01-29 20:13                     ` Hans van Kranenburg
  0 siblings, 1 reply; 43+ messages in thread
From: Oliver Freyermuth @ 2017-01-29 19:52 UTC (permalink / raw)
  To: Hans van Kranenburg, Hugo Mills; +Cc: linux-btrfs

Am 29.01.2017 um 20:28 schrieb Hans van Kranenburg:
> On 01/29/2017 08:09 PM, Oliver Freyermuth wrote:
>>> [..whaaa.. text.. see previous message..]
>> Wow - this nice python toolset really makes it easy, bigmomma holding your hands ;-) . 
>>
>> Indeed, I get exactly the same output you did show in your example, which almost matches my manual change, apart from one bit here:
>> -00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
>> +00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00
>> I do not understand this change from 01 to 00, is this some parity information which python-btrfs fixed up automatically?
>>
>> Trusting the output, I did:
>> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=43417600 count=16384
>> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=1117159424 count=16384
>> and re-ran "btrfs-debug-tree -b 35028992 /dev/sdb1" to confirm, item 243 is now:
>> ...
>>         key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
>>         key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
>> =>      key (1302405120 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
>>         key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
>>         key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427
> 
> Ehm, oh yes, that was obviously a mistake in what I showed. The
> 0xffffffff cuts off too much..
> 
>>>> 0xd89500014da12000 & 0xffffffff
> 1302405120L
> 
> This is better...
> 
>>>> 0xd89500014da12000 & 0xffffffffff
> 5597372416L
> 
> ...which is the value Hugo also mentioned to likely be the value that
> has to be there, since it nicely fits in between the surrounding keys.
Understood!
Now the diff matches exactly what I would done:
-00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
-00001fc0  4c 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
+00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 00 00
+00001fc0  a8 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00

It's really nice that python-btrfs takes over all the checksumming stuff. 

Writing things back and re-running "btrfs-debug-tree -b 35028992 /dev/sdb1", I find:

        key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
        key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
=>      key (5597372416 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
        key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
        key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427

This matches the surroundings much better. 

> 
>> ...
>> Sadly, trying to mount, I still get:
>> [190422.147717] BTRFS info (device sdb1): use lzo compression
>> [190422.147846] BTRFS info (device sdb1): disk space caching is enabled
>> [190422.229227] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
>> [190422.241635] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
>> [190422.241644] BTRFS error (device sdb1): failed to read block groups: -5
>> [190422.254824] BTRFS error (device sdb1): open_ctree failed
>> The notable difference is that previously, the message was:
>> corrupt node, bad key order: block=35028992, root=1, slot=243
>> So does this tell me that also item 242 was corrupted?
> 
> No, I was just going too fast.
> 
> A nice extra excercise is to look up the block at 596459520, which this
> item points to, and then see which object is the first one in the part
> of the tree stored in that page. It should be (5597372416 EXTENT_ITEM
> 303104) I guess.
> 
That indeed matches your expectation, i.e.:
# btrfs-debug-tree -b 596459520 /dev/sdb1
contains:
        item 0 key (5597372416 EXTENT_ITEM 303104) itemoff 16230 itemsize 53

So all looks well! 

And now the final good news:
I can mount, no error messages in the syslog are shown! 


Finally, just to make sure there are no other issues, I ran a btrfs check in readonly mode:
 # btrfs check --readonly /dev/sdb1
Checking filesystem on /dev/sdb1
UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
checking extents
checking free space cache
checking fs roots
invalid location in dir item 120
root 5 inode 177542 errors 2000, link count wrong
        unresolved ref dir 117670 index 29695 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 1, no dir item
root 5 inode 18446744073709551361 errors 2001, no inode item, link count wrong
        unresolved ref dir 117670 index 0 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 6, no dir index, no inode ref
found 127774183424 bytes used err is 1
total csum bytes: 124401728
total tree bytes: 346046464
total fs tree bytes: 163315712
total extent tree bytes: 35667968
btree space waste bytes: 53986463
file data blocks allocated: 177184325632
 referenced 130490667008

These errors are unrelated and likely caused by an earlier hard poweroff sometime last year. 

Nevertheless, since I'll now try to use this FS (let's see how long it keeps stable), I ran repair:
# btrfs check --repair /dev/sdb1
enabling repair mode
Checking filesystem on /dev/sdb1
UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
checking extents
Fixed 0 roots.
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
invalid location in dir item 120
Trying to rebuild inode:18446744073709551361
Failed to reset nlink for inode 18446744073709551361: No such file or directory
        unresolved ref dir 117670 index 0 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 6, no dir index, no inode ref
checking csums
checking root refs
found 127774183424 bytes used err is 0
total csum bytes: 124401728
total tree bytes: 346046464
total fs tree bytes: 163315712
total extent tree bytes: 35667968
btree space waste bytes: 53986463
file data blocks allocated: 177184325632
 referenced 130490667008

It still mounts, and now:
[193339.299305] BTRFS info (device sdb1): use lzo compression
[193339.299308] BTRFS info (device sdb1): disk space caching is enabled
[193339.653980] BTRFS info (device sdb1): checking UUID tree

I guess this all is fine :-) . 

So all in all, I have to say a great thanks for all this support - it really was a good educational experience, and I am pretty sure this functionality of python-btrfs will be of help to others, too! 

Cheers and thanks, 
	Oliver

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-29 19:52                   ` Oliver Freyermuth
@ 2017-01-29 20:13                     ` Hans van Kranenburg
  0 siblings, 0 replies; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-29 20:13 UTC (permalink / raw)
  To: Oliver Freyermuth, Hugo Mills; +Cc: linux-btrfs

On 01/29/2017 08:52 PM, Oliver Freyermuth wrote:
> Am 29.01.2017 um 20:28 schrieb Hans van Kranenburg:
>> On 01/29/2017 08:09 PM, Oliver Freyermuth wrote:
>>>> [..whaaa.. text.. see previous message..]
>>> Wow - this nice python toolset really makes it easy, bigmomma holding your hands ;-) . 

Well, bigmomma is a nickname of someone on IRC that I helped with a
similar issue a few weeks ago, also a quite bizarre case of a random
collection of bytes ending up into a leaf metadata page. While doing
that I started this branch, adding some code for extra data structures
and to write changed values back.

So far the python-btrfs project focused on only working with filesystems
that are already online and mounted and correctly working.

So doing the simple chunk tree lookup we needed to find the location to
dd was already not possible with it now.

The code hacked together already for putting the metadata page into
objects with nice attributes is waiting in an experimental branch for
later, when I'm going to have a look at working with unmounted
filesystems and how to interface to the C code for doing tree plumbing. :)

>>> Indeed, I get exactly the same output you did show in your example, which almost matches my manual change, apart from one bit here:
>>> -00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
>>> +00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 00 00 00 00
>>> I do not understand this change from 01 to 00, is this some parity information which python-btrfs fixed up automatically?
>>>
>>> Trusting the output, I did:
>>> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=43417600 count=16384
>>> dd if=mblock_first_fixed of=/dev/sdb1 bs=1 seek=1117159424 count=16384
>>> and re-ran "btrfs-debug-tree -b 35028992 /dev/sdb1" to confirm, item 243 is now:
>>> ...
>>>         key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
>>>         key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
>>> =>      key (1302405120 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
>>>         key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
>>>         key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427
>>
>> Ehm, oh yes, that was obviously a mistake in what I showed. The
>> 0xffffffff cuts off too much..
>>
>>>>> 0xd89500014da12000 & 0xffffffff
>> 1302405120L
>>
>> This is better...
>>
>>>>> 0xd89500014da12000 & 0xffffffffff
>> 5597372416L
>>
>> ...which is the value Hugo also mentioned to likely be the value that
>> has to be there, since it nicely fits in between the surrounding keys.
> Understood!
> Now the diff matches exactly what I would done:
> -00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 95 d8
> -00001fc0  4c 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
> +00001fb0  d9 4f 00 00 00 00 00 00  00 20 a1 4d 01 00 00 00
> +00001fc0  a8 00 a0 04 00 00 00 00  00 00 40 8d 23 00 00 00
> 
> It's really nice that python-btrfs takes over all the checksumming stuff. 
> 
> Writing things back and re-running "btrfs-debug-tree -b 35028992 /dev/sdb1", I find:
> 
>         key (5547032576 EXTENT_ITEM 204800) block 596426752 (36403) gen 20441
>         key (5561905152 EXTENT_ITEM 184320) block 596443136 (36404) gen 20441
> =>      key (5597372416 EXTENT_ITEM 303104) block 596459520 (36405) gen 20441
>         key (5726711808 EXTENT_ITEM 524288) block 596475904 (36406) gen 20441
>         key (5820571648 EXTENT_ITEM 524288) block 350322688 (21382) gen 20427
> 
> This matches the surroundings much better. 

Yes, good.

>>> ...
>>> Sadly, trying to mount, I still get:
>>> [190422.147717] BTRFS info (device sdb1): use lzo compression
>>> [190422.147846] BTRFS info (device sdb1): disk space caching is enabled
>>> [190422.229227] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
>>> [190422.241635] BTRFS critical (device sdb1): corrupt node, bad key order: block=35028992, root=1, slot=242
>>> [190422.241644] BTRFS error (device sdb1): failed to read block groups: -5
>>> [190422.254824] BTRFS error (device sdb1): open_ctree failed
>>> The notable difference is that previously, the message was:
>>> corrupt node, bad key order: block=35028992, root=1, slot=243
>>> So does this tell me that also item 242 was corrupted?
>>
>> No, I was just going too fast.
>>
>> A nice extra excercise is to look up the block at 596459520, which this
>> item points to, and then see which object is the first one in the part
>> of the tree stored in that page. It should be (5597372416 EXTENT_ITEM
>> 303104) I guess.
>>
> That indeed matches your expectation, i.e.:
> # btrfs-debug-tree -b 596459520 /dev/sdb1
> contains:
>         item 0 key (5597372416 EXTENT_ITEM 303104) itemoff 16230 itemsize 53
> 
> So all looks well! 

Yay.

> And now the final good news:
> I can mount, no error messages in the syslog are shown! 
> 
> 
> Finally, just to make sure there are no other issues, I ran a btrfs check in readonly mode:
>  # btrfs check --readonly /dev/sdb1
> Checking filesystem on /dev/sdb1
> UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
> checking extents
> checking free space cache
> checking fs roots
> invalid location in dir item 120
> root 5 inode 177542 errors 2000, link count wrong
>         unresolved ref dir 117670 index 29695 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 1, no dir item
> root 5 inode 18446744073709551361 errors 2001, no inode item, link count wrong
>         unresolved ref dir 117670 index 0 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 6, no dir index, no inode ref
> found 127774183424 bytes used err is 1
> total csum bytes: 124401728
> total tree bytes: 346046464
> total fs tree bytes: 163315712
> total extent tree bytes: 35667968
> btree space waste bytes: 53986463
> file data blocks allocated: 177184325632
>  referenced 130490667008
> 
> These errors are unrelated and likely caused by an earlier hard poweroff sometime last year. 
> 
> Nevertheless, since I'll now try to use this FS (let's see how long it keeps stable), I ran repair:
> # btrfs check --repair /dev/sdb1
> enabling repair mode
> Checking filesystem on /dev/sdb1
> UUID: cfd16c65-7f3b-4f5e-9029-971f2433d7ab
> checking extents
> Fixed 0 roots.
> checking free space cache
> cache and super generation don't match, space cache will be invalidated
> checking fs roots
> invalid location in dir item 120
> Trying to rebuild inode:18446744073709551361
> Failed to reset nlink for inode 18446744073709551361: No such file or directory
>         unresolved ref dir 117670 index 0 namelen 20 name 2016-07-12_10_26.jpg filetype 1 errors 6, no dir index, no inode ref
> checking csums
> checking root refs
> found 127774183424 bytes used err is 0
> total csum bytes: 124401728
> total tree bytes: 346046464
> total fs tree bytes: 163315712
> total extent tree bytes: 35667968
> btree space waste bytes: 53986463
> file data blocks allocated: 177184325632
>  referenced 130490667008
> 
> It still mounts, and now:
> [193339.299305] BTRFS info (device sdb1): use lzo compression
> [193339.299308] BTRFS info (device sdb1): disk space caching is enabled
> [193339.653980] BTRFS info (device sdb1): checking UUID tree
> 
> I guess this all is fine :-) . 
> 
> So all in all, I have to say a great thanks for all this support - it really was a good educational experience, and I am pretty sure this functionality of python-btrfs will be of help to others, too! 

Have fun,

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28  5:00           ` Duncan
  2017-01-28 12:37             ` Janos Toth F.
  2017-01-28 16:46             ` Oliver Freyermuth
@ 2017-01-30 12:41             ` Austin S. Hemmelgarn
  2 siblings, 0 replies; 43+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-30 12:41 UTC (permalink / raw)
  To: linux-btrfs

On 2017-01-28 00:00, Duncan wrote:
> Austin S. Hemmelgarn posted on Fri, 27 Jan 2017 07:58:20 -0500 as
> excerpted:
>
>> On 2017-01-27 06:01, Oliver Freyermuth wrote:
>>>> I'm also running 'memtester 12G' right now, which at least tests 2/3
>>>> of the memory. I'll leave that running for a day or so, but of course
>>>> it will not provide a clear answer...
>>>
>>> A small update: while the online memtester is without any errors still,
>>> I checked old syslogs from the machine and found something intriguing.
>
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00098d39
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 00099795
>>> kernel: Corrupted low memory at ffff880000009000 (9000 phys) = 000dd64e
>
> 0x9000 = 36K...
>
>>> This seems to be consistently happening from time to time (I have low
>>> memory corruption checking compiled in).
>>> The numbers always consistently increase, and after a reboot, start
>>> fresh from a small number again.
>>>
>>> I suppose this is a BIOS bug and it's storing some counter in low
>>> memory. I am unsure whether this could have triggered the BTRFS
>>> corruption, nor do I know what to do about it (are there kernel quirks
>>> for that?). The vendor does not provide any updates, as usual.
>>>
>>> If someone could confirm whether this might cause corruption for btrfs
>>> (and maybe direct me to the correct place to ask for a kernel quirk for
>>> this device - do I ask on MM, or somewhere else?), that would be much
>>> appreciated.
>
>> It is a firmware bug, Linux doesn't use stuff in that physical address
>> range at all.  I don't think it's likely that this specific bug caused
>> the corruption, but given that the firmware doesn't have it's
>> allocations listed correctly in the e820 table (if they were listed
>> correctly, you wouldn't be seeing this message), it would not surprise
>> me if the firmware was involved somehow.
>
> Correct me if I'm wrong (I'm no kernel expert, but I've been building my
> own kernel for well over a decade now so having a working familiarity
> with the kernel options, of which the following is my possibly incorrect
> read), but I believe that's only "fact check: mostly correct" (mostly as
> in yes it's the default, but there's a mainline kernel option to change
> it).
>
> I was just going over the related kernel options again a couple days ago,
> so they're fresh in my head, and AFAICT...
>
> There are THREE semi-related kernel options (config UI option location is
> based on the mainline 4.10-rc5+ git kernel I'm presently running):
>
> DEFAULT_MMAP_MIN_ADDR
>
> Config location: Processor type and features:
> Low address space to protect from user allocation
>
> This one is virtual memory according to config help, so likely not
> directly related, but similar idea.
Yeah, it really only affects userspace.  In effect, it's the lowest 
virtual address that a userspace program can allocate memory at.  By 
default on most systems it only covers the first page (which is to 
protect against NULL pointer bugs).  Most distros set it at 64k to 
provide a bit of extra protection.  There are a handful that set it to 0 
so that vm86 stuff works, but the number of such distros is going down 
over time because vm86 is not a common use case, and this can be 
configured at runtime through /proc/sys/vm/mmap_min_addr.
>
> X86_CHECK_BIOS_CORRUPTION
>
> Location: Same section, a few lines below the first one:
> Check for low memory corruption
>
> I guess this is the option you (OF) have enabled.  Note that according to
> help, in addition to enabling this in options, a runtime kernel
> commandline option must be given as well, to actually enable the checks.
There's another option that controls the default (I forget the config 
option and I'm too lazy right now to check), but he obviously either has 
that option enabled or has it enabled at run-time, otherwise there 
wouldn't be any messages in the kernel log about the check failing. 
FWIW, the reason this defaults to being off is that it runs every 60 
seconds, and therefore has a significant impact on power usage on mobile 
systems.
>
> X86_RESERVE_LOW
>
> Location: Same section, immediately below the check option:
> Amount of low memory, in kilobytes, to reserve for the BIOS
>
> Help for this one suggests enabling the check bios corruption option
> above if there are any doubts, so the two are directly related.
Yes.  This specifies both the kernel equivalent of DEFAULT_MMAP_MIN_ADDR 
(so the kernel won't use anything with a physical address between 0 and 
this range), and the upper bound for the corruption check.
>
> All three options apparently default to 64K (as that's what I see here
> and I don't believe I've changed them), but can be changed.  See the
> kernel options help and where it points for more.
>
> My read of the above is that yes, by default the kernel won't use
> physical 0x9000 (36K), as it's well within the 64K default reserve area,
> but a blanket "Linux doesn't use stuff in that physical address range at
> all" is incorrect, as if the defaults have been changed it /could/ use
> that space (#3's minimum is 1 page, 4K, leaving that 36K address
> uncovered) -- there's a mainline-official option to do so, so it doesn't
> even require patching.
You're correct, but the only realistic case where Linux will actually 
use that range on x86 is in custom built kernels and a handful or OEM 
vendor kernels.  Distributions all have it set at the default because 
they want to work safely on most hardware (and with limited exceptions, 
Windows doesn't touch the low 64k either, so BIOS vendors aren't as 
worried as they should be about that range), and that covers almost 
everyone who isn't building a kernel themself.
>
> Meanwhile, since the defaults cover it, no quirk should be necessary (tho
> I might increase the reserve and test coverage area to the maximum 640K
> and run for awhile to be sure it's not going above the 64K default), but
> were it outside the default 64K coverage area, I would probably file it
> as a bug (my usual method for confirmed bugs), and mark it initially as
> an arch-x86 bug, tho they may switch it to something else, later.  But
> the devs would probably suggest further debugging, possibly giving you
> debug patches to try, etc, to nail down the specific device, before
> setting up a quirk for it.  Because the problem could be an expansion
> card or something, not the mobo/factory-default-machine, too, and it'd be
> a shame to setup a quirk for the wrong hardware.
As a general rule, I just use 640k on everything.  It's simpler than 
trying to fight with OEM's to get the firmware fixed, and on most 
systems it's a fraction of a percent of the RAM so it doesn't matter 
much.  I don't normally have the checking enabled either, but I usually 
do check (and report any issues) during the first few boots on new hardware.
>
>>> Additionally, I found that "btrfs restore" works on this broken FS. I
>>> will take an external backup of the content within the next 24 hours
>>> using that, then I am ready to try anything you suggeest.
>
>> FWIW the fact that btrfs restore works is a good sign, it means that
>> the filesystem is almost certainly repairable (even though the tools
>> might not be able to repair it themselves).
>
> Btrfs restore is a very useful tool.  It has gotten me out of a few
> "changes since the last backup weren't valuable enough to have updated
> the backup yet when the risk was theoretical, so nothing serious, but now
> that it's no longer theory only, it'd still be useful to be able to save
> the current version, if it's not /too/ much trouble" type situations,
> myself. =:^)
>
> Just don't count on restore to save your *** and always treat what it can
> often bring to current as a pleasant surprise, and having it fail won't
> be a down side, while having it work, if it does, will always be up side.
> =:^)
Entirely agreed on restore.  It's a wonderful tool to have, but you 
absolutely should not rely on it.

FWIW< there do exist such tools for other filesystems, they just aren't 
usually part of the filesystem's standard tools.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-28 16:46             ` Oliver Freyermuth
@ 2017-01-31  4:58               ` Duncan
  2017-01-31 12:45                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 43+ messages in thread
From: Duncan @ 2017-01-31  4:58 UTC (permalink / raw)
  To: linux-btrfs

Oliver Freyermuth posted on Sat, 28 Jan 2017 17:46:24 +0100 as excerpted:

>> Just don't count on restore to save your *** and always treat what it
>> can often bring to current as a pleasant surprise, and having it fail
>> won't be a down side, while having it work, if it does, will always be
>> up side.
>> =:^)
>> 
> I'll keep that in mind, and I think that in the future, before trying
> any "btrfs check" (or even repair)
> I will always try restore first if my backup was not fresh enough :-).

That's a wise idea, as long as you have the resources to actually be able 
to write the files somewhere (as people running btrfs really should, 
because it's /not/ fully stable yet).

One of the great things about restore is that all the writing it does is 
to the destination filesystem -- it doesn't attempt to actually write or 
repair anything on the filesystem it's trying to restore /from/, so it's 
far lower risk than anything that /does/ actually attempt to write to or 
repair the potentially damaged filesystem.

That makes it /extremely/ useful as a "first, to the extent possibke, 
make sure the backups are safely freshened" tool. =:^)


Meanwhile, FWIW, restore can also be used as a sort of undelete tool.  
Remember, btrfs is COW and writes any changes to a new location.  The old 
location tends to stick around, not any more referenced by anything 
"live", but still there until some other change happens to overwrite it.
 
Just like undelete on a more conventional filesystem, therefore, as long 
as you notice the problem before the old location has been overwritten 
again, it's often possible to recover it, altho the mechanisms involved 
are rather different on btrfs.  Basically, you use btrfs-find-root to get 
a list of old roots, then point restore at them using the -t option.  
There's a page on the wiki that goes into some detail in a more desperate 
"restore anything" context, but here, once you found a root that looked 
promising, you'd use restore's regex option to restore /just/ the file 
you're interested in, as it existed at the time that root was written.

There's actually a btrfs-undelete script on github that turns the 
otherwise multiple manual steps into a nice, smooth, undelete operation.  
Or at least it's supposed to.  I've never actually used it, tho I have 
examined the script out of curiosity to see what it did and how, and it /
looks/ like it should work.  I've kept that trick (and knowledge of where 
to look for the script) filed away in the back of my head in case I need 
it someday. =:^)


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-31  4:58               ` Duncan
@ 2017-01-31 12:45                 ` Austin S. Hemmelgarn
  2017-02-01  4:36                   ` Duncan
  0 siblings, 1 reply; 43+ messages in thread
From: Austin S. Hemmelgarn @ 2017-01-31 12:45 UTC (permalink / raw)
  To: linux-btrfs

On 2017-01-30 23:58, Duncan wrote:
> Oliver Freyermuth posted on Sat, 28 Jan 2017 17:46:24 +0100 as excerpted:
>
>>> Just don't count on restore to save your *** and always treat what it
>>> can often bring to current as a pleasant surprise, and having it fail
>>> won't be a down side, while having it work, if it does, will always be
>>> up side.
>>> =:^)
>>>
>> I'll keep that in mind, and I think that in the future, before trying
>> any "btrfs check" (or even repair)
>> I will always try restore first if my backup was not fresh enough :-).
>
> That's a wise idea, as long as you have the resources to actually be able
> to write the files somewhere (as people running btrfs really should,
> because it's /not/ fully stable yet).
>
> One of the great things about restore is that all the writing it does is
> to the destination filesystem -- it doesn't attempt to actually write or
> repair anything on the filesystem it's trying to restore /from/, so it's
> far lower risk than anything that /does/ actually attempt to write to or
> repair the potentially damaged filesystem.
>
> That makes it /extremely/ useful as a "first, to the extent possibke,
> make sure the backups are safely freshened" tool. =:^)
It also has the interesting side effect that you can (mostly) safely run 
restore against a mounted filesystem.  I've never tried this myself, but 
providing that restore doesn't check if the FS is mounted first (might 
be something to add an option to disable if it does), the worst that 
could happen is getting a corrupted file out or having restore crash on you.
>
>
> Meanwhile, FWIW, restore can also be used as a sort of undelete tool.
> Remember, btrfs is COW and writes any changes to a new location.  The old
> location tends to stick around, not any more referenced by anything
> "live", but still there until some other change happens to overwrite it.
Note that this becomes harder the more active the FS is.  This is the 
case for most filesystems, but it's a much bigger factor for COW 
filesystems, and even more so for BTRFS (because it will preferentially 
pack data into existing chunks instead of allocating new ones).
>
> Just like undelete on a more conventional filesystem, therefore, as long
> as you notice the problem before the old location has been overwritten
> again, it's often possible to recover it, altho the mechanisms involved
> are rather different on btrfs.  Basically, you use btrfs-find-root to get
> a list of old roots, then point restore at them using the -t option.
> There's a page on the wiki that goes into some detail in a more desperate
> "restore anything" context, but here, once you found a root that looked
> promising, you'd use restore's regex option to restore /just/ the file
> you're interested in, as it existed at the time that root was written.
>
> There's actually a btrfs-undelete script on github that turns the
> otherwise multiple manual steps into a nice, smooth, undelete operation.
> Or at least it's supposed to.  I've never actually used it, tho I have
> examined the script out of curiosity to see what it did and how, and it /
> looks/ like it should work.  I've kept that trick (and knowledge of where
> to look for the script) filed away in the back of my head in case I need
> it someday. =:^)
I've not used the script itself before, but I've used the method before 
on a couple of occasions to pull out old versions of files that I should 
have had under some kind of VCS but didn't, and the method does work 
reliably as long as you do it soon.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-31 12:45                 ` Austin S. Hemmelgarn
@ 2017-02-01  4:36                   ` Duncan
  0 siblings, 0 replies; 43+ messages in thread
From: Duncan @ 2017-02-01  4:36 UTC (permalink / raw)
  To: linux-btrfs

Austin S. Hemmelgarn posted on Tue, 31 Jan 2017 07:45:42 -0500 as
excerpted:

>> There's actually a btrfs-undelete script on github that turns the
>> otherwise multiple manual steps into a nice, smooth, undelete
>> operation. Or at least it's supposed to.  I've never actually used it,
>> tho I have examined the script out of curiosity to see what it did and
>> how, and it /looks/ like it should work.  I've kept that trick (and
>> knowledge of where to look for the script) filed away in the back of
>> my head in case I need it someday. =:^)

> I've not used the script itself before, but I've used the method before
> on a couple of occasions to pull out old versions of files that I should
> have had under some kind of VCS but didn't, and the method does work
> reliably as long as you do it soon.

>From reading the script, the two potentially difficult steps the script 
helpfully automates for you are...

1) going thru the roots find-root has found to find a good one to use

... and...

2) the fiddly regex escaping, so you don't have to pay too much attention 
to that, just feed it a normal path.

IOW, it should be a great help to users that don't know btrfs command or 
filesystem internals very well, and/or who don't find regex use 
particularly easy.

IOW, it'd be an excellent tool to either include in btrfs-tools as-is or 
C-codify and add as a btrfs subcommand, at some point as btrfs nears true 
stability and readiness for for ordinary less technical users.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 22:37         ` Michael Born
  2017-01-31  0:29           ` GWB
@ 2017-01-31  9:08           ` Graham Cobb
  1 sibling, 0 replies; 43+ messages in thread
From: Graham Cobb @ 2017-01-31  9:08 UTC (permalink / raw)
  To: Btrfs BTRFS

On 30/01/17 22:37, Michael Born wrote:
> Also, I'm not interested in restoring the old Suse 13.2 system. I just
> want some configuration files from it.

If all you really want is to get some important information from some
specific config files, and it is so important it is worth an hour or so
of your time, you could consider a brute-force method such as just
grep-ing the whole image file for a string you know should appear in the
relevant config file and dumping the blocks around those locations to
see if you can see the data you need.

Unfortunately this won't work if you had file compression on. Or if
there is no reasonably unique text to search for, of course. Just a thought.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:07   ` Michael Born
  2017-01-30 21:16     ` Hans van Kranenburg
  2017-01-30 21:20     ` Chris Murphy
@ 2017-01-31  4:30     ` Duncan
  2 siblings, 0 replies; 43+ messages in thread
From: Duncan @ 2017-01-31  4:30 UTC (permalink / raw)
  To: linux-btrfs

Michael Born posted on Mon, 30 Jan 2017 22:07:00 +0100 as excerpted:

> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de>
>> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>> 
>> There's still another day :-D
>> 
>> 
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with
>>> dd to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>> 
>> That won't ever work for any file system. It must be unmounted.
> 
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.

There's another angle with btrfs that makes block device image copies 
such as that a big problem, even if the dd was done with the filesystem 
unmounted.  This hasn't yet been mentioned in this thread, that I've 
seen, anyway.

* Btrfs takes the filesystem UUID, universally unique ID, at face value, 
considering it *UNIQUE* and actually identifying the various components 
of a possibly multi-device filesystem by the UUID.  Again, this is 
because btrfs, unlike normal filesystems, can be composed of multiple 
devices, so btrfs needs a way to detect what devices form parts of what 
filesystems, and it does this by tracking the UUID and considering 
anything with that UUID (which is supposed to be unique to that 
filesystem, remember, it actually _says_ "unique" in the label, after 
all) to be part of that filesystem.

Now you dd the block device somewhere else, making another copy, and 
btrfs suddenly has more devices that have UUIDs saying they belong to 
this filesystem than it should!

That has actually triggered corruption in some cases, because btrfs gets 
mixed up and writes changes to the wrong device, because after all, it 
*HAS* to be part of the same filesystem, because it has the same 
universally *unique* ID.

Only the supposedly "unique" ID isn't so "unique" any more, because 
someone copied the block device, and now there's two copies of the 
filesystem claiming to be the same one!  "Unique" is no longer "unique" 
and it has created the all too predictable problems as a result.


There are ways to work around the problem.  Basically, don't let btrfs 
see both copies at the same time, and *definitely* don't let it see both 
copies when one is mounted or an attempt is being made to mount it.

(Btrfs "sees" a new device when btrfs device scan is run.  Unfortunately 
for this case, udev tends to run btrfs device scan automatically whenever 
it detects a new device that seems to have btrfs on it.  So it can be 
rather difficult to keep btrfs from seeing it, because udev tends to 
monitor for new devices and see it right away, and tell btrfs about it 
when it does.  But it's possible to avoid damage if you're careful to 
only dd the unmounted filesystem device(s) and to safely hide one copy 
before attempting to mount the other.)


Of course that wasn't the case here.  With the dd of a live-mounted btrfs 
device, it's quite possible that btrfs detected and started writing to 
the dd-destination device instead of the original at some point, screwing 
things up even more than they would have been for a normal filesystem 
live-mounted dd.

In turn, it's quite possible that's why the old xfs /home still mounted, 
but the btrfs / didn't, because the xfs, while potentially damaged a bit, 
didn't suffer the abuse of writes to the wrong device that btrfs may well 
have suffered, due to the non-uniqueness of the supposedly universally 
unique IDs and the very confused btrfs that may well have caused.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 22:37         ` Michael Born
@ 2017-01-31  0:29           ` GWB
  2017-01-31  9:08           ` Graham Cobb
  1 sibling, 0 replies; 43+ messages in thread
From: GWB @ 2017-01-31  0:29 UTC (permalink / raw)
  To: Michael Born; +Cc: Btrfs BTRFS

Hello, Micheal,

Yes, you would certainly run the risk of doing more damage with dd, so
if you have an alternative, use that, and avoid dd.  If nothing else
works and you need the files, you might try it as a last resort.

My guess (and it is only a guess) is that if the image is close to the
same size as the root partition, the file data is there.  But that
doesn't do you much good if btrfs cannot read the "container" or the
specific partition and file system information, which btrfs send
provides.

Does someone on the list know if ext3/4 data recovery tools can also
search btrfs data?  That's another option.

Gordon

On Mon, Jan 30, 2017 at 4:37 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
> Hi Gordon,
>
> I'm quite sure this is not a good idea.
> I do understand, that dd-ing a running system will miss some changes
> done to the file system while copying. I'm surprised that I didn't end
> up with some corrupted files, but with no files at all.
> Also, I'm not interested in restoring the old Suse 13.2 system. I just
> want some configuration files from it.
>
> Cheers,
> Michael
>
> Am 30.01.2017 um 23:24 schrieb GWB:
>> <<
>> Hi btrfs experts.
>>
>> Hereby I apply for the stupidity of the month award.
>>>>
>>
>> I have no doubt that I will will mount a serious challenge to you for
>> that title, so you haven't won yet.
>>
>> Why not dd the image back onto the original partition (or another
>> partition identical in size) and see if that is readable?
>>
>> My limited experience with btrfs (I am not an expert) is that read
>> only snapshots work well in this situation, but the initial hurdle is
>> using dd to get the image back onto a partition.  So I wonder if you
>> could dd the image back onto the original media (the hd sdd), then
>> make a read only snapshot, and then send the snapshot with btrfs send
>> to another storage medium.  With any luck, the machine might boot, and
>> you might find other snapshots which you may be able to turn into read
>> only snaps for btrfs send.
>>
>> This has worked for me on Ubuntu 14 for quite some time, but luckily I
>> have not had to restore the image file sent from btrfs send yet.  I
>> say luckily, because I realise now that the image created from btrfs
>> send should be tested, but so far no catastrophic failures with my
>> root partition have occurred (knock on wood).
>>
>> dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
>> tries to do, but not so great on for some file systems for more
>> intricate uses.  But why not try:
>>
>> dd if=imagefile.dd of=/dev/sdaX
>>
>> and see if it boots?  If it does not, then perhaps you have another
>> shot at the one time mount for btrfs rw if that works.
>>
>> Or is your root partition now running fine under Suse 14.2, and you
>> are just looking to recover a file files from the image?  If so, you
>> might try to dd from the image to a partition of original size as the
>> previous root, then adjust with gparted or fpart, and see if it is
>> readable.
>>
>> So instead of trying to restore a btrfs file structure, why not just
>> restore a partition with dd that happens to contain a btrfs file
>> structure, and then adjust the partition size to match the original?
>> btrfs cares about the tree structures, etc.  dd does not.
>>
>> What you did is not unusual, and can work fine with a number of file
>> structures, but the potential for disaster with dd is also great.  The
>> only thing I know of in btrfs that does a similar thing is:
>>
>> btrfs send -f btrfs-send-image-file /mount/read-write-snapshot
>>
>> Chances are, of course, good that without having current backups dd
>> could potentially ruin the rest of your file system set up, so maybe
>> transfer the image over to another machine that is expendable and test
>> this out.  I use btrfs on root and zfs for data, and make lots of
>> snapshots and send them to incremental backups (mostly zfs, but btrfs
>> works nicely with Ubuntu on root, with the occasional balance
>> problem).
>>
>> If dd did it, dd might be able to fix it.  Do that first before you
>> try to restore btrfs file structures.
>>
>> Or is this a terrible idea?  Someone else on the list should say so if
>> they know otherwise.
>>
>> Gordon
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 22:24       ` GWB
@ 2017-01-30 22:37         ` Michael Born
  2017-01-31  0:29           ` GWB
  2017-01-31  9:08           ` Graham Cobb
  0 siblings, 2 replies; 43+ messages in thread
From: Michael Born @ 2017-01-30 22:37 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi Gordon,

I'm quite sure this is not a good idea.
I do understand, that dd-ing a running system will miss some changes
done to the file system while copying. I'm surprised that I didn't end
up with some corrupted files, but with no files at all.
Also, I'm not interested in restoring the old Suse 13.2 system. I just
want some configuration files from it.

Cheers,
Michael

Am 30.01.2017 um 23:24 schrieb GWB:
> <<
> Hi btrfs experts.
> 
> Hereby I apply for the stupidity of the month award.
>>>
> 
> I have no doubt that I will will mount a serious challenge to you for
> that title, so you haven't won yet.
> 
> Why not dd the image back onto the original partition (or another
> partition identical in size) and see if that is readable?
> 
> My limited experience with btrfs (I am not an expert) is that read
> only snapshots work well in this situation, but the initial hurdle is
> using dd to get the image back onto a partition.  So I wonder if you
> could dd the image back onto the original media (the hd sdd), then
> make a read only snapshot, and then send the snapshot with btrfs send
> to another storage medium.  With any luck, the machine might boot, and
> you might find other snapshots which you may be able to turn into read
> only snaps for btrfs send.
> 
> This has worked for me on Ubuntu 14 for quite some time, but luckily I
> have not had to restore the image file sent from btrfs send yet.  I
> say luckily, because I realise now that the image created from btrfs
> send should be tested, but so far no catastrophic failures with my
> root partition have occurred (knock on wood).
> 
> dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
> tries to do, but not so great on for some file systems for more
> intricate uses.  But why not try:
> 
> dd if=imagefile.dd of=/dev/sdaX
> 
> and see if it boots?  If it does not, then perhaps you have another
> shot at the one time mount for btrfs rw if that works.
> 
> Or is your root partition now running fine under Suse 14.2, and you
> are just looking to recover a file files from the image?  If so, you
> might try to dd from the image to a partition of original size as the
> previous root, then adjust with gparted or fpart, and see if it is
> readable.
> 
> So instead of trying to restore a btrfs file structure, why not just
> restore a partition with dd that happens to contain a btrfs file
> structure, and then adjust the partition size to match the original?
> btrfs cares about the tree structures, etc.  dd does not.
> 
> What you did is not unusual, and can work fine with a number of file
> structures, but the potential for disaster with dd is also great.  The
> only thing I know of in btrfs that does a similar thing is:
> 
> btrfs send -f btrfs-send-image-file /mount/read-write-snapshot
> 
> Chances are, of course, good that without having current backups dd
> could potentially ruin the rest of your file system set up, so maybe
> transfer the image over to another machine that is expendable and test
> this out.  I use btrfs on root and zfs for data, and make lots of
> snapshots and send them to incremental backups (mostly zfs, but btrfs
> works nicely with Ubuntu on root, with the occasional balance
> problem).
> 
> If dd did it, dd might be able to fix it.  Do that first before you
> try to restore btrfs file structures.
> 
> Or is this a terrible idea?  Someone else on the list should say so if
> they know otherwise.
> 
> Gordon


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:16     ` Hans van Kranenburg
@ 2017-01-30 22:24       ` GWB
  2017-01-30 22:37         ` Michael Born
  0 siblings, 1 reply; 43+ messages in thread
From: GWB @ 2017-01-30 22:24 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Michael Born, Btrfs BTRFS

<<
Hi btrfs experts.

Hereby I apply for the stupidity of the month award.
>>

I have no doubt that I will will mount a serious challenge to you for
that title, so you haven't won yet.

Why not dd the image back onto the original partition (or another
partition identical in size) and see if that is readable?

My limited experience with btrfs (I am not an expert) is that read
only snapshots work well in this situation, but the initial hurdle is
using dd to get the image back onto a partition.  So I wonder if you
could dd the image back onto the original media (the hd sdd), then
make a read only snapshot, and then send the snapshot with btrfs send
to another storage medium.  With any luck, the machine might boot, and
you might find other snapshots which you may be able to turn into read
only snaps for btrfs send.

This has worked for me on Ubuntu 14 for quite some time, but luckily I
have not had to restore the image file sent from btrfs send yet.  I
say luckily, because I realise now that the image created from btrfs
send should be tested, but so far no catastrophic failures with my
root partition have occurred (knock on wood).

dd is (like dumpfs, ddrescue, and the bsd variations) good for what it
tries to do, but not so great on for some file systems for more
intricate uses.  But why not try:

dd if=imagefile.dd of=/dev/sdaX

and see if it boots?  If it does not, then perhaps you have another
shot at the one time mount for btrfs rw if that works.

Or is your root partition now running fine under Suse 14.2, and you
are just looking to recover a file files from the image?  If so, you
might try to dd from the image to a partition of original size as the
previous root, then adjust with gparted or fpart, and see if it is
readable.

So instead of trying to restore a btrfs file structure, why not just
restore a partition with dd that happens to contain a btrfs file
structure, and then adjust the partition size to match the original?
btrfs cares about the tree structures, etc.  dd does not.

What you did is not unusual, and can work fine with a number of file
structures, but the potential for disaster with dd is also great.  The
only thing I know of in btrfs that does a similar thing is:

btrfs send -f btrfs-send-image-file /mount/read-write-snapshot

Chances are, of course, good that without having current backups dd
could potentially ruin the rest of your file system set up, so maybe
transfer the image over to another machine that is expendable and test
this out.  I use btrfs on root and zfs for data, and make lots of
snapshots and send them to incremental backups (mostly zfs, but btrfs
works nicely with Ubuntu on root, with the occasional balance
problem).

If dd did it, dd might be able to fix it.  Do that first before you
try to restore btrfs file structures.

Or is this a terrible idea?  Someone else on the list should say so if
they know otherwise.

Gordon


On Mon, Jan 30, 2017 at 3:16 PM, Hans van Kranenburg
<hans.van.kranenburg@mendix.com> wrote:
> On 01/30/2017 10:07 PM, Michael Born wrote:
>>
>>
>> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>>>> Hi btrfs experts.
>>>>
>>>> Hereby I apply for the stupidity of the month award.
>>>
>>> There's still another day :-D
>>>
>>>
>>>
>>>>
>>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>>>> to an image file - while the system was online/running.
>>>> Now, I can't mount the image.
>>>
>>> That won't ever work for any file system. It must be unmounted.
>>
>> I could mount and copy the data out of my /home image.dd (encrypted
>> xfs). That was also online while dd-ing it.
>>
>>>> Could you give me some instructions how to repair the file system or
>>>> extract some files from it?
>>>
>>> Not possible. The file system was being modified while dd was
>>> happening, so the image you've taken is inconsistent.
>>
>> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
>> change for months. Why would they change in the moment I copy their
>> blocks with dd?
>
> The metadata of btrfs is organized in a bunch of tree structures. The
> top of the trees (the smallest parts, trees are upside-down here /\ )
> and the superblock get modified quite often. Every time a tree gets
> modified, the new modified parts are written as a modified copy in
> unused space.
>
> So even if the files themselves do not change... if you miss those new
> writes which are being done in space that your dd already left behind...
> you end up with old and new parts of trees all over the place.
>
> In other words, a big puzzle with parts that do not connect with each
> other any more.
>
> And that's exactly what you see in all the errors. E.g. "parent transid
> verify failed on 32869482496 wanted 550112 found 550121" <- a part of a
> tree points to another part, but suddenly something else is found which
> should not be there. In this case wanted 550112 found 550121 means it's
> bumping into something "from the future". Whaaa..
>
> --
> Hans van Kranenburg
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:20     ` Chris Murphy
  2017-01-30 21:35       ` Chris Murphy
@ 2017-01-30 21:40       ` Michael Born
  1 sibling, 0 replies; 43+ messages in thread
From: Michael Born @ 2017-01-30 21:40 UTC (permalink / raw)
  To: Btrfs BTRFS

Am 30.01.2017 um 22:20 schrieb Chris Murphy:
> On Mon, Jan 30, 2017 at 2:07 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
>> change for months. Why would they change in the moment I copy their
>> blocks with dd?
> 
> They didn't change. The file system changed. While dd is reading, it
> might be minutes between capturing different parts of the file system,
> and each superblock is in different locations on the disk,
> guaranteeing that if the dd takes more than 30 seconds, your dd image
> has different generation super blocks. Btrfs notices this at mount
> time and will refuse to mount because the file system is inconsistent.
> 
> It is certainly possible to fix this, but it's likely to be really,
> really tedious. The existing tools don't take this use case into
> account.
> 
> Maybe btfs-find-root can come up with some suggestions and you can use
> btrfs restore -t with the bytenr from find root, to see if you can get
> this old data, ignoring the changes that don't affect the old data.
> 
> What you do with this is btrfs-find-root and see what it comes up
> with. And work with the most recent (highest) generation going
> backward, plugging in the bytenr into btrfs restore with -t option.
> You'll also want to use the dry run to see if you're getting what you
> want. It's best to use the exact path if you know it, this takes much
> less time for it to search all files in a given tree. If you don't
> know the exact path, but you know part of a file name, then you'll
> need to use the regex option; or just let it dump everything it can
> from the image and go dumpster diving...

I really want to try the "btrfs-find-root / btrfs restore -t" method.
But, btrfs-find-root gives me just the 3 lines output and then nothing
for 16 hours.
I think, I saw a similar report that the tool just doesn't report back
in the mailing list archive (btrfs-find-root duration? Markus Binsteiner
Sat, 10 Dec 2016 16:12:25 -0800)

./btrfs-find-root /dev/loop0
Couldn't read tree root
Superblock thinks the generation is 550114
Superblock thinks the level is 1

Hans, also thank you for the explanation even though I'm not sure, I
understand.
I would be happy with older parts of the tree which then have lower
numbers than the 550112.

Michael



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:20     ` Chris Murphy
@ 2017-01-30 21:35       ` Chris Murphy
  2017-01-30 21:40       ` Michael Born
  1 sibling, 0 replies; 43+ messages in thread
From: Chris Murphy @ 2017-01-30 21:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Michael Born, Btrfs BTRFS

On Mon, Jan 30, 2017 at 2:20 PM, Chris Murphy <lists@colorremedies.com> wrote:

> What people do with huge databases, which have this same problem,
> they'll take a volume snapshot. This first commits everything in
> flight, freezes the fs so no more changes can happen, then takes a
> snapshot, then unfreezes the original so the database can stay online.
> The freeze takes maybe a second or maybe a bit longer depending on how
> much stuff needs to be committed to stable media. Then backup the
> snapshot as a read-only volume. Once the backups is done, delete the
> snapshot.I

In Btrfs land, the way to do it is snapshot a subvolume, and then
rsync or 'btrfs send' the contents of the snapshot somewhere. I
actually often use this for whole volume backups:

## this will capture /boot and /boot/efi on separate file systems and
put the tar on Btrfs root.
cd /
tar -acf boot.tar.gz boot/

## My subvolumes are at the top level, fstab mounts them specifically,
so mount the top level to get access
sudo mount -o noatime <dev> /mnt
## Take a snapshot of rootfs
sudo btrfs sub snap - r /mnt/root /mnt/root.20170130
## Send it to remote server
sudo btrfs send /mnt/root.20170130 | ssh chris@server "cat - >
~/root.20170130.btrfs'
## Restore it from server, assumes the subvolume/snapshot does not exist
ssh chris@server "sudo btrfs send -f ~/root.20170130.btrfs" | sudo
btrfs receive /mnt/

The same can be done with incremental images, but of course you need
all the the files and named in a sane way so you know in what order to
restore them since those incrementals are parent/child specific.

The other thing this avoids, critically, is the form of corruptions of
Btrfs whenever two or more of the same volume (by UUID) appears to the
kernel at the same time, and one of them is mounted. See gotchas of
block level copies.

https://btrfs.wiki.kernel.org/index.php/Gotchas

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:07   ` Michael Born
  2017-01-30 21:16     ` Hans van Kranenburg
@ 2017-01-30 21:20     ` Chris Murphy
  2017-01-30 21:35       ` Chris Murphy
  2017-01-30 21:40       ` Michael Born
  2017-01-31  4:30     ` Duncan
  2 siblings, 2 replies; 43+ messages in thread
From: Chris Murphy @ 2017-01-30 21:20 UTC (permalink / raw)
  To: Michael Born; +Cc: Btrfs BTRFS

On Mon, Jan 30, 2017 at 2:07 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>
>
> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>>
>> There's still another day :-D
>>
>>
>>
>>>
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>>> to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>>
>> That won't ever work for any file system. It must be unmounted.
>
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.

If there are no substantial writes happening, it's possible it'll
behave like a power failure, read the journal and continue possibly
with the most recent commits being lost. But any substantial amount of
writes means some part of the volume is changed, and the update
reflecting that change is elsewhere, meanwhile the dd is capturing the
volume at different points in time rather than exactly as it is. It's
just not workable.

What people do with huge databases, which have this same problem,
they'll take a volume snapshot. This first commits everything in
flight, freezes the fs so no more changes can happen, then takes a
snapshot, then unfreezes the original so the database can stay online.
The freeze takes maybe a second or maybe a bit longer depending on how
much stuff needs to be committed to stable media. Then backup the
snapshot as a read-only volume. Once the backups is done, delete the
snapshot.





>
>>> Could you give me some instructions how to repair the file system or
>>> extract some files from it?
>>
>> Not possible. The file system was being modified while dd was
>> happening, so the image you've taken is inconsistent.
>
> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
> change for months. Why would they change in the moment I copy their
> blocks with dd?

They didn't change. The file system changed. While dd is reading, it
might be minutes between capturing different parts of the file system,
and each superblock is in different locations on the disk,
guaranteeing that if the dd takes more than 30 seconds, your dd image
has different generation super blocks. Btrfs notices this at mount
time and will refuse to mount because the file system is inconsistent.

It is certainly possible to fix this, but it's likely to be really,
really tedious. The existing tools don't take this use case into
account.

Maybe btfs-find-root can come up with some suggestions and you can use
btrfs restore -t with the bytenr from find root, to see if you can get
this old data, ignoring the changes that don't affect the old data.

What you do with this is btrfs-find-root and see what it comes up
with. And work with the most recent (highest) generation going
backward, plugging in the bytenr into btrfs restore with -t option.
You'll also want to use the dry run to see if you're getting what you
want. It's best to use the exact path if you know it, this takes much
less time for it to search all files in a given tree. If you don't
know the exact path, but you know part of a file name, then you'll
need to use the regex option; or just let it dump everything it can
from the image and go dumpster diving...



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 21:07   ` Michael Born
@ 2017-01-30 21:16     ` Hans van Kranenburg
  2017-01-30 22:24       ` GWB
  2017-01-30 21:20     ` Chris Murphy
  2017-01-31  4:30     ` Duncan
  2 siblings, 1 reply; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-30 21:16 UTC (permalink / raw)
  To: Michael Born, Btrfs BTRFS

On 01/30/2017 10:07 PM, Michael Born wrote:
> 
> 
> Am 30.01.2017 um 21:51 schrieb Chris Murphy:
>> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>>> Hi btrfs experts.
>>>
>>> Hereby I apply for the stupidity of the month award.
>>
>> There's still another day :-D
>>
>>
>>
>>>
>>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>>> to an image file - while the system was online/running.
>>> Now, I can't mount the image.
>>
>> That won't ever work for any file system. It must be unmounted.
> 
> I could mount and copy the data out of my /home image.dd (encrypted
> xfs). That was also online while dd-ing it.
> 
>>> Could you give me some instructions how to repair the file system or
>>> extract some files from it?
>>
>> Not possible. The file system was being modified while dd was
>> happening, so the image you've taken is inconsistent.
> 
> The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
> change for months. Why would they change in the moment I copy their
> blocks with dd?

The metadata of btrfs is organized in a bunch of tree structures. The
top of the trees (the smallest parts, trees are upside-down here /\ )
and the superblock get modified quite often. Every time a tree gets
modified, the new modified parts are written as a modified copy in
unused space.

So even if the files themselves do not change... if you miss those new
writes which are being done in space that your dd already left behind...
you end up with old and new parts of trees all over the place.

In other words, a big puzzle with parts that do not connect with each
other any more.

And that's exactly what you see in all the errors. E.g. "parent transid
verify failed on 32869482496 wanted 550112 found 550121" <- a part of a
tree points to another part, but suddenly something else is found which
should not be there. In this case wanted 550112 found 550121 means it's
bumping into something "from the future". Whaaa..

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 20:51 ` Chris Murphy
@ 2017-01-30 21:07   ` Michael Born
  2017-01-30 21:16     ` Hans van Kranenburg
                       ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Michael Born @ 2017-01-30 21:07 UTC (permalink / raw)
  To: Btrfs BTRFS



Am 30.01.2017 um 21:51 schrieb Chris Murphy:
> On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
>> Hi btrfs experts.
>>
>> Hereby I apply for the stupidity of the month award.
> 
> There's still another day :-D
> 
> 
> 
>>
>> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
>> to an image file - while the system was online/running.
>> Now, I can't mount the image.
> 
> That won't ever work for any file system. It must be unmounted.

I could mount and copy the data out of my /home image.dd (encrypted
xfs). That was also online while dd-ing it.

>> Could you give me some instructions how to repair the file system or
>> extract some files from it?
> 
> Not possible. The file system was being modified while dd was
> happening, so the image you've taken is inconsistent.

The files I'm interested in (fstab, NetworkManager.conf, ...) didn't
change for months. Why would they change in the moment I copy their
blocks with dd?

Michael




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 20:02 Michael Born
  2017-01-30 20:27 ` Hans van Kranenburg
@ 2017-01-30 20:51 ` Chris Murphy
  2017-01-30 21:07   ` Michael Born
  1 sibling, 1 reply; 43+ messages in thread
From: Chris Murphy @ 2017-01-30 20:51 UTC (permalink / raw)
  To: Michael Born; +Cc: Btrfs BTRFS

On Mon, Jan 30, 2017 at 1:02 PM, Michael Born <Michael.Born@aei.mpg.de> wrote:
> Hi btrfs experts.
>
> Hereby I apply for the stupidity of the month award.

There's still another day :-D



>
> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
> to an image file - while the system was online/running.
> Now, I can't mount the image.

That won't ever work for any file system. It must be unmounted.


> Could you give me some instructions how to repair the file system or
> extract some files from it?

Not possible. The file system was being modified while dd was
happening, so the image you've taken is inconsistent.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-30 20:02 Michael Born
@ 2017-01-30 20:27 ` Hans van Kranenburg
  2017-01-30 20:51 ` Chris Murphy
  1 sibling, 0 replies; 43+ messages in thread
From: Hans van Kranenburg @ 2017-01-30 20:27 UTC (permalink / raw)
  To: Michael Born, linux-btrfs

On 01/30/2017 09:02 PM, Michael Born wrote:
> Hi btrfs experts.
> 
> Hereby I apply for the stupidity of the month award.
> But, maybe you can help me restoring my dd backup or extracting some
> files from it?
> 
> Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
> to an image file - while the system was online/running.
> Now, I can't mount the image.

Making a block level copy of a filesystem while it is online and being
modified has a near 100% chance of producing a corrupt result.

Simply think of the fact that something gets written somewhere at the
end of the disk which also relates to something that gets written at the
beginning of the disk, while your dd copy is doing its thing somewhere
in between...

-- 
Hans van Kranenburg

^ permalink raw reply	[flat|nested] 43+ messages in thread

* btrfs recovery
@ 2017-01-30 20:02 Michael Born
  2017-01-30 20:27 ` Hans van Kranenburg
  2017-01-30 20:51 ` Chris Murphy
  0 siblings, 2 replies; 43+ messages in thread
From: Michael Born @ 2017-01-30 20:02 UTC (permalink / raw)
  To: linux-btrfs

Hi btrfs experts.

Hereby I apply for the stupidity of the month award.
But, maybe you can help me restoring my dd backup or extracting some
files from it?

Before switching from Suse 13.2 to 42.2, I copied my / partition with dd
to an image file - while the system was online/running.
Now, I can't mount the image.

I tried many commands (some output is below) that are suggested in the
wiki or blog pages without any success.
Unfortunately, the promising tool btrfs-find-root seems not to work. I
let it run on backup1.dd for 16 hours with the only output being:
./btrfs-find-root /dev/loop0
Couldn't read tree root
Superblock thinks the generation is 550114
Superblock thinks the level is 1

I then stopped it manually. (The 60GB dd file is on a ssd and one cpu
core was at 100% load all night)
I also tried the git clone of btrfs-progs which I checked out (the
tagged versions 4.9, 4.7, 4.4, 4.1) and compiled. I always got the
btrfs-find-root output as shown above.

Could you give me some instructions how to repair the file system or
extract some files from it?

Thank you,
Michael

PS: could you please CC me, as I'm not subscribed to the list.

Some commands and their output.

mount -t btrfs -o recovery,ro /dev/loop0 /mnt/oldroot/
mount: Falscher Dateisystemtyp, ungültige Optionen, der
Superblock von /dev/loop0 ist beschädigt, fehlende
Kodierungsseite oder ein anderer Fehler

dmesg -T says:
[Mo Jan 30 01:08:20 2017] BTRFS info (device loop0): enabling auto recovery
[Mo Jan 30 01:08:20 2017] BTRFS info (device loop0): disk space caching
is enabled
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865271808
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865271808
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32862011392
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): parent transid
verify failed on 32869482496 wanted 550112 found 550121
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS error (device loop0): bad tree block
start 0 32865353728
[Mo Jan 30 01:08:20 2017] BTRFS: failed to read tree root on loop0
[Mo Jan 30 01:08:20 2017] BTRFS: open_ctree failed

---

btrfs fi show
Label: none  uuid: 1c203c00-2768-4ea8-9e00-94aba5825394
        Total devices 1 FS bytes used 29.28GiB
        devid    1 size 60.00GiB used 32.07GiB path /dev/sda2

Label: none  uuid: 91a79eeb-08e0-470e-beab-916b38e09aca
        Total devices 1 FS bytes used 44.23GiB
        devid    1 size 60.00GiB used 60.00GiB path /dev/loop0

The 1st one is my now running Suse 42.2 /

---

btrfs check /dev/loop0
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
ERROR: cannot open file system

---

./btrfs restore -l /dev/loop0
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
Could not open root, trying backup super
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
checksum verify failed on 32865271808 found E4E3BDB6 wanted 00000000
bytenr mismatch, want=32865271808, have=0
Couldn't read tree root
Could not open root, trying backup super
ERROR: superblock bytenr 274877906944 is larger than device size 64428703744
Could not open root, trying backup super

---

uname -a
Linux linux-azo5 4.4.36-8-default #1 SMP Fri Dec 9 16:18:38 UTC 2016
(3ec5648) x86_64 x86_64 x86_64 GNU/Linux

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-23 11:15   ` Sebastian Gottschall
@ 2017-01-24  0:39     ` Qu Wenruo
  0 siblings, 0 replies; 43+ messages in thread
From: Qu Wenruo @ 2017-01-24  0:39 UTC (permalink / raw)
  To: Sebastian Gottschall, linux-btrfs, Duncan



At 01/23/2017 07:15 PM, Sebastian Gottschall wrote:
> Hello again
>
> by the way. the init-extent-tree is still running (now almost 7 days).
> is there any chance to find out how long it will take at the end?
>
> Sebastian

I think it may encounters a dead loop.

If its output doesn't loop(from a large scale), then you could try wait.

But at the point trying --init-extent-tree, your chance to recover your 
fs is quite low then.
Sorry for that.

Thanks,
Qu

>
> Am 20.01.2017 um 02:08 schrieb Qu Wenruo:
>>
>>
>> At 01/19/2017 06:06 PM, Sebastian Gottschall wrote:
>>> Hello
>>>
>>> I have a question. after a power outage my system was turning into a
>>> unrecoverable state using btrfs (kernel 4.9)
>>> since im running --init-extent-tree now for 3 days i'm asking how long
>>> this process normally takes and why it outputs millions of lines like
>>
>> --init-extent-tree will trash *ALL* current extent tree, and *REBUILD*
>> them from fs-tree.
>>
>> This can takes a long time depending on the size of the fs, and how
>> many shared extents there are (snapshots and reflinks all counts).
>>
>> Such a huge operation should only be used if you're sure only extent
>> tree is corrupted, and other tree are all OK.
>>
>> Or you'll just totally screw your fs further, especially when
>> interrupted.
>>
>>>
>>> Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not
>>> found in extent tree
>>> Incorrect local backref count on 1562890240 root 262 owner 483059214
>>> offset 0 found 1 wanted 0 back 0x23b0211d0
>>> backpointer mismatch on [1562890240 4096]
>>
>> This is common, since --init-extent-tree trash all extent tree, so
>> every tree-block/data extent will trigger such output
>>
>>> adding new data backref on 1562890240 root 262 owner 483059214 offset 0
>>> found 1
>>> Repaired extent references for 1562890240
>>
>> But as you see, it repaired the extent tree by adding back
>> EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works.
>>
>> If you see such output with all the same bytenr, then things goes
>> really wrong and maybe a dead loop.
>>
>>
>> Personally speaking, normal problem like failed to mount should not
>> need --init-extent-tree.
>>
>> Especially, extent-tree corruption normally is not really related to
>> mount failure, but sudden remount to RO and kernel wanring.
>>
>> Thanks,
>> Qu
>>
>>>
>>> please avoid typical answers like "potential dangerous operation" since
>>> all repair options are declared as potenial dangerous.
>>>
>>>
>>> Sebastian
>>>
>>
>>
>>
>
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-20  1:08 ` Qu Wenruo
  2017-01-20  9:45   ` Sebastian Gottschall
@ 2017-01-23 11:15   ` Sebastian Gottschall
  2017-01-24  0:39     ` Qu Wenruo
  1 sibling, 1 reply; 43+ messages in thread
From: Sebastian Gottschall @ 2017-01-23 11:15 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs, Duncan

Hello again

by the way. the init-extent-tree is still running (now almost 7 days). 
is there any chance to find out how long it will take at the end?

Sebastian

Am 20.01.2017 um 02:08 schrieb Qu Wenruo:
>
>
> At 01/19/2017 06:06 PM, Sebastian Gottschall wrote:
>> Hello
>>
>> I have a question. after a power outage my system was turning into a
>> unrecoverable state using btrfs (kernel 4.9)
>> since im running --init-extent-tree now for 3 days i'm asking how long
>> this process normally takes and why it outputs millions of lines like
>
> --init-extent-tree will trash *ALL* current extent tree, and *REBUILD* 
> them from fs-tree.
>
> This can takes a long time depending on the size of the fs, and how 
> many shared extents there are (snapshots and reflinks all counts).
>
> Such a huge operation should only be used if you're sure only extent 
> tree is corrupted, and other tree are all OK.
>
> Or you'll just totally screw your fs further, especially when 
> interrupted.
>
>>
>> Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not
>> found in extent tree
>> Incorrect local backref count on 1562890240 root 262 owner 483059214
>> offset 0 found 1 wanted 0 back 0x23b0211d0
>> backpointer mismatch on [1562890240 4096]
>
> This is common, since --init-extent-tree trash all extent tree, so 
> every tree-block/data extent will trigger such output
>
>> adding new data backref on 1562890240 root 262 owner 483059214 offset 0
>> found 1
>> Repaired extent references for 1562890240
>
> But as you see, it repaired the extent tree by adding back 
> EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works.
>
> If you see such output with all the same bytenr, then things goes 
> really wrong and maybe a dead loop.
>
>
> Personally speaking, normal problem like failed to mount should not 
> need --init-extent-tree.
>
> Especially, extent-tree corruption normally is not really related to 
> mount failure, but sudden remount to RO and kernel wanring.
>
> Thanks,
> Qu
>
>>
>> please avoid typical answers like "potential dangerous operation" since
>> all repair options are declared as potenial dangerous.
>>
>>
>> Sebastian
>>
>
>
>


-- 
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Berliner Ring 101, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottschall@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-20  8:05 ` Duncan
@ 2017-01-20  9:59   ` Sebastian Gottschall
  0 siblings, 0 replies; 43+ messages in thread
From: Sebastian Gottschall @ 2017-01-20  9:59 UTC (permalink / raw)
  To: Duncan, linux-btrfs

Am 20.01.2017 um 09:05 schrieb Duncan:
> Sebastian Gottschall posted on Thu, 19 Jan 2017 11:06:19 +0100 as
> excerpted:
>
>> I have a question. after a power outage my system was turning into a
>> unrecoverable state using btrfs (kernel 4.9)
>> since im running --init-extent-tree now for 3 days i'm asking how long
>> this process normally takes
> QW has the better direct answer for you, but...
>
> This is just a note to remind you, in general questions like "how long"
> can be better answered if we know the size of your filesystem, the mode
> (how many devices and what duplication mode for data and metadata) and
> something about how you use it -- how many subvolumes and snapshots you
> have, whether you have quotas enabled, etc.
hard to give a answer right now since the fs is still in 
init-tree-extent. so i cannot get any details from it while running this 
process.
it was a standard opensuse 42.1 installation with btrfs as rootfs. the 
size is about 1,8 tb. no soft raid. its a hardware raid6 system using a 
areca controller
running all as single device
> hr
>
> Normally output from commands like btrfs fi usage can answer most of the
> filesystem size and mode stuff, but of course that command requires a
> mount, and you're doing an unmounted check ATM.  However, btrfs fi show
> should still work and give us basic information like file size and number
> of devices, and you can fill in the blanks from there.
0:rescue:~ # btrfs.static fi show
Label: none  uuid: 946b1a04-c321-4a24-bfb4-d6dcfa8b52dc
         Total devices 1 FS bytes used 1.15TiB
         devid    1 size 1.62TiB used 1.37TiB path /dev/sda3

>
> You did mention the kernel version (4.9) however, something that a lot of
> reports miss, and you're current, so kudos for that. =:^)
i was reading other reports first, so i know whats expected :-)
beside this im a linux developer as well, so i know whats most important 
to know and most systems i run are almost up to date
>
> As to your question, assuming a terabyte scale filesystem, as QW
> suggested, a full extent tree rebuild is a big job and could indeed take
> awhile (days).
4992 minutes now so, 3.4 days
>
>  From a practical perspective...
>
> Given the state of btrfs as a still stabilizing and maturing filesystem,
> having backups for any data you value more than the time and hassle
> necessary to do the backup is even more a given than on a fully stable
> filesystem, which means, given the time for an extent tree rebuild on
> that size of a filesystem, unless you're doing the rebuild specifically
> to get the experience or test the code, as a practical matter it's
> probably simply easier to restore from that backup if you valued the data
> enough to have one, or simply scrap the filesystem and start over if you
> considered the data worth less than the time and hassle of a backup, and
> thus didn't have one.
i have a backup for sure for worst case, its just not always up to date. 
which means i might lost minor work of 6 -7 days maximum since i cannot 
mirror the whole filesystem every second
sourcecodes in repository are safe for sure and there will be nothing 
lost,but will take always some time to get the backup back to the 
system, reinstalling OS, etc. my OS is not very vanilla. its all
a little bit customized and not sure how i did it last time and would 
take some time to find the right path back. so its worth to try it 
without going the hard way
>


-- 
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Berliner Ring 101, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottschall@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-20  1:08 ` Qu Wenruo
@ 2017-01-20  9:45   ` Sebastian Gottschall
  2017-01-23 11:15   ` Sebastian Gottschall
  1 sibling, 0 replies; 43+ messages in thread
From: Sebastian Gottschall @ 2017-01-20  9:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Am 20.01.2017 um 02:08 schrieb Qu Wenruo:
>
>
> At 01/19/2017 06:06 PM, Sebastian Gottschall wrote:
>> Hello
>>
>> I have a question. after a power outage my system was turning into a
>> unrecoverable state using btrfs (kernel 4.9)
>> since im running --init-extent-tree now for 3 days i'm asking how long
>> this process normally takes and why it outputs millions of lines like
>
> --init-extent-tree will trash *ALL* current extent tree, and *REBUILD* 
> them from fs-tree.
>
> This can takes a long time depending on the size of the fs, and how 
> many shared extents there are (snapshots and reflinks all counts).
its about 1,8 tb. so not a great size, but millions of files. its a 
build server
>
> Such a huge operation should only be used if you're sure only extent 
> tree is corrupted, and other tree are all OK.
since operations like zero-log doesnt help and scrub cancles after 5 
seconds with error (can't remember the exact error right now)
i'm sure there is something corrupt in it
>
> Or you'll just totally screw your fs further, especially when 
> interrupted.
running since 4 days now and for sure i wont interrupt it now at this state
>
>>
>> Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not
>> found in extent tree
>> Incorrect local backref count on 1562890240 root 262 owner 483059214
>> offset 0 found 1 wanted 0 back 0x23b0211d0
>> backpointer mismatch on [1562890240 4096]
>
> This is common, since --init-extent-tree trash all extent tree, so 
> every tree-block/data extent will trigger such output
>
>> adding new data backref on 1562890240 root 262 owner 483059214 offset 0
>> found 1
>> Repaired extent references for 1562890240
>
> But as you see, it repaired the extent tree by adding back 
> EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works.
>
> If you see such output with all the same bytenr, then things goes 
> really wrong and maybe a dead loop.
they are all incremental. so looks okay then. i just dont know where the 
end is
>
> Personally speaking, normal problem like failed to mount should not 
> need --init-extent-tree.
>
> Especially, extent-tree corruption normally is not really related to 
> mount failure, but sudden remount to RO and kernel wanring.
initially i was able to mount the fs, but it turned to be ro only.
>
> Thanks,
> Qu
>
>>
>> please avoid typical answers like "potential dangerous operation" since
>> all repair options are declared as potenial dangerous.
>>
>>
>> Sebastian
>>
>
>
>


-- 
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Berliner Ring 101, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottschall@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-19 10:06 Sebastian Gottschall
  2017-01-20  1:08 ` Qu Wenruo
@ 2017-01-20  8:05 ` Duncan
  2017-01-20  9:59   ` Sebastian Gottschall
  1 sibling, 1 reply; 43+ messages in thread
From: Duncan @ 2017-01-20  8:05 UTC (permalink / raw)
  To: linux-btrfs

Sebastian Gottschall posted on Thu, 19 Jan 2017 11:06:19 +0100 as
excerpted:

> I have a question. after a power outage my system was turning into a
> unrecoverable state using btrfs (kernel 4.9)
> since im running --init-extent-tree now for 3 days i'm asking how long
> this process normally takes

QW has the better direct answer for you, but...

This is just a note to remind you, in general questions like "how long" 
can be better answered if we know the size of your filesystem, the mode 
(how many devices and what duplication mode for data and metadata) and 
something about how you use it -- how many subvolumes and snapshots you 
have, whether you have quotas enabled, etc.

Normally output from commands like btrfs fi usage can answer most of the 
filesystem size and mode stuff, but of course that command requires a 
mount, and you're doing an unmounted check ATM.  However, btrfs fi show 
should still work and give us basic information like file size and number 
of devices, and you can fill in the blanks from there.

You did mention the kernel version (4.9) however, something that a lot of 
reports miss, and you're current, so kudos for that. =:^)

As to your question, assuming a terabyte scale filesystem, as QW 
suggested, a full extent tree rebuild is a big job and could indeed take 
awhile (days).

>From a practical perspective...

Given the state of btrfs as a still stabilizing and maturing filesystem, 
having backups for any data you value more than the time and hassle 
necessary to do the backup is even more a given than on a fully stable 
filesystem, which means, given the time for an extent tree rebuild on 
that size of a filesystem, unless you're doing the rebuild specifically 
to get the experience or test the code, as a practical matter it's 
probably simply easier to restore from that backup if you valued the data 
enough to have one, or simply scrap the filesystem and start over if you 
considered the data worth less than the time and hassle of a backup, and 
thus didn't have one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: btrfs recovery
  2017-01-19 10:06 Sebastian Gottschall
@ 2017-01-20  1:08 ` Qu Wenruo
  2017-01-20  9:45   ` Sebastian Gottschall
  2017-01-23 11:15   ` Sebastian Gottschall
  2017-01-20  8:05 ` Duncan
  1 sibling, 2 replies; 43+ messages in thread
From: Qu Wenruo @ 2017-01-20  1:08 UTC (permalink / raw)
  To: Sebastian Gottschall, linux-btrfs



At 01/19/2017 06:06 PM, Sebastian Gottschall wrote:
> Hello
>
> I have a question. after a power outage my system was turning into a
> unrecoverable state using btrfs (kernel 4.9)
> since im running --init-extent-tree now for 3 days i'm asking how long
> this process normally takes and why it outputs millions of lines like

--init-extent-tree will trash *ALL* current extent tree, and *REBUILD* 
them from fs-tree.

This can takes a long time depending on the size of the fs, and how many 
shared extents there are (snapshots and reflinks all counts).

Such a huge operation should only be used if you're sure only extent 
tree is corrupted, and other tree are all OK.

Or you'll just totally screw your fs further, especially when interrupted.

>
> Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not
> found in extent tree
> Incorrect local backref count on 1562890240 root 262 owner 483059214
> offset 0 found 1 wanted 0 back 0x23b0211d0
> backpointer mismatch on [1562890240 4096]

This is common, since --init-extent-tree trash all extent tree, so every 
tree-block/data extent will trigger such output

> adding new data backref on 1562890240 root 262 owner 483059214 offset 0
> found 1
> Repaired extent references for 1562890240

But as you see, it repaired the extent tree by adding back 
EXTENT_ITEM/METADATA_ITEM into extent tree, so far it works.

If you see such output with all the same bytenr, then things goes really 
wrong and maybe a dead loop.


Personally speaking, normal problem like failed to mount should not need 
--init-extent-tree.

Especially, extent-tree corruption normally is not really related to 
mount failure, but sudden remount to RO and kernel wanring.

Thanks,
Qu

>
> please avoid typical answers like "potential dangerous operation" since
> all repair options are declared as potenial dangerous.
>
>
> Sebastian
>



^ permalink raw reply	[flat|nested] 43+ messages in thread

* btrfs recovery
@ 2017-01-19 10:06 Sebastian Gottschall
  2017-01-20  1:08 ` Qu Wenruo
  2017-01-20  8:05 ` Duncan
  0 siblings, 2 replies; 43+ messages in thread
From: Sebastian Gottschall @ 2017-01-19 10:06 UTC (permalink / raw)
  To: linux-btrfs

Hello

I have a question. after a power outage my system was turning into a 
unrecoverable state using btrfs (kernel 4.9)
since im running --init-extent-tree now for 3 days i'm asking how long 
this process normally takes and why it outputs millions of lines like

Backref 1562890240 root 262 owner 483059214 offset 0 num_refs 0 not 
found in extent tree
Incorrect local backref count on 1562890240 root 262 owner 483059214 
offset 0 found 1 wanted 0 back 0x23b0211d0
backpointer mismatch on [1562890240 4096]
adding new data backref on 1562890240 root 262 owner 483059214 offset 0 
found 1
Repaired extent references for 1562890240

please avoid typical answers like "potential dangerous operation" since 
all repair options are declared as potenial dangerous.


Sebastian

-- 
Mit freundlichen Grüssen / Regards

Sebastian Gottschall / CTO

NewMedia-NET GmbH - DD-WRT
Firmensitz:  Berliner Ring 101, 64625 Bensheim
Registergericht: Amtsgericht Darmstadt, HRB 25473
Geschäftsführer: Peter Steinhäuser, Christian Scheele
http://www.dd-wrt.com
email: s.gottschall@dd-wrt.com
Tel.: +496251-582650 / Fax: +496251-5826565


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-02-01  4:36 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26  9:18 btrfs recovery Oliver Freyermuth
2017-01-26  9:25 ` Hugo Mills
2017-01-26  9:36   ` Oliver Freyermuth
2017-01-26 10:00     ` Hugo Mills
2017-01-26 11:01     ` Oliver Freyermuth
2017-01-27 11:01       ` Oliver Freyermuth
2017-01-27 12:58         ` Austin S. Hemmelgarn
2017-01-28  5:00           ` Duncan
2017-01-28 12:37             ` Janos Toth F.
2017-01-28 16:51               ` Oliver Freyermuth
2017-01-28 16:46             ` Oliver Freyermuth
2017-01-31  4:58               ` Duncan
2017-01-31 12:45                 ` Austin S. Hemmelgarn
2017-02-01  4:36                   ` Duncan
2017-01-30 12:41             ` Austin S. Hemmelgarn
2017-01-28 21:04       ` Oliver Freyermuth
2017-01-28 22:27         ` Hans van Kranenburg
2017-01-29  2:02           ` Oliver Freyermuth
2017-01-29 16:44             ` Hans van Kranenburg
2017-01-29 19:09               ` Oliver Freyermuth
2017-01-29 19:28                 ` Hans van Kranenburg
2017-01-29 19:52                   ` Oliver Freyermuth
2017-01-29 20:13                     ` Hans van Kranenburg
  -- strict thread matches above, loose matches on Subject: below --
2017-01-30 20:02 Michael Born
2017-01-30 20:27 ` Hans van Kranenburg
2017-01-30 20:51 ` Chris Murphy
2017-01-30 21:07   ` Michael Born
2017-01-30 21:16     ` Hans van Kranenburg
2017-01-30 22:24       ` GWB
2017-01-30 22:37         ` Michael Born
2017-01-31  0:29           ` GWB
2017-01-31  9:08           ` Graham Cobb
2017-01-30 21:20     ` Chris Murphy
2017-01-30 21:35       ` Chris Murphy
2017-01-30 21:40       ` Michael Born
2017-01-31  4:30     ` Duncan
2017-01-19 10:06 Sebastian Gottschall
2017-01-20  1:08 ` Qu Wenruo
2017-01-20  9:45   ` Sebastian Gottschall
2017-01-23 11:15   ` Sebastian Gottschall
2017-01-24  0:39     ` Qu Wenruo
2017-01-20  8:05 ` Duncan
2017-01-20  9:59   ` Sebastian Gottschall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.