linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* btrfs goes read-only when btrfs-cleaner runs
@ 2019-01-13 21:51 Oliver Freyermuth
  2019-01-14  0:48 ` Oliver Freyermuth
  2019-01-16  0:41 ` Chris Murphy
  0 siblings, 2 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-13 21:51 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 676 bytes --]

(resending with gzipped attachment)

Dear BTRFS experts,

I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
i.e. when purging old subvolumes. 

I have attached the kernel log from when this happens. 

What is the best way to proceed from here? Running "btrfs check repair" on the device? 
Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
But it would be good to understand the cause and know if there is a better fix than starting from scratch. 

Cheers,
	Oliver

PS: Please include me directly in replies, I am not subscribed to the list. 

[-- Attachment #2: btrfs-issue.txt.gz --]
[-- Type: application/gzip, Size: 16489 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-13 21:51 btrfs goes read-only when btrfs-cleaner runs Oliver Freyermuth
@ 2019-01-14  0:48 ` Oliver Freyermuth
  2019-01-15 22:24   ` Oliver Freyermuth
  2019-01-16  0:41 ` Chris Murphy
  1 sibling, 1 reply; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-14  0:48 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1375 bytes --]

Am 13.01.19 um 22:51 schrieb Oliver Freyermuth:
> (resending with gzipped attachment)
> 
> Dear BTRFS experts,
> 
> I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
> i.e. when purging old subvolumes. 
> 
> I have attached the kernel log from when this happens. 
> 
> What is the best way to proceed from here? Running "btrfs check repair" on the device? 
> Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
> But it would be good to understand the cause and know if there is a better fix than starting from scratch. 
> 
> Cheers,
> 	Oliver
> 
> PS: Please include me directly in replies, I am not subscribed to the list. 
> 

Dear BTRFS experts,

attached is the output of "btrfs check -p /dev/sdc2". 
I can't guarantee the volume has never been cleanly unmounted. 

I found several past occasions of this here:
https://www.spinics.net/lists/linux-btrfs/msg69040.html
and here:
https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing
but without conclusive result. 

Please let me know what's the best way to proceed. From these links, it seems
btrfs check --repair
_should_ help, but I would prefer to get some advice first whether this is really the best approach. 

Cheers,
	Oliver

[-- Attachment #2: btrfsck.txt.gz --]
[-- Type: application/gzip, Size: 3549 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-14  0:48 ` Oliver Freyermuth
@ 2019-01-15 22:24   ` Oliver Freyermuth
  2019-01-15 22:58     ` Oliver Freyermuth
  2019-01-16  7:11     ` Nikolay Borisov
  0 siblings, 2 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-15 22:24 UTC (permalink / raw)
  To: linux-btrfs

Am 14.01.19 um 01:48 schrieb Oliver Freyermuth:
> Am 13.01.19 um 22:51 schrieb Oliver Freyermuth:
>> I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
>> i.e. when purging old subvolumes. 
>>
>> I have attached the kernel log from when this happens. 
>>
>> What is the best way to proceed from here? Running "btrfs check repair" on the device? 
>> Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
>> But it would be good to understand the cause and know if there is a better fix than starting from scratch. 
> attached is the output of "btrfs check -p /dev/sdc2". 
> I can't guarantee the volume has never been cleanly unmounted. 
> 
> I found several past occasions of this here:
> https://www.spinics.net/lists/linux-btrfs/msg69040.html
> and here:
> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing
> but without conclusive result. 
> 
> Please let me know what's the best way to proceed. From these links, it seems
> btrfs check --repair
> _should_ help, but I would prefer to get some advice first whether this is really the best approach. 
> 

Dear BTRFS experts,

I have now salvaged all my backup subvolumes with btrfs send (using btrbk archive) to a new btrfs partition. 
Interestingly, when the old partition was mounted r/w initially and remounted r/o after the described issue was triggered by btrfs-cleaner:

[34758.491644] BTRFS: error (device sdc2) in __btrfs_free_extent:6828: errno=-2 No such entry                                                                                                                                               
[34758.491647] BTRFS info (device sdc2): forced readonly                                                                                                                                                                                     
[34758.491652] BTRFS: error (device sdc2) in btrfs_run_delayed_refs:2978: errno=-2 No such entry 

btrfs send appeared to fail on some subvolumes with:

[41822.676040] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
[41822.676260] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
[41822.676266] BTRFS info (device sdc2): no csum found for inode 22175978 start 0                                                                                                                                                           
[41822.683112] BTRFS warning (device sdc2): csum failed root 25758 ino 22175978 off 4427459514368 csum 0x5d3b8d26 expected csum 0x00000000 mirror 1 

Unmounting and remounting the broken file system r/o, all visible subvolumes could be transferred without that issue. 
I presume that there's also a bug when the automatic remount as r/o happens since csum 0x00000000 does not look correct. 

Since there's now nothing to lose and I received no other advice up to now, I'm running "btrfs check --repair" now just for the sake of learning
whether this appears to fix it. I'll shortly report back when that's done. 

If anybody can suggest a better solution in case this happens again (the issue appears to be wide-spread) I would be happy to learn. 

Cheers,
	Oliver

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-15 22:24   ` Oliver Freyermuth
@ 2019-01-15 22:58     ` Oliver Freyermuth
  2019-01-16  7:11     ` Nikolay Borisov
  1 sibling, 0 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-15 22:58 UTC (permalink / raw)
  To: linux-btrfs

Am 15.01.19 um 23:24 schrieb Oliver Freyermuth:
> Am 14.01.19 um 01:48 schrieb Oliver Freyermuth:
>> Am 13.01.19 um 22:51 schrieb Oliver Freyermuth:
>>> I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
>>> i.e. when purging old subvolumes. 
>>>
>>> I have attached the kernel log from when this happens. 
>>>
>>> What is the best way to proceed from here? Running "btrfs check repair" on the device? 
>>> Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
>>> But it would be good to understand the cause and know if there is a better fix than starting from scratch. 
>> attached is the output of "btrfs check -p /dev/sdc2". 
>> I can't guarantee the volume has never been cleanly unmounted. 
>>
>> I found several past occasions of this here:
>> https://www.spinics.net/lists/linux-btrfs/msg69040.html
>> and here:
>> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing
>> but without conclusive result. 
>>
>> Please let me know what's the best way to proceed. From these links, it seems
>> btrfs check --repair
>> _should_ help, but I would prefer to get some advice first whether this is really the best approach. 
>>

> I have now salvaged all my backup subvolumes with btrfs send (using btrbk archive) to a new btrfs partition. 
> Interestingly, when the old partition was mounted r/w initially and remounted r/o after the described issue was triggered by btrfs-cleaner:
> 
> [34758.491644] BTRFS: error (device sdc2) in __btrfs_free_extent:6828: errno=-2 No such entry                                                                                                                                               
> [34758.491647] BTRFS info (device sdc2): forced readonly                                                                                                                                                                                     
> [34758.491652] BTRFS: error (device sdc2) in btrfs_run_delayed_refs:2978: errno=-2 No such entry 
> 
> btrfs send appeared to fail on some subvolumes with:
> 
> [41822.676040] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
> [41822.676260] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
> [41822.676266] BTRFS info (device sdc2): no csum found for inode 22175978 start 0                                                                                                                                                           
> [41822.683112] BTRFS warning (device sdc2): csum failed root 25758 ino 22175978 off 4427459514368 csum 0x5d3b8d26 expected csum 0x00000000 mirror 1 
> 
> Unmounting and remounting the broken file system r/o, all visible subvolumes could be transferred without that issue. 
> I presume that there's also a bug when the automatic remount as r/o happens since csum 0x00000000 does not look correct. 
> 
> Since there's now nothing to lose and I received no other advice up to now, I'm running "btrfs check --repair" now just for the sake of learning
> whether this appears to fix it. I'll shortly report back when that's done. 
> 
> If anybody can suggest a better solution in case this happens again (the issue appears to be wide-spread) I would be happy to learn. 

btrfs check --repair started to do it's thing - and died. 
Below is the log and trace in the hope that it may help to fix the BUG_ON. 
That's with btrfs-progs 4.19.1 on Kernel 4.20.1. 

I'll run repair again, but I guess the volume is hosed, broken somewhere in subvolume deletion. 
Still seems fine when mounting r/o, though. 

--------------------------------------------------------------------------------------
$ btrfs check -p --repair /dev/sdc2
enabling repair mode
Opening filesystem to check...
Checking filesystem on /dev/sdc2
UUID: 3ded2960-989e-4890-9756-d6e60433e42f
[1/7] checking root items                      (0:06:27 elapsed, 12335857 items checked)
Fixed 0 roots.
ref mismatch on [711065600 16384] extent item 0, found 1elapsed, 1184265 items checked)
tree backref 711065600 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [711065600 16384]
adding new tree backref on start 711065600 len 16384 parent 0 root 18178
Repaired extent references for 711065600
ref mismatch on [928907264 16384] extent item 0, found 1
tree backref 928907264 parent 25744 root 25744 not found in extent tree
backpointer mismatch on [928907264 16384]
owner ref check failed [928907264 16384]
repair deleting extent record: key [928907264,169,1]
adding new tree backref on start 928907264 len 16384 parent 0 root 25744
Repaired extent references for 928907264
ref mismatch on [28052652032 16384] extent item 0, found 1
tree backref 28052652032 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [28052652032 16384]
owner ref check failed [28052652032 16384]
repair deleting extent record: key [28052652032,169,1]
adding new tree backref on start 28052652032 len 16384 parent 0 root 18178
Repaired extent references for 28052652032
ref mismatch on [28088516608 16384] extent item 0, found 1
tree backref 28088516608 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [28088516608 16384]
owner ref check failed [28088516608 16384]
repair deleting extent record: key [28088516608,169,1]
adding new tree backref on start 28088516608 len 16384 parent 0 root 18178
Repaired extent references for 28088516608
ref mismatch on [52375928832 16384] extent item 0, found 1
tree backref 52375928832 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [52375928832 16384]
adding new tree backref on start 52375928832 len 16384 parent 0 root 18178
Repaired extent references for 52375928832
ref mismatch on [185114099712 16384] extent item 0, found 1
tree backref 185114099712 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [185114099712 16384]
adding new tree backref on start 185114099712 len 16384 parent 0 root 18178
Repaired extent references for 185114099712
ref mismatch on [283321597952 16384] extent item 0, found 1
tree backref 283321597952 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [283321597952 16384]
owner ref check failed [283321597952 16384]
repair deleting extent record: key [283321597952,169,1]
adding new tree backref on start 283321597952 len 16384 parent 0 root 18178
Repaired extent references for 283321597952
ref mismatch on [419430154240 16384] extent item 0, found 1
tree backref 419430154240 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [419430154240 16384]
owner ref check failed [419430154240 16384]
repair deleting extent record: key [419430154240,169,1]
adding new tree backref on start 419430154240 len 16384 parent 0 root 18178
Repaired extent references for 419430154240
ref mismatch on [419638804480 16384] extent item 0, found 1
tree backref 419638804480 parent 18178 root 18178 not found in extent tree
backpointer mismatch on [419638804480 16384]
owner ref check failed [419638804480 16384]
repair deleting extent record: key [419638804480,169,1]
adding new tree backref on start 419638804480 len 16384 parent 0 root 18178
Failed to find [52107100160, 168, 16384]
btrfs unable to find ref byte nr 52107116544 parent 0 root 2  owner 1 offset 0
transaction.c:168: btrfs_commit_transaction: BUG_ON `ret` triggered, value -5
btrfs(+0x507f9)[0x55ec25cf97f9]
btrfs(btrfs_commit_transaction+0x193)[0x55ec25cf9dd3]
btrfs(+0x1d74e)[0x55ec25cc674e]
btrfs(cmd_check+0x1104)[0x55ec25d0f2e4]
btrfs(main+0x82)[0x55ec25cc70f2]
/lib64/libc.so.6(__libc_start_main+0xe7)[0x7f41b421dae7]
btrfs(_start+0x2a)[0x55ec25cc72ca]
--------------------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-13 21:51 btrfs goes read-only when btrfs-cleaner runs Oliver Freyermuth
  2019-01-14  0:48 ` Oliver Freyermuth
@ 2019-01-16  0:41 ` Chris Murphy
  2019-01-16  1:11   ` Chris Murphy
  2019-01-16  1:13   ` Oliver Freyermuth
  1 sibling, 2 replies; 12+ messages in thread
From: Chris Murphy @ 2019-01-16  0:41 UTC (permalink / raw)
  To: Oliver Freyermuth, Qu Wenruo; +Cc: Btrfs BTRFS

The relevant error messages are:

unable to find ref byte
errno=-2 No such entry

Somehow a reference byte has been corrupted and inserted into multiple
locations in the tree and it's not repairable: i.e. neither a correct
value can be inferred from other available information, nor do the
tools have a good way to just trim out the item that contains bad key
pointers - part of the problem with just cutting out the bad parts is
it's not clear the problem is made even worse or how far the
corruption extends.

What's further troubling though is the idea that this corruption might
have propagated to a separate volume via snapshot send receive. Either
of the file systems might still be useful for a developer, it seems to
me important to have some kind of check to make sure it's not possible
for corruption to propagate in this manner.

In the meantime, I think it's a good idea to do a memory test. There's
some information in the archives about how to do this in a more
reliable way than just memtest86 type tests, but if you can run even a
memtest86 over a weekend it might confirm there's a memory problem.
Unfortunately a pass doesn't necessarily mean there aren't rare
transient problems.


Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-16  0:41 ` Chris Murphy
@ 2019-01-16  1:11   ` Chris Murphy
  2019-01-16  1:15     ` Oliver Freyermuth
  2019-01-16  1:13   ` Oliver Freyermuth
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2019-01-16  1:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Oliver Freyermuth, Qu Wenruo, Btrfs BTRFS

On Tue, Jan 15, 2019 at 5:41 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> The relevant error messages are:
>
> unable to find ref byte
> errno=-2 No such entry
>
> Somehow a reference byte has been corrupted and inserted into multiple
> locations in the tree and it's not repairable: i.e. neither a correct
> value can be inferred from other available information, nor do the
> tools have a good way to just trim out the item that contains bad key
> pointers - part of the problem with just cutting out the bad parts is
> it's not clear the problem is made even worse or how far the
> corruption extends.
>
> What's further troubling though is the idea that this corruption might
> have propagated to a separate volume via snapshot send receive. Either
> of the file systems might still be useful for a developer, it seems to
> me important to have some kind of check to make sure it's not possible
> for corruption to propagate in this manner.
>
> In the meantime, I think it's a good idea to do a memory test. There's
> some information in the archives about how to do this in a more
> reliable way than just memtest86 type tests, but if you can run even a
> memtest86 over a weekend it might confirm there's a memory problem.
> Unfortunately a pass doesn't necessarily mean there aren't rare
> transient problems.

Also, this could be related...

[ 4368.361487] CPU: 0 PID: 23915 Comm: kworker/u16:11 Tainted: P
 W  O      4.20.1-gentoo #1

Do you know why it's tainted? Looks like an out of tree proprietary
module. And also there has been a previous kernel warning but that
part of the dmesg isn't included. It's not possible to absolutely
exclude an out of tree kernel from being a source of memory
corruption, but it makes it a possible suspect and therefore it makes
tracking down the source of the problem harder.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-16  0:41 ` Chris Murphy
  2019-01-16  1:11   ` Chris Murphy
@ 2019-01-16  1:13   ` Oliver Freyermuth
  1 sibling, 0 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-16  1:13 UTC (permalink / raw)
  To: Chris Murphy, Qu Wenruo; +Cc: Btrfs BTRFS

Thanks for the reply!

Am 16.01.19 um 01:41 schrieb Chris Murphy:
> The relevant error messages are:
> 
> unable to find ref byte
> errno=-2 No such entry
> 
> Somehow a reference byte has been corrupted and inserted into multiple
> locations in the tree and it's not repairable: i.e. neither a correct
> value can be inferred from other available information, nor do the
> tools have a good way to just trim out the item that contains bad key
> pointers - part of the problem with just cutting out the bad parts is
> it's not clear the problem is made even worse or how far the
> corruption extends.
> 
> What's further troubling though is the idea that this corruption might
> have propagated to a separate volume via snapshot send receive. Either
> of the file systems might still be useful for a developer, it seems to
> me important to have some kind of check to make sure it's not possible
> for corruption to propagate in this manner.
> 
> In the meantime, I think it's a good idea to do a memory test. There's
> some information in the archives about how to do this in a more
> reliable way than just memtest86 type tests, but if you can run even a
> memtest86 over a weekend it might confirm there's a memory problem.
> Unfortunately a pass doesn't necessarily mean there aren't rare
> transient problems.

There are some things which do not quote match up for a broken-memory explanation,
unless my understanding is wrong. 

I'll try to explain more concisely:
- The broken file system is on an external USB drive (SMR sadly!) and 
  was used as backup target for btrfs send of snapshots. 
- The machine sending data there does not have a corrupted filesystem. 
  It scrubs perfectly fine. The disk was only connected to that machine for backups, 
  from time to time. 
- To salvage data from the broken FS, I have now mounted it read-only (to prevent btrfs-cleaner from kicking in)
  and sent all snapshots (via btrbk archive) to a fresh filesystem (on a non-SMR disk). 
  For the read-only-mounted broken filesystem, no corruption error was shown in syslog. 
  Checking the new filesystem which has received all snapshots with "btrfs check --readonly",
  no corruption is visible. 
  So I must deduce the corruption was not part of a snapshot which was sent - which would mean
  the corruption is only part of a subvolume pending cleanup by btrfs-cleaner. 

So the only way corruption could have crept in from the machine's memory would have been
during actual send / receive. Also, since sending from the corrupted FS worked, I presume this corruption
only affects subvolumes marked for deletion, which can't be deleted due to the corruption. 

It *might* have happened that during the reboot after the kernel upgrade (after which the corruption appeared), 
the disk did not properly unmount (while btrfs-cleaner was running). Unmounting that SMR disk while deferred
activities are going on may take many minutes, and something may have timeouted during shutdown. 
I can't exclude this, and since after the reboot, btrfs-cleaner continued, that's indeed pretty likely. 

Is an interrupted btrfs-cleaner execution a possible explanation for this issue? 
This would also explain why the re-sent snapshots all seem fine. 

The filesystem itself has 1.2 TB with personal content. If there is a way to extract just the important bits for the developers
and remove anything about the actual content, of course I can do that. 

Cheers,
	Oliver

> 
> 
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-16  1:11   ` Chris Murphy
@ 2019-01-16  1:15     ` Oliver Freyermuth
  0 siblings, 0 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-16  1:15 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Btrfs BTRFS

Am 16.01.19 um 02:11 schrieb Chris Murphy:
> On Tue, Jan 15, 2019 at 5:41 PM Chris Murphy <lists@colorremedies.com> wrote:
>>
>> The relevant error messages are:
>>
>> unable to find ref byte
>> errno=-2 No such entry
>>
>> Somehow a reference byte has been corrupted and inserted into multiple
>> locations in the tree and it's not repairable: i.e. neither a correct
>> value can be inferred from other available information, nor do the
>> tools have a good way to just trim out the item that contains bad key
>> pointers - part of the problem with just cutting out the bad parts is
>> it's not clear the problem is made even worse or how far the
>> corruption extends.
>>
>> What's further troubling though is the idea that this corruption might
>> have propagated to a separate volume via snapshot send receive. Either
>> of the file systems might still be useful for a developer, it seems to
>> me important to have some kind of check to make sure it's not possible
>> for corruption to propagate in this manner.
>>
>> In the meantime, I think it's a good idea to do a memory test. There's
>> some information in the archives about how to do this in a more
>> reliable way than just memtest86 type tests, but if you can run even a
>> memtest86 over a weekend it might confirm there's a memory problem.
>> Unfortunately a pass doesn't necessarily mean there aren't rare
>> transient problems.
> 
> Also, this could be related...
> 
> [ 4368.361487] CPU: 0 PID: 23915 Comm: kworker/u16:11 Tainted: P
>  W  O      4.20.1-gentoo #1
> 
> Do you know why it's tainted? Looks like an out of tree proprietary
> module. And also there has been a previous kernel warning but that
> part of the dmesg isn't included. It's not possible to absolutely
> exclude an out of tree kernel from being a source of memory
> corruption, but it makes it a possible suspect and therefore it makes
> tracking down the source of the problem harder.
> 

nvidia-drivers, sadly. Nouveau fails with the hardware I have, and it's rare enough
nobody with enough experience has stepped up to make it work well :-(. 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-15 22:24   ` Oliver Freyermuth
  2019-01-15 22:58     ` Oliver Freyermuth
@ 2019-01-16  7:11     ` Nikolay Borisov
  2019-01-16 19:40       ` Oliver Freyermuth
  1 sibling, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2019-01-16  7:11 UTC (permalink / raw)
  To: Oliver Freyermuth, linux-btrfs



On 16.01.19 г. 0:24 ч., Oliver Freyermuth wrote:
> Am 14.01.19 um 01:48 schrieb Oliver Freyermuth:
>> Am 13.01.19 um 22:51 schrieb Oliver Freyermuth:
>>> I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
>>> i.e. when purging old subvolumes. 
>>>
>>> I have attached the kernel log from when this happens. 
>>>
>>> What is the best way to proceed from here? Running "btrfs check repair" on the device? 
>>> Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
>>> But it would be good to understand the cause and know if there is a better fix than starting from scratch. 
>> attached is the output of "btrfs check -p /dev/sdc2". 
>> I can't guarantee the volume has never been cleanly unmounted. 
>>
>> I found several past occasions of this here:
>> https://www.spinics.net/lists/linux-btrfs/msg69040.html
>> and here:
>> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing
>> but without conclusive result. 
>>
>> Please let me know what's the best way to proceed. From these links, it seems
>> btrfs check --repair
>> _should_ help, but I would prefer to get some advice first whether this is really the best approach. 
>>
> 
> Dear BTRFS experts,
> 
> I have now salvaged all my backup subvolumes with btrfs send (using btrbk archive) to a new btrfs partition. 
> Interestingly, when the old partition was mounted r/w initially and remounted r/o after the described issue was triggered by btrfs-cleaner:
> 
> [34758.491644] BTRFS: error (device sdc2) in __btrfs_free_extent:6828: errno=-2 No such entry                                                                                                                                               
> [34758.491647] BTRFS info (device sdc2): forced readonly                                                                                                                                                                                     
> [34758.491652] BTRFS: error (device sdc2) in btrfs_run_delayed_refs:2978: errno=-2 No such entry 
> 

You are likely hitting a known issue, you need to apply:

btrfs: run delayed items before dropping the snapshot, currently this
patch is part of 5.0 but it has also been marked for stable so should
land in some of the stable kernels. So you have 2 options:

1. Backport the patch to the kernel you desire
2. Wait until the patch lands in a stable release.

> btrfs send appeared to fail on some subvolumes with:
> 
> [41822.676040] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
> [41822.676260] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
> [41822.676266] BTRFS info (device sdc2): no csum found for inode 22175978 start 0                                                                                                                                                           
> [41822.683112] BTRFS warning (device sdc2): csum failed root 25758 ino 22175978 off 4427459514368 csum 0x5d3b8d26 expected csum 0x00000000 mirror 1 
> 
> Unmounting and remounting the broken file system r/o, all visible subvolumes could be transferred without that issue. 
> I presume that there's also a bug when the automatic remount as r/o happens since csum 0x00000000 does not look correct. 
> 
> Since there's now nothing to lose and I received no other advice up to now, I'm running "btrfs check --repair" now just for the sake of learning
> whether this appears to fix it. I'll shortly report back when that's done. 

--repair won't fix the problem, also it's possible it *could* make
things worse.

> 
> If anybody can suggest a better solution in case this happens again (the issue appears to be wide-spread) I would be happy to learn. 
> 
> Cheers,
> 	Oliver
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-16  7:11     ` Nikolay Borisov
@ 2019-01-16 19:40       ` Oliver Freyermuth
  2019-01-17  0:28         ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-16 19:40 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs

Am 16.01.19 um 08:11 schrieb Nikolay Borisov:
> 
> 
> On 16.01.19 г. 0:24 ч., Oliver Freyermuth wrote:
>> Am 14.01.19 um 01:48 schrieb Oliver Freyermuth:
>>> Am 13.01.19 um 22:51 schrieb Oliver Freyermuth:
>>>> I just upgraded to 4.20.1 from 4.19 (not sure if related) and my btrfs backup volume entered read-only mode when running btrfs-cleaner,
>>>> i.e. when purging old subvolumes. 
>>>>
>>>> I have attached the kernel log from when this happens. 
>>>>
>>>> What is the best way to proceed from here? Running "btrfs check repair" on the device? 
>>>> Worst case it's not a huge issue to lose the data stored there, it's my backup volume after all. 
>>>> But it would be good to understand the cause and know if there is a better fix than starting from scratch. 
>>> attached is the output of "btrfs check -p /dev/sdc2". 
>>> I can't guarantee the volume has never been cleanly unmounted. 
>>>
>>> I found several past occasions of this here:
>>> https://www.spinics.net/lists/linux-btrfs/msg69040.html
>>> and here:
>>> https://unix.stackexchange.com/questions/369133/dealing-with-btrfs-ref-backpointer-mismatches-backref-missing
>>> but without conclusive result. 
>>>
>>> Please let me know what's the best way to proceed. From these links, it seems
>>> btrfs check --repair
>>> _should_ help, but I would prefer to get some advice first whether this is really the best approach. 
>>>
>>
>> Dear BTRFS experts,
>>
>> I have now salvaged all my backup subvolumes with btrfs send (using btrbk archive) to a new btrfs partition. 
>> Interestingly, when the old partition was mounted r/w initially and remounted r/o after the described issue was triggered by btrfs-cleaner:
>>
>> [34758.491644] BTRFS: error (device sdc2) in __btrfs_free_extent:6828: errno=-2 No such entry                                                                                                                                               
>> [34758.491647] BTRFS info (device sdc2): forced readonly                                                                                                                                                                                     
>> [34758.491652] BTRFS: error (device sdc2) in btrfs_run_delayed_refs:2978: errno=-2 No such entry 
>>
> 
> You are likely hitting a known issue, you need to apply:
> 
> btrfs: run delayed items before dropping the snapshot, currently this
> patch is part of 5.0 but it has also been marked for stable so should
> land in some of the stable kernels. So you have 2 options:
> 
> 1. Backport the patch to the kernel you desire
> 2. Wait until the patch lands in a stable release.

Thanks a lot for the pointer! 
Sadly, it seems that was already in 4.20.1, which I am using:
https://lkml.org/lkml/2019/1/9/792

> 
>> btrfs send appeared to fail on some subvolumes with:
>>
>> [41822.676040] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
>> [41822.676260] BTRFS error (device sdc2): parent transid verify failed on 52633681920 wanted 88063 found 87999                                                                                                                               
>> [41822.676266] BTRFS info (device sdc2): no csum found for inode 22175978 start 0                                                                                                                                                           
>> [41822.683112] BTRFS warning (device sdc2): csum failed root 25758 ino 22175978 off 4427459514368 csum 0x5d3b8d26 expected csum 0x00000000 mirror 1 
>>
>> Unmounting and remounting the broken file system r/o, all visible subvolumes could be transferred without that issue. 
>> I presume that there's also a bug when the automatic remount as r/o happens since csum 0x00000000 does not look correct. 
>>
>> Since there's now nothing to lose and I received no other advice up to now, I'm running "btrfs check --repair" now just for the sake of learning
>> whether this appears to fix it. I'll shortly report back when that's done. 
> 
> --repair won't fix the problem, also it's possible it *could* make
> things worse.

Since repair did already run (and did not really help, but segfaults after trying some things) I guess the volume is hosed now anyways. 
It's still sad there is no clear explanation for the corruption - I still believe it *might* have been unmounted hard while btrfs-cleaner was running, though,
but I would hope that can not lead to a non-recoverable state (especially if "only" deleted / to-be-deleted subvolumes are affected). 

I doubt it's memory corruption, since the source is fine and it only happened for those deleted subvolumes immediately after rebooting from 4.19 to 4.20
(but I don't think the kernel version change was the reason, but rather the reboot during deletion which should have done a graceful unmount but might not have done so). 

I'll keep the volume around for a few more days in case somebody is interested to hunt down the cause, just let me know what is needed. 

Cheers,
	Oliver

> 
>>
>> If anybody can suggest a better solution in case this happens again (the issue appears to be wide-spread) I would be happy to learn. 
>>
>> Cheers,
>> 	Oliver
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-16 19:40       ` Oliver Freyermuth
@ 2019-01-17  0:28         ` Chris Murphy
  2019-01-17  1:03           ` Oliver Freyermuth
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2019-01-17  0:28 UTC (permalink / raw)
  To: Oliver Freyermuth; +Cc: Nikolay Borisov, Btrfs BTRFS

On Wed, Jan 16, 2019 at 12:40 PM Oliver Freyermuth
<o.freyermuth@googlemail.com> wrote:
> Since repair did already run (and did not really help, but segfaults after trying some things) I guess the volume is hosed now anyways.
> It's still sad there is no clear explanation for the corruption - I still believe it *might* have been unmounted hard while btrfs-cleaner was running, though,
> but I would hope that can not lead to a non-recoverable state (especially if "only" deleted / to-be-deleted subvolumes are affected).

Well in theory no, even cleaning is predicated on COW. So what should
be true is the trees are pruned and written into a new location, and
only once that succeeds is a new super written. Of course, a
complicating factor is the tree walking is expensive, likely not all
of it can fit in memory at one time, and the new pruned tree isn't all
written out at once either; and then another complicating factor is if
any new files are being created at the same time. You don't want all
new writes to be held up until the cleaning is done.

But if there's no hardware fault or transient power problem or failure
at the time, then all of this should eventually complete successful
and in the proper order. That it didn't, suggests a bug. The problem
is where. Btrfs bug? Some other kernel bug? Hardware, including
firmware, bug?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: btrfs goes read-only when btrfs-cleaner runs
  2019-01-17  0:28         ` Chris Murphy
@ 2019-01-17  1:03           ` Oliver Freyermuth
  0 siblings, 0 replies; 12+ messages in thread
From: Oliver Freyermuth @ 2019-01-17  1:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Btrfs BTRFS

Am 17.01.19 um 01:28 schrieb Chris Murphy:
> On Wed, Jan 16, 2019 at 12:40 PM Oliver Freyermuth
> <o.freyermuth@googlemail.com> wrote:
>> Since repair did already run (and did not really help, but segfaults after trying some things) I guess the volume is hosed now anyways.
>> It's still sad there is no clear explanation for the corruption - I still believe it *might* have been unmounted hard while btrfs-cleaner was running, though,
>> but I would hope that can not lead to a non-recoverable state (especially if "only" deleted / to-be-deleted subvolumes are affected).
> 
> Well in theory no, even cleaning is predicated on COW. So what should
> be true is the trees are pruned and written into a new location, and
> only once that succeeds is a new super written. Of course, a
> complicating factor is the tree walking is expensive, likely not all
> of it can fit in memory at one time, and the new pruned tree isn't all
> written out at once either; and then another complicating factor is if
> any new files are being created at the same time. You don't want all
> new writes to be held up until the cleaning is done.
> 
> But if there's no hardware fault or transient power problem or failure
> at the time, then all of this should eventually complete successful
> and in the proper order. That it didn't, suggests a bug. The problem
> is where. Btrfs bug? Some other kernel bug? Hardware, including
> firmware, bug?
> 

Thanks for the explanation, getting a better understanding of the expected
behaviour is really appreciated! 

I would even add the disk controller's firmware to the list of potential causes - it's an SMR disk 
(sadly, fully device managed) which does _a lot_ of data shuffling, 
sometimes slowing down to something like
300 kB/s especially when btrfs-clean runs and does a lot of random access. 
I have learnt my lesson never to do random writes on device-managed SMR again, 
but use them as write-once media. 

In case anybody thinks that providing metadata of the image or 
extracting more details from it helps to at least exclude / identify a btrfs bug, please tell me.
Otherwise I'll purge it in a few days, I have salvaged all data I wanted. 

Excluding everything else is complex. I know the pain, since I have already (almost) lost one
and corrupted another BTRFS volume due to the infamous r8169 bug which has affected
hundreds of thousands of machines (and still affects those with old kernels!):
https://github.com/torvalds/linux/commit/a78e93661c5fd30b9e1dee464b2f62f966883ef7
It took me years to finally invest enough time to identify the cause of that corruption,
since memtests could never show it and I had to actually trigger
it from user space by reading network statistics (and only then, a userspace memtest could see it). 

Cheers,
	Oliver

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-01-17  1:03 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-13 21:51 btrfs goes read-only when btrfs-cleaner runs Oliver Freyermuth
2019-01-14  0:48 ` Oliver Freyermuth
2019-01-15 22:24   ` Oliver Freyermuth
2019-01-15 22:58     ` Oliver Freyermuth
2019-01-16  7:11     ` Nikolay Borisov
2019-01-16 19:40       ` Oliver Freyermuth
2019-01-17  0:28         ` Chris Murphy
2019-01-17  1:03           ` Oliver Freyermuth
2019-01-16  0:41 ` Chris Murphy
2019-01-16  1:11   ` Chris Murphy
2019-01-16  1:15     ` Oliver Freyermuth
2019-01-16  1:13   ` Oliver Freyermuth

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).