All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: btrfs send and kernel 3.17
       [not found] <DC336054-F307-4A86-AD6D-204E700DE9AA@prnet.org>
@ 2014-10-07 13:19 ` Chris Mason
  2014-10-07 20:45   ` David Arendt
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2014-10-07 13:19 UTC (permalink / raw)
  To: David Arendt; +Cc: linux-btrfs



On Tue, Oct 7, 2014 at 1:25 AM, David Arendt <admin@prnet.org> wrote:
> I did a revert of this commit. After creating a snapshot, the 
> filesystem was no longer usable, even with kernel 3.16.3 (crashes 10 
> seconds after mount without error message) . Maybe there was some 
> previous damage that just appeared now. This evening, I will restore 
> from backup and report back.
> 
> On October 7, 2014 12:22:11 AM CEST, Chris Mason <clm@fb.com> wrote:
>> On Mon, Oct 6, 2014 at 4:51 PM, David Arendt <admin@prnet.org> wrote:
>>>  I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is
>>>  working without any problem. Afterwards I upgraded again to 3.17 
>>> and
>>>  the
>>>  problem reappeared. So the problem seems to be kernel version 
>>> related.
>> 
>> [ backref errors during btrfs-send ]
>> 
>> Ok then, our list of suspects is pretty short.  Can you easily build
>> test kernels?
>> 
>> I'd like to try reverting this commit:
>> 
>> 51f395ad4058883e4273b02fdebe98072dbdc0d2

Oh no!  Reverting this definitely should not have caused corruptions, 
so I think the problem was already there.  Do you still have the 
filesystem image?

Please let us know if you're missing files off the backup, we'll help 
pull them out.

-chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-07 13:19 ` btrfs send and kernel 3.17 Chris Mason
@ 2014-10-07 20:45   ` David Arendt
  2014-10-07 20:46     ` Chris Mason
  0 siblings, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-07 20:45 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

On 10/07/2014 03:19 PM, Chris Mason wrote:
>
>
> On Tue, Oct 7, 2014 at 1:25 AM, David Arendt <admin@prnet.org> wrote:
>> I did a revert of this commit. After creating a snapshot, the
>> filesystem was no longer usable, even with kernel 3.16.3 (crashes 10
>> seconds after mount without error message) . Maybe there was some
>> previous damage that just appeared now. This evening, I will restore
>> from backup and report back.
>>
>> On October 7, 2014 12:22:11 AM CEST, Chris Mason <clm@fb.com> wrote:
>>> On Mon, Oct 6, 2014 at 4:51 PM, David Arendt <admin@prnet.org> wrote:
>>>>  I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is
>>>>  working without any problem. Afterwards I upgraded again to 3.17 and
>>>>  the
>>>>  problem reappeared. So the problem seems to be kernel version
>>>> related.
>>>
>>> [ backref errors during btrfs-send ]
>>>
>>> Ok then, our list of suspects is pretty short.  Can you easily build
>>> test kernels?
>>>
>>> I'd like to try reverting this commit:
>>>
>>> 51f395ad4058883e4273b02fdebe98072dbdc0d2
>
> Oh no!  Reverting this definitely should not have caused corruptions,
> so I think the problem was already there.  Do you still have the
> filesystem image?
>
> Please let us know if you're missing files off the backup, we'll help
> pull them out.
>
> -chris
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Due to space constraints, it was not possible to take an image of the
corrupted filesystem. As I do backups daily, and the problems occurred 5
hours after backup, no file was lost. Thanks for offering your help. In
4 days I will do some send tests on the newly created filesystem and
report back.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-07 20:45   ` David Arendt
@ 2014-10-07 20:46     ` Chris Mason
  2014-10-12 11:11       ` David Arendt
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Mason @ 2014-10-07 20:46 UTC (permalink / raw)
  To: David Arendt; +Cc: linux-btrfs

On Tue, Oct 7, 2014 at 4:45 PM, David Arendt <admin@prnet.org> wrote:
> On 10/07/2014 03:19 PM, Chris Mason wrote:
>> 
>> 
>>  On Tue, Oct 7, 2014 at 1:25 AM, David Arendt <admin@prnet.org> 
>> wrote:
>>>  I did a revert of this commit. After creating a snapshot, the
>>>  filesystem was no longer usable, even with kernel 3.16.3 (crashes 
>>> 10
>>>  seconds after mount without error message) . Maybe there was some
>>>  previous damage that just appeared now. This evening, I will 
>>> restore
>>>  from backup and report back.
>>> 
>>>  On October 7, 2014 12:22:11 AM CEST, Chris Mason <clm@fb.com> 
>>> wrote:
>>>>  On Mon, Oct 6, 2014 at 4:51 PM, David Arendt <admin@prnet.org> 
>>>> wrote:
>>>>>   I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send 
>>>>> is
>>>>>   working without any problem. Afterwards I upgraded again to 
>>>>> 3.17 and
>>>>>   the
>>>>>   problem reappeared. So the problem seems to be kernel version
>>>>>  related.
>>>> 
>>>>  [ backref errors during btrfs-send ]
>>>> 
>>>>  Ok then, our list of suspects is pretty short.  Can you easily 
>>>> build
>>>>  test kernels?
>>>> 
>>>>  I'd like to try reverting this commit:
>>>> 
>>>>  51f395ad4058883e4273b02fdebe98072dbdc0d2
>> 
>>  Oh no!  Reverting this definitely should not have caused 
>> corruptions,
>>  so I think the problem was already there.  Do you still have the
>>  filesystem image?
>> 
>>  Please let us know if you're missing files off the backup, we'll 
>> help
>>  pull them out.
>> 
> Due to space constraints, it was not possible to take an image of the
> corrupted filesystem. As I do backups daily, and the problems 
> occurred 5
> hours after backup, no file was lost. Thanks for offering your help. 
> In
> 4 days I will do some send tests on the newly created filesystem and
> report back.

Ok, if you have the kernel messages from the panic, please send them 
along.

-chris




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-07 20:46     ` Chris Mason
@ 2014-10-12 11:11       ` David Arendt
  2014-10-12 15:24         ` john terragon
  2014-10-13 17:22         ` Rich Freeman
  0 siblings, 2 replies; 32+ messages in thread
From: David Arendt @ 2014-10-12 11:11 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

This weekend I finally had time to try btrfs send again on the newly
created fs. Now I am running into another problem:

btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
memory

In dmesg I see only the following output:

parent transid verify failed on 21325004800 wanted 2620 found 8325


On 10/07/2014 10:46 PM, Chris Mason wrote:
> On Tue, Oct 7, 2014 at 4:45 PM, David Arendt <admin@prnet.org> wrote:
>> On 10/07/2014 03:19 PM, Chris Mason wrote:
>>>
>>>
>>>  On Tue, Oct 7, 2014 at 1:25 AM, David Arendt <admin@prnet.org> wrote:
>>>>  I did a revert of this commit. After creating a snapshot, the
>>>>  filesystem was no longer usable, even with kernel 3.16.3 (crashes 10
>>>>  seconds after mount without error message) . Maybe there was some
>>>>  previous damage that just appeared now. This evening, I will restore
>>>>  from backup and report back.
>>>>
>>>>  On October 7, 2014 12:22:11 AM CEST, Chris Mason <clm@fb.com> wrote:
>>>>>  On Mon, Oct 6, 2014 at 4:51 PM, David Arendt <admin@prnet.org>
>>>>> wrote:
>>>>>>   I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is
>>>>>>   working without any problem. Afterwards I upgraded again to
>>>>>> 3.17 and
>>>>>>   the
>>>>>>   problem reappeared. So the problem seems to be kernel version
>>>>>>  related.
>>>>>
>>>>>  [ backref errors during btrfs-send ]
>>>>>
>>>>>  Ok then, our list of suspects is pretty short.  Can you easily build
>>>>>  test kernels?
>>>>>
>>>>>  I'd like to try reverting this commit:
>>>>>
>>>>>  51f395ad4058883e4273b02fdebe98072dbdc0d2
>>>
>>>  Oh no!  Reverting this definitely should not have caused corruptions,
>>>  so I think the problem was already there.  Do you still have the
>>>  filesystem image?
>>>
>>>  Please let us know if you're missing files off the backup, we'll help
>>>  pull them out.
>>>
>> Due to space constraints, it was not possible to take an image of the
>> corrupted filesystem. As I do backups daily, and the problems occurred 5
>> hours after backup, no file was lost. Thanks for offering your help. In
>> 4 days I will do some send tests on the newly created filesystem and
>> report back.
>
> Ok, if you have the kernel messages from the panic, please send them
> along.
>
> -chris
>
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-12 11:11       ` David Arendt
@ 2014-10-12 15:24         ` john terragon
  2014-10-12 21:35           ` David Arendt
  2014-10-13 17:22         ` Rich Freeman
  1 sibling, 1 reply; 32+ messages in thread
From: john terragon @ 2014-10-12 15:24 UTC (permalink / raw)
  To: David Arendt; +Cc: Chris Mason, Btrfs BTRFS

Hi.

I just wanted to "confirm David's story" so to speak :)

-kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any
btrfs fixes, I think)

-btrfs-progs 3.16.2 (also compiled from source, so no
distribution-specific patches)

-fresh fs

-I get the same two errors David got (first I got the I/O error one
and then the memory allocation one)

-plus now when I ls -la the fs top volume this is what I get

drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
d????????? ? ?    ?       ?            ? home-backup
drwxr-xr-x 1 root root  250 Oct 10 15:37 root
d????????? ? ?    ?       ?            ? root-backup
drwxr-xr-x 1 root root   88 Sep 15 16:02 vms
drwxr-xr-x 1 root root   88 Sep 15 16:02 vms-backup

yes, the question marks on those two *-backup snapshots are really
there. I can't access the snapshots, I can't delete them, I can't do
anything with them.

-btrfs check segfaults

-the events that led to this situation are these:
 1) btrfs su snap -r root root-backup
 2) send |receive (the entire root-backup, not and incremental send)
     immediate I/O error
 3) move on to home: btrfs su snap -r home home-backup
 4) send|receive (again not an incremental send)
     everything goes well (!)
 5) retry with root: btrfs su snap -r root root-backup
 6) send|receive
     and it goes seemingly well
 7) apt-get dist-upgrade just to modify root and try an incremental send
 8) reboot after the dist-upgrade
 9) ls -la the fs top volume: first I get the memory allocation error
and after that
       any ls -la gives the output I pasted above. (notice that beside
the ls -la, the
       two snapshots were not touched in any way since the two send|receive)

Few final notes. I haven't tried send/receive in a while (they were
unreliable) so I can't tell which is the last version they worked for
me (well, no version actually :) ).
I've never had any problem with just snapshots. I make them regularly,
I use them, I modify them and I've never had one problem (with 3.17
too, it's just send/receive that murders them).

Best regards

John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-12 15:24         ` john terragon
@ 2014-10-12 21:35           ` David Arendt
  2014-10-13  4:11             ` David Arendt
  0 siblings, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-12 21:35 UTC (permalink / raw)
  To: john terragon; +Cc: Chris Mason, Btrfs BTRFS

Just to let you know, I just tried an ls -l on 2 machines running kernel
3.17 and btrfs-progs 3.16.2.

Here is my ls -l output:

Machine 1:
ls: cannot access root.20141009.000503.backup: Cannot allocate memory
total 0
d????????? ? ?      ?         ?            ? root.20141009.000503.backup
drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141012.095526.backup
drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141012.000503.backup
drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141011.000502.backup
drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141010.000502.backup

root.20141009.000503.backup is not deletable.

Machine 2:
ls: cannot access root.20141006.003239.backup: Cannot allocate memory
ls: cannot access root.20141007.001616.backup: Cannot allocate memory
ls: cannot access root.20141008.000501.backup: Cannot allocate memory
ls: cannot access root.20141009.052436.backup: Cannot allocate memory
total 0
d????????? ? ?    ?      ?            ? root.20141009.052436.backup
d????????? ? ?    ?      ?            ? root.20141008.000501.backup
d????????? ? ?    ?      ?            ? root.20141007.001616.backup
d????????? ? ?    ?      ?            ? root.20141006.003239.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140925.001125.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140924.001017.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140923.001008.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140922.001836.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140921.001029.backup
drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140920.001020.backup

The ? ones are also not deletable.

Both machines are giving transid verify failed errors.

I verified my logfiles and this problem was never there using previous
kernel versions. On machine 1, it is also sure that it was not any
previous corruption as this filesystem has also been created with
btrfs-progs 3.16.2 using kernel 3.17.

On 10/12/2014 05:24 PM, john terragon wrote:
> Hi.
>
> I just wanted to "confirm David's story" so to speak :)
>
> -kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any
> btrfs fixes, I think)
>
> -btrfs-progs 3.16.2 (also compiled from source, so no
> distribution-specific patches)
>
> -fresh fs
>
> -I get the same two errors David got (first I got the I/O error one
> and then the memory allocation one)
>
> -plus now when I ls -la the fs top volume this is what I get
>
> drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
> d????????? ? ?    ?       ?            ? home-backup
> drwxr-xr-x 1 root root  250 Oct 10 15:37 root
> d????????? ? ?    ?       ?            ? root-backup
> drwxr-xr-x 1 root root   88 Sep 15 16:02 vms
> drwxr-xr-x 1 root root   88 Sep 15 16:02 vms-backup
>
> yes, the question marks on those two *-backup snapshots are really
> there. I can't access the snapshots, I can't delete them, I can't do
> anything with them.
>
> -btrfs check segfaults
>
> -the events that led to this situation are these:
>  1) btrfs su snap -r root root-backup
>  2) send |receive (the entire root-backup, not and incremental send)
>      immediate I/O error
>  3) move on to home: btrfs su snap -r home home-backup
>  4) send|receive (again not an incremental send)
>      everything goes well (!)
>  5) retry with root: btrfs su snap -r root root-backup
>  6) send|receive
>      and it goes seemingly well
>  7) apt-get dist-upgrade just to modify root and try an incremental send
>  8) reboot after the dist-upgrade
>  9) ls -la the fs top volume: first I get the memory allocation error
> and after that
>        any ls -la gives the output I pasted above. (notice that beside
> the ls -la, the
>        two snapshots were not touched in any way since the two send|receive)
>
> Few final notes. I haven't tried send/receive in a while (they were
> unreliable) so I can't tell which is the last version they worked for
> me (well, no version actually :) ).
> I've never had any problem with just snapshots. I make them regularly,
> I use them, I modify them and I've never had one problem (with 3.17
> too, it's just send/receive that murders them).
>
> Best regards
>
> John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-12 21:35           ` David Arendt
@ 2014-10-13  4:11             ` David Arendt
  2014-10-13 12:40               ` john terragon
  0 siblings, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-13  4:11 UTC (permalink / raw)
  To: john terragon; +Cc: Chris Mason, Btrfs BTRFS

Some more info I thought off. For me, the corruption problem seems not
to be send related but snapshot creation related. On machine 2 send was
never used. However both filesystems are stored on SSDs (of different
brand). Another filesystem stored on a normal HDD didn't experience the
problem. Maybe this is pure coincidence and has nothing to do with the
fact that it is on SSD or HDD. Another thing I noticed is that for me,
the problem only seems to occur for root subvolumes with many small
files. I have no root subvolumes on HDD so it might be not SSD related.

On 10/12/2014 11:35 PM, David Arendt wrote:
> Just to let you know, I just tried an ls -l on 2 machines running kernel
> 3.17 and btrfs-progs 3.16.2.
>
> Here is my ls -l output:
>
> Machine 1:
> ls: cannot access root.20141009.000503.backup: Cannot allocate memory
> total 0
> d????????? ? ?      ?         ?            ? root.20141009.000503.backup
> drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141012.095526.backup
> drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141012.000503.backup
> drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141011.000502.backup
> drwxr-xr-x 1 root   root    182 Oct  7 20:35 root.20141010.000502.backup
>
> root.20141009.000503.backup is not deletable.
>
> Machine 2:
> ls: cannot access root.20141006.003239.backup: Cannot allocate memory
> ls: cannot access root.20141007.001616.backup: Cannot allocate memory
> ls: cannot access root.20141008.000501.backup: Cannot allocate memory
> ls: cannot access root.20141009.052436.backup: Cannot allocate memory
> total 0
> d????????? ? ?    ?      ?            ? root.20141009.052436.backup
> d????????? ? ?    ?      ?            ? root.20141008.000501.backup
> d????????? ? ?    ?      ?            ? root.20141007.001616.backup
> d????????? ? ?    ?      ?            ? root.20141006.003239.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140925.001125.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140924.001017.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140923.001008.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140922.001836.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140921.001029.backup
> drwxr-xr-x 1 root root 232 Aug  3 15:00 root.20140920.001020.backup
>
> The ? ones are also not deletable.
>
> Both machines are giving transid verify failed errors.
>
> I verified my logfiles and this problem was never there using previous
> kernel versions. On machine 1, it is also sure that it was not any
> previous corruption as this filesystem has also been created with
> btrfs-progs 3.16.2 using kernel 3.17.
>
> On 10/12/2014 05:24 PM, john terragon wrote:
>> Hi.
>>
>> I just wanted to "confirm David's story" so to speak :)
>>
>> -kernel 3.17-rc7 (didn't bother to compile 3.17 as there weren't any
>> btrfs fixes, I think)
>>
>> -btrfs-progs 3.16.2 (also compiled from source, so no
>> distribution-specific patches)
>>
>> -fresh fs
>>
>> -I get the same two errors David got (first I got the I/O error one
>> and then the memory allocation one)
>>
>> -plus now when I ls -la the fs top volume this is what I get
>>
>> drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
>> d????????? ? ?    ?       ?            ? home-backup
>> drwxr-xr-x 1 root root  250 Oct 10 15:37 root
>> d????????? ? ?    ?       ?            ? root-backup
>> drwxr-xr-x 1 root root   88 Sep 15 16:02 vms
>> drwxr-xr-x 1 root root   88 Sep 15 16:02 vms-backup
>>
>> yes, the question marks on those two *-backup snapshots are really
>> there. I can't access the snapshots, I can't delete them, I can't do
>> anything with them.
>>
>> -btrfs check segfaults
>>
>> -the events that led to this situation are these:
>>  1) btrfs su snap -r root root-backup
>>  2) send |receive (the entire root-backup, not and incremental send)
>>      immediate I/O error
>>  3) move on to home: btrfs su snap -r home home-backup
>>  4) send|receive (again not an incremental send)
>>      everything goes well (!)
>>  5) retry with root: btrfs su snap -r root root-backup
>>  6) send|receive
>>      and it goes seemingly well
>>  7) apt-get dist-upgrade just to modify root and try an incremental send
>>  8) reboot after the dist-upgrade
>>  9) ls -la the fs top volume: first I get the memory allocation error
>> and after that
>>        any ls -la gives the output I pasted above. (notice that beside
>> the ls -la, the
>>        two snapshots were not touched in any way since the two send|receive)
>>
>> Few final notes. I haven't tried send/receive in a while (they were
>> unreliable) so I can't tell which is the last version they worked for
>> me (well, no version actually :) ).
>> I've never had any problem with just snapshots. I make them regularly,
>> I use them, I modify them and I've never had one problem (with 3.17
>> too, it's just send/receive that murders them).
>>
>> Best regards
>>
>> John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-13  4:11             ` David Arendt
@ 2014-10-13 12:40               ` john terragon
  2014-10-13 15:40                 ` David Arendt
  0 siblings, 1 reply; 32+ messages in thread
From: john terragon @ 2014-10-13 12:40 UTC (permalink / raw)
  To: David Arendt; +Cc: Chris Mason, Btrfs BTRFS

Actually it seems strange that a send operation could corrupt the
source subvolume or fs. Why would the send modify the source subvolume
in any significant way? The only way I can find to reconcile your
observations with mine is that maybe the snapshots get corrupted not
by the send operation by itself but when they are generated with -r
(readonly, as it is needed to send them). Are the corrupted snapshots
you have in machine 2 (the one in which send was never used) readonly?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-13 12:40               ` john terragon
@ 2014-10-13 15:40                 ` David Arendt
  0 siblings, 0 replies; 32+ messages in thread
From: David Arendt @ 2014-10-13 15:40 UTC (permalink / raw)
  To: john terragon; +Cc: Chris Mason, Btrfs BTRFS

On 10/13/2014 02:40 PM, john terragon wrote:
> Actually it seems strange that a send operation could corrupt the
> source subvolume or fs. Why would the send modify the source subvolume
> in any significant way? The only way I can find to reconcile your
> observations with mine is that maybe the snapshots get corrupted not
> by the send operation by itself but when they are generated with -r
> (readonly, as it is needed to send them). Are the corrupted snapshots
> you have in machine 2 (the one in which send was never used) readonly?
Yes, on both machines there are only readonly snapshots.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-12 11:11       ` David Arendt
  2014-10-12 15:24         ` john terragon
@ 2014-10-13 17:22         ` Rich Freeman
  2014-10-13 20:27           ` btrfs random filesystem corruption in " David Arendt
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Freeman @ 2014-10-13 17:22 UTC (permalink / raw)
  To: David Arendt; +Cc: Chris Mason, Btrfs BTRFS

On Sun, Oct 12, 2014 at 7:11 AM, David Arendt <admin@prnet.org> wrote:
> This weekend I finally had time to try btrfs send again on the newly
> created fs. Now I am running into another problem:
>
> btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
> memory
>
> In dmesg I see only the following output:
>
> parent transid verify failed on 21325004800 wanted 2620 found 8325
>

I'm not using send at all, but I've been running into parent transid
verify failed messages where the wanted is way smaller than the found
when trying to balance a raid1 after adding a new drive.  Originally I
had gotten a BUG, and after reboot the drive finished balancing
(interestingly enough without moving any chunks to the new drive -
just consolidating everything on the old drives), and then when I try
to do another balance I get:
[ 4426.987177] BTRFS info (device sdc2): relocating block group
10367073779712 flags 17
[ 4446.287998] BTRFS info (device sdc2): found 13 extents
[ 4451.330887] parent transid verify failed on 10063286579200 wanted
987432 found 993678
[ 4451.350663] parent transid verify failed on 10063286579200 wanted
987432 found 993678

The btrfs program itself outputs:
btrfs balance start -v /data
Dumping filters: flags 0x7, state 0x0, force is off
  DATA (flags 0x0): balancing
  METADATA (flags 0x0): balancing
  SYSTEM (flags 0x0): balancing
ERROR: error during balancing '/data' - Cannot allocate memory
There may be more info in syslog - try dmesg | tail

This is also on 3.17.  This may be completely unrelated, but it seemed
similar enough to be worth mentioning.

The filesystem otherwise seems to work fine, other than the new drive
not having any data on it:
Label: 'datafs'  uuid: cd074207-9bc3-402d-bee8-6a8c77d56959
        Total devices 6 FS bytes used 2.16TiB
        devid    1 size 2.73TiB used 2.40TiB path /dev/sdc2
        devid    2 size 931.32GiB used 695.03GiB path /dev/sda2
        devid    3 size 931.32GiB used 700.00GiB path /dev/sdb2
        devid    4 size 931.32GiB used 700.00GiB path /dev/sdd2
        devid    5 size 931.32GiB used 699.00GiB path /dev/sde2
        devid    6 size 2.73TiB used 0.00 path /dev/sdf2

This is btrfs-progs-3.16.2.

--
Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 17:22         ` Rich Freeman
@ 2014-10-13 20:27           ` David Arendt
  2014-10-13 20:42             ` Rich Freeman
  2014-10-13 20:48             ` john terragon
  0 siblings, 2 replies; 32+ messages in thread
From: David Arendt @ 2014-10-13 20:27 UTC (permalink / raw)
  To: Rich Freeman; +Cc: Chris Mason, Btrfs BTRFS

>From my own experience and based on what other people are saying, I
think there is a random btrfs filesystem corruption problem in kernel
3.17 at least related to snapshots, therefore I decided to post using
another subject to draw attention from people not concerned about btrfs
send to it. More information can be found in the brtfs send posts.

Did the filesystem you tried to balance contain snapshots ? Read only ones ?

On 10/13/2014 07:22 PM, Rich Freeman wrote:
> On Sun, Oct 12, 2014 at 7:11 AM, David Arendt <admin@prnet.org> wrote:
>> This weekend I finally had time to try btrfs send again on the newly
>> created fs. Now I am running into another problem:
>>
>> btrfs send returns: ERROR: send ioctl failed with -12: Cannot allocate
>> memory
>>
>> In dmesg I see only the following output:
>>
>> parent transid verify failed on 21325004800 wanted 2620 found 8325
>>
> I'm not using send at all, but I've been running into parent transid
> verify failed messages where the wanted is way smaller than the found
> when trying to balance a raid1 after adding a new drive.  Originally I
> had gotten a BUG, and after reboot the drive finished balancing
> (interestingly enough without moving any chunks to the new drive -
> just consolidating everything on the old drives), and then when I try
> to do another balance I get:
> [ 4426.987177] BTRFS info (device sdc2): relocating block group
> 10367073779712 flags 17
> [ 4446.287998] BTRFS info (device sdc2): found 13 extents
> [ 4451.330887] parent transid verify failed on 10063286579200 wanted
> 987432 found 993678
> [ 4451.350663] parent transid verify failed on 10063286579200 wanted
> 987432 found 993678
>
> The btrfs program itself outputs:
> btrfs balance start -v /data
> Dumping filters: flags 0x7, state 0x0, force is off
>   DATA (flags 0x0): balancing
>   METADATA (flags 0x0): balancing
>   SYSTEM (flags 0x0): balancing
> ERROR: error during balancing '/data' - Cannot allocate memory
> There may be more info in syslog - try dmesg | tail
>
> This is also on 3.17.  This may be completely unrelated, but it seemed
> similar enough to be worth mentioning.
>
> The filesystem otherwise seems to work fine, other than the new drive
> not having any data on it:
> Label: 'datafs'  uuid: cd074207-9bc3-402d-bee8-6a8c77d56959
>         Total devices 6 FS bytes used 2.16TiB
>         devid    1 size 2.73TiB used 2.40TiB path /dev/sdc2
>         devid    2 size 931.32GiB used 695.03GiB path /dev/sda2
>         devid    3 size 931.32GiB used 700.00GiB path /dev/sdb2
>         devid    4 size 931.32GiB used 700.00GiB path /dev/sdd2
>         devid    5 size 931.32GiB used 699.00GiB path /dev/sde2
>         devid    6 size 2.73TiB used 0.00 path /dev/sdf2
>
> This is btrfs-progs-3.16.2.
>
> --
> Rich


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:27           ` btrfs random filesystem corruption in " David Arendt
@ 2014-10-13 20:42             ` Rich Freeman
  2014-10-13 22:36               ` Duncan
  2014-10-13 20:48             ` john terragon
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Freeman @ 2014-10-13 20:42 UTC (permalink / raw)
  To: David Arendt; +Cc: Chris Mason, Btrfs BTRFS

On Mon, Oct 13, 2014 at 4:27 PM, David Arendt <admin@prnet.org> wrote:
> From my own experience and based on what other people are saying, I
> think there is a random btrfs filesystem corruption problem in kernel
> 3.17 at least related to snapshots, therefore I decided to post using
> another subject to draw attention from people not concerned about btrfs
> send to it. More information can be found in the brtfs send posts.
>
> Did the filesystem you tried to balance contain snapshots ? Read only ones ?

The filesystem contains numerous subvolumes and snapshots, many of
which are read-only.  I'm managing many with snapper.

The similarity of the transid verify errors made me think this issue
is related, and the root cause may have nothing to do with btrfs send.

As far as I can tell these errors aren't having any affect on my data
- hopefully the system is catching the problems before there are
actual disk writes/etc.

--
Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:27           ` btrfs random filesystem corruption in " David Arendt
  2014-10-13 20:42             ` Rich Freeman
@ 2014-10-13 20:48             ` john terragon
  2014-10-13 20:55               ` Rich Freeman
  2014-10-13 21:22               ` David Arendt
  1 sibling, 2 replies; 32+ messages in thread
From: john terragon @ 2014-10-13 20:48 UTC (permalink / raw)
  To: David Arendt; +Cc: Rich Freeman, Chris Mason, Btrfs BTRFS

I think I just found a consistent simple way to trigger the problem
(at least on my system). And, as I guessed before, it seems to be
related just to readonly snapshots:

1) I create a readonly snapshot
2) I do some changes on the source subvolume for the snapshot (I'm not
sure changes are strictly needed)
3) reboot (or probably just unmount and remount. I reboot because the
fs I've problems with contains my root subvolume)

After the rebooting (or the remount) I consistently have the corruption
with the usual multitude of these in dmesg
"parent transid verify failed on 902316032 wanted 2484 found 4101"
and the characteristic ls -la output

drwxr-xr-x 1 root root  250 Oct 10 15:37 root
d????????? ? ?    ?       ?            ? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
d????????? ? ?    ?       ?            ? root-backup

root-backup and root-b2 are both readonly whereas root-b3 is rw (and
it didn't get corrupted).

David, maybe you can try the same steps on one of your machines?

John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:48             ` john terragon
@ 2014-10-13 20:55               ` Rich Freeman
  2014-10-13 20:57                 ` Rich Freeman
  2014-10-13 21:22                 ` john terragon
  2014-10-13 21:22               ` David Arendt
  1 sibling, 2 replies; 32+ messages in thread
From: Rich Freeman @ 2014-10-13 20:55 UTC (permalink / raw)
  To: john terragon; +Cc: David Arendt, Chris Mason, Btrfs BTRFS

On Mon, Oct 13, 2014 at 4:48 PM, john terragon <jterragon@gmail.com> wrote:
> I think I just found a consistent simple way to trigger the problem
> (at least on my system). And, as I guessed before, it seems to be
> related just to readonly snapshots:
>
> 1) I create a readonly snapshot
> 2) I do some changes on the source subvolume for the snapshot (I'm not
> sure changes are strictly needed)
> 3) reboot (or probably just unmount and remount. I reboot because the
> fs I've problems with contains my root subvolume)
>
> After the rebooting (or the remount) I consistently have the corruption
> with the usual multitude of these in dmesg
> "parent transid verify failed on 902316032 wanted 2484 found 4101"
> and the characteristic ls -la output
>
> drwxr-xr-x 1 root root  250 Oct 10 15:37 root
> d????????? ? ?    ?       ?            ? root-b2
> drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
> d????????? ? ?    ?       ?            ? root-backup
>
> root-backup and root-b2 are both readonly whereas root-b3 is rw (and
> it didn't get corrupted).
>
> David, maybe you can try the same steps on one of your machines?
>

Look at that.  I didn't realize it, but indeed I have a corrupted snapshot:
/data/.snapshots/5338/:
ls: cannot access /data/.snapshots/5338/snapshot: Cannot allocate memory
total 4
drwxr-xr-x 1 root root  32 Oct 11 06:09 .
drwxr-x--- 1 root root  32 Oct 11 07:42 ..
-rw------- 1 root root 135 Oct 11 06:09 info.xml
d????????? ? ?    ?      ?            ? snapshot

Several older snapshots are fine, and those predate my 3.17 upgrade.

I noticed that this corrupted snapshot isn't even listed in my snapper lists.

btrfs su delete /data/.snapshots/5338/snapshot
Transaction commit: none (default)
ERROR: error accessing '/data/.snapshots/5338/snapshot'

Removing them appears to be problematic as well.  I might just disable
compress=lzo and go back to 3.16 to see how that goes.

--
Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:55               ` Rich Freeman
@ 2014-10-13 20:57                 ` Rich Freeman
  2014-10-13 21:22                 ` john terragon
  1 sibling, 0 replies; 32+ messages in thread
From: Rich Freeman @ 2014-10-13 20:57 UTC (permalink / raw)
  To: john terragon; +Cc: David Arendt, Chris Mason, Btrfs BTRFS

On Mon, Oct 13, 2014 at 4:55 PM, Rich Freeman
<r-btrfs@thefreemanclan.net> wrote:
> On Mon, Oct 13, 2014 at 4:48 PM, john terragon <jterragon@gmail.com> wrote:
>>
>> After the rebooting (or the remount) I consistently have the corruption
>> with the usual multitude of these in dmesg
>> "parent transid verify failed on 902316032 wanted 2484 found 4101"
>> and the characteristic ls -la output

Sorry to double-reply, but I left this out.  I have a long string of
these early in boot as well that I never noticed before.

--
Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:55               ` Rich Freeman
  2014-10-13 20:57                 ` Rich Freeman
@ 2014-10-13 21:22                 ` john terragon
  2014-10-13 21:25                   ` David Arendt
  2014-10-13 23:18                   ` Rich Freeman
  1 sibling, 2 replies; 32+ messages in thread
From: john terragon @ 2014-10-13 21:22 UTC (permalink / raw)
  To: Rich Freeman; +Cc: David Arendt, Chris Mason, Btrfs BTRFS

I'm using "compress=no" so compression doesn't seem to be related, at
least in my case. Just read-only snapshots on 3.17 (although I haven't
tried 3.16).

John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:48             ` john terragon
  2014-10-13 20:55               ` Rich Freeman
@ 2014-10-13 21:22               ` David Arendt
  1 sibling, 0 replies; 32+ messages in thread
From: David Arendt @ 2014-10-13 21:22 UTC (permalink / raw)
  To: john terragon; +Cc: Rich Freeman, Chris Mason, Btrfs BTRFS

As these to machines are running as server for different purposes (yes,
I know that btrfs is unstable and any corruption or data loss is at my
own risk therefore I have good backups), I want to reboot them not more
then necessary.

However I tried to bring my reboot times in relation with corruptions:

machine 1:

d????????? ? ?      ?         ?            ? root.20141009.000503.backup

reboot   system boot  3.17.0           Thu Oct  9 23:20   still running
reboot   system boot  3.17.0           Tue Oct  7 21:25 - 23:18 (2+01:53)
reboot   system boot  3.17.0           Mon Oct  6 22:47 - 23:18 (3+00:31)

For this machine, corruption seems to have occurred for a snapshot
created after a reboot.


machine 2:

d????????? ? ?    ?      ?            ? root.20141006.003239.backup
d????????? ? ?    ?      ?            ? root.20141007.001616.backup
d????????? ? ?    ?      ?            ? root.20141008.000501.backup
d????????? ? ?    ?      ?            ? root.20141009.052436.backup

reboot   system boot  3.17.0           Thu Oct  9 21:31   still running
reboot   system boot  3.17.0           Tue Oct  7 21:27 - 21:30 (2+00:03)
reboot   system boot  3.17.0           Tue Oct  7 17:51 - 21:26  (03:34)
reboot   system boot  3.17.0           Sun Oct  5 23:50 - 17:50 (1+17:59)
reboot   system boot  3.17.0           Sun Oct  5 23:47 - 23:49  (00:01)

During the next days, I will setup a virtual machine to do more tests.

On 10/13/2014 10:48 PM, john terragon wrote:
> I think I just found a consistent simple way to trigger the problem
> (at least on my system). And, as I guessed before, it seems to be
> related just to readonly snapshots:
>
> 1) I create a readonly snapshot
> 2) I do some changes on the source subvolume for the snapshot (I'm not
> sure changes are strictly needed)
> 3) reboot (or probably just unmount and remount. I reboot because the
> fs I've problems with contains my root subvolume)
>
> After the rebooting (or the remount) I consistently have the corruption
> with the usual multitude of these in dmesg
> "parent transid verify failed on 902316032 wanted 2484 found 4101"
> and the characteristic ls -la output
>
> drwxr-xr-x 1 root root  250 Oct 10 15:37 root
> d????????? ? ?    ?       ?            ? root-b2
> drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
> d????????? ? ?    ?       ?            ? root-backup
>
> root-backup and root-b2 are both readonly whereas root-b3 is rw (and
> it didn't get corrupted).
>
> David, maybe you can try the same steps on one of your machines?
>
> John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 21:22                 ` john terragon
@ 2014-10-13 21:25                   ` David Arendt
  2014-10-13 21:49                     ` Duncan
  2014-10-13 23:18                   ` Rich Freeman
  1 sibling, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-13 21:25 UTC (permalink / raw)
  To: john terragon, Rich Freeman; +Cc: Chris Mason, Btrfs BTRFS

I'm also using no compression.

On 10/13/2014 11:22 PM, john terragon wrote:
> I'm using "compress=no" so compression doesn't seem to be related, at
> least in my case. Just read-only snapshots on 3.17 (although I haven't
> tried 3.16).
>
> John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 21:25                   ` David Arendt
@ 2014-10-13 21:49                     ` Duncan
  0 siblings, 0 replies; 32+ messages in thread
From: Duncan @ 2014-10-13 21:49 UTC (permalink / raw)
  To: linux-btrfs

David Arendt posted on Mon, 13 Oct 2014 23:25:23 +0200 as excerpted:

> I'm also using no compression.
> 
> On 10/13/2014 11:22 PM, john terragon wrote:
>> I'm using "compress=no" so compression doesn't seem to be related, at
>> least in my case. Just read-only snapshots on 3.17 (although I haven't
>> tried 3.16).

While I'm not a mind-reader and thus don't know for sure, Rich's 
reference to 3.16 and compression might not be related to this bug at 
all.  In 3.15 and early 3.16, there was a different bug related to 
compression, tho IIRC it was patched in 3.16.2 and 3.17-rc2 (or maybe .3 
and rc3, it's patched in the latest 3.16.x anyway, and in 3.17).  So how 
I read his comment was that he was considering going back to 3.16 and 
disabling compression to deal with that bug (he may not know the patch 
was marked for stable and is in current 3.16.x), rather than stay on 
3.17, since this bug hasn't even been traced yet, let alone patched.

Meanwhile, this bug makes me glad my use-case doesn't involve snapshots, 
and I've seen nothing of it. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 20:42             ` Rich Freeman
@ 2014-10-13 22:36               ` Duncan
  2014-10-14 11:17                 ` admin
  2014-10-14 17:00                 ` David Arendt
  0 siblings, 2 replies; 32+ messages in thread
From: Duncan @ 2014-10-13 22:36 UTC (permalink / raw)
  To: linux-btrfs

Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted:

> On Mon, Oct 13, 2014 at 4:27 PM, David Arendt <admin@prnet.org> wrote:
>> From my own experience and based on what other people are saying, I
>> think there is a random btrfs filesystem corruption problem in kernel
>> 3.17 at least related to snapshots, therefore I decided to post using
>> another subject to draw attention from people not concerned about btrfs
>> send to it. More information can be found in the brtfs send posts.
>>
>> Did the filesystem you tried to balance contain snapshots ? Read only
>> ones ?
> 
> The filesystem contains numerous subvolumes and snapshots, many of which
> are read-only.  I'm managing many with snapper.
> 
> The similarity of the transid verify errors made me think this issue is
> related, and the root cause may have nothing to do with btrfs send.
> 
> As far as I can tell these errors aren't having any affect on my data -
> hopefully the system is catching the problems before there are actual
> disk writes/etc.

Summarizing what I've seen on the threads...

1) The bug seems to be read-only snapshot related.  The connection to 
send is that send creates read-only snapshots, but people creating read-
only snapshots for other purposes are now reporting the same problem, so 
it's not send, it's the read-only snapshots.

2) Writable snapshots haven't been implicated yet, and the working set 
from which the snapshots are taken doesn't seem to be affected, either.  
So in that sense it's not affecting ordinary usage, only the read-only 
snapshots themselves.

3) More problematic, however, is the fact that these apparently corrupted 
read-only snapshots often are not listed properly and can't be deleted, 
tho I'm not sure if that's /all/ the corrupted snapshots or only part of 
them. So while it may not affect ordinary operation in the short term, 
over time until there's a fix, people routinely doing read-only snapshots 
are going to be getting more and more of these undeletable snapshots, and 
depending on whether the eventual patch only prevents more or can 
actually fix the bad ones (possibly via btrfs check or the like), 
affected filesystems may ultimately have to be blown away and recreated 
with a fresh mkfs, in ordered to kill the currently undeletable snapshots.

So the first thing to do would be to shut off whatever's making read-only 
snapshots, so you don't make the problem worse while it's being 
investigated.  For those who can do that without too big an interruption 
to their normal routine (who don't depend on send/receive, for instance), 
just keep it off for the time being.  For those who depend on read-only 
snapshots (send-receive for backup and the data is too valuable to not do 
the backups for a few days), consider switching back to 3.16-stable -- 
from 3.16.3 at least, the patch for the compress bug is there, so that 
shouldn't be a problem.

And if you're affected, be aware that until we have a fix, we don't know 
if it'll be possible to remove the affected and currently undeletable 
snapshots.  If it's not, at some point you'll need to do a fresh 
mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to 
affect writable snapshots or the "head" from which snapshots are made, 
it's not urgent, and a full fix is likely to include a patch to detect 
and fix the problem as well, but until we know what the problem is we 
can't be sure of that, so be prepared to do that mkfs at some point, as 
at this point it's possible that's the only way you'll be able to kill 
the corrupted snapshots.

4) Total speculation on my part, but given the wanted transid (aka 
generation, in different contexts) is significantly lower than the found 
transid, and the fact that the problem appears to be limited to
/read-only/ snapshots, my first suspicion is that something's getting 
updated that would normally apply to all snapshots, but the read-only 
nature of the snapshots is preventing the full update there.  The transid 
of the block is updated, but the snapshot being read-only is preventing 
update of the pointer in that snapshot accordingly.

What I do /not/ know is whether the bug is that something's getting 
updated that should NOT be, and it's simply the read-only snapshots 
letting us know about it since the writable snapshots are fully updated, 
even if that breaks the snapshot (breaking writable snapshots in a 
different and currently undetected way), or if instead, it's a legitimate 
update, like a balance simply moving the snapshot around but not 
affecting it otherwise, and the bug is that the read-only snapshots 
aren't allowing the legitimate update.

Either way, this more or less developed over the weekend, and it's Monday 
now, so the devs should be on it.  If it's anything like the 3.15/3.16 
compression bug, it'll take some time for them to properly trace it, and 
then to figure out an appropriate fix, but they will.  Chances are we'll 
have at least some decent progress on a trace by Friday, and maybe even a 
good-to-go patch. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 21:22                 ` john terragon
  2014-10-13 21:25                   ` David Arendt
@ 2014-10-13 23:18                   ` Rich Freeman
  2014-10-14  1:30                     ` john terragon
  1 sibling, 1 reply; 32+ messages in thread
From: Rich Freeman @ 2014-10-13 23:18 UTC (permalink / raw)
  To: john terragon; +Cc: David Arendt, Chris Mason, Btrfs BTRFS

On Mon, Oct 13, 2014 at 5:22 PM, john terragon <jterragon@gmail.com> wrote:
> I'm using "compress=no" so compression doesn't seem to be related, at
> least in my case. Just read-only snapshots on 3.17 (although I haven't
> tried 3.16).

I was using lzo compression, and hence my comment about turning it off
before going back to 3.16 (not realizing that 3.16 has subsequently
been fixed).

Ironically enough I discovered this as I was about to migrate my ext4
backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
that and have an rsync backup of the filesystem handy (minus
snapshots) just in case.  :)

I'd switch to 3.16, but it sounds like there is no way to remove the
snapshots at the moment, and I can live for a while without the
ability to create new ones.

interestingly enough it doesn't look like ALL snapshots are affected.
I checked and some of the snapshots I made last weekend while doing
system updates look accessible.  They are significantly smaller, and
the subvolumes they were made from are also fairly new - though I have
no idea if that is related.

The subvolumes do show up in btrfs su list.  They cannot be examined
using btrfs su show.

It would be VERY nice to have a way of cleaning this up without
blowing away the entire filesystem...

--
Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 23:18                   ` Rich Freeman
@ 2014-10-14  1:30                     ` john terragon
  0 siblings, 0 replies; 32+ messages in thread
From: john terragon @ 2014-10-14  1:30 UTC (permalink / raw)
  To: Btrfs BTRFS

And another worrying thing I didn't notice before. Two snapshots have
dates that do not make sense. root-b3 and root-b4 have been created
Oct 14th (and btw root's modification time was also on Oct the 14th).
So why do they show Oct 10th? And root-prov has actually been created
on Oct 10 15:37, as it correctly shows, so it's like btrfs sub snap
picks up old stale data from who knows were or when or for what
reason. Moreover, root-b4 was created with 3.16.5....not good.

drwxrwsr-x 1 root staff  30 Sep 11 16:15 home
d????????? ? ?    ?       ?            ? home-backup
drwxr-xr-x 1 root root  250 Oct 14 03:02 root
d????????? ? ?    ?       ?            ? root-b2
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b3
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-b4
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b5
drwxr-xr-x 1 root root  250 Oct 14 03:02 root-b6
d????????? ? ?    ?       ?            ? root-backup
drwxr-xr-x 1 root root  250 Oct 10 15:37 root-prov
drwxr-xr-x 1 root root   88 Sep 15 16:02 vms

On Tue, Oct 14, 2014 at 1:18 AM, Rich Freeman
<r-btrfs@thefreemanclan.net> wrote:
> On Mon, Oct 13, 2014 at 5:22 PM, john terragon <jterragon@gmail.com> wrote:
>> I'm using "compress=no" so compression doesn't seem to be related, at
>> least in my case. Just read-only snapshots on 3.17 (although I haven't
>> tried 3.16).
>
> I was using lzo compression, and hence my comment about turning it off
> before going back to 3.16 (not realizing that 3.16 has subsequently
> been fixed).
>
> Ironically enough I discovered this as I was about to migrate my ext4
> backup drive into my btrfs raid1.  Maybe I'll go ahead and wait on
> that and have an rsync backup of the filesystem handy (minus
> snapshots) just in case.  :)
>
> I'd switch to 3.16, but it sounds like there is no way to remove the
> snapshots at the moment, and I can live for a while without the
> ability to create new ones.
>
> interestingly enough it doesn't look like ALL snapshots are affected.
> I checked and some of the snapshots I made last weekend while doing
> system updates look accessible.  They are significantly smaller, and
> the subvolumes they were made from are also fairly new - though I have
> no idea if that is related.
>
> The subvolumes do show up in btrfs su list.  They cannot be examined
> using btrfs su show.
>
> It would be VERY nice to have a way of cleaning this up without
> blowing away the entire filesystem...
>
> --
> Rich

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 22:36               ` Duncan
@ 2014-10-14 11:17                 ` admin
  2014-10-14 21:35                   ` Duncan
  2014-10-14 17:00                 ` David Arendt
  1 sibling, 1 reply; 32+ messages in thread
From: admin @ 2014-10-14 11:17 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

> Summarizing what I've seen on the threads...

First of all many thanks for summarizing the info.

> 1) The bug seems to be read-only snapshot related.  The connection to
> send is that send creates read-only snapshots, but people creating 
> read-
> only snapshots for other purposes are now reporting the same problem, 
> so
> it's not send, it's the read-only snapshots.

In fact send does not create a read-only snapshot, snapshots are created 
manually prior to calling send.

> 2) Writable snapshots haven't been implicated yet, and the working set
> from which the snapshots are taken doesn't seem to be affected, either.
> So in that sense it's not affecting ordinary usage, only the read-only
> snapshots themselves.
> 
> 3) More problematic, however, is the fact that these apparently 
> corrupted
> read-only snapshots often are not listed properly and can't be deleted,
> tho I'm not sure if that's /all/ the corrupted snapshots or only part 
> of
> them. So while it may not affect ordinary operation in the short term,
> over time until there's a fix, people routinely doing read-only 
> snapshots
> are going to be getting more and more of these undeletable snapshots, 
> and
> depending on whether the eventual patch only prevents more or can
> actually fix the bad ones (possibly via btrfs check or the like),
> affected filesystems may ultimately have to be blown away and recreated
> with a fresh mkfs, in ordered to kill the currently undeletable 
> snapshots.
> 
> So the first thing to do would be to shut off whatever's making 
> read-only
> snapshots, so you don't make the problem worse while it's being
> investigated.  For those who can do that without too big an 
> interruption
> to their normal routine (who don't depend on send/receive, for 
> instance),
> just keep it off for the time being.  For those who depend on read-only
> snapshots (send-receive for backup and the data is too valuable to not 
> do
> the backups for a few days), consider switching back to 3.16-stable --
> from 3.16.3 at least, the patch for the compress bug is there, so that
> shouldn't be a problem.
> 
> And if you're affected, be aware that until we have a fix, we don't 
> know
> if it'll be possible to remove the affected and currently undeletable
> snapshots.  If it's not, at some point you'll need to do a fresh
> mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to
> affect writable snapshots or the "head" from which snapshots are made,
> it's not urgent, and a full fix is likely to include a patch to detect
> and fix the problem as well, but until we know what the problem is we
> can't be sure of that, so be prepared to do that mkfs at some point, as
> at this point it's possible that's the only way you'll be able to kill
> the corrupted snapshots.

I don't agree with you concerning the not urgent part. In my opinion, 
any problem leading to filesystem or other data corruption should be 
considered as urgent, at least as long as it isn't known what exactly is 
affected and whether there is a simple way to salvage the corruption 
without going the backup/restore route.

> 4) Total speculation on my part, but given the wanted transid (aka
> generation, in different contexts) is significantly lower than the 
> found
> transid, and the fact that the problem appears to be limited to
> /read-only/ snapshots, my first suspicion is that something's getting
> updated that would normally apply to all snapshots, but the read-only
> nature of the snapshots is preventing the full update there.  The 
> transid
> of the block is updated, but the snapshot being read-only is preventing
> update of the pointer in that snapshot accordingly.
> 
> What I do /not/ know is whether the bug is that something's getting
> updated that should NOT be, and it's simply the read-only snapshots
> letting us know about it since the writable snapshots are fully 
> updated,
> even if that breaks the snapshot (breaking writable snapshots in a
> different and currently undetected way), or if instead, it's a 
> legitimate
> update, like a balance simply moving the snapshot around but not
> affecting it otherwise, and the bug is that the read-only snapshots
> aren't allowing the legitimate update.
> 
> Either way, this more or less developed over the weekend, and it's 
> Monday
> now, so the devs should be on it.  If it's anything like the 3.15/3.16
> compression bug, it'll take some time for them to properly trace it, 
> and
> then to figure out an appropriate fix, but they will.  Chances are 
> we'll
> have at least some decent progress on a trace by Friday, and maybe even 
> a
> good-to-go patch. =:^)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-13 22:36               ` Duncan
  2014-10-14 11:17                 ` admin
@ 2014-10-14 17:00                 ` David Arendt
  1 sibling, 0 replies; 32+ messages in thread
From: David Arendt @ 2014-10-14 17:00 UTC (permalink / raw)
  To: Duncan, linux-btrfs

The corruption seems to be worse than expected. In kernel 3.16.5 I can 
not mount this filesystem read/write.

I'm in progress of doing a tar - mkfs.btrfs - untar recovery and staying 
on 3.16.5 for now.

[   55.465584] parent transid verify failed on 51150848 wanted 272368 
found 276401
[   55.468415] parent transid verify failed on 918274048 wanted 273135 
found 274590
[   55.470915] parent transid verify failed on 508444672 wanted 274054 
found 276617
[   55.473758] parent transid verify failed on 18317623296 wanted 275876 
found 278431
[   55.476240] parent transid verify failed on 127254528 wanted 276488 
found 276490
[   55.479494] ------------[ cut here ]------------
[   55.479499] WARNING: CPU: 1 PID: 1723 at fs/btrfs/extent-tree.c:876 
btrfs_lookup_extent_info+0x44c/0x490()
[   55.479500] Modules linked in:
[   55.479502] CPU: 1 PID: 1723 Comm: ls Not tainted 3.16.5 #1
[   55.479502] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014
[   55.479503]  0000000000000000 0000000000000009 ffffffff816ff873 
0000000000000000
[   55.479504]  ffffffff81078261 ffff8807f7084770 ffff8807ed8ca000 
000000003dcf4000
[   55.479506]  ffff8807f7133de0 0000000000000000 ffffffff812be9bc 
0000000000004000
[   55.479507] Call Trace:
[   55.479511]  [<ffffffff816ff873>] ? dump_stack+0x41/0x51
[   55.479514]  [<ffffffff81078261>] ? warn_slowpath_common+0x81/0xb0
[   55.479515]  [<ffffffff812be9bc>] ? btrfs_lookup_extent_info+0x44c/0x490
[   55.479516]  [<ffffffff812c4998>] ? btrfs_alloc_free_block+0x2c8/0x450
[   55.479519]  [<ffffffff812af7df>] ? update_ref_for_cow+0x1ff/0x3f0
[   55.479520]  [<ffffffff812afc0a>] ? __btrfs_cow_block+0x23a/0x5a0
[   55.479522]  [<ffffffff812d14fd>] ? btrfs_buffer_uptodate+0x6d/0x80
[   55.479524]  [<ffffffff812b0136>] ? btrfs_cow_block+0x126/0x190
[   55.479525]  [<ffffffff812b43bd>] ? btrfs_search_slot+0x1fd/0xaa0
[   55.479527]  [<ffffffff812e07a3>] ? 
btrfs_truncate_inode_items+0x123/0x8e0
[   55.479529]  [<ffffffff812e204a>] ? btrfs_evict_inode+0x32a/0x490
[   55.479532]  [<ffffffff8112e02a>] ? unlock_new_inode+0x3a/0x60
[   55.479533]  [<ffffffff8113abb5>] ? __inode_wait_for_writeback+0x65/0xb0
[   55.479536]  [<ffffffff810a8f70>] ? wake_atomic_t_function+0x30/0x30
[   55.479537]  [<ffffffff8112f276>] ? evict+0xa6/0x160
[   55.479539]  [<ffffffff812e2c2d>] ? btrfs_orphan_cleanup+0x1ed/0x430
[   55.479540]  [<ffffffff812e31c8>] ? btrfs_lookup_dentry+0x358/0x4c0
[   55.479542]  [<ffffffff812e3339>] ? btrfs_lookup+0x9/0x30
[   55.479543]  [<ffffffff8111f6c4>] ? lookup_real+0x14/0x50
[   55.479545]  [<ffffffff81120292>] ? __lookup_hash+0x32/0x50
[   55.479546]  [<ffffffff81120938>] ? lookup_slow+0x48/0xc0
[   55.479547]  [<ffffffff811227bc>] ? path_lookupat+0x73c/0x770
[   55.479550]  [<ffffffff81164860>] ? posix_acl_xattr_get+0x40/0xb0
[   55.479551]  [<ffffffff81137a80>] ? generic_getxattr+0x50/0x80
[   55.479552]  [<ffffffff8112281e>] ? filename_lookup.isra.51+0x2e/0x90
[   55.479554]  [<ffffffff8112553f>] ? user_path_at_empty+0x5f/0xb0
[   55.479555]  [<ffffffff81125549>] ? user_path_at_empty+0x69/0xb0
[   55.479556]  [<ffffffff8111b690>] ? vfs_fstatat+0x40/0x90
[   55.479557]  [<ffffffff8111b862>] ? SyS_newlstat+0x12/0x30
[   55.479559]  [<ffffffff8111f89d>] ? path_put+0xd/0x20
[   55.479560]  [<ffffffff81138ab7>] ? SyS_getxattr+0x57/0x80
[   55.479562]  [<ffffffff817053d2>] ? system_call_fastpath+0x16/0x1b
[   55.479563] ---[ end trace a8ad56fd476f7474 ]---
[   55.479564] BTRFS: error (device sda2) in update_ref_for_cow:1018: 
errno=-30 Readonly filesystem
[   55.479565] BTRFS info (device sda2): forced readonly
[   55.479565] ------------[ cut here ]------------
[   55.479567] WARNING: CPU: 1 PID: 1723 at fs/btrfs/super.c:259 
__btrfs_abort_transaction+0x5a/0x140()
[   55.479567] BTRFS: Transaction aborted (error -30)
[   55.479568] Modules linked in:
[   55.479569] CPU: 1 PID: 1723 Comm: ls Tainted: G        W 3.16.5 #1
[   55.479569] Hardware name: ASUS All Series/H87M-PRO, BIOS 2101 07/21/2014
[   55.479570]  0000000000000000 0000000000000009 ffffffff816ff873 
ffff8807f2dcf788
[   55.479571]  ffffffff81078261 00000000ffffffe2 ffff8807ed8ca000 
ffff8807f7133de0
[   55.479572]  ffffffff8184d800 0000000000000488 ffffffff81078345 
ffffffff8197afd8
[   55.479573] Call Trace:
[   55.479574]  [<ffffffff816ff873>] ? dump_stack+0x41/0x51
[   55.479576]  [<ffffffff81078261>] ? warn_slowpath_common+0x81/0xb0
[   55.479578]  [<ffffffff81078345>] ? warn_slowpath_fmt+0x45/0x50
[   55.479579]  [<ffffffff812aa41a>] ? __btrfs_abort_transaction+0x5a/0x140
[   55.479580]  [<ffffffff812afe02>] ? __btrfs_cow_block+0x432/0x5a0
[   55.479582]  [<ffffffff812d14fd>] ? btrfs_buffer_uptodate+0x6d/0x80
[   55.479583]  [<ffffffff812b0136>] ? btrfs_cow_block+0x126/0x190
[   55.479584]  [<ffffffff812b43bd>] ? btrfs_search_slot+0x1fd/0xaa0
[   55.479586]  [<ffffffff812e07a3>] ? 
btrfs_truncate_inode_items+0x123/0x8e0
[   55.479587]  [<ffffffff812e204a>] ? btrfs_evict_inode+0x32a/0x490
[   55.479588]  [<ffffffff8112e02a>] ? unlock_new_inode+0x3a/0x60
[   55.479590]  [<ffffffff8113abb5>] ? __inode_wait_for_writeback+0x65/0xb0
[   55.479591]  [<ffffffff810a8f70>] ? wake_atomic_t_function+0x30/0x30
[   55.479592]  [<ffffffff8112f276>] ? evict+0xa6/0x160
[   55.479594]  [<ffffffff812e2c2d>] ? btrfs_orphan_cleanup+0x1ed/0x430
[   55.479595]  [<ffffffff812e31c8>] ? btrfs_lookup_dentry+0x358/0x4c0
[   55.479596]  [<ffffffff812e3339>] ? btrfs_lookup+0x9/0x30
[   55.479598]  [<ffffffff8111f6c4>] ? lookup_real+0x14/0x50
[   55.479599]  [<ffffffff81120292>] ? __lookup_hash+0x32/0x50
[   55.479600]  [<ffffffff81120938>] ? lookup_slow+0x48/0xc0
[   55.479601]  [<ffffffff811227bc>] ? path_lookupat+0x73c/0x770
[   55.479603]  [<ffffffff81164860>] ? posix_acl_xattr_get+0x40/0xb0
[   55.479605]  [<ffffffff81137a80>] ? generic_getxattr+0x50/0x80
[   55.479606]  [<ffffffff8112281e>] ? filename_lookup.isra.51+0x2e/0x90
[   55.479607]  [<ffffffff8112553f>] ? user_path_at_empty+0x5f/0xb0
[   55.479608]  [<ffffffff81125549>] ? user_path_at_empty+0x69/0xb0
[   55.479609]  [<ffffffff8111b690>] ? vfs_fstatat+0x40/0x90
[   55.479610]  [<ffffffff8111b862>] ? SyS_newlstat+0x12/0x30
[   55.479611]  [<ffffffff8111f89d>] ? path_put+0xd/0x20
[   55.479613]  [<ffffffff81138ab7>] ? SyS_getxattr+0x57/0x80
[   55.479614]  [<ffffffff817053d2>] ? system_call_fastpath+0x16/0x1b
[   55.479615] ---[ end trace a8ad56fd476f7475 ]---
[   55.479620] BTRFS error (device sda2): Error removing orphan entry, 
stopping orphan cleanup
[   55.479621] BTRFS critical (device sda2): could not do orphan cleanup -22
[   83.454294] parent transid verify failed on 51150848 wanted 272368 
found 276401
[   83.454945] parent transid verify failed on 918274048 wanted 273135 
found 274590
[   83.455601] parent transid verify failed on 508444672 wanted 274054 
found 276617
[   83.456251] parent transid verify failed on 18317623296 wanted 275876 
found 278431
[   83.456897] parent transid verify failed on 127254528 wanted 276488 
found 276490
[   84.647964] parent transid verify failed on 51150848 wanted 272368 
found 276401
[   84.648612] parent transid verify failed on 918274048 wanted 273135 
found 274590
[   84.649267] parent transid verify failed on 508444672 wanted 274054 
found 276617
[   84.649913] parent transid verify failed on 18317623296 wanted 275876 
found 278431
[   84.650557] parent transid verify failed on 127254528 wanted 276488 
found 276490


On 10/14/14 12:36 AM, Duncan wrote:
> Rich Freeman posted on Mon, 13 Oct 2014 16:42:14 -0400 as excerpted:
>
>> On Mon, Oct 13, 2014 at 4:27 PM, David Arendt <admin@prnet.org> wrote:
>>>  From my own experience and based on what other people are saying, I
>>> think there is a random btrfs filesystem corruption problem in kernel
>>> 3.17 at least related to snapshots, therefore I decided to post using
>>> another subject to draw attention from people not concerned about btrfs
>>> send to it. More information can be found in the brtfs send posts.
>>>
>>> Did the filesystem you tried to balance contain snapshots ? Read only
>>> ones ?
>> The filesystem contains numerous subvolumes and snapshots, many of which
>> are read-only.  I'm managing many with snapper.
>>
>> The similarity of the transid verify errors made me think this issue is
>> related, and the root cause may have nothing to do with btrfs send.
>>
>> As far as I can tell these errors aren't having any affect on my data -
>> hopefully the system is catching the problems before there are actual
>> disk writes/etc.
> Summarizing what I've seen on the threads...
>
> 1) The bug seems to be read-only snapshot related.  The connection to
> send is that send creates read-only snapshots, but people creating read-
> only snapshots for other purposes are now reporting the same problem, so
> it's not send, it's the read-only snapshots.
>
> 2) Writable snapshots haven't been implicated yet, and the working set
> from which the snapshots are taken doesn't seem to be affected, either.
> So in that sense it's not affecting ordinary usage, only the read-only
> snapshots themselves.
>
> 3) More problematic, however, is the fact that these apparently corrupted
> read-only snapshots often are not listed properly and can't be deleted,
> tho I'm not sure if that's /all/ the corrupted snapshots or only part of
> them. So while it may not affect ordinary operation in the short term,
> over time until there's a fix, people routinely doing read-only snapshots
> are going to be getting more and more of these undeletable snapshots, and
> depending on whether the eventual patch only prevents more or can
> actually fix the bad ones (possibly via btrfs check or the like),
> affected filesystems may ultimately have to be blown away and recreated
> with a fresh mkfs, in ordered to kill the currently undeletable snapshots.
>
> So the first thing to do would be to shut off whatever's making read-only
> snapshots, so you don't make the problem worse while it's being
> investigated.  For those who can do that without too big an interruption
> to their normal routine (who don't depend on send/receive, for instance),
> just keep it off for the time being.  For those who depend on read-only
> snapshots (send-receive for backup and the data is too valuable to not do
> the backups for a few days), consider switching back to 3.16-stable --
> from 3.16.3 at least, the patch for the compress bug is there, so that
> shouldn't be a problem.
>
> And if you're affected, be aware that until we have a fix, we don't know
> if it'll be possible to remove the affected and currently undeletable
> snapshots.  If it's not, at some point you'll need to do a fresh
> mkfs.btrfs, to get rid of the damage.  Since the bug doesn't appear to
> affect writable snapshots or the "head" from which snapshots are made,
> it's not urgent, and a full fix is likely to include a patch to detect
> and fix the problem as well, but until we know what the problem is we
> can't be sure of that, so be prepared to do that mkfs at some point, as
> at this point it's possible that's the only way you'll be able to kill
> the corrupted snapshots.
>
> 4) Total speculation on my part, but given the wanted transid (aka
> generation, in different contexts) is significantly lower than the found
> transid, and the fact that the problem appears to be limited to
> /read-only/ snapshots, my first suspicion is that something's getting
> updated that would normally apply to all snapshots, but the read-only
> nature of the snapshots is preventing the full update there.  The transid
> of the block is updated, but the snapshot being read-only is preventing
> update of the pointer in that snapshot accordingly.
>
> What I do /not/ know is whether the bug is that something's getting
> updated that should NOT be, and it's simply the read-only snapshots
> letting us know about it since the writable snapshots are fully updated,
> even if that breaks the snapshot (breaking writable snapshots in a
> different and currently undetected way), or if instead, it's a legitimate
> update, like a balance simply moving the snapshot around but not
> affecting it otherwise, and the bug is that the read-only snapshots
> aren't allowing the legitimate update.
>
> Either way, this more or less developed over the weekend, and it's Monday
> now, so the devs should be on it.  If it's anything like the 3.15/3.16
> compression bug, it'll take some time for them to properly trace it, and
> then to figure out an appropriate fix, but they will.  Chances are we'll
> have at least some decent progress on a trace by Friday, and maybe even a
> good-to-go patch. =:^)
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-14 11:17                 ` admin
@ 2014-10-14 21:35                   ` Duncan
  2014-10-14 22:03                     ` Robert White
  0 siblings, 1 reply; 32+ messages in thread
From: Duncan @ 2014-10-14 21:35 UTC (permalink / raw)
  To: linux-btrfs

admin posted on Tue, 14 Oct 2014 13:17:41 +0200 as excerpted:

>> And if you're affected, be aware that until we have a fix, we don't
>> know if it'll be possible to remove the affected and currently
>> undeletable snapshots.  If it's not, at some point you'll need to do a
>> fresh mkfs.btrfs, to get rid of the damage.  Since the bug doesn't
>> appear to affect writable snapshots or the "head" from which snapshots
>> are made, it's not urgent, and a full fix is likely to include a patch
>> to detect and fix the problem as well, but until we know what the
>> problem is we can't be sure of that, so be prepared to do that mkfs at
>> some point, as at this point it's possible that's the only way you'll
>> be able to kill the corrupted snapshots.
> 
> I don't agree with you concerning the not urgent part. In my opinion,
> any problem leading to filesystem or other data corruption should be
> considered as urgent, at least as long as it isn't known what exactly is
> affected and whether there is a simple way to salvage the corruption
> without going the backup/restore route.

I shouldn't have used a pronoun there as "it" wasn't clear.

By "it", I didn't mean the bug, which I agree is urgent for the reasons 
you state, but the mkfs.  Since there's currently no fix for the bug but 
it (the bug) seems to be limited to read-only snapshots at this point, 
_doing_the_mkfs_ isn't urgent.  With the damage limited to the read-only 
snapshots, you don't have to drop everything and do a mkfs _right_now_ to 
be rid of it.

But at some point, presumably after a fix is in place, since the damaged 
snapshots aren't currently always deletable, if the fix only prevents new 
damage from occurring and doesn't provide a way to fix the damaged ones, 
then mkfs would be the only way to do so.  With the damage limited to 
those snapshots and not spreading to normal writable snapshots or the 
working copy, dropping everything to do that mkfs isn't urgent, but it 
(the mkfs) will need to be done at some point to clear the undeletable 
snapshots, again, assuming the fix doesn't provide a way to get rid of 
them (the currently undeletable snapshots).

That's what I meant.  Yes the bug is urgent.  Doing a mkfs _right_now_ to 
get rid of the damage, not so much, because by all accounts so far the 
damage is limited to those read-only snapshots and isn't affecting 
ordinary writable snapshots or the working copies.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-14 21:35                   ` Duncan
@ 2014-10-14 22:03                     ` Robert White
  2014-10-14 22:55                       ` Duncan
  0 siblings, 1 reply; 32+ messages in thread
From: Robert White @ 2014-10-14 22:03 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 10/14/2014 02:35 PM, Duncan wrote:
> But at some point, presumably after a fix is in place, since the damaged
> snapshots aren't currently always deletable, if the fix only prevents new
> damage from occurring and doesn't provide a way to fix the damaged ones,
> then mkfs would be the only way to do so.  With the damage limited to
> those snapshots and not spreading to normal writable snapshots or the
> working copy, dropping everything to do that mkfs isn't urgent, but it
> (the mkfs) will need to be done at some point to clear the undeletable
> snapshots, again, assuming the fix doesn't provide a way to get rid of
> them (the currently undeletable snapshots).


What happens if "btrfs property set" is used to (attempt to) promote the 
snapshot from read-only to read-write? Can the damaged snapshot then be 
subjected to scrub of btrfsck?

e.g.

btrfs property set /path/to/snapshot ro false
(maintenance here)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs random filesystem corruption in kernel 3.17
  2014-10-14 22:03                     ` Robert White
@ 2014-10-14 22:55                       ` Duncan
  0 siblings, 0 replies; 32+ messages in thread
From: Duncan @ 2014-10-14 22:55 UTC (permalink / raw)
  To: linux-btrfs

Robert White posted on Tue, 14 Oct 2014 15:03:21 -0700 as excerpted:

> What happens if "btrfs property set" is used to (attempt to) promote the
> snapshot from read-only to read-write? Can the damaged snapshot then be
> subjected to scrub of btrfsck?
> 
> e.g.
> 
> btrfs property set /path/to/snapshot ro false (maintenance here)

Very good question not yet answered. =:^)

But it's one I can't answer as my use-case doesn't call for such 
snapshots in the first place and I don't have any to be personally 
affected by this bug, so my interest is academic.

I simply saw the big hairy thread and tried to summarize what I could get 
out of it to that point, with a bit of my own speculation as to what the 
"reversed" transid complaints meant.

(Since transids are normally sequential, in most corruption cases, the 
filesystem has moved on and has a higher transid that's "wanted", but can 
only find an older/lower transid for something or other.  Or at least 
that's what I've seen here and what seems common in the other reports 
I've seen posted.  This bug reverses that, with an older/lower "wanted" 
transid, but finding a newer/higher one.  That's the strange point that 
leapt out to me and I'd guess it's a strong hint at the problem, thus my 
definitely admin-not-coder speculation on that point.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-06 20:51   ` David Arendt
@ 2014-10-06 22:22     ` Chris Mason
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Mason @ 2014-10-06 22:22 UTC (permalink / raw)
  To: David Arendt; +Cc: linux-btrfs

On Mon, Oct 6, 2014 at 4:51 PM, David Arendt <admin@prnet.org> wrote:
> I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is
> working without any problem. Afterwards I upgraded again to 3.17 and 
> the
> problem reappeared. So the problem seems to be kernel version related.

[ backref errors during btrfs-send ]

Ok then, our list of suspects is pretty short.  Can you easily build 
test kernels?

I'd like to try reverting this commit:

51f395ad4058883e4273b02fdebe98072dbdc0d2

-chris



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-06 19:06 ` Chris Mason
  2014-10-06 19:48   ` David Arendt
@ 2014-10-06 20:51   ` David Arendt
  2014-10-06 22:22     ` Chris Mason
  1 sibling, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-06 20:51 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

I just tried downgrading to 3.16.3 again. In 3.16.3 btrfs send is
working without any problem. Afterwards I upgraded again to 3.17 and the
problem reappeared. So the problem seems to be kernel version related.

On 10/06/2014 09:06 PM, Chris Mason wrote:
> On Mon, Oct 6, 2014 at 2:50 PM, David Arendt <admin@prnet.org> wrote:
>> Hi,
>>
>> After upgrading to kernel 3.17 btrfs send has stopped working.
>>
>> ERROR: send ioctl failed with -5: Input/output error
>>
>> The following message is printed by kernel:
>>
>> [75322.782197] BTRFS error (device sda2): did not find backref in
>> send_root. inode=461, offset=0, disk_byte=1094713344 found
>> extent=1094713344
>>
>> btrfs inspect-internal inode-resolve -v 461 /u00/root.snapshot returns:
>>
>> /var/log/emerge-fetch.log
>>
>> After removing this file, the error moves on to another file.
>>
>> btrfs scrub output:
>>
>> scrub status for bc31b068-2c36-4ff2-ac5c-7ce55af5371d
>>     scrub started at Mon Oct  6 19:49:25 2014 and finished after 1748
>> seconds
>>     total bytes scrubbed: 94.21GiB with 0 errors
>>
>> Other then the btrfs send problem, the filesystem works normally.
>>
>> Is this a bug in btrfs-send or is my filesystem corrupted and should be
>> restored from backup ?
>>
>> Please tell me if I can do anything else to help debugging this issue.
>
> Which kernel did you upgrade from?  I don't think we have changes in
> 3.17 that should impact this.
>
> Is merge-fetch.log just a simple append-only log file?
>
> -chris
>
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-06 19:06 ` Chris Mason
@ 2014-10-06 19:48   ` David Arendt
  2014-10-06 20:51   ` David Arendt
  1 sibling, 0 replies; 32+ messages in thread
From: David Arendt @ 2014-10-06 19:48 UTC (permalink / raw)
  To: Chris Mason; +Cc: linux-btrfs

I have upgraded from kernel 3.16.3.

emerge-fetch.log is a simple append-only log file. Other files having
the problems after deleting them one by one have been emerge.log,
mysql.log, freshclam.log and main.cvd from clamav. At this one, I
stopped deleting.

On 10/06/2014 09:06 PM, Chris Mason wrote:
> On Mon, Oct 6, 2014 at 2:50 PM, David Arendt <admin@prnet.org> wrote:
>> Hi,
>>
>> After upgrading to kernel 3.17 btrfs send has stopped working.
>>
>> ERROR: send ioctl failed with -5: Input/output error
>>
>> The following message is printed by kernel:
>>
>> [75322.782197] BTRFS error (device sda2): did not find backref in
>> send_root. inode=461, offset=0, disk_byte=1094713344 found
>> extent=1094713344
>>
>> btrfs inspect-internal inode-resolve -v 461 /u00/root.snapshot returns:
>>
>> /var/log/emerge-fetch.log
>>
>> After removing this file, the error moves on to another file.
>>
>> btrfs scrub output:
>>
>> scrub status for bc31b068-2c36-4ff2-ac5c-7ce55af5371d
>>     scrub started at Mon Oct  6 19:49:25 2014 and finished after 1748
>> seconds
>>     total bytes scrubbed: 94.21GiB with 0 errors
>>
>> Other then the btrfs send problem, the filesystem works normally.
>>
>> Is this a bug in btrfs-send or is my filesystem corrupted and should be
>> restored from backup ?
>>
>> Please tell me if I can do anything else to help debugging this issue.
>
> Which kernel did you upgrade from?  I don't think we have changes in
> 3.17 that should impact this.
>
> Is merge-fetch.log just a simple append-only log file?
>
> -chris
>
>
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: btrfs send and kernel 3.17
  2014-10-06 18:50 btrfs send and " David Arendt
@ 2014-10-06 19:06 ` Chris Mason
  2014-10-06 19:48   ` David Arendt
  2014-10-06 20:51   ` David Arendt
  0 siblings, 2 replies; 32+ messages in thread
From: Chris Mason @ 2014-10-06 19:06 UTC (permalink / raw)
  To: David Arendt; +Cc: linux-btrfs

On Mon, Oct 6, 2014 at 2:50 PM, David Arendt <admin@prnet.org> wrote:
> Hi,
> 
> After upgrading to kernel 3.17 btrfs send has stopped working.
> 
> ERROR: send ioctl failed with -5: Input/output error
> 
> The following message is printed by kernel:
> 
> [75322.782197] BTRFS error (device sda2): did not find backref in
> send_root. inode=461, offset=0, disk_byte=1094713344 found 
> extent=1094713344
> 
> btrfs inspect-internal inode-resolve -v 461 /u00/root.snapshot 
> returns:
> 
> /var/log/emerge-fetch.log
> 
> After removing this file, the error moves on to another file.
> 
> btrfs scrub output:
> 
> scrub status for bc31b068-2c36-4ff2-ac5c-7ce55af5371d
>     scrub started at Mon Oct  6 19:49:25 2014 and finished after 1748
> seconds
>     total bytes scrubbed: 94.21GiB with 0 errors
> 
> Other then the btrfs send problem, the filesystem works normally.
> 
> Is this a bug in btrfs-send or is my filesystem corrupted and should 
> be
> restored from backup ?
> 
> Please tell me if I can do anything else to help debugging this issue.

Which kernel did you upgrade from?  I don't think we have changes in 
3.17 that should impact this.

Is merge-fetch.log just a simple append-only log file?

-chris




^ permalink raw reply	[flat|nested] 32+ messages in thread

* btrfs send and kernel 3.17
@ 2014-10-06 18:50 David Arendt
  2014-10-06 19:06 ` Chris Mason
  0 siblings, 1 reply; 32+ messages in thread
From: David Arendt @ 2014-10-06 18:50 UTC (permalink / raw)
  To: linux-btrfs

Hi,

After upgrading to kernel 3.17 btrfs send has stopped working.

ERROR: send ioctl failed with -5: Input/output error

The following message is printed by kernel:

[75322.782197] BTRFS error (device sda2): did not find backref in
send_root. inode=461, offset=0, disk_byte=1094713344 found extent=1094713344

btrfs inspect-internal inode-resolve -v 461 /u00/root.snapshot returns:

/var/log/emerge-fetch.log

After removing this file, the error moves on to another file.

btrfs scrub output:

scrub status for bc31b068-2c36-4ff2-ac5c-7ce55af5371d
    scrub started at Mon Oct  6 19:49:25 2014 and finished after 1748
seconds
    total bytes scrubbed: 94.21GiB with 0 errors

Other then the btrfs send problem, the filesystem works normally.

Is this a bug in btrfs-send or is my filesystem corrupted and should be
restored from backup ?

Please tell me if I can do anything else to help debugging this issue.

Thanks in advance,
David Arendt

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-10-14 22:56 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <DC336054-F307-4A86-AD6D-204E700DE9AA@prnet.org>
2014-10-07 13:19 ` btrfs send and kernel 3.17 Chris Mason
2014-10-07 20:45   ` David Arendt
2014-10-07 20:46     ` Chris Mason
2014-10-12 11:11       ` David Arendt
2014-10-12 15:24         ` john terragon
2014-10-12 21:35           ` David Arendt
2014-10-13  4:11             ` David Arendt
2014-10-13 12:40               ` john terragon
2014-10-13 15:40                 ` David Arendt
2014-10-13 17:22         ` Rich Freeman
2014-10-13 20:27           ` btrfs random filesystem corruption in " David Arendt
2014-10-13 20:42             ` Rich Freeman
2014-10-13 22:36               ` Duncan
2014-10-14 11:17                 ` admin
2014-10-14 21:35                   ` Duncan
2014-10-14 22:03                     ` Robert White
2014-10-14 22:55                       ` Duncan
2014-10-14 17:00                 ` David Arendt
2014-10-13 20:48             ` john terragon
2014-10-13 20:55               ` Rich Freeman
2014-10-13 20:57                 ` Rich Freeman
2014-10-13 21:22                 ` john terragon
2014-10-13 21:25                   ` David Arendt
2014-10-13 21:49                     ` Duncan
2014-10-13 23:18                   ` Rich Freeman
2014-10-14  1:30                     ` john terragon
2014-10-13 21:22               ` David Arendt
2014-10-06 18:50 btrfs send and " David Arendt
2014-10-06 19:06 ` Chris Mason
2014-10-06 19:48   ` David Arendt
2014-10-06 20:51   ` David Arendt
2014-10-06 22:22     ` Chris Mason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.