linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* first mount(s) after unclean shutdown always fail
@ 2020-07-01  0:51 Marc Lehmann
  2020-07-01  1:30 ` Qu Wenruo
  2020-07-02 18:31 ` Zygo Blaxell
  0 siblings, 2 replies; 11+ messages in thread
From: Marc Lehmann @ 2020-07-01  0:51 UTC (permalink / raw)
  To: linux-btrfs

Hi!

I have a server with multiple btrfs filesystems and some moderate-sized
dmcache caches (a few million blocks/100s of GBs).

When the server has an unclean shutdown, dmcache treats all cached blocks
as dirty. This has the effect of extremely slow I/O, as dmcache basically
caches a lot of random I/O, and writing these blocks back to the rotating
disk backing store can take hours. This, I think, is related to the
problem.

When the server is in this condition, then all btrfs filesystems on slow
stores (regardless of whether they use dmcache or not) fail their first
mount attempt(s) like this:

   [  173.243117] BTRFS info (device dm-7): has skinny extents
   [  864.982108] BTRFS error (device dm-7): open_ctree failed

Recent kernels sometimes additionally fail like this (super_total_bytes):

   [  867.721885] BTRFS info (device dm-7): turning on sync discard
   [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
   [  867.722691] BTRFS info (device dm-7): has skinny extents
   [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
   [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
   [  871.269989] BTRFS error (device dm-7): open_ctree failed

all the filesystems in question are mounted twice during normal boots,
with diferent subvolumes, and systemd parallelises these mounts. This might
play a role in these failures.

Simply trying to mount the filesystems again then (usually) succeeds with
seemingly no issues, so these are spurious mount failures. These repeated
mount attewmpts are also much faster, presumably because a lot of the data
is already in memory.

As far as I am concerned, this is 100% reproducible (i.e. it happens on every
unclean shutdown). It also happens on "old" (4.19 era) filesystems as well as
on filesystems that have never seen anything older than 5.4 kernels.

It does _not_ happen with filesystems on SSDs, regardless of whether they
are mounted multiple times or not. It does happen to all filesystems that
are on rotating disks affected by dm-cache writes, regardless of whether
the filesystem itself uses dmcache or not.

The system in question is currently running 5.6.17, but the same thing
happens with 5.4 and 5.2 kernels, and it might have happened with much
earlier kernels as well, but I didn't have time to report this (as I
secretly hoped newer kernels would fix this, and unclean shutdowns are
rare).

Example btrfs kernel messages for one such unclean boot. This involved
normal boot, followed by unsuccessfull "mount -va" in the emergency shell
(i.e. a second mount fasilure for the same filesystem), followed by a
successfull "mount -va" in the shell.

[  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
[  173.242545] BTRFS info (device dm-7): disk space caching is enabled
[  173.243117] BTRFS info (device dm-7): has skinny extents
[  363.573875] INFO: task mount:1103 blocked for more than 120 seconds.
the above message repeats multiple times, backtrace &c has been removed for clarity
[  484.405875] INFO: task mount:1103 blocked for more than 241 seconds.
[  605.237859] INFO: task mount:1103 blocked for more than 362 seconds.
[  605.252478] INFO: task mount:1211 blocked for more than 120 seconds.
[  726.069900] INFO: task mount:1103 blocked for more than 483 seconds.
[  726.084415] INFO: task mount:1211 blocked for more than 241 seconds.
[  846.901874] INFO: task mount:1103 blocked for more than 604 seconds.
[  846.916431] INFO: task mount:1211 blocked for more than 362 seconds.
[  864.982108] BTRFS error (device dm-7): open_ctree failed
[  867.551400] BTRFS info (device dm-7): turning on sync discard
[  867.551875] BTRFS info (device dm-7): disk space caching is enabled
[  867.552242] BTRFS info (device dm-7): has skinny extents
[  867.565896] BTRFS error (device dm-7): open_ctree failed
[  867.721885] BTRFS info (device dm-7): turning on sync discard
[  867.722341] BTRFS info (device dm-7): disk space caching is enabled
[  867.722691] BTRFS info (device dm-7): has skinny extents
[  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
[  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
[  871.269989] BTRFS error (device dm-7): open_ctree failed
[  872.535935] BTRFS info (device dm-7): disk space caching is enabled
[  872.536438] BTRFS info (device dm-7): has skinny extents

Example fstab entries for the mounts above:

/dev/mapper/cryptlocalvol       /localvol       btrfs           defaults,nossd,discard                  0       0
/dev/mapper/cryptlocalvol       /cryptlocalvol  btrfs           defaults,nossd,subvol=/                 0       0

I don't need assistance, I merely write this in the hope of btrfs being
improved by this information.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01  0:51 first mount(s) after unclean shutdown always fail Marc Lehmann
@ 2020-07-01  1:30 ` Qu Wenruo
  2020-07-01 20:14   ` Marc Lehmann
  2020-07-02 18:31 ` Zygo Blaxell
  1 sibling, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-07-01  1:30 UTC (permalink / raw)
  To: Marc Lehmann, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5375 bytes --]



On 2020/7/1 上午8:51, Marc Lehmann wrote:
> Hi!
> 
> I have a server with multiple btrfs filesystems and some moderate-sized
> dmcache caches (a few million blocks/100s of GBs).
> 
> When the server has an unclean shutdown, dmcache treats all cached blocks
> as dirty. This has the effect of extremely slow I/O, as dmcache basically
> caches a lot of random I/O, and writing these blocks back to the rotating
> disk backing store can take hours. This, I think, is related to the
> problem.
> 
> When the server is in this condition, then all btrfs filesystems on slow
> stores (regardless of whether they use dmcache or not) fail their first
> mount attempt(s) like this:
> 
>    [  173.243117] BTRFS info (device dm-7): has skinny extents
>    [  864.982108] BTRFS error (device dm-7): open_ctree failed
> 
> Recent kernels sometimes additionally fail like this (super_total_bytes):
> 
>    [  867.721885] BTRFS info (device dm-7): turning on sync discard
>    [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
>    [  867.722691] BTRFS info (device dm-7): has skinny extents
>    [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
>    [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
>    [  871.269989] BTRFS error (device dm-7): open_ctree failed

This looks like an old fs with some bad accounting numbers.

Have you tried btrfs rescue fix-device-size?

Thanks,
Qu
> 
> all the filesystems in question are mounted twice during normal boots,
> with diferent subvolumes, and systemd parallelises these mounts. This might
> play a role in these failures.
> 
> Simply trying to mount the filesystems again then (usually) succeeds with
> seemingly no issues, so these are spurious mount failures. These repeated
> mount attewmpts are also much faster, presumably because a lot of the data
> is already in memory.
> 
> As far as I am concerned, this is 100% reproducible (i.e. it happens on every
> unclean shutdown). It also happens on "old" (4.19 era) filesystems as well as
> on filesystems that have never seen anything older than 5.4 kernels.
> 
> It does _not_ happen with filesystems on SSDs, regardless of whether they
> are mounted multiple times or not. It does happen to all filesystems that
> are on rotating disks affected by dm-cache writes, regardless of whether
> the filesystem itself uses dmcache or not.
> 
> The system in question is currently running 5.6.17, but the same thing
> happens with 5.4 and 5.2 kernels, and it might have happened with much
> earlier kernels as well, but I didn't have time to report this (as I
> secretly hoped newer kernels would fix this, and unclean shutdowns are
> rare).
> 
> Example btrfs kernel messages for one such unclean boot. This involved
> normal boot, followed by unsuccessfull "mount -va" in the emergency shell
> (i.e. a second mount fasilure for the same filesystem), followed by a
> successfull "mount -va" in the shell.
> 
> [  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
> [  173.242545] BTRFS info (device dm-7): disk space caching is enabled
> [  173.243117] BTRFS info (device dm-7): has skinny extents
> [  363.573875] INFO: task mount:1103 blocked for more than 120 seconds.
> the above message repeats multiple times, backtrace &c has been removed for clarity
> [  484.405875] INFO: task mount:1103 blocked for more than 241 seconds.
> [  605.237859] INFO: task mount:1103 blocked for more than 362 seconds.
> [  605.252478] INFO: task mount:1211 blocked for more than 120 seconds.
> [  726.069900] INFO: task mount:1103 blocked for more than 483 seconds.
> [  726.084415] INFO: task mount:1211 blocked for more than 241 seconds.
> [  846.901874] INFO: task mount:1103 blocked for more than 604 seconds.
> [  846.916431] INFO: task mount:1211 blocked for more than 362 seconds.
> [  864.982108] BTRFS error (device dm-7): open_ctree failed
> [  867.551400] BTRFS info (device dm-7): turning on sync discard
> [  867.551875] BTRFS info (device dm-7): disk space caching is enabled
> [  867.552242] BTRFS info (device dm-7): has skinny extents
> [  867.565896] BTRFS error (device dm-7): open_ctree failed
> [  867.721885] BTRFS info (device dm-7): turning on sync discard
> [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
> [  867.722691] BTRFS info (device dm-7): has skinny extents
> [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
> [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
> [  871.269989] BTRFS error (device dm-7): open_ctree failed
> [  872.535935] BTRFS info (device dm-7): disk space caching is enabled
> [  872.536438] BTRFS info (device dm-7): has skinny extents
> 
> Example fstab entries for the mounts above:
> 
> /dev/mapper/cryptlocalvol       /localvol       btrfs           defaults,nossd,discard                  0       0
> /dev/mapper/cryptlocalvol       /cryptlocalvol  btrfs           defaults,nossd,subvol=/                 0       0
> 
> I don't need assistance, I merely write this in the hope of btrfs being
> improved by this information.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01  1:30 ` Qu Wenruo
@ 2020-07-01 20:14   ` Marc Lehmann
  2020-07-01 23:45     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Lehmann @ 2020-07-01 20:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Wed, Jul 01, 2020 at 09:30:25AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> This looks like an old fs with some bad accounting numbers.

Yeah, but it's not.

> Have you tried btrfs rescue fix-device-size?

Why would I want to try this?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01 20:14   ` Marc Lehmann
@ 2020-07-01 23:45     ` Qu Wenruo
  2020-07-01 23:55       ` Marc Lehmann
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-07-01 23:45 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 363 bytes --]



On 2020/7/2 上午4:14, Marc Lehmann wrote:
> On Wed, Jul 01, 2020 at 09:30:25AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> This looks like an old fs with some bad accounting numbers.
> 
> Yeah, but it's not.
> 
>> Have you tried btrfs rescue fix-device-size?
> 
> Why would I want to try this?
> 

Read the man page of "btrfs-rescue".


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01 23:45     ` Qu Wenruo
@ 2020-07-01 23:55       ` Marc Lehmann
  2020-07-02  0:02         ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Lehmann @ 2020-07-01 23:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 07:45:07AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> On 2020/7/2 上午4:14, Marc Lehmann wrote:
> > On Wed, Jul 01, 2020 at 09:30:25AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >> This looks like an old fs with some bad accounting numbers.
> > 
> > Yeah, but it's not.
> > 
> >> Have you tried btrfs rescue fix-device-size?
> > 
> > Why would I want to try this?
> 
> Read the man page of "btrfs-rescue".

Well, nothing in there explains why I should use it in my situation, or
what it has to do with the problem I reported, so again, why would I want
to do this?

Also, shouldn't btrfs be fixed instead? I was under the impression that
one of the goals of btrfs is to be safe w.r.t. crashes.

The bug I reported has very little or nothing to with strict checking.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01 23:55       ` Marc Lehmann
@ 2020-07-02  0:02         ` Qu Wenruo
  2020-07-02  1:11           ` Marc Lehmann
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-07-02  0:02 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 1849 bytes --]



On 2020/7/2 上午7:55, Marc Lehmann wrote:
> On Thu, Jul 02, 2020 at 07:45:07AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> On 2020/7/2 上午4:14, Marc Lehmann wrote:
>>> On Wed, Jul 01, 2020 at 09:30:25AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>> This looks like an old fs with some bad accounting numbers.
>>>
>>> Yeah, but it's not.
>>>
>>>> Have you tried btrfs rescue fix-device-size?
>>>
>>> Why would I want to try this?
>>
>> Read the man page of "btrfs-rescue".
> 
> Well, nothing in there explains why I should use it in my situation, or
> what it has to do with the problem I reported, so again, why would I want
> to do this?

Well, if you want to go this way, let me show the code here.

From fs/btrfs/volumes.c:btrfs_read_chunk_tree():

        if (btrfs_super_total_bytes(fs_info->super_copy) <
            fs_info->fs_devices->total_rw_bytes) {
                btrfs_err(fs_info,
        "super_total_bytes %llu mismatch with fs_devices total_rw_bytes
%llu",
                          btrfs_super_total_bytes(fs_info->super_copy),
                          fs_info->fs_devices->total_rw_bytes);
                ret = -EINVAL;
                goto error;
        }

Doesn't this explain why we abort the mount?

> 
> Also, shouldn't btrfs be fixed instead? I was under the impression that
> one of the goals of btrfs is to be safe w.r.t. crashes.

That's why we provide the btrfs rescue fix-device-size.

We don't want to spend too complex logic on some already screwed
accounting numbers.

Furthermore, a btrfs check run without --repair may expose more problems.

> 
> The bug I reported has very little or nothing to with strict checking.
> 

I have provide the code to prove why it's related.
Whether you believe is your problem then.

Thanks,
Qu


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-02  0:02         ` Qu Wenruo
@ 2020-07-02  1:11           ` Marc Lehmann
  2020-07-02  1:28             ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Marc Lehmann @ 2020-07-02  1:11 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 08:02:52AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> Well, if you want to go this way, let me show the code here.
> 
> From fs/btrfs/volumes.c:btrfs_read_chunk_tree():
> 
>         if (btrfs_super_total_bytes(fs_info->super_copy) <
>             fs_info->fs_devices->total_rw_bytes) {
>                 btrfs_err(fs_info,
>         "super_total_bytes %llu mismatch with fs_devices total_rw_bytes
> %llu",
>                           btrfs_super_total_bytes(fs_info->super_copy),
>                           fs_info->fs_devices->total_rw_bytes);
>                 ret = -EINVAL;
>                 goto error;
>         }
> 
> Doesn't this explain why we abort the mount?

I wouldn't see how, especially if the code doesn't do anything _unless_ it
also prints the message.

When it doesn't produce the message, all it does is compare two numbers
(unless btrfs_super_total_bytes does something very funny) - how does this
explain that the mount fails, then succeeds, in the cases where the message
is _not_ logged, as reported?

> > Also, shouldn't btrfs be fixed instead? I was under the impression that
> > one of the goals of btrfs is to be safe w.r.t. crashes.
> 
> That's why we provide the btrfs rescue fix-device-size.

Not sure how that follows - there is a bug in the kernel filesystem and
you provide a userspace tool that should be run on every crash, to what
end?

Spurious mount failures are a bug in the btrfs kernel driver.

> > The bug I reported has very little or nothing to with strict checking.
> 
> I have provide the code to prove why it's related.

The code proves only that you are wrong - the code _always_ prints the
message. Unless btrfs_super_total_bytes does more than just read some
data, it cannot explain the bug I reported, simply because the message is
not always produced, and the mount is not always aborted.

> Whether you believe is your problem then.

No, it's not, simply because I don't have a problem...

btrfs has problems, and I reported one, that's all that has happened.

I slowly get the distinct feeling that reporting bugs in btrfs us a futile
exercise, though.

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-02  1:11           ` Marc Lehmann
@ 2020-07-02  1:28             ` Qu Wenruo
  2020-07-02  2:13               ` Marc Lehmann
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2020-07-02  1:28 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3396 bytes --]



On 2020/7/2 上午9:11, Marc Lehmann wrote:
> On Thu, Jul 02, 2020 at 08:02:52AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> Well, if you want to go this way, let me show the code here.
>>
>> From fs/btrfs/volumes.c:btrfs_read_chunk_tree():
>>
>>         if (btrfs_super_total_bytes(fs_info->super_copy) <
>>             fs_info->fs_devices->total_rw_bytes) {
>>                 btrfs_err(fs_info,
>>         "super_total_bytes %llu mismatch with fs_devices total_rw_bytes
>> %llu",
>>                           btrfs_super_total_bytes(fs_info->super_copy),
>>                           fs_info->fs_devices->total_rw_bytes);
>>                 ret = -EINVAL;
>>                 goto error;
>>         }
>>
>> Doesn't this explain why we abort the mount?
> 
> I wouldn't see how, especially if the code doesn't do anything _unless_ it
> also prints the message.
> 
> When it doesn't produce the message, all it does is compare two numbers
> (unless btrfs_super_total_bytes does something very funny) - how does this
> explain that the mount fails, then succeeds, in the cases where the message
> is _not_ logged, as reported?

When the error is logged, this snippet get triggered and abort mount.

And you have reported this at least happened once.

Then for that case, you should go btrfs rescue fix-device-size.

> 
>>> Also, shouldn't btrfs be fixed instead? I was under the impression that
>>> one of the goals of btrfs is to be safe w.r.t. crashes.
>>
>> That's why we provide the btrfs rescue fix-device-size.
> 
> Not sure how that follows - there is a bug in the kernel filesystem and
> you provide a userspace tool that should be run on every crash, to what
> end?

Nope, it get executed once and that specific problem will be gone.

As said, that's caused by some older kernel, newer kernel has extra safe
net to ensure the accounting numbers are safe.

> 
> Spurious mount failures are a bug in the btrfs kernel driver.

Then report them as separate bugs.

The bugs of that message is well known and we have solution for a while.

> 
>>> The bug I reported has very little or nothing to with strict checking.
>>
>> I have provide the code to prove why it's related.
> 
> The code proves only that you are wrong - the code _always_ prints the
> message. Unless btrfs_super_total_bytes does more than just read some
> data, it cannot explain the bug I reported, simply because the message is
> not always produced, and the mount is not always aborted.

Solve one problem and go on to solve the next one.

If you don't even bother the solution to that specific problem, you
won't bother any debug procedure provided by any developer.

> 
>> Whether you believe is your problem then.
> 
> No, it's not, simply because I don't have a problem...
> 
> btrfs has problems, and I reported one, that's all that has happened.

You reported several problem without proper reproducer.

You can reproduce it on your system is not a proper reproducer.
I provided one solution to one of your problems, you ignored and that's
your problem.

I don't see any point to debug any bugs reported by the one who doesn't
even want to try a known solution but insists on whatever he believe is
correct.

> 
> I slowly get the distinct feeling that reporting bugs in btrfs us a futile
> exercise, though.
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-02  1:28             ` Qu Wenruo
@ 2020-07-02  2:13               ` Marc Lehmann
  0 siblings, 0 replies; 11+ messages in thread
From: Marc Lehmann @ 2020-07-02  2:13 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 09:28:39AM +0800, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > explain that the mount fails, then succeeds, in the cases where the message
> > is _not_ logged, as reported?
> 
> When the error is logged, this snippet get triggered and abort mount.
> 
> And you have reported this at least happened once.

I reported that it sometimes happens, sometimes not.

> Then for that case, you should go btrfs rescue fix-device-size.

I have done this in the past, it doesn't fix the issue I reported.

Again, the problem is spurious mount failures, and *sometimes*,
*occasionally*, *not always*, I also get the super_total_bytes message,
but subsequent mount attempts succeed. Until the next crash.

> Nope, it get executed once and that specific problem will be gone.

Until one of the next unclean shutdowns, yes.

That specific problem is, of course, also going away by doing nothing, as
reported.

That command has no visible effect on the problem.

> As said, that's caused by some older kernel, newer kernel has extra safe
> net to ensure the accounting numbers are safe.

You haven't said that, but it's wrong either way: linux 5.4 and 5.6 are
NOT older than 4.11 - the code was introduced in 4.11, according to the
btrfs-rescue manpage, while the filesystem has never seen an older kernel
than 5.4, and the problem most recently happened with 5.6.

> > Spurious mount failures are a bug in the btrfs kernel driver.
> 
> Then report them as separate bugs.

That's what I did in the beginning of this thread. I can re-send a copy of
the mail, but that seems silly.

> The bugs of that message is well known and we have solution for a while.

And what is the solution? The problem clearly is still in linux 5.6.

> > The code proves only that you are wrong - the code _always_ prints the
> > message. Unless btrfs_super_total_bytes does more than just read some
> > data, it cannot explain the bug I reported, simply because the message is
> > not always produced, and the mount is not always aborted.
> 
> Solve one problem and go on to solve the next one.

Yup. The problem I reported is spurious mount errors after a crash.

> If you don't even bother the solution to that specific problem, you
> won't bother any debug procedure provided by any developer.

You haven't given me a solution to any problem I reported, you only made a
lot of provably wrong claims.

> > btrfs has problems, and I reported one, that's all that has happened.
> 
> You reported several problem

No, I only reported one problem, spurious mount failures.

> without proper reproducer.

Define "proper reproducer"? What's missing?

> You can reproduce it on your system is not a proper reproducer.

That's fine - but that's all I can do - reproduce it on my syytems,
because I don't have ay other systems to reproduce it on.

So in effect, what you are saying is that I cannot possibly provide a
"proper reproducer", while at the same time refusing to even think about a
btrfs bug because I didn't provide one.

Doesn't this strike you as farcical? Do you realise that it is impossible to
report bugs according to your standard?

> I provided one solution to one of your problems, you ignored and that's
> your problem.

No, you haven't. You provided something that you claim would fix it, but
it doesn't fix it, as pointed out multiple times.

> I don't see any point to debug any bugs reported by the one who doesn't
> even want to try a known solution but insists on whatever he believe is
> correct.

You have never provided a solution, of course.

> > I slowly get the distinct feeling that reporting bugs in btrfs us a futile
> > exercise, though.

What is wrong with this list? In the last decades I have reported dozens
of bugs in ext2/3/4, xfs and even udf. In all cases, filesystem developers
were open to fixing bugs and were interested in details. All except one
(which was a known but extremely rare bug) of the bugs I reported have
been fixed.

What is so different with btrfs that each time I report an actual bug, I get
attacked instead?

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-01  0:51 first mount(s) after unclean shutdown always fail Marc Lehmann
  2020-07-01  1:30 ` Qu Wenruo
@ 2020-07-02 18:31 ` Zygo Blaxell
  2020-07-03  8:04   ` Marc Lehmann
  1 sibling, 1 reply; 11+ messages in thread
From: Zygo Blaxell @ 2020-07-02 18:31 UTC (permalink / raw)
  To: Marc Lehmann; +Cc: linux-btrfs

On Wed, Jul 01, 2020 at 02:51:16AM +0200, Marc Lehmann wrote:
> Hi!
> 
> I have a server with multiple btrfs filesystems and some moderate-sized
> dmcache caches (a few million blocks/100s of GBs).
> 
> When the server has an unclean shutdown, dmcache treats all cached blocks
> as dirty. This has the effect of extremely slow I/O, as dmcache basically
> caches a lot of random I/O, and writing these blocks back to the rotating
> disk backing store can take hours. This, I think, is related to the
> problem.
> 
> When the server is in this condition, then all btrfs filesystems on slow
> stores (regardless of whether they use dmcache or not) fail their first
> mount attempt(s) like this:
> 
>    [  173.243117] BTRFS info (device dm-7): has skinny extents
>    [  864.982108] BTRFS error (device dm-7): open_ctree failed
> 
> Recent kernels sometimes additionally fail like this (super_total_bytes):
> 
>    [  867.721885] BTRFS info (device dm-7): turning on sync discard
>    [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
>    [  867.722691] BTRFS info (device dm-7): has skinny extents
>    [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
>    [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
>    [  871.269989] BTRFS error (device dm-7): open_ctree failed
> 
> all the filesystems in question are mounted twice during normal boots,
> with diferent subvolumes, and systemd parallelises these mounts. This might
> play a role in these failures.

When LVM is manipulating the device-mapper table it can sometimes trigger
udev events and run btrfs dev scan, which temporarily points btrfs to
the wrong device.  

It's possible that the backing device for lvmcache is visible temporarily
during startup (userspace cache_check has to be able to access the cache
layers separately while it runs during lvm startup).  If systemd attempts
to mount the btrfs, and btrfs ends up using the backing store directly
while cache_check is running, then btrfs gets confused when LVM enables
the cache and flips btrfs to the cached LV (not to mention any metadata
inconsistencies from looking at the backing disk that will be missing
unflushed writeback updates).  Normally this isn't a problem if you run
vgchange then btrfs dev scan then mount, but systemd's hyperaggressive
mount execution and potentially incomplete parallel execution dependency
model might make it a problem.

btrfs is not blameless here either.  Once a btrfs has been mounted
on a set of devices, it should be difficult or impossible for udev
rules to switch to different devices without umounting, possibly with
narrow exceptions e.g. for filesystems mounted on removable USB media.
It certainly shouldn't happen several times each time lvm reloads the
device-mapper table, as the btrfs signature can appear on a lot of
different dm-table devices.

This happens all the time with pvmove and other LVM volume management
operations, you'll see a handful of btrfs corruption errors scrolling
by while moving btrfs LVs between disks, sometimes also when creating
new LVs.  In one instance, I pvmoved a btrfs raid1 LV and deleted a
swap LV from a disk, then attempted to recreate the swap LV.  The new
swap LV ended up in the same physical position as the btrfs raid1 LV,
which was still mounted and active.  udev immediately flipped btrfs over
to using the swap LV as its raid1 mirror, then the swap LV was formatted
and used, with predictably disastrous results: btrfs raid1 interpreted
the swapped out pages as data corruption errors and repaired them, while
programs were killed because their swapped-out pages were overwritten
with filesystem and crashed hard when the pages were swapped back in.
A fine example of worst-case udev and LVM behavior.

> Simply trying to mount the filesystems again then (usually) succeeds with
> seemingly no issues, so these are spurious mount failures. These repeated
> mount attewmpts are also much faster, presumably because a lot of the data
> is already in memory.
> 
> As far as I am concerned, this is 100% reproducible (i.e. it happens on every
> unclean shutdown). It also happens on "old" (4.19 era) filesystems as well as
> on filesystems that have never seen anything older than 5.4 kernels.
> 
> It does _not_ happen with filesystems on SSDs, regardless of whether they
> are mounted multiple times or not. It does happen to all filesystems that
> are on rotating disks affected by dm-cache writes, regardless of whether
> the filesystem itself uses dmcache or not.

Does this mean that you have a rotating disk which contains a dm-cached
filesystem and a btrfs filesystem that does not use dm-cache, and the
btrfs filesystem is affected?  That would dismiss my theory above.
I have no theory to handle that case.

> The system in question is currently running 5.6.17, but the same thing
> happens with 5.4 and 5.2 kernels, and it might have happened with much
> earlier kernels as well, but I didn't have time to report this (as I
> secretly hoped newer kernels would fix this, and unclean shutdowns are
> rare).

FWIW unclean shutdowns are the only kind of shutdown I do with lvmcache
and btrfs (intentionally, in order to detect bugs like this), and I
haven't seen this problem.  I don't run systemd.

> Example btrfs kernel messages for one such unclean boot. This involved
> normal boot, followed by unsuccessfull "mount -va" in the emergency shell
> (i.e. a second mount fasilure for the same filesystem), followed by a
> successfull "mount -va" in the shell.
> 
> [  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
> [  173.242545] BTRFS info (device dm-7): disk space caching is enabled
> [  173.243117] BTRFS info (device dm-7): has skinny extents
> [  363.573875] INFO: task mount:1103 blocked for more than 120 seconds.
> the above message repeats multiple times, backtrace &c has been removed for clarity
> [  484.405875] INFO: task mount:1103 blocked for more than 241 seconds.
> [  605.237859] INFO: task mount:1103 blocked for more than 362 seconds.
> [  605.252478] INFO: task mount:1211 blocked for more than 120 seconds.
> [  726.069900] INFO: task mount:1103 blocked for more than 483 seconds.
> [  726.084415] INFO: task mount:1211 blocked for more than 241 seconds.
> [  846.901874] INFO: task mount:1103 blocked for more than 604 seconds.
> [  846.916431] INFO: task mount:1211 blocked for more than 362 seconds.
> [  864.982108] BTRFS error (device dm-7): open_ctree failed
> [  867.551400] BTRFS info (device dm-7): turning on sync discard
> [  867.551875] BTRFS info (device dm-7): disk space caching is enabled
> [  867.552242] BTRFS info (device dm-7): has skinny extents
> [  867.565896] BTRFS error (device dm-7): open_ctree failed

Have you deleted some log messages there?  There's no explanation for
the first two open_ctree failures.

> [  867.721885] BTRFS info (device dm-7): turning on sync discard
> [  867.722341] BTRFS info (device dm-7): disk space caching is enabled
> [  867.722691] BTRFS info (device dm-7): has skinny extents
> [  871.257020] BTRFS error (device dm-7): super_total_bytes 858976681984 mismatch with fs_devices total_rw_bytes 1717953363968
> [  871.257487] BTRFS error (device dm-7): failed to read chunk tree: -22
> [  871.269989] BTRFS error (device dm-7): open_ctree failed
> [  872.535935] BTRFS info (device dm-7): disk space caching is enabled
> [  872.536438] BTRFS info (device dm-7): has skinny extents
> 
> Example fstab entries for the mounts above:
> 
> /dev/mapper/cryptlocalvol       /localvol       btrfs           defaults,nossd,discard                  0       0
> /dev/mapper/cryptlocalvol       /cryptlocalvol  btrfs           defaults,nossd,subvol=/                 0       0
> 
> I don't need assistance, I merely write this in the hope of btrfs being
> improved by this information.
> 
> -- 
>                 The choice of a       Deliantra, the free code+content MORPG
>       -----==-     _GNU_              http://www.deliantra.net
>       ----==-- _       generation
>       ---==---(_)__  __ ____  __      Marc Lehmann
>       --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
>       -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: first mount(s) after unclean shutdown always fail
  2020-07-02 18:31 ` Zygo Blaxell
@ 2020-07-03  8:04   ` Marc Lehmann
  0 siblings, 0 replies; 11+ messages in thread
From: Marc Lehmann @ 2020-07-03  8:04 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: linux-btrfs

On Thu, Jul 02, 2020 at 02:31:34PM -0400, Zygo Blaxell <ce3g8jdj@umail.furryterror.org> wrote:
> When LVM is manipulating the device-mapper table it can sometimes trigger
> udev events and run btrfs dev scan, which temporarily points btrfs to
> the wrong device.  
> 
> It's possible that the backing device for lvmcache is visible temporarily
> during startup (userspace cache_check has to be able to access the cache

Ah, that is very interesting, but doesn't apply to this report, as dmcache
is not involved other than in slowing down the system (it is true that
the mount failures *also* happen on dmcache devices, but the example I
reported is for an uncached lv).

Furthermore, the dmcache'd lvs are all encrypted, either using
luks/cryptsetup or by another means, and this should ensure that there
cannot ever be a confusion, as btrfs can never see the unencrypted
filesystem data on different dm devices, as there either is the dmcrypt
device or none at any one time, and none are managed by automatic systems.

> unflushed writeback updates).  Normally this isn't a problem if you run
> vgchange then btrfs dev scan then mount, but systemd's hyperaggressive
> mount execution and potentially incomplete parallel execution dependency
> model might make it a problem.

In this specific case, the lvs are also activated in the initrd (which
doesn't use systemd), which runs a considerable time before systemd
takes over, due to an interactive password prompt - although it might be
possible that the dirty dmcache might slow the system down enough for it
to matter.  But the extra dmcrypt layer should completely invalidate any
confusion.

> btrfs is not blameless here either.  Once a btrfs has been mounted
> on a set of devices, it should be difficult or impossible for udev
> rules to switch to different devices without umounting, possibly with
> narrow exceptions e.g. for filesystems mounted on removable USB media.
> It certainly shouldn't happen several times each time lvm reloads the
> device-mapper table, as the btrfs signature can appear on a lot of
> different dm-table devices.

I have seen btrfs scan messages switch multiple times between devices
during boot in the past, so maybe I should look out for this more.

However, unfortunately, it seems this cannot be the explanation in this
case.

For this boot, the only other btrfs messages before the problem happened is:

[  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
[  122.857710] BTRFS: device label ROOT devid 1 transid 862260 /dev/mapper/cryptroot scanned by btrfs (727)
[  122.863350] BTRFS: device fsid a01ee881-c53e-4653-ab6a-33973a07c3e4 devid 1 transid 3426 /dev/sdc scanned by btrfs (727)
[  122.886161] BTRFS info (device dm-6): disk space caching is enabled
[  122.886712] BTRFS info (device dm-6): has skinny extents
[  123.015032] BTRFS info (device dm-6): enabling ssd optimizations

The first message refers to the failing fs. The fsid message refers to a fs
that isn't being mounted at all and the rest refer to the root fs, which is
on an SSD, and never caused trouble.

The only common pattern to this problem I see is that it happens when
mounting takes a very long time - it feels as if there is a timeout
somewhere (not necessarily in btrfs), i.e. it doesn't happen to other
filesystems on the same controller, only filesystems that share a disk
with a busy dmcached fs, when mount times are excessive.

> This happens all the time with pvmove and other LVM volume management
> operations, you'll see a handful of btrfs corruption errors scrolling
> by while moving btrfs LVs between disks, sometimes also when creating
> new LVs.

I semi-regularly use pvmove (it's considerably less work then btrfs dev
replace, so tempting for small devices where copying speed is not that
important), and fortunately, this has never happened to me. I would
remember, as BTRFS corruption messages make me squirm considerably.

> Does this mean that you have a rotating disk which contains a dm-cached
> filesystem and a btrfs filesystem that does not use dm-cache, and the
> btrfs filesystem is affected?  That would dismiss my theory above.
> I have no theory to handle that case.

Yes, unfortunately.

In this box, it is now:

(slow rotating raid) -> lvm ->             dmcrypt -> example fs from my report
(slow rotating raid) -> lvm -> lvmcache -> dmcrypt -> another fs

Until recently, I also had a third dmcache'd fs.

The problem did appear for all of them on an unclean shutdown, but most
commonly for the non-dmcache device, likely because it is mounted before the
others, i.e. it is simply the first.

The first filesystem (the one from my example) is mounted via
systemd/fstab, the others use a different mechanism and are mounted later,
and serially.

> > The system in question is currently running 5.6.17, but the same thing
> > happens with 5.4 and 5.2 kernels, and it might have happened with much
> > earlier kernels as well, but I didn't have time to report this (as I
> > secretly hoped newer kernels would fix this, and unclean shutdowns are
> > rare).
> 
> FWIW unclean shutdowns are the only kind of shutdown I do with lvmcache
> and btrfs (intentionally, in order to detect bugs like this), and I
> haven't seen this problem.  I don't run systemd.

I wish I could run without systemd :)

lvmcache is the reason why I didn't report this earlier, though, because
all filesystems once used lvmcache, and I can't distinguish between
dmcache issues (which have eaten my btrfs filesystems multiple times due
to kernel bugs) and btrfs issues.

However, I removed the dmcache on the first filesystem on purpose a few
months ago, to see if it would cause trouble.

I do like to avoid unclean shutdowns, because it means the server is close
to unusable for a rather long time (just booting it can take 15 minutes,
as you can see form the log).

> > [  122.856787] BTRFS: device label LOCALVOL devid 1 transid 152865 /dev/mapper/cryptlocalvol scanned by btrfs (727)
> > [  173.242545] BTRFS info (device dm-7): disk space caching is enabled
> > [  173.243117] BTRFS info (device dm-7): has skinny extents
> > [  363.573875] INFO: task mount:1103 blocked for more than 120 seconds.
> > the above message repeats multiple times, backtrace &c has been removed for clarity
> > [  484.405875] INFO: task mount:1103 blocked for more than 241 seconds.
> > [  605.237859] INFO: task mount:1103 blocked for more than 362 seconds.
> > [  605.252478] INFO: task mount:1211 blocked for more than 120 seconds.
> > [  726.069900] INFO: task mount:1103 blocked for more than 483 seconds.
> > [  726.084415] INFO: task mount:1211 blocked for more than 241 seconds.
> > [  846.901874] INFO: task mount:1103 blocked for more than 604 seconds.
> > [  846.916431] INFO: task mount:1211 blocked for more than 362 seconds.
> > [  864.982108] BTRFS error (device dm-7): open_ctree failed
> > [  867.551400] BTRFS info (device dm-7): turning on sync discard
> > [  867.551875] BTRFS info (device dm-7): disk space caching is enabled
> > [  867.552242] BTRFS info (device dm-7): has skinny extents
> > [  867.565896] BTRFS error (device dm-7): open_ctree failed
> 
> Have you deleted some log messages there?  There's no explanation for
> the first two open_ctree failures.

I only deleted kernel backtraces and register dumps. And yes, the *only*
messages I get when the mount fails are, in most cases:

> > [  173.242545] BTRFS info (device dm-7): disk space caching is enabled
> > [  173.243117] BTRFS info (device dm-7): has skinny extents
> > [  864.982108] BTRFS error (device dm-7): open_ctree failed

I.e. when I run into this problem there are no other messages, other than
mount itself returning an error (when I immediatelly re-run the mount) -
the last message is the "open_ctree failed" message, with no explanation
on what went wrong.

I only started getting the super_total_bytes mismatch messages recently,
after recreating the filesystems from scratch (and restoring from backup)
under 5.4, and they appear maybe in 20% of the cases or so (the machine
doesn't crash/is being reset as often under linux 5.x as it did under
4.19).

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp@schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-07-03  8:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01  0:51 first mount(s) after unclean shutdown always fail Marc Lehmann
2020-07-01  1:30 ` Qu Wenruo
2020-07-01 20:14   ` Marc Lehmann
2020-07-01 23:45     ` Qu Wenruo
2020-07-01 23:55       ` Marc Lehmann
2020-07-02  0:02         ` Qu Wenruo
2020-07-02  1:11           ` Marc Lehmann
2020-07-02  1:28             ` Qu Wenruo
2020-07-02  2:13               ` Marc Lehmann
2020-07-02 18:31 ` Zygo Blaxell
2020-07-03  8:04   ` Marc Lehmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).