All of lore.kernel.org
 help / color / mirror / Atom feed
* Growing number of "invalid tree nritems" errors
@ 2020-07-05  8:37 Thilo-Alexander Ginkel
  2020-07-05  9:53 ` Qu Wenruo
  2020-07-05 11:15 ` Patrik Lundquist
  0 siblings, 2 replies; 8+ messages in thread
From: Thilo-Alexander Ginkel @ 2020-07-05  8:37 UTC (permalink / raw)
  To: linux-btrfs

Hello everyone,

one of our servers just started producing loads of "BTRFS error
(device dm-0): invalid tree nritems" errors and eventually caught a
hung task (not sure if those are related):

[...]
[126990.493897] BTRFS error (device dm-0): invalid tree nritems,
bytenr=201179545600 nritems=0 expect >0
[127041.504620] BTRFS error (device dm-0): invalid tree nritems,
bytenr=204159336448 nritems=0 expect >0
[127106.733494] BTRFS error (device dm-0): invalid tree nritems,
bytenr=233554296832 nritems=0 expect >0
[127125.504302] BTRFS error (device dm-0): invalid tree nritems,
bytenr=233693298688 nritems=0 expect >0
[127254.512800] BTRFS error (device dm-0): invalid tree nritems,
bytenr=299654774784 nritems=0 expect >0
[127544.739078] BTRFS error (device dm-0): invalid tree nritems,
bytenr=435922501632 nritems=0 expect >0
[127544.739190] BTRFS error (device dm-0): invalid tree nritems,
bytenr=435922714624 nritems=0 expect >0
[...]
[129532.769484] INFO: task kcompactd0:64 blocked for more than 120 seconds.
[129532.769569]       Tainted: G            E    4.15.0-109-generic #110-Ubuntu
[129532.769651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[129532.769749] kcompactd0      D    0    64      2 0x80000000
[129532.769751] Call Trace:
[129532.769756]  __schedule+0x24e/0x880
[129532.769758]  schedule+0x2c/0x80
[129532.769759]  io_schedule+0x16/0x40
[129532.769761]  __lock_page+0xff/0x140
[129532.769763]  ? page_cache_tree_insert+0xe0/0xe0
[129532.769765]  migrate_pages+0x91f/0xb80
[129532.769766]  ? __ClearPageMovable+0x10/0x10
[129532.769768]  ? isolate_freepages_block+0x3b0/0x3b0
[129532.769769]  compact_zone+0x681/0x950
[129532.769770]  kcompactd_do_work+0xfe/0x2a0
[129532.769772]  ? __switch_to_asm+0x35/0x70
[129532.769773]  ? __switch_to_asm+0x41/0x70
[129532.769774]  kcompactd+0x86/0x1c0
[129532.769775]  ? kcompactd+0x86/0x1c0
[129532.769778]  ? wait_woken+0x80/0x80
[129532.769780]  kthread+0x121/0x140
[129532.769781]  ? kcompactd_do_work+0x2a0/0x2a0
[129532.769782]  ? kthread_create_worker_on_cpu+0x70/0x70
[129532.769783]  ret_from_fork+0x35/0x40

I took the server offline and ran `btrfs check`, which did not bring
up any errors:

# btrfs check -p --check-data-csum /dev/mapper/luks
Opening filesystem to check...
Checking filesystem on /dev/mapper/luks
UUID: b5872f47-c87e-47ac-b036-4f2725cf6dc6
[1/7] checking root items                      (0:00:20 elapsed,
12381226 items checked)
[2/7] checking extents                         (0:05:38 elapsed,
5163753 items checked)
[3/7] checking free space cache                (0:00:12 elapsed, 376
items checked)
[4/7] checking fs roots                        (0:41:33 elapsed,
5021296 items checked)
[5/7] checking csums against data              (0:05:35 elapsed,
3911047 items checked)
[6/7] checking root refs                       (0:00:00 elapsed, 28110
items checked)
[7/7] checking quota groups skipped (not enabled on this FS)
found 292229652480 bytes used, no error found
total csum bytes: 200196548
total tree bytes: 84142292992
total fs tree bytes: 82578096128
total extent tree bytes: 1175896064
btree space waste bytes: 24570610642
file data blocks allocated: 245858725888
 referenced 202896068608

I will be running memtester to make sure the problems are not RAM-related.

Any ideas?

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05  8:37 Growing number of "invalid tree nritems" errors Thilo-Alexander Ginkel
@ 2020-07-05  9:53 ` Qu Wenruo
  2020-07-05 10:30   ` Thilo-Alexander Ginkel
  2020-07-05 11:15 ` Patrik Lundquist
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2020-07-05  9:53 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 5329 bytes --]



On 2020/7/5 下午4:37, Thilo-Alexander Ginkel wrote:
> Hello everyone,
> 
> one of our servers just started producing loads of "BTRFS error
> (device dm-0): invalid tree nritems" errors and eventually caught a
> hung task (not sure if those are related):
> 
> [...]
> [126990.493897] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=201179545600 nritems=0 expect >0

This means we got a child tree block whose nritems is 0.

This is not valid for child tree block at all, thus btrfs is warning
about it.

Unfortunately, we didn't output more info about it to further pindown
the problem.

The only good news is, at this stage, nothing wrong has reached disk,
thus the fs should be fine, just as your later btrfs check run shows.


> [127041.504620] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=204159336448 nritems=0 expect >0
> [127106.733494] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=233554296832 nritems=0 expect >0
> [127125.504302] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=233693298688 nritems=0 expect >0
> [127254.512800] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=299654774784 nritems=0 expect >0
> [127544.739078] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=435922501632 nritems=0 expect >0
> [127544.739190] BTRFS error (device dm-0): invalid tree nritems,
> bytenr=435922714624 nritems=0 expect >0
> [...]
> [129532.769484] INFO: task kcompactd0:64 blocked for more than 120 seconds.
> [129532.769569]       Tainted: G            E    4.15.0-109-generic #110-Ubuntu
> [129532.769651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [129532.769749] kcompactd0      D    0    64      2 0x80000000
> [129532.769751] Call Trace:
> [129532.769756]  __schedule+0x24e/0x880
> [129532.769758]  schedule+0x2c/0x80
> [129532.769759]  io_schedule+0x16/0x40
> [129532.769761]  __lock_page+0xff/0x140
> [129532.769763]  ? page_cache_tree_insert+0xe0/0xe0
> [129532.769765]  migrate_pages+0x91f/0xb80
> [129532.769766]  ? __ClearPageMovable+0x10/0x10
> [129532.769768]  ? isolate_freepages_block+0x3b0/0x3b0
> [129532.769769]  compact_zone+0x681/0x950
> [129532.769770]  kcompactd_do_work+0xfe/0x2a0
> [129532.769772]  ? __switch_to_asm+0x35/0x70
> [129532.769773]  ? __switch_to_asm+0x41/0x70
> [129532.769774]  kcompactd+0x86/0x1c0
> [129532.769775]  ? kcompactd+0x86/0x1c0
> [129532.769778]  ? wait_woken+0x80/0x80
> [129532.769780]  kthread+0x121/0x140
> [129532.769781]  ? kcompactd_do_work+0x2a0/0x2a0
> [129532.769782]  ? kthread_create_worker_on_cpu+0x70/0x70
> [129532.769783]  ret_from_fork+0x35/0x40
> 
> I took the server offline and ran `btrfs check`, which did not bring
> up any errors:
> 
> # btrfs check -p --check-data-csum /dev/mapper/luks
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/luks
> UUID: b5872f47-c87e-47ac-b036-4f2725cf6dc6
> [1/7] checking root items                      (0:00:20 elapsed,
> 12381226 items checked)
> [2/7] checking extents                         (0:05:38 elapsed,
> 5163753 items checked)
> [3/7] checking free space cache                (0:00:12 elapsed, 376
> items checked)
> [4/7] checking fs roots                        (0:41:33 elapsed,
> 5021296 items checked)
> [5/7] checking csums against data              (0:05:35 elapsed,
> 3911047 items checked)
> [6/7] checking root refs                       (0:00:00 elapsed, 28110
> items checked)
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 292229652480 bytes used, no error found
> total csum bytes: 200196548
> total tree bytes: 84142292992
> total fs tree bytes: 82578096128
> total extent tree bytes: 1175896064
> btree space waste bytes: 24570610642
> file data blocks allocated: 245858725888
>  referenced 202896068608
> 
> I will be running memtester to make sure the problems are not RAM-related.

That would be helpful to rule out RAM related problem.

> 
> Any ideas?

How producible is this?

If it still shows the same symptom after verifying the RAM, would you
please apply this small debug diff on your kernel?

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c27022f13150..92dd9a3e5644 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -406,8 +406,9 @@ int btrfs_verify_level_key(struct extent_buffer *eb,
int level,
        /* We have @first_key, so this @eb must have at least one item */
        if (btrfs_header_nritems(eb) == 0) {
                btrfs_err(fs_info,
-               "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
-                         eb->start);
+               "invalid tree nritems, bytenr=%llu owner=%lld nritems=0
expect >0",
+                         eb->start, btrfs_header_owner(eb));
+               WARN_ON_ONCE(1);
                WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
                return -EUCLEAN;
        }



It would:
- Provide which tree owns the offending tree block
  If it's some essential tree, then it should never be empty, and this
  is really something wrong other than false alerts.

- The call trace of the first encounter
  This may provide some info on how it's happening.

Thanks,
Qu

> 
> Thanks,
> Thilo
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05  9:53 ` Qu Wenruo
@ 2020-07-05 10:30   ` Thilo-Alexander Ginkel
  2020-07-05 12:10     ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Thilo-Alexander Ginkel @ 2020-07-05 10:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Jul 5, 2020 at 11:53 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> How producible is this?

I did some log analysis: The problem started showing up on two of
three servers starting July 3rd, 2020. This coincides with an applied
Ubuntu Linux kernel update to 4.15.0-109-generic whose changelog shows
plenty of btrfs changes:
https://launchpad.net/ubuntu/+source/linux/4.15.0-109.110

Server #2 (still online) shows 16 error messages in its log since
2020-07-03 whereas server #3 shows 310 error messages.

On thing special about server #3 is that its btrfs file system has a
huge metadata section (probably due to it hosting many [~ 50 Mio]
small files), which doesn't seem too healthy:

# btrfs filesystem usage /mnt
Overall:
    Device size:                 476.30GiB
    Device allocated:            372.02GiB
    Device unallocated:          104.28GiB
    Device missing:                  0.00B
    Used:                        272.16GiB
    Free (estimated):            194.49GiB      (min: 194.49GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:284.01GiB, Used:193.80GiB
   /dev/mapper/luks      284.01GiB

Metadata,single: Size:88.01GiB, Used:78.36GiB
   /dev/mapper/luks       88.01GiB

System,single: Size:4.00MiB, Used:80.00KiB
   /dev/mapper/luks        4.00MiB

Unallocated:
   /dev/mapper/luks      104.28GiB

> If it still shows the same symptom after verifying the RAM, would you
> please apply this small debug diff on your kernel?

I'll see what I can do.

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05  8:37 Growing number of "invalid tree nritems" errors Thilo-Alexander Ginkel
  2020-07-05  9:53 ` Qu Wenruo
@ 2020-07-05 11:15 ` Patrik Lundquist
  1 sibling, 0 replies; 8+ messages in thread
From: Patrik Lundquist @ 2020-07-05 11:15 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: Linux Btrfs

On Sun, 5 Jul 2020 at 10:44, Thilo-Alexander Ginkel <thilo@ginkel.com> wrote:
>
> [129532.769569]       Tainted: G            E    4.15.0-109-generic #110-Ubuntu

That's a pretty old kernel from a Btrfs perspective. I'd consider
installing the Ubuntu HWE kernel which is currently at Linux 5.3.

https://wiki.ubuntu.com/Kernel/LTSEnablementStack

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05 10:30   ` Thilo-Alexander Ginkel
@ 2020-07-05 12:10     ` Qu Wenruo
  2020-07-05 13:24       ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2020-07-05 12:10 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 2919 bytes --]



On 2020/7/5 下午6:30, Thilo-Alexander Ginkel wrote:
> On Sun, Jul 5, 2020 at 11:53 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>> How producible is this?
> 
> I did some log analysis: The problem started showing up on two of
> three servers starting July 3rd, 2020. This coincides with an applied
> Ubuntu Linux kernel update to 4.15.0-109-generic whose changelog shows
> plenty of btrfs changes:
> https://launchpad.net/ubuntu/+source/linux/4.15.0-109.110

So it backported all these restrict self check of recent kernels.

That's great to expose any unexpected metadata.
Although sometimes backport itself may introduce new bugs (very rare),
especially for heavy backported kernels.

So if it's possible, try upstream kernel can also be an alternative to
test if it's really something wrong.

Another factor involved is btrfs-progs version, which normally gets less
backports, while upstream normally have more strict checks overall.
So trying upstream btrfs-check would also be a good idea if possible.

> 
> Server #2 (still online) shows 16 error messages in its log since
> 2020-07-03 whereas server #3 shows 310 error messages.

Then it shouldn't be a hardware problem unless all servers have the same
problem.

In such cases, I would recommend to try upstream kernels first,
especially when the heavily backported kernels are involved.

If you can reproduce it with upstream kernel, then I strongly recommend
to use that mentioned diff to provide more info to debug, as it would be
a false alert.

> 
> On thing special about server #3 is that its btrfs file system has a
> huge metadata section (probably due to it hosting many [~ 50 Mio]
> small files), which doesn't seem too healthy:
> 
> # btrfs filesystem usage /mnt
> Overall:
>     Device size:                 476.30GiB
>     Device allocated:            372.02GiB
>     Device unallocated:          104.28GiB
>     Device missing:                  0.00B
>     Used:                        272.16GiB
>     Free (estimated):            194.49GiB      (min: 194.49GiB)
>     Data ratio:                       1.00
>     Metadata ratio:                   1.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,single: Size:284.01GiB, Used:193.80GiB
>    /dev/mapper/luks      284.01GiB
> 
> Metadata,single: Size:88.01GiB, Used:78.36GiB
>    /dev/mapper/luks       88.01GiB

In fact, your metadata is not that unhealthy.

> 
> System,single: Size:4.00MiB, Used:80.00KiB
>    /dev/mapper/luks        4.00MiB
> 
> Unallocated:
>    /dev/mapper/luks      104.28GiB

And there are plenty unallocated space, so your fs looks pretty healthy
instead.

Thanks,
Qu
> 
>> If it still shows the same symptom after verifying the RAM, would you
>> please apply this small debug diff on your kernel?
> 
> I'll see what I can do.
> 
> Thanks,
> Thilo
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05 12:10     ` Qu Wenruo
@ 2020-07-05 13:24       ` Thilo-Alexander Ginkel
  2020-07-06 20:19         ` Thilo-Alexander Ginkel
  0 siblings, 1 reply; 8+ messages in thread
From: Thilo-Alexander Ginkel @ 2020-07-05 13:24 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Jul 5, 2020 at 2:10 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > I did some log analysis: The problem started showing up on two of
> > three servers starting July 3rd, 2020. This coincides with an applied
> > Ubuntu Linux kernel update to 4.15.0-109-generic whose changelog shows
> > plenty of btrfs changes:
> > https://launchpad.net/ubuntu/+source/linux/4.15.0-109.110
>
> So it backported all these restrict self check of recent kernels.
>
> That's great to expose any unexpected metadata.
> Although sometimes backport itself may introduce new bugs (very rare),
> especially for heavy backported kernels.
>
> So if it's possible, try upstream kernel can also be an alternative to
> test if it's really something wrong.

I took Patrik Lundquist's advice and upgraded to the latest HWE
kernel, which is based on 5.3.0. I'll follow your suggestions if the
problem manifests again.

> > Server #2 (still online) shows 16 error messages in its log since
> > 2020-07-03 whereas server #3 shows 310 error messages.
>
> Then it shouldn't be a hardware problem unless all servers have the same
> problem.

ACK. Memory test also came back negative.

> > On thing special about server #3 is that its btrfs file system has a
> > huge metadata section (probably due to it hosting many [~ 50 Mio]
> > small files), which doesn't seem too healthy:
> >
> > # btrfs filesystem usage /mnt
> > Overall:
> >     Device size:                 476.30GiB
> >     Device allocated:            372.02GiB
> >     Device unallocated:          104.28GiB
> >     Device missing:                  0.00B
> >     Used:                        272.16GiB
> >     Free (estimated):            194.49GiB      (min: 194.49GiB)
> >     Data ratio:                       1.00
> >     Metadata ratio:                   1.00
> >     Global reserve:              512.00MiB      (used: 0.00B)
> >
> > Data,single: Size:284.01GiB, Used:193.80GiB
> >    /dev/mapper/luks      284.01GiB
> >
> > Metadata,single: Size:88.01GiB, Used:78.36GiB
> >    /dev/mapper/luks       88.01GiB
>
> In fact, your metadata is not that unhealthy.

Allright, thanks for pointing this out. The other servers have ~ 1.5
GB allocated for metadata, so this seemed way off (but can probably be
explained by the vastly different file system usage on #3).

Thanks,
Thilo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-05 13:24       ` Thilo-Alexander Ginkel
@ 2020-07-06 20:19         ` Thilo-Alexander Ginkel
  2020-07-06 22:44           ` Qu Wenruo
  0 siblings, 1 reply; 8+ messages in thread
From: Thilo-Alexander Ginkel @ 2020-07-06 20:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sun, Jul 5, 2020 at 3:24 PM Thilo-Alexander Ginkel <thilo@ginkel.com> wrote:
> On Sun, Jul 5, 2020 at 2:10 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> > So if it's possible, try upstream kernel can also be an alternative to
> > test if it's really something wrong.
>
> I took Patrik Lundquist's advice and upgraded to the latest HWE
> kernel, which is based on 5.3.0. I'll follow your suggestions if the
> problem manifests again.

Good news, the problem is gone after upgrading to 5.3.0 (the most
recent Ubuntu 18.04 HWE kernel).

Thanks for your support!

Regards,
Thilo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Growing number of "invalid tree nritems" errors
  2020-07-06 20:19         ` Thilo-Alexander Ginkel
@ 2020-07-06 22:44           ` Qu Wenruo
  0 siblings, 0 replies; 8+ messages in thread
From: Qu Wenruo @ 2020-07-06 22:44 UTC (permalink / raw)
  To: Thilo-Alexander Ginkel; +Cc: linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 810 bytes --]



On 2020/7/7 上午4:19, Thilo-Alexander Ginkel wrote:
> On Sun, Jul 5, 2020 at 3:24 PM Thilo-Alexander Ginkel <thilo@ginkel.com> wrote:
>> On Sun, Jul 5, 2020 at 2:10 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>> So if it's possible, try upstream kernel can also be an alternative to
>>> test if it's really something wrong.
>>
>> I took Patrik Lundquist's advice and upgraded to the latest HWE
>> kernel, which is based on 5.3.0. I'll follow your suggestions if the
>> problem manifests again.
> 
> Good news, the problem is gone after upgrading to 5.3.0 (the most
> recent Ubuntu 18.04 HWE kernel).

Then it will be a good idea to report this incident to Ubuntu team, to
let them know the backport problems.

Thanks,
Qu
> 
> Thanks for your support!
> 
> Regards,
> Thilo
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-07-06 22:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-05  8:37 Growing number of "invalid tree nritems" errors Thilo-Alexander Ginkel
2020-07-05  9:53 ` Qu Wenruo
2020-07-05 10:30   ` Thilo-Alexander Ginkel
2020-07-05 12:10     ` Qu Wenruo
2020-07-05 13:24       ` Thilo-Alexander Ginkel
2020-07-06 20:19         ` Thilo-Alexander Ginkel
2020-07-06 22:44           ` Qu Wenruo
2020-07-05 11:15 ` Patrik Lundquist

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.