Re: 'watch btrfs fi show' crash while 'btrfs device delete'

* Re: 'watch btrfs fi show' crash while 'btrfs device delete'
@ 2019-06-01 12:35 Peter Hjalmarsson
  2019-06-03  1:35 ` Su Yue
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Hjalmarsson @ 2019-06-01 12:35 UTC (permalink / raw)
  To: lists; +Cc: linux-btrfs

Hi,

I was the one reporting the issue to the Red Hat Bugzilla, and was
able to reproduce it as well

The problem is related to resizing a btrfs filesystem (at least with
the helpf of "btrfs dev del") and being able to hit "btrfs fi sh" at
the same time as the size is changed..
Something in the logic for "btrfs filesystem show" will run some tests
against the size of the filesystem, and if there are mismatches in
their results (like in a before and after removing a device) then the
btrfs tool will SIGABRT
I think that the btrfs tool could handle this more pretty, like giving
a message "device resize in progress" istead of SIGABRT.

The oiginal system is a x86_64 based machine with a couple of HDDs in
a btrfs raid setup.

I was able to reproduce this on the following testsystem:
A Raspberry pi 3 running Fedora 30 aarch64 from SD-card
The two HDDs partitioned in two equal sized partitions for a total of
four partitions
The testsystem was used since when using SSD or a RAM-based storage it
seems it is harder t hit this (possibly due to access-speeds
involved).

After this I ran:
# mkfs.btrfs -d raid1 -m raid1 /dev/sd[a-b][1-2]
<..>
# mkdir /mnt/test && mount /dev/sda1 /mnt/test
# btrfs fi df /mnt/test/
Data, RAID1: total=1.00GiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 128.00KiB
        devid    1 size 149.04GiB used 1.00GiB path /dev/sda1
        devid    2 size 149.04GiB used 1.00GiB path /dev/sda2
        devid    3 size 149.04GiB used 1.01GiB path /dev/sdb1
        devid    4 size 149.04GiB used 1.01GiB path /dev/sdb2
# btrfs dev del /dev/sda2 /mnt/test
# btrfs dev del /dev/sdb2 /mnt/test
# btrfs dev add /dev/sda2 /mnt/test
# btrfs dev add /dev/sdb2 /mnt/test
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 640.00KiB
        devid    1 size 149.04GiB used 2.03GiB path /dev/sda1
        devid    3 size 149.04GiB used 2.03GiB path /dev/sdb1
        devid    4 size 149.04GiB used 0.00B path /dev/sda2
        devid    5 size 149.04GiB used 0.00B path /dev/sdb2

This makes it possible to maximize the amount of device add/remove
from a volume, as removeing any of sda2 or sdb2 does not require
moving any big amount of data, and the add/remove seems to be what
triggered the behaviour from "btrfs fi sh".
Then I ditch "watch" and run a sime "while true"-loop for "btrfs fi
sh" to prevent that the device add/remove happends while watch does it
2 s sleep.

So after this I start the following in one shell:
---
i="0"
while true
do echo $((i++))
btrfs dev del /dev/sda2 /mnt/test/
btrfs dev add /dev/sda2 /mnt/test/
btrfs dev del /dev/sdb2 /mnt/test/
btrfs dev add /dev/sdb2 /mnt/test/
done
---

And in the other shell:
---
while btrfs fi sh ; do true ; done
---

Often the last commando does not need to go for more then 3 to 4 times
before a message like the following:

corrupted size vs. prev_size
Aborted (core dumped)

This one leave the following in the journal:
May 24 13:57:14 localhost systemd-coredump[3198]: Process 3193 (btrfs)
of user 0 dumped core.

                                                      Stack trace of
thread 3193:
                                                      #0
0x0000ffffb669fca0 raise (libc.so.6)
                                                      #1
0x0000ffffb668daa8 abort (libc.so.6)
                                                      #2
0x0000ffffb66d9a0c __libc_message (libc.so.6)
                                                      #3
0x0000ffffb66dffd4 malloc_printerr (libc.so.6)
                                                      #4
0x0000ffffb66e0730 unlink_chunk.isra.0 (libc.so.6)
                                                      #5
0x0000ffffb66e193c _int_free (libc.so.6)
                                                      #6
0x0000ffffb6709c40 closedir (libc.so.6)
                                                      #7
0x0000aaaab1debf48 close_file_or_dir (btrfs)
                                                      #8
0x0000aaaab1dece00 get_fs_info (btrfs)
                                                      #9
0x0000aaaab1e027cc btrfs_scan_kernel (btrfs)
                                                      #10
0x0000aaaab1dcc8dc main (btrfs)
                                                      #11
0x0000ffffb668deec __libc_start_main (libc.so.6)
                                                      #12
0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #13
0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
May 24 13:57:14 localhost kernel: BTRFS info (device sda1): device
deleted: /dev/sdb2

I also get this from time to time:

free(): invalid next size (normal)
Aborted (core dumped)

May 24 14:02:32 localhost systemd-coredump[5153]: Process 5148 (btrfs)
of user 0 dumped core.

                                                      Stack trace of
thread 5148:
                                                      #0
0x0000ffffa1491ca0 raise (libc.so.6)
                                                      #1
0x0000ffffa147faa8 abort (libc.so.6)
                                                      #2
0x0000ffffa14cba0c __libc_message (libc.so.6)
                                                      #3
0x0000ffffa14d1fd4 malloc_printerr (libc.so.6)
                                                      #4
0x0000ffffa14d3920 _int_free (libc.so.6)
                                                      #5
0x0000aaaac0ed18c8 btrfs_scan_kernel (btrfs)
                                                      #6
0x0000aaaac0e9b8dc main (btrfs)
                                                      #7
0x0000ffffa147feec __libc_start_main (libc.so.6)
                                                      #8
0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #9
0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
May 24 14:02:32 localhost kernel: BTRFS info (device sda1): device
deleted: /dev/sda2

Please ask if you need more info.

Best Regard,
Peter

^ permalink raw reply	[flat|nested] 3+ messages in thread