linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 'watch btrfs fi show' crash while 'btrfs device delete'
@ 2019-06-01 12:35 Peter Hjalmarsson
  2019-06-03  1:35 ` Su Yue
  0 siblings, 1 reply; 3+ messages in thread
From: Peter Hjalmarsson @ 2019-06-01 12:35 UTC (permalink / raw)
  To: lists; +Cc: linux-btrfs

Hi,

I was the one reporting the issue to the Red Hat Bugzilla, and was
able to reproduce it as well

The problem is related to resizing a btrfs filesystem (at least with
the helpf of "btrfs dev del") and being able to hit "btrfs fi sh" at
the same time as the size is changed..
Something in the logic for "btrfs filesystem show" will run some tests
against the size of the filesystem, and if there are mismatches in
their results (like in a before and after removing a device) then the
btrfs tool will SIGABRT
I think that the btrfs tool could handle this more pretty, like giving
a message "device resize in progress" istead of SIGABRT.

The oiginal system is a x86_64 based machine with a couple of HDDs in
a btrfs raid setup.

I was able to reproduce this on the following testsystem:
A Raspberry pi 3 running Fedora 30 aarch64 from SD-card
The two HDDs partitioned in two equal sized partitions for a total of
four partitions
The testsystem was used since when using SSD or a RAM-based storage it
seems it is harder t hit this (possibly due to access-speeds
involved).

After this I ran:
# mkfs.btrfs -d raid1 -m raid1 /dev/sd[a-b][1-2]
<..>
# mkdir /mnt/test && mount /dev/sda1 /mnt/test
# btrfs fi df /mnt/test/
Data, RAID1: total=1.00GiB, used=0.00B
System, RAID1: total=8.00MiB, used=16.00KiB
Metadata, RAID1: total=1.00GiB, used=112.00KiB
GlobalReserve, single: total=16.00MiB, used=0.00B
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 128.00KiB
        devid    1 size 149.04GiB used 1.00GiB path /dev/sda1
        devid    2 size 149.04GiB used 1.00GiB path /dev/sda2
        devid    3 size 149.04GiB used 1.01GiB path /dev/sdb1
        devid    4 size 149.04GiB used 1.01GiB path /dev/sdb2
# btrfs dev del /dev/sda2 /mnt/test
# btrfs dev del /dev/sdb2 /mnt/test
# btrfs dev add /dev/sda2 /mnt/test
# btrfs dev add /dev/sdb2 /mnt/test
# btrfs fi sh
Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
        Total devices 4 FS bytes used 640.00KiB
        devid    1 size 149.04GiB used 2.03GiB path /dev/sda1
        devid    3 size 149.04GiB used 2.03GiB path /dev/sdb1
        devid    4 size 149.04GiB used 0.00B path /dev/sda2
        devid    5 size 149.04GiB used 0.00B path /dev/sdb2


This makes it possible to maximize the amount of device add/remove
from a volume, as removeing any of sda2 or sdb2 does not require
moving any big amount of data, and the add/remove seems to be what
triggered the behaviour from "btrfs fi sh".
Then I ditch "watch" and run a sime "while true"-loop for "btrfs fi
sh" to prevent that the device add/remove happends while watch does it
2 s sleep.

So after this I start the following in one shell:
---
i="0"
while true
do echo $((i++))
btrfs dev del /dev/sda2 /mnt/test/
btrfs dev add /dev/sda2 /mnt/test/
btrfs dev del /dev/sdb2 /mnt/test/
btrfs dev add /dev/sdb2 /mnt/test/
done
---

And in the other shell:
---
while btrfs fi sh ; do true ; done
---


Often the last commando does not need to go for more then 3 to 4 times
before a message like the following:

corrupted size vs. prev_size
Aborted (core dumped)

This one leave the following in the journal:
May 24 13:57:14 localhost systemd-coredump[3198]: Process 3193 (btrfs)
of user 0 dumped core.

                                                      Stack trace of
thread 3193:
                                                      #0
0x0000ffffb669fca0 raise (libc.so.6)
                                                      #1
0x0000ffffb668daa8 abort (libc.so.6)
                                                      #2
0x0000ffffb66d9a0c __libc_message (libc.so.6)
                                                      #3
0x0000ffffb66dffd4 malloc_printerr (libc.so.6)
                                                      #4
0x0000ffffb66e0730 unlink_chunk.isra.0 (libc.so.6)
                                                      #5
0x0000ffffb66e193c _int_free (libc.so.6)
                                                      #6
0x0000ffffb6709c40 closedir (libc.so.6)
                                                      #7
0x0000aaaab1debf48 close_file_or_dir (btrfs)
                                                      #8
0x0000aaaab1dece00 get_fs_info (btrfs)
                                                      #9
0x0000aaaab1e027cc btrfs_scan_kernel (btrfs)
                                                      #10
0x0000aaaab1dcc8dc main (btrfs)
                                                      #11
0x0000ffffb668deec __libc_start_main (libc.so.6)
                                                      #12
0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #13
0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
May 24 13:57:14 localhost kernel: BTRFS info (device sda1): device
deleted: /dev/sdb2


I also get this from time to time:

free(): invalid next size (normal)
Aborted (core dumped)

May 24 14:02:32 localhost systemd-coredump[5153]: Process 5148 (btrfs)
of user 0 dumped core.

                                                      Stack trace of
thread 5148:
                                                      #0
0x0000ffffa1491ca0 raise (libc.so.6)
                                                      #1
0x0000ffffa147faa8 abort (libc.so.6)
                                                      #2
0x0000ffffa14cba0c __libc_message (libc.so.6)
                                                      #3
0x0000ffffa14d1fd4 malloc_printerr (libc.so.6)
                                                      #4
0x0000ffffa14d3920 _int_free (libc.so.6)
                                                      #5
0x0000aaaac0ed18c8 btrfs_scan_kernel (btrfs)
                                                      #6
0x0000aaaac0e9b8dc main (btrfs)
                                                      #7
0x0000ffffa147feec __libc_start_main (libc.so.6)
                                                      #8
0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
                                                      #9
0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
May 24 14:02:32 localhost kernel: BTRFS info (device sda1): device
deleted: /dev/sda2

Please ask if you need more info.

Best Regard,
Peter

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: 'watch btrfs fi show' crash while 'btrfs device delete'
  2019-06-01 12:35 'watch btrfs fi show' crash while 'btrfs device delete' Peter Hjalmarsson
@ 2019-06-03  1:35 ` Su Yue
  0 siblings, 0 replies; 3+ messages in thread
From: Su Yue @ 2019-06-03  1:35 UTC (permalink / raw)
  To: Peter Hjalmarsson, lists; +Cc: linux-btrfs



On 2019/6/1 8:35 PM, Peter Hjalmarsson wrote:
> Hi,
>
> I was the one reporting the issue to the Red Hat Bugzilla, and was
> able to reproduce it as well
>

Thanks for the report,  reproduced the bug following your steps.
I just sent a patch named "btrfs-progs: fix invalid memory write in
get_fs_info()". It should fix the bug (at least in my test machine).

Could you try it? It won't spend much time since the fix is just
for btrfs-progs.


---
Su
> The problem is related to resizing a btrfs filesystem (at least with
> the helpf of "btrfs dev del") and being able to hit "btrfs fi sh" at
> the same time as the size is changed..
> Something in the logic for "btrfs filesystem show" will run some tests
> against the size of the filesystem, and if there are mismatches in
> their results (like in a before and after removing a device) then the
> btrfs tool will SIGABRT
> I think that the btrfs tool could handle this more pretty, like giving
> a message "device resize in progress" istead of SIGABRT.
>
> The oiginal system is a x86_64 based machine with a couple of HDDs in
> a btrfs raid setup.
>
> I was able to reproduce this on the following testsystem:
> A Raspberry pi 3 running Fedora 30 aarch64 from SD-card
> The two HDDs partitioned in two equal sized partitions for a total of
> four partitions
> The testsystem was used since when using SSD or a RAM-based storage it
> seems it is harder t hit this (possibly due to access-speeds
> involved).
>
> After this I ran:
> # mkfs.btrfs -d raid1 -m raid1 /dev/sd[a-b][1-2]
> <..>
> # mkdir /mnt/test && mount /dev/sda1 /mnt/test
> # btrfs fi df /mnt/test/
> Data, RAID1: total=1.00GiB, used=0.00B
> System, RAID1: total=8.00MiB, used=16.00KiB
> Metadata, RAID1: total=1.00GiB, used=112.00KiB
> GlobalReserve, single: total=16.00MiB, used=0.00B
> # btrfs fi sh
> Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
>          Total devices 4 FS bytes used 128.00KiB
>          devid    1 size 149.04GiB used 1.00GiB path /dev/sda1
>          devid    2 size 149.04GiB used 1.00GiB path /dev/sda2
>          devid    3 size 149.04GiB used 1.01GiB path /dev/sdb1
>          devid    4 size 149.04GiB used 1.01GiB path /dev/sdb2
> # btrfs dev del /dev/sda2 /mnt/test
> # btrfs dev del /dev/sdb2 /mnt/test
> # btrfs dev add /dev/sda2 /mnt/test
> # btrfs dev add /dev/sdb2 /mnt/test
> # btrfs fi sh
> Label: none  uuid: c34e4190-674b-4111-ba37-8128c1f120f4
>          Total devices 4 FS bytes used 640.00KiB
>          devid    1 size 149.04GiB used 2.03GiB path /dev/sda1
>          devid    3 size 149.04GiB used 2.03GiB path /dev/sdb1
>          devid    4 size 149.04GiB used 0.00B path /dev/sda2
>          devid    5 size 149.04GiB used 0.00B path /dev/sdb2
>
>
> This makes it possible to maximize the amount of device add/remove
> from a volume, as removeing any of sda2 or sdb2 does not require
> moving any big amount of data, and the add/remove seems to be what
> triggered the behaviour from "btrfs fi sh".
> Then I ditch "watch" and run a sime "while true"-loop for "btrfs fi
> sh" to prevent that the device add/remove happends while watch does it
> 2 s sleep.
>
> So after this I start the following in one shell:
> ---
> i="0"
> while true
> do echo $((i++))
> btrfs dev del /dev/sda2 /mnt/test/
> btrfs dev add /dev/sda2 /mnt/test/
> btrfs dev del /dev/sdb2 /mnt/test/
> btrfs dev add /dev/sdb2 /mnt/test/
> done
> ---
>
> And in the other shell:
> ---
> while btrfs fi sh ; do true ; done
> ---
>
>
> Often the last commando does not need to go for more then 3 to 4 times
> before a message like the following:
>
> corrupted size vs. prev_size
> Aborted (core dumped)
>
> This one leave the following in the journal:
> May 24 13:57:14 localhost systemd-coredump[3198]: Process 3193 (btrfs)
> of user 0 dumped core.
>
>                                                        Stack trace of
> thread 3193:
>                                                        #0
> 0x0000ffffb669fca0 raise (libc.so.6)
>                                                        #1
> 0x0000ffffb668daa8 abort (libc.so.6)
>                                                        #2
> 0x0000ffffb66d9a0c __libc_message (libc.so.6)
>                                                        #3
> 0x0000ffffb66dffd4 malloc_printerr (libc.so.6)
>                                                        #4
> 0x0000ffffb66e0730 unlink_chunk.isra.0 (libc.so.6)
>                                                        #5
> 0x0000ffffb66e193c _int_free (libc.so.6)
>                                                        #6
> 0x0000ffffb6709c40 closedir (libc.so.6)
>                                                        #7
> 0x0000aaaab1debf48 close_file_or_dir (btrfs)
>                                                        #8
> 0x0000aaaab1dece00 get_fs_info (btrfs)
>                                                        #9
> 0x0000aaaab1e027cc btrfs_scan_kernel (btrfs)
>                                                        #10
> 0x0000aaaab1dcc8dc main (btrfs)
>                                                        #11
> 0x0000ffffb668deec __libc_start_main (libc.so.6)
>                                                        #12
> 0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
>                                                        #13
> 0x0000aaaab1dccad8 .annobin_stubs.c_end.startup (btrfs)
> May 24 13:57:14 localhost kernel: BTRFS info (device sda1): device
> deleted: /dev/sdb2
>
>
> I also get this from time to time:
>
> free(): invalid next size (normal)
> Aborted (core dumped)
>
> May 24 14:02:32 localhost systemd-coredump[5153]: Process 5148 (btrfs)
> of user 0 dumped core.
>
>                                                        Stack trace of
> thread 5148:
>                                                        #0
> 0x0000ffffa1491ca0 raise (libc.so.6)
>                                                        #1
> 0x0000ffffa147faa8 abort (libc.so.6)
>                                                        #2
> 0x0000ffffa14cba0c __libc_message (libc.so.6)
>                                                        #3
> 0x0000ffffa14d1fd4 malloc_printerr (libc.so.6)
>                                                        #4
> 0x0000ffffa14d3920 _int_free (libc.so.6)
>                                                        #5
> 0x0000aaaac0ed18c8 btrfs_scan_kernel (btrfs)
>                                                        #6
> 0x0000aaaac0e9b8dc main (btrfs)
>                                                        #7
> 0x0000ffffa147feec __libc_start_main (libc.so.6)
>                                                        #8
> 0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
>                                                        #9
> 0x0000aaaac0e9bad8 .annobin_stubs.c_end.startup (btrfs)
> May 24 14:02:32 localhost kernel: BTRFS info (device sda1): device
> deleted: /dev/sda2
>
> Please ask if you need more info.
>
> Best Regard,
> Peter
>



^ permalink raw reply	[flat|nested] 3+ messages in thread

* 'watch btrfs fi show' crash while 'btrfs device delete'
@ 2019-05-20 19:23 Chris Murphy
  0 siblings, 0 replies; 3+ messages in thread
From: Chris Murphy @ 2019-05-20 19:23 UTC (permalink / raw)
  To: Btrfs BTRFS

btrfs-progs: btrfs_scan_kernel(): btrfs killed by SIGABRT
https://bugzilla.redhat.com/show_bug.cgi?id=1711787

btrfs-progs-4.20.2-1.fc29
kernel:5.0.16-200.fc29.x86_64

If I understand the bug report correctly, one shell is watching for
changing 'btrfs fi show' while another deletes a device, and then
there's a crash.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2019-06-03  1:36 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-01 12:35 'watch btrfs fi show' crash while 'btrfs device delete' Peter Hjalmarsson
2019-06-03  1:35 ` Su Yue
  -- strict thread matches above, loose matches on Subject: below --
2019-05-20 19:23 Chris Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).