linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
@ 2019-10-19 19:29 Peter Hjalmarsson
  2019-10-21  9:17 ` Johannes Thumshirn
  0 siblings, 1 reply; 9+ messages in thread
From: Peter Hjalmarsson @ 2019-10-19 19:29 UTC (permalink / raw)
  To: linux-btrfs

Hi,

While trying to reproduce another problem I have seen with BTRFS while
running balance and raid6 I hit an issue resulting in:
BUG: kernel NULL pointer dereference, address: 00000000000002ce

I created a script trying to pinpoint the problem utilizing zram, and
the run goes like this:

1. run the scripts which sets up a "SINGEL" filesystem on two larger
devices, fills it with an adequate amount of data, and then tries to
convert it to raid6
2. the filesystem will fail to convert due to not enough space to
convert all data to raid6 (not enough space on at least 3 devices at
the same time), hitting an enospc-error
* Up until this point stuff still seems to work without crashing, and
the system seems stable, just with two different profiles fot the data
3. issue the umount command which will be "Killed", and a backtrace
will be written to dmesg

This results in a filesystem that cannot be synced, unmounted, and in
all just seems crashed.
I have ran the script 6 times. 1 time it passed, 5 times it has
crashed, and I have rebooted the computer, since that is the only way
it seems to get rid of the filesysem, so the issue seems pretty
reproducable.

The system is a laptop running Fedora 31 Beta with the latest updates,
and the latest kernel (5.3.6-300.fc31.x86_64)

Do you have any input, or any other questions or stuff you want me to test?

The script looks as follow:
-----
$ cat run-btrfs-test
modprobe -iv zram num_devices=8
udevadm trigger
sync
zramctl /dev/zram0 -s 8G && \
zramctl /dev/zram1 -s 8G && \
zramctl /dev/zram2 -s 4G && \
zramctl /dev/zram3 -s 4G && \
zramctl /dev/zram4 -s 4G && \
zramctl /dev/zram5 -s 2G && \
zramctl /dev/zram6 -s 2G && \
zramctl /dev/zram7 -s 4G && \
mkfs.btrfs /dev/zram0 && \
mkdir -p /mnt/btrfs-test && \
mount /dev/zram0 /mnt/btrfs-test && \
echo "FS Mounted" && \
btrfs dev add /dev/zram1 /mnt/btrfs-test && \
echo "Devices added" && \
for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
bs=32M count=1 && sync ; done
btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
btrfs fi sh /mnt/btrfs-test && \
btrfs fi df /mnt/btrfs-test && \
btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
btrfs bal star -dconvert=raid6 /mnt/btrfs-test
btrfs fi sh /mnt/btrfs-test && \
btrfs fi df /mnt/btrfs-test
=====

The output from the script, I will trim the output down to after the for-loop:
------
Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
Total devices 8 FS bytes used 15.64GiB
devid    1 size 8.00GiB used 8.00GiB path /dev/zram0
devid    2 size 8.00GiB used 8.00GiB path /dev/zram1
devid    3 size 4.00GiB used 0.00B path /dev/zram2
devid    4 size 4.00GiB used 0.00B path /dev/zram3
devid    5 size 4.00GiB used 0.00B path /dev/zram4
devid    6 size 2.00GiB used 0.00B path /dev/zram5
devid    7 size 2.00GiB used 0.00B path /dev/zram6
devid    8 size 4.00GiB used 0.00B path /dev/zram7

Data, single: total=15.74GiB, used=15.63GiB
System, single: total=4.00MiB, used=16.00KiB
Metadata, single: total=264.00MiB, used=17.06MiB
GlobalReserve, single: total=16.70MiB, used=0.00B
Done, had to relocate 3 out of 20 chunks
ERROR: error during balancing '/mnt/btrfs-test': No space left on device
There may be more info in syslog - try dmesg | tail
Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
Total devices 8 FS bytes used 15.65GiB
devid    1 size 8.00GiB used 2.99GiB path /dev/zram0
devid    2 size 8.00GiB used 2.00GiB path /dev/zram1
devid    3 size 4.00GiB used 4.00GiB path /dev/zram2
devid    4 size 4.00GiB used 4.00GiB path /dev/zram3
devid    5 size 4.00GiB used 4.00GiB path /dev/zram4
devid    6 size 2.00GiB used 2.00GiB path /dev/zram5
devid    7 size 2.00GiB used 2.00GiB path /dev/zram6
devid    8 size 4.00GiB used 4.00GiB path /dev/zram7

Data, single: total=2.00GiB, used=1.97GiB
Data, RAID6: total=14.41GiB, used=13.66GiB
System, RAID6: total=80.00MiB, used=16.00KiB
Metadata, RAID6: total=512.00MiB, used=17.52MiB
GlobalReserve, single: total=17.16MiB, used=0.00B
===

Issueing the umount:
----
# umount /mnt/btrfs-test && modprobe -rv zram
Killed
===

And last but not least: the output in dmesg:
---
[  205.960233] BTRFS info (device zram0): 2 enospc errors during balance
[  205.960235] BTRFS info (device zram0): balance: ended with status: -28
[  235.774821] BUG: kernel NULL pointer dereference, address: 00000000000002ce
[  235.774826] #PF: supervisor read access in kernel mode
[  235.774828] #PF: error_code(0x0000) - not-present page
[  235.774830] PGD 0 P4D 0
[  235.774834] Oops: 0000 [#1] SMP PTI
[  235.774838] CPU: 3 PID: 5421 Comm: umount Not tainted
5.3.6-300.fc31.x86_64 #1
[  235.774840] Hardware name: LENOVO 80JV/Lenovo U41-70, BIOS BDCN71WW
08/03/2016
[  235.774847] RIP: 0010:__free_pages+0x5/0x30
[  235.774850] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[  235.774852] RSP: 0018:ffffc3cf0ffb7db0 EFLAGS: 00010046
[  235.774854] RAX: ffffa0ad0ffa0118 RBX: 0000000000000045 RCX: 0000000000000000
[  235.774856] RDX: ffffa0aeceaee2f0 RSI: 0000000000000000 RDI: 000000000000029a
[  235.774858] RBP: ffffa0ad0ffa0000 R08: fffffb06c23fd108 R09: fffffb06c23fd108
[  235.774860] R10: 0000000000068879 R11: fffffb06c02cf820 R12: 0000000000000045
[  235.774861] R13: ffffa0ade5c00010 R14: ffffa0ae0d19e5ac R15: ffffa0ae0d19e578
[  235.774864] FS:  00007f3b40f6cc80(0000) GS:ffffa0aeceac0000(0000)
knlGS:0000000000000000
[  235.774866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.774867] CR2: 00000000000002ce CR3: 000000004c1f0004 CR4: 00000000003606e0
[  235.774869] Call Trace:
[  235.774920]  __free_raid_bio+0x72/0xb0 [btrfs]
[  235.774961]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
[  235.774992]  close_ctree+0x1ea/0x2f0 [btrfs]
[  235.774998]  generic_shutdown_super+0x6c/0x100
[  235.775001]  kill_anon_super+0x14/0x30
[  235.775024]  btrfs_kill_super+0x12/0xa0 [btrfs]
[  235.775029]  deactivate_locked_super+0x36/0x70
[  235.775033]  cleanup_mnt+0x104/0x150
[  235.775038]  task_work_run+0x87/0xa0
[  235.775043]  exit_to_usermode_loop+0xda/0x100
[  235.775047]  do_syscall_64+0x183/0x1a0
[  235.775053]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  235.775056] RIP: 0033:0x7f3b411b767b
[  235.775060] Code: 08 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 07 0c 00 f7 d8 64 89
01 48
[  235.775062] RSP: 002b:00007fffb57488e8 EFLAGS: 00000246 ORIG_RAX:
00000000000000a6
[  235.775065] RAX: 0000000000000000 RBX: 00007f3b412e11e4 RCX: 00007f3b411b767b
[  235.775066] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055dc3815f730
[  235.775068] RBP: 000055dc3815f520 R08: 0000000000000000 R09: 00007fffb5747660
[  235.775070] R10: 000055dc38160750 R11: 0000000000000246 R12: 000055dc3815f730
[  235.775072] R13: 0000000000000000 R14: 000055dc3815f618 R15: 0000000000000000
[  235.775074] Modules linked in: zram uinput rfcomm ccm xt_CHECKSUM
xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbios_ns
nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter
ipt_REJECT nf_reject_ipv4 xt_conntrack tun bridge stp llc ebtable_nat
ebtable_broute ip6table_nat ip6_udp_tunnel udp_tunnel ip6table_mangle
ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle
iptable_raw iptable_security nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter cmac bnep cachefiles fscache
sunrpc vfat fat iwlmvm intel_rapl_msr intel_rapl_common
x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 kvm_intel
libarc4 iwlwifi kvm uvcvideo videobuf2_vmalloc mei_hdcp
videobuf2_memops videobuf2_v4l2 videobuf2_common cfg80211
snd_hda_codec_realtek irqbypass snd_hda_codec_generic ledtrig_audio
intel_cstate snd_hda_codec_hdmi intel_uncore btusb videodev
snd_hda_intel btrtl btbcm btintel snd_hda_codec
[  235.775113]  mc bluetooth snd_hda_core snd_hwdep intel_rapl_perf
wdat_wdt snd_seq snd_seq_device snd_pcm joydev asus_wmi input_polldev
i2c_i801 intel_wmi_thunderbolt intel_pch_thermal wmi_bmof ecdh_generic
ideapad_laptop ecc snd_timer lpc_ich sparse_keymap mei_me snd mei
soundcore rfkill acpi_pad lz4 lz4_compress binfmt_misc ip_tables btrfs
libcrc32c xor zstd_decompress zstd_compress raid6_pq dm_crypt i915
nouveau mxm_wmi ttm crct10dif_pclmul crc32_pclmul crc32c_intel
i2c_algo_bit drm_kms_helper drm ghash_clmulni_intel serio_raw r8169
wmi video fuse [last unloaded: zram]
[  235.775143] CR2: 00000000000002ce
[  235.775146] ---[ end trace 1ed5f1c3085019fd ]---
[  235.775151] RIP: 0010:__free_pages+0x5/0x30
[  235.775154] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[  235.775156] RSP: 0018:ffffc3cf0ffb7db0 EFLAGS: 00010046
[  235.775158] RAX: ffffa0ad0ffa0118 RBX: 0000000000000045 RCX: 0000000000000000
[  235.775160] RDX: ffffa0aeceaee2f0 RSI: 0000000000000000 RDI: 000000000000029a
[  235.775162] RBP: ffffa0ad0ffa0000 R08: fffffb06c23fd108 R09: fffffb06c23fd108
[  235.775164] R10: 0000000000068879 R11: fffffb06c02cf820 R12: 0000000000000045
[  235.775165] R13: ffffa0ade5c00010 R14: ffffa0ae0d19e5ac R15: ffffa0ae0d19e578
[  235.775168] FS:  00007f3b40f6cc80(0000) GS:ffffa0aeceac0000(0000)
knlGS:0000000000000000
[  235.775170] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  235.775172] CR2: 00000000000002ce CR3: 000000004c1f0004 CR4: 00000000003606e0

Best Regards,
Peter Hjalmarsson

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-19 19:29 "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error Peter Hjalmarsson
@ 2019-10-21  9:17 ` Johannes Thumshirn
  2019-10-21 14:32   ` Johannes Thumshirn
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2019-10-21  9:17 UTC (permalink / raw)
  To: Peter Hjalmarsson, linux-btrfs

On 19/10/2019 21:29, Peter Hjalmarsson wrote:

Hi Peter,

Thanks for the report.

> While trying to reproduce another problem I have seen with BTRFS while
> running balance and raid6 I hit an issue resulting in:
> BUG: kernel NULL pointer dereference, address: 00000000000002ce
> 
> I created a script trying to pinpoint the problem utilizing zram, and
> the run goes like this:
> 
> 1. run the scripts which sets up a "SINGEL" filesystem on two larger
> devices, fills it with an adequate amount of data, and then tries to
> convert it to raid6
> 2. the filesystem will fail to convert due to not enough space to
> convert all data to raid6 (not enough space on at least 3 devices at
> the same time), hitting an enospc-error
> * Up until this point stuff still seems to work without crashing, and
> the system seems stable, just with two different profiles fot the data
> 3. issue the umount command which will be "Killed", and a backtrace
> will be written to dmesg
> 
> This results in a filesystem that cannot be synced, unmounted, and in
> all just seems crashed.
> I have ran the script 6 times. 1 time it passed, 5 times it has
> crashed, and I have rebooted the computer, since that is the only way
> it seems to get rid of the filesysem, so the issue seems pretty
> reproducable.
> 
> The system is a laptop running Fedora 31 Beta with the latest updates,
> and the latest kernel (5.3.6-300.fc31.x86_64)
> 
> Do you have any input, or any other questions or stuff you want me to test?
> 
> The script looks as follow:
> -----
> $ cat run-btrfs-test
> modprobe -iv zram num_devices=8
> udevadm trigger
> sync
> zramctl /dev/zram0 -s 8G && \
> zramctl /dev/zram1 -s 8G && \
> zramctl /dev/zram2 -s 4G && \
> zramctl /dev/zram3 -s 4G && \
> zramctl /dev/zram4 -s 4G && \
> zramctl /dev/zram5 -s 2G && \
> zramctl /dev/zram6 -s 2G && \
> zramctl /dev/zram7 -s 4G && \
> mkfs.btrfs /dev/zram0 && \
> mkdir -p /mnt/btrfs-test && \
> mount /dev/zram0 /mnt/btrfs-test && \
> echo "FS Mounted" && \
> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
> echo "Devices added" && \
> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
> bs=32M count=1 && sync ; done
> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test && \
> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test
> =====
> 
> The output from the script, I will trim the output down to after the for-loop:
> ------
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.64GiB
> devid    1 size 8.00GiB used 8.00GiB path /dev/zram0
> devid    2 size 8.00GiB used 8.00GiB path /dev/zram1
> devid    3 size 4.00GiB used 0.00B path /dev/zram2
> devid    4 size 4.00GiB used 0.00B path /dev/zram3
> devid    5 size 4.00GiB used 0.00B path /dev/zram4
> devid    6 size 2.00GiB used 0.00B path /dev/zram5
> devid    7 size 2.00GiB used 0.00B path /dev/zram6
> devid    8 size 4.00GiB used 0.00B path /dev/zram7
> 
> Data, single: total=15.74GiB, used=15.63GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=264.00MiB, used=17.06MiB
> GlobalReserve, single: total=16.70MiB, used=0.00B
> Done, had to relocate 3 out of 20 chunks
> ERROR: error during balancing '/mnt/btrfs-test': No space left on device
> There may be more info in syslog - try dmesg | tail
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.65GiB
> devid    1 size 8.00GiB used 2.99GiB path /dev/zram0
> devid    2 size 8.00GiB used 2.00GiB path /dev/zram1
> devid    3 size 4.00GiB used 4.00GiB path /dev/zram2
> devid    4 size 4.00GiB used 4.00GiB path /dev/zram3
> devid    5 size 4.00GiB used 4.00GiB path /dev/zram4
> devid    6 size 2.00GiB used 2.00GiB path /dev/zram5
> devid    7 size 2.00GiB used 2.00GiB path /dev/zram6
> devid    8 size 4.00GiB used 4.00GiB path /dev/zram7
> 
> Data, single: total=2.00GiB, used=1.97GiB
> Data, RAID6: total=14.41GiB, used=13.66GiB
> System, RAID6: total=80.00MiB, used=16.00KiB
> Metadata, RAID6: total=512.00MiB, used=17.52MiB
> GlobalReserve, single: total=17.16MiB, used=0.00B
> ===
> 
> Issueing the umount:
> ----
> # umount /mnt/btrfs-test && modprobe -rv zram

OK, I couldn't reproduce it in my environment (5.4-rc3+ based
btrfs-devel/misc-next form David) with this script. I'll dig deeper.


> Killed
> ===
> 
> And last but not least: the output in dmesg:
> ---
> [  205.960233] BTRFS info (device zram0): 2 enospc errors during balance
> [  205.960235] BTRFS info (device zram0): balance: ended with status: -28

Here balance ended with -ENOSPC (28).

> [  235.774821] BUG: kernel NULL pointer dereference, address: 00000000000002ce

That's a NULL pointer deference with an offset of 0x2ce (718).

> [  235.774826] #PF: supervisor read access in kernel mode
> [  235.774828] #PF: error_code(0x0000) - not-present page
> [  235.774830] PGD 0 P4D 0
> [  235.774834] Oops: 0000 [#1] SMP PTI
> [  235.774838] CPU: 3 PID: 5421 Comm: umount Not tainted
> 5.3.6-300.fc31.x86_64 #1
> [  235.774840] Hardware name: LENOVO 80JV/Lenovo U41-70, BIOS BDCN71WW
> 08/03/2016
> [  235.774847] RIP: 0010:__free_pages+0x5/0x30
> [  235.774850] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
> e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
> 44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
> e9 82
> [  235.774852] RSP: 0018:ffffc3cf0ffb7db0 EFLAGS: 00010046
> [  235.774854] RAX: ffffa0ad0ffa0118 RBX: 0000000000000045 RCX: 0000000000000000
> [  235.774856] RDX: ffffa0aeceaee2f0 RSI: 0000000000000000 RDI: 000000000000029a
> [  235.774858] RBP: ffffa0ad0ffa0000 R08: fffffb06c23fd108 R09: fffffb06c23fd108
> [  235.774860] R10: 0000000000068879 R11: fffffb06c02cf820 R12: 0000000000000045
> [  235.774861] R13: ffffa0ade5c00010 R14: ffffa0ae0d19e5ac R15: ffffa0ae0d19e578
> [  235.774864] FS:  00007f3b40f6cc80(0000) GS:ffffa0aeceac0000(0000)
> knlGS:0000000000000000
> [  235.774866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  235.774867] CR2: 00000000000002ce CR3: 000000004c1f0004 CR4: 00000000003606e0

CR2 has it as well, so we're page faulting on an access to 0x2ce.

Let's look at __free_pages():

(gdb) l *(__free_pages+0x5)
0xffffffff81179e45 is in __free_pages (mm/page_alloc.c:4818).
4813			__free_pages_ok(page, order);
4814	}
4815	
4816	void __free_pages(struct page *page, unsigned int order)
4817	{
4818		if (put_page_testzero(page))
4819			free_the_page(page, order);
4820	}
4821	EXPORT_SYMBOL(__free_pages);
4822	
(gdb)

We're getting called by __free_page() so we know order is 0. So
something must have passed in a NULL page (somehow).

Looking at __free_raid_bio() one step further up the call-chain I see this:

for (i = 0; i < rbio->nr_pages; i++) {
         if (rbio->stripe_pages[i]) {
                 __free_page(rbio->stripe_pages[i]);
                 rbio->stripe_pages[i] = NULL;
         }
}

> [  235.774869] Call Trace:
> [  235.774920]  __free_raid_bio+0x72/0xb0 [btrfs]
> [  235.774961]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
> [  235.774992]  close_ctree+0x1ea/0x2f0 [btrfs]
> [  235.774998]  generic_shutdown_super+0x6c/0x100
> [  235.775001]  kill_anon_super+0x14/0x30
> [  235.775024]  btrfs_kill_super+0x12/0xa0 [btrfs]
> [  235.775029]  deactivate_locked_super+0x36/0x70
> [  235.775033]  cleanup_mnt+0x104/0x150
> [  235.775038]  task_work_run+0x87/0xa0
> [  235.775043]  exit_to_usermode_loop+0xda/0x100
> [  235.775047]  do_syscall_64+0x183/0x1a0
> [  235.775053]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  235.775056] RIP: 0033:0x7f3b411b767b
> [  235.775060] Code: 08 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
> 0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
> 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 07 0c 00 f7 d8 64 89
> 01 48

This decodes to:

All code
========
   0:	08 0c 00             	or     %cl,(%rax,%rax,1)
   3:	f7 d8                	neg    %eax
   5:	64 89 01             	mov    %eax,%fs:(%rcx)
   8:	48 83 c8 ff          	or     $0xffffffffffffffff,%rax
   c:	c3                   	retq
   d:	66 90                	xchg   %ax,%ax
   f:	f3 0f 1e fa          	endbr64
  13:	31 f6                	xor    %esi,%esi
  15:	e9 05 00 00 00       	jmpq   0x1f
  1a:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  1f:	f3 0f 1e fa          	endbr64
  23:	b8 a6 00 00 00       	mov    $0xa6,%eax
  28:	0f 05                	syscall
  2a:*	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax		<--
trapping instruction
  30:	73 01                	jae    0x33
  32:	c3                   	retq
  33:	48 8b 0d dd 07 0c 00 	mov    0xc07dd(%rip),%rcx        # 0xc0817
  3a:	f7 d8                	neg    %eax
  3c:	64 89 01             	mov    %eax,%fs:(%rcx)
  3f:	48                   	rex.W

Code starting with the faulting instruction
===========================================
   0:	48 3d 01 f0 ff ff    	cmp    $0xfffffffffffff001,%rax
   6:	73 01                	jae    0x9
   8:	c3                   	retq
   9:	48 8b 0d dd 07 0c 00 	mov    0xc07dd(%rip),%rcx        # 0xc07ed
  10:	f7 d8                	neg    %eax
  12:	64 89 01             	mov    %eax,%fs:(%rcx)
  15:	48                   	rex.W

This doesn't look like __free_pages

(gdb) disassemble __free_pages
Dump of assembler code for function __free_pages:
   0xffffffff81179e40 <+0>:	lock decl 0x34(%rdi)
   0xffffffff81179e44 <+4>:	jne    0xffffffff81179e51 <__free_pages+17>
   0xffffffff81179e46 <+6>:	test   %esi,%esi
   0xffffffff81179e48 <+8>:	je     0xffffffff81179e4f <__free_pages+15>
   0xffffffff81179e4a <+10>:	jmpq   0xffffffff81178500 <__free_pages_ok>
   0xffffffff81179e4f <+15>:	jmp    0xffffffff81179de0 <free_unref_page>
   0xffffffff81179e51 <+17>:	repz retq
End of assembler dump.
(gdb)

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-21  9:17 ` Johannes Thumshirn
@ 2019-10-21 14:32   ` Johannes Thumshirn
  2019-10-24 20:23     ` Peter Hjalmarsson
  2019-10-27 13:24     ` Su Yue
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2019-10-21 14:32 UTC (permalink / raw)
  To: Peter Hjalmarsson, linux-btrfs

On 21/10/2019 11:17, Johannes Thumshirn wrote:
[...]
>> -----
>> $ cat run-btrfs-test
>> modprobe -iv zram num_devices=8
>> udevadm trigger
>> sync
>> zramctl /dev/zram0 -s 8G && \
>> zramctl /dev/zram1 -s 8G && \
>> zramctl /dev/zram2 -s 4G && \
>> zramctl /dev/zram3 -s 4G && \
>> zramctl /dev/zram4 -s 4G && \
>> zramctl /dev/zram5 -s 2G && \
>> zramctl /dev/zram6 -s 2G && \
>> zramctl /dev/zram7 -s 4G && \
>> mkfs.btrfs /dev/zram0 && \
>> mkdir -p /mnt/btrfs-test && \
>> mount /dev/zram0 /mnt/btrfs-test && \
>> echo "FS Mounted" && \
>> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
>> echo "Devices added" && \
>> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
>> bs=32M count=1 && sync ; done
>> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
>> btrfs fi sh /mnt/btrfs-test && \
>> btrfs fi df /mnt/btrfs-test && \
>> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
>> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
>> btrfs fi sh /mnt/btrfs-test && \
>> btrfs fi df /mnt/btrfs-test

I'm sorry. I ran this script in a loop for 35 iterations on 5.3.6 and
couldn't reproduce a single crash.


Is there anything else needed?
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-21 14:32   ` Johannes Thumshirn
@ 2019-10-24 20:23     ` Peter Hjalmarsson
  2019-10-25  8:07       ` Johannes Thumshirn
  2019-10-27 13:24     ` Su Yue
  1 sibling, 1 reply; 9+ messages in thread
From: Peter Hjalmarsson @ 2019-10-24 20:23 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: linux-btrfs

Hi,

Sorry for late answer. Have been out of town for work.
Interesting that you could not reproduce. What kind of system do you run?
I have tried on a couple of system now, and they behave slightly
different, so maybe it has something to do with which arch it runs
what kind of crash it triggers if any?
Also make sure that the "dd"-line in the script is not wrapped like it
became in the mail.

For me running trough the test in a couple of systems, I found the following:

The systems that always crashes did that during the first or the
second run of the script (so probably no need to run for longer than
maybe three to four times to verify).

Two systems passes 5 times (32 bit arm)
One system crashes during umount with a slightly different traceback
but essential the same as previously reported Pine64 Rock64 rk3328:
[503397.433500] Call trace:
[503397.436338]  __free_pages+0x1c/0x80
[503397.440506]  __free_raid_bio+0x84/0xf8 [btrfs]
[503397.445713]  __remove_rbio_from_cache+0x134/0x1b8 [btrfs]
[503397.451957]  btrfs_clear_rbio_cache.isra.0+0x5c/0x98 [btrfs]
[503397.458473]  btrfs_free_stripe_hash_table+0x24/0x40 [btrfs]
[503397.464891]  close_ctree+0x1b0/0x2c8 [btrfs]
[503397.469851]  btrfs_put_super+0x20/0x30 [btrfs]

Two system crashes (one x86_64 Atom, one RPI3 arch64) during balance
with a dmesg like:
[10282.926420] Call Trace:
[10282.926488]  lock_stripe_add+0x292/0x370 [btrfs]
[10282.926560]  __raid56_parity_write+0x20/0x40 [btrfs]
[10282.926633]  run_plug+0x131/0x150 [btrfs]
[10282.926671]  blk_flush_plug_list+0xc2/0x110
[10282.926708]  blk_finish_plug+0x21/0x2e
[10282.926769]  btrfs_write_and_wait_transaction.isra.0+0x57/0xa0 [btrfs]
[10282.926851]  btrfs_commit_transaction+0x72e/0x9a0 [btrfs]

Three system crashes (three x86_64, the one "tainted" has nvidia
binary blobs) during umount with a dmesg like:
[  658.646613] Call Trace:
[  658.646675]  __free_raid_bio+0x72/0xb0 [btrfs]
[  658.646728]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
[  658.646766]  close_ctree+0x1ea/0x2f0 [btrfs]
[  658.646773]  generic_shutdown_super+0x6c/0x100
[  658.646778]  kill_anon_super+0x14/0x30
[  658.646808]  btrfs_kill_super+0x12/0xa0 [btrfs]

All full dmesg saved if you want to look at any of the other not posted below.

dmesg for the three x86_64 crashes during umount as follows (since
that was what I reported in this thread):
[ 7322.868716] BUG: kernel NULL pointer dereference, address: 00000000000002ce
[ 7322.868720] #PF: supervisor read access in kernel mode
[ 7322.868721] #PF: error_code(0x0000) - not-present page
[ 7322.868722] PGD 0 P4D 0
[ 7322.868725] Oops: 0000 [#1] SMP PTI
[ 7322.868727] CPU: 1 PID: 18329 Comm: umount Tainted: P           OE
   5.3.6-200.fc30.x86_64 #1
[ 7322.868728] Hardware name: System manufacturer System Product
Name/Z170 PRO GAMING, BIOS 3805 05/16/2018
[ 7322.868733] RIP: 0010:__free_pages+0x5/0x30
[ 7322.868735] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[ 7322.868737] RSP: 0018:ffffb481d632fdb0 EFLAGS: 00010046
[ 7322.868738] RAX: ffff8932a0b32118 RBX: 0000000000000045 RCX: 0000000000000000
[ 7322.868740] RDX: ffff893366a6e2f0 RSI: 0000000000000000 RDI: 000000000000029a
[ 7322.868741] RBP: ffff8932a0b32000 R08: ffffd62ac0e41b48 R09: ffffd62ac0e41b48
[ 7322.868742] R10: 000000000004f4b1 R11: ffffd62ac055d220 R12: 0000000000000045
[ 7322.868743] R13: ffff893135a20010 R14: ffff89326792e72c R15: ffff89326792e6f8
[ 7322.868744] FS:  00007fc8c95d8080(0000) GS:ffff893366a40000(0000)
knlGS:0000000000000000
[ 7322.868746] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7322.868747] CR2: 00000000000002ce CR3: 0000000085bd6001 CR4: 00000000003606e0
[ 7322.868748] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7322.868749] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 7322.868750] Call Trace:
[ 7322.868782]  __free_raid_bio+0x72/0xb0 [btrfs]
[ 7322.868809]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
[ 7322.868827]  close_ctree+0x1ea/0x2f0 [btrfs]
[ 7322.868831]  generic_shutdown_super+0x6c/0x100
[ 7322.868834]  kill_anon_super+0x14/0x30
[ 7322.868848]  btrfs_kill_super+0x12/0xa0 [btrfs]
[ 7322.868851]  deactivate_locked_super+0x36/0x70
[ 7322.868853]  cleanup_mnt+0x104/0x150
[ 7322.868856]  task_work_run+0x87/0xa0
[ 7322.868860]  exit_to_usermode_loop+0xda/0x100
[ 7322.868862]  do_syscall_64+0x183/0x1a0
[ 7322.868866]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 7322.868868] RIP: 0033:0x7fc8c982358b
[ 7322.868870] Code: 39 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 38 0c 00 f7 d8 64 89
01 48
[ 7322.868872] RSP: 002b:00007ffd1ba2b9f8 EFLAGS: 00000246 ORIG_RAX:
00000000000000a6
[ 7322.868873] RAX: 0000000000000000 RBX: 00007fc8c994e1c4 RCX: 00007fc8c982358b
[ 7322.868874] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055588393a6f0
[ 7322.868875] RBP: 000055588393a4e0 R08: 0000000000000000 R09: 00007ffd1ba2a7a0
[ 7322.868876] R10: 000055588393b750 R11: 0000000000000246 R12: 000055588393a6f0
[ 7322.868877] R13: 0000000000000000 R14: 000055588393a5d8 R15: 0000000000000000
[ 7322.868879] Modules linked in: zram rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace rfcomm fuse xt_CHECKSUM
xt_MASQUERADE tun xt_addrtype br_netfilter bridge stp llc
nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack
ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw
iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter ip_tables overlay cmac bnep nct6775 cachefiles
hwmon_vid fscache sunrpc vfat fat nvidia_drm(POE) nvidia_modeset(POE)
nvidia_uvm(OE) nvidia(POE) intel_rapl_msr intel_rapl_common
snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm
ledtrig_audio irqbypass snd_hda_intel btusb snd_hda_codec btrtl btbcm
btintel crct10dif_pclmul snd_hda_core
[ 7322.868904]  ucsi_ccg crc32_pclmul typec_ucsi bluetooth snd_hwdep
mei_hdcp typec snd_seq eeepc_wmi iTCO_wdt ghash_clmulni_intel
iTCO_vendor_support asus_wmi intel_cstate snd_seq_device intel_uncore
intel_rapl_perf wmi_bmof sparse_keymap drm_kms_helper snd_pcm
ecdh_generic i2c_i801 rfkill mei_me ecc snd_timer drm cp210x mei snd
ipmi_devintf soundcore i2c_nvidia_gpu ipmi_msghandler acpi_pad
binfmt_misc btrfs libcrc32c xor zstd_decompress zstd_compress raid6_pq
mxm_wmi e1000e nvme crc32c_intel nvme_core wmi video [last unloaded:
zram]
[ 7322.868923] CR2: 00000000000002ce
[ 7322.868924] ---[ end trace 9c6f3e1ed9db6ba7 ]---
[ 7322.868927] RIP: 0010:__free_pages+0x5/0x30
[ 7322.868929] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[ 7322.868930] RSP: 0018:ffffb481d632fdb0 EFLAGS: 00010046
[ 7322.868931] RAX: ffff8932a0b32118 RBX: 0000000000000045 RCX: 0000000000000000
[ 7322.868932] RDX: ffff893366a6e2f0 RSI: 0000000000000000 RDI: 000000000000029a
[ 7322.868933] RBP: ffff8932a0b32000 R08: ffffd62ac0e41b48 R09: ffffd62ac0e41b48
[ 7322.868934] R10: 000000000004f4b1 R11: ffffd62ac055d220 R12: 0000000000000045
[ 7322.868935] R13: ffff893135a20010 R14: ffff89326792e72c R15: ffff89326792e6f8
[ 7322.868937] FS:  00007fc8c95d8080(0000) GS:ffff893366a40000(0000)
knlGS:0000000000000000
[ 7322.868938] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7322.868939] CR2: 00000000000002ce CR3: 0000000085bd6001 CR4: 00000000003606e0
[ 7322.868940] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 7322.868941] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


[ 3270.915492] BUG: kernel NULL pointer dereference, address: 00000000000002d1
[ 3270.915499] #PF: supervisor read access in kernel mode
[ 3270.915502] #PF: error_code(0x0000) - not-present page
[ 3270.915504] PGD 0 P4D 0
[ 3270.915509] Oops: 0000 [#1] SMP PTI
[ 3270.915514] CPU: 1 PID: 2578 Comm: umount Not tainted
5.3.6-200.fc30.x86_64 #1
[ 3270.915516] Hardware name: System manufacturer System Product
Name/P7P55D-E LX, BIOS 1701    09/27/2012
[ 3270.915525] RIP: 0010:__free_pages+0x5/0x30
[ 3270.915529] Code: 90 48 89 c3 fa 66 66 90 66 66 90 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 66 66 90 66 90 5b 5d 41 5c c3 66 66
66 66 90 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[ 3270.915532] RSP: 0018:ffffb41a52367db0 EFLAGS: 00010046
[ 3270.915535] RAX: ffffa01a356ed918 RBX: 0000000000000045 RCX: 0000000000000000
[ 3270.915538] RDX: ffffa01ad586e350 RSI: 0000000000000000 RDI: 000000000000029d
[ 3270.915541] RBP: ffffa01a356ed800 R08: fffff43e482f9688 R09: fffff43e482f9688
[ 3270.915543] R10: 000000000021592f R11: fffff43e482b8420 R12: 0000000000000045
[ 3270.915546] R13: ffffa01a63040010 R14: ffffa01a6ae125ec R15: ffffa01a6ae125b8
[ 3270.915549] FS:  00007fbfce760080(0000) GS:ffffa01ad5840000(0000)
knlGS:0000000000000000
[ 3270.915552] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3270.915554] CR2: 00000000000002d1 CR3: 000000020b64c000 CR4: 00000000000006e0
[ 3270.915557] Call Trace:
[ 3270.915617]  __free_raid_bio+0x72/0xb0 [btrfs]
[ 3270.915668]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
[ 3270.915709]  close_ctree+0x1ea/0x2f0 [btrfs]
[ 3270.915715]  generic_shutdown_super+0x6c/0x100
[ 3270.915720]  kill_anon_super+0x14/0x30
[ 3270.915751]  btrfs_kill_super+0x12/0xa0 [btrfs]
[ 3270.915756]  deactivate_locked_super+0x36/0x70
[ 3270.915760]  cleanup_mnt+0x104/0x150
[ 3270.915765]  task_work_run+0x87/0xa0
[ 3270.915771]  exit_to_usermode_loop+0xda/0x100
[ 3270.915776]  do_syscall_64+0x183/0x1a0
[ 3270.915782]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 3270.915786] RIP: 0033:0x7fbfce9ab58b
[ 3270.915790] Code: 39 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 38 0c 00 f7 d8 64 89
01 48
[ 3270.915792] RSP: 002b:00007ffcf5f6a918 EFLAGS: 00000246 ORIG_RAX:
00000000000000a6
[ 3270.915796] RAX: 0000000000000000 RBX: 00007fbfcead61c4 RCX: 00007fbfce9ab58b
[ 3270.915798] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b3dcd106f0
[ 3270.915801] RBP: 000055b3dcd104e0 R08: 0000000000000000 R09: 00007ffcf5f696c0
[ 3270.915803] R10: 000055b3dcd10710 R11: 0000000000000246 R12: 000055b3dcd106f0
[ 3270.915805] R13: 0000000000000000 R14: 000055b3dcd105d8 R15: 0000000000000000
[ 3270.915809] Modules linked in: zram rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace sunrpc rfcomm bluetooth
ecdh_generic rfkill ecc ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute
ip6table_nat ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat iptable_mangle iptable_raw iptable_security
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables intel_powerclamp kvm_intel kvm irqbypass snd_hda_codec_via
iTCO_wdt snd_hda_codec_generic iTCO_vendor_support intel_cstate
intel_uncore snd_hda_codec_hdmi ledtrig_audio uinput snd_hda_intel
snd_hda_codec cachefiles fscache hwmon_vid coretemp snd_hda_core
i7core_edac snd_hwdep xpad i2c_i801 snd_seq joydev ff_memless lpc_ich
snd_seq_device snd_pcm asus_atk0110 snd_timer snd soundcore
acpi_cpufreq amdgpu amd_iommu_v2 gpu_sched btrfs radeon libcrc32c xor
zstd_decompress zstd_compress
[ 3270.915860]  raid6_pq i2c_algo_bit drm_kms_helper crc32c_intel ttm
serio_raw drm r8169
[ 3270.915870] CR2: 00000000000002d1
[ 3270.915874] ---[ end trace 03b4a864514336a0 ]---
[ 3270.915879] RIP: 0010:__free_pages+0x5/0x30
[ 3270.915882] Code: 90 48 89 c3 fa 66 66 90 66 66 90 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 66 66 90 66 90 5b 5d 41 5c c3 66 66
66 66 90 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[ 3270.915885] RSP: 0018:ffffb41a52367db0 EFLAGS: 00010046
[ 3270.915888] RAX: ffffa01a356ed918 RBX: 0000000000000045 RCX: 0000000000000000
[ 3270.915890] RDX: ffffa01ad586e350 RSI: 0000000000000000 RDI: 000000000000029d
[ 3270.915893] RBP: ffffa01a356ed800 R08: fffff43e482f9688 R09: fffff43e482f9688
[ 3270.915895] R10: 000000000021592f R11: fffff43e482b8420 R12: 0000000000000045
[ 3270.915898] R13: ffffa01a63040010 R14: ffffa01a6ae125ec R15: ffffa01a6ae125b8
[ 3270.915901] FS:  00007fbfce760080(0000) GS:ffffa01ad5840000(0000)
knlGS:0000000000000000
[ 3270.915904] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3270.915906] CR2: 00000000000002d1 CR3: 000000020b64c000 CR4: 00000000000006e0


[  658.646553] BUG: kernel NULL pointer dereference, address: 00000000000002d1
[  658.646560] #PF: supervisor read access in kernel mode
[  658.646562] #PF: error_code(0x0000) - not-present page
[  658.646564] PGD 0 P4D 0
[  658.646569] Oops: 0000 [#1] SMP PTI
[  658.646574] CPU: 0 PID: 6418 Comm: umount Not tainted
5.3.6-300.fc31.x86_64 #1
[  658.646576] Hardware name: LENOVO 80JV/Lenovo U41-70, BIOS BDCN71WW
08/03/2016
[  658.646584] RIP: 0010:__free_pages+0x5/0x30
[  658.646588] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[  658.646591] RSP: 0018:ffffb23bc8d73db0 EFLAGS: 00010046
[  658.646594] RAX: ffff9d285707c918 RBX: 0000000000000045 RCX: 0000000000000000
[  658.646597] RDX: ffff9d290ea2e2f0 RSI: 0000000000000000 RDI: 000000000000029d
[  658.646599] RBP: ffff9d285707c800 R08: ffffdc18806c2d48 R09: ffffdc18806c2d48
[  658.646601] R10: 000000000001b12d R11: ffffdc1888e5a020 R12: 0000000000000045
[  658.646603] R13: ffff9d2825350010 R14: ffff9d2882cec66c R15: ffff9d2882cec638
[  658.646606] FS:  00007f28dbf23c80(0000) GS:ffff9d290ea00000(0000)
knlGS:0000000000000000
[  658.646609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  658.646611] CR2: 00000000000002d1 CR3: 0000000092fd0003 CR4: 00000000003606f0
[  658.646613] Call Trace:
[  658.646675]  __free_raid_bio+0x72/0xb0 [btrfs]
[  658.646728]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
[  658.646766]  close_ctree+0x1ea/0x2f0 [btrfs]
[  658.646773]  generic_shutdown_super+0x6c/0x100
[  658.646778]  kill_anon_super+0x14/0x30
[  658.646808]  btrfs_kill_super+0x12/0xa0 [btrfs]
[  658.646814]  deactivate_locked_super+0x36/0x70
[  658.646819]  cleanup_mnt+0x104/0x150
[  658.646825]  task_work_run+0x87/0xa0
[  658.646831]  exit_to_usermode_loop+0xda/0x100
[  658.646836]  do_syscall_64+0x183/0x1a0
[  658.646843]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  658.646847] RIP: 0033:0x7f28dc16e67b
[  658.646851] Code: 08 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3
0f 1e fa 31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d dd 07 0c 00 f7 d8 64 89
01 48
[  658.646854] RSP: 002b:00007ffe46290828 EFLAGS: 00000246 ORIG_RAX:
00000000000000a6
[  658.646858] RAX: 0000000000000000 RBX: 00007f28dc2981e4 RCX: 00007f28dc16e67b
[  658.646860] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055d87bf48730
[  658.646862] RBP: 000055d87bf48520 R08: 0000000000000000 R09: 00007ffe4628f5a0
[  658.646864] R10: 000055d87bf49750 R11: 0000000000000246 R12: 000055d87bf48730
[  658.646867] R13: 0000000000000000 R14: 000055d87bf48618 R15: 0000000000000000
[  658.646870] Modules linked in: zram uinput rfcomm ccm xt_CHECKSUM
xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp nf_conntrack_netbios_ns
nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter
ipt_REJECT nf_reject_ipv4 xt_conntrack tun bridge ebtable_nat stp
ebtable_broute ip6table_nat llc ip6table_mangle ip6table_raw
ip6table_security iptable_nat nf_nat iptable_mangle ip6_udp_tunnel
udp_tunnel iptable_raw iptable_security nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 ip_set nfnetlink ebtable_filter ebtables
ip6table_filter ip6_tables iptable_filter cmac bnep cachefiles fscache
sunrpc vfat fat intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek iwlmvm
snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi mei_hdcp
snd_hda_intel mac80211 kvm snd_hda_codec uvcvideo libarc4 snd_hda_core
videobuf2_vmalloc iwlwifi btusb snd_hwdep videobuf2_memops irqbypass
snd_seq videobuf2_v4l2 btrtl btbcm btintel snd_seq_device
[  658.646916]  videobuf2_common intel_cstate intel_uncore snd_pcm
videodev mc intel_rapl_perf bluetooth cfg80211 asus_wmi joydev
input_polldev wmi_bmof i2c_i801 mei_me intel_wmi_thunderbolt wdat_wdt
snd_timer intel_pch_thermal ecdh_generic ecc mei snd soundcore lpc_ich
ideapad_laptop sparse_keymap rfkill acpi_pad lz4 lz4_compress
binfmt_misc ip_tables btrfs libcrc32c xor zstd_decompress
zstd_compress raid6_pq dm_crypt i915 nouveau crct10dif_pclmul
crc32_pclmul crc32c_intel mxm_wmi ttm i2c_algo_bit drm_kms_helper
ghash_clmulni_intel drm serio_raw r8169 wmi video fuse [last unloaded:
zram]
[  658.646954] CR2: 00000000000002d1
[  658.646958] ---[ end trace 0e45be4afd3f4e04 ]---
[  658.646964] RIP: 0010:__free_pages+0x5/0x30
[  658.646967] Code: 00 48 89 c3 fa 66 0f 1f 44 00 00 48 89 ef 4c 89
e6 e8 2f ef ff ff 48 89 df 57 9d 0f 1f 44 00 00 5b 5d 41 5c c3 0f 1f
44 00 00 <8b> 47 34 85 c0 74 12 f0 ff 4f 34 75 06 85 f6 75 03 eb 88 c3
e9 82
[  658.646970] RSP: 0018:ffffb23bc8d73db0 EFLAGS: 00010046
[  658.646973] RAX: ffff9d285707c918 RBX: 0000000000000045 RCX: 0000000000000000
[  658.646975] RDX: ffff9d290ea2e2f0 RSI: 0000000000000000 RDI: 000000000000029d
[  658.646978] RBP: ffff9d285707c800 R08: ffffdc18806c2d48 R09: ffffdc18806c2d48
[  658.646980] R10: 000000000001b12d R11: ffffdc1888e5a020 R12: 0000000000000045
[  658.646982] R13: ffff9d2825350010 R14: ffff9d2882cec66c R15: ffff9d2882cec638
[  658.646986] FS:  00007f28dbf23c80(0000) GS:ffff9d290ea00000(0000)
knlGS:0000000000000000
[  658.646988] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  658.646991] CR2: 00000000000002d1 CR3: 0000000092fd0003 CR4: 00000000003606f0


BR,
Peter Hjalmarsson

Den mån 21 okt. 2019 kl 16:33 skrev Johannes Thumshirn <jthumshirn@suse.de>:
>
> On 21/10/2019 11:17, Johannes Thumshirn wrote:
> [...]
> >> -----
> >> $ cat run-btrfs-test
> >> modprobe -iv zram num_devices=8
> >> udevadm trigger
> >> sync
> >> zramctl /dev/zram0 -s 8G && \
> >> zramctl /dev/zram1 -s 8G && \
> >> zramctl /dev/zram2 -s 4G && \
> >> zramctl /dev/zram3 -s 4G && \
> >> zramctl /dev/zram4 -s 4G && \
> >> zramctl /dev/zram5 -s 2G && \
> >> zramctl /dev/zram6 -s 2G && \
> >> zramctl /dev/zram7 -s 4G && \
> >> mkfs.btrfs /dev/zram0 && \
> >> mkdir -p /mnt/btrfs-test && \
> >> mount /dev/zram0 /mnt/btrfs-test && \
> >> echo "FS Mounted" && \
> >> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
> >> echo "Devices added" && \
> >> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
> >> bs=32M count=1 && sync ; done
> >> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
> >> btrfs fi sh /mnt/btrfs-test && \
> >> btrfs fi df /mnt/btrfs-test && \
> >> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
> >> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
> >> btrfs fi sh /mnt/btrfs-test && \
> >> btrfs fi df /mnt/btrfs-test
>
> I'm sorry. I ran this script in a loop for 35 iterations on 5.3.6 and
> couldn't reproduce a single crash.
>
>
> Is there anything else needed?
> --
> Johannes Thumshirn                            SUSE Labs Filesystems
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nürnberg
> Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-24 20:23     ` Peter Hjalmarsson
@ 2019-10-25  8:07       ` Johannes Thumshirn
  2019-10-25 10:02         ` Johannes Thumshirn
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2019-10-25  8:07 UTC (permalink / raw)
  To: Peter Hjalmarsson; +Cc: linux-btrfs

On 24/10/2019 22:23, Peter Hjalmarsson wrote:
> Hi,
> 
> Sorry for late answer. Have been out of town for work.
> Interesting that you could not reproduce. What kind of system do you run?

I ran it up to 64 times but in a very minimal VM. Let me try to reprouce
the issue on a bare metal system.

> I have tried on a couple of system now, and they behave slightly
> different, so maybe it has something to do with which arch it runs
> what kind of crash it triggers if any?
> Also make sure that the "dd"-line in the script is not wrapped like it
> became in the mail.

Yes I've fixed that up.

> For me running trough the test in a couple of systems, I found the following:
> 
> The systems that always crashes did that during the first or the
> second run of the script (so probably no need to run for longer than
> maybe three to four times to verify).
>
> Two systems passes 5 times (32 bit arm)
> One system crashes during umount with a slightly different traceback
> but essential the same as previously reported Pine64 Rock64 rk3328:
> [503397.433500] Call trace:
> [503397.436338]  __free_pages+0x1c/0x80
> [503397.440506]  __free_raid_bio+0x84/0xf8 [btrfs]
> [503397.445713]  __remove_rbio_from_cache+0x134/0x1b8 [btrfs]
> [503397.451957]  btrfs_clear_rbio_cache.isra.0+0x5c/0x98 [btrfs]
> [503397.458473]  btrfs_free_stripe_hash_table+0x24/0x40 [btrfs]
> [503397.464891]  close_ctree+0x1b0/0x2c8 [btrfs]
> [503397.469851]  btrfs_put_super+0x20/0x30 [btrfs]
> 
> Two system crashes (one x86_64 Atom, one RPI3 arch64) during balance
> with a dmesg like:
> [10282.926420] Call Trace:
> [10282.926488]  lock_stripe_add+0x292/0x370 [btrfs]
> [10282.926560]  __raid56_parity_write+0x20/0x40 [btrfs]
> [10282.926633]  run_plug+0x131/0x150 [btrfs]
> [10282.926671]  blk_flush_plug_list+0xc2/0x110
> [10282.926708]  blk_finish_plug+0x21/0x2e
> [10282.926769]  btrfs_write_and_wait_transaction.isra.0+0x57/0xa0 [btrfs]
> [10282.926851]  btrfs_commit_transaction+0x72e/0x9a0 [btrfs]
> 
> Three system crashes (three x86_64, the one "tainted" has nvidia
> binary blobs) during umount with a dmesg like:
> [  658.646613] Call Trace:
> [  658.646675]  __free_raid_bio+0x72/0xb0 [btrfs]
> [  658.646728]  btrfs_free_stripe_hash_table+0x3d/0x70 [btrfs]
> [  658.646766]  close_ctree+0x1ea/0x2f0 [btrfs]
> [  658.646773]  generic_shutdown_super+0x6c/0x100
> [  658.646778]  kill_anon_super+0x14/0x30
> [  658.646808]  btrfs_kill_super+0x12/0xa0 [btrfs]
> 
> All full dmesg saved if you want to look at any of the other not posted below.

this could be handy, but not sure yet.


Byte,
	Johannes
-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-25  8:07       ` Johannes Thumshirn
@ 2019-10-25 10:02         ` Johannes Thumshirn
  0 siblings, 0 replies; 9+ messages in thread
From: Johannes Thumshirn @ 2019-10-25 10:02 UTC (permalink / raw)
  To: Peter Hjalmarsson; +Cc: linux-btrfs

On 25/10/2019 10:07, Johannes Thumshirn wrote:
> On 24/10/2019 22:23, Peter Hjalmarsson wrote:
>> Hi,
>>
>> Sorry for late answer. Have been out of town for work.
>> Interesting that you could not reproduce. What kind of system do you run?
> 
> I ran it up to 64 times but in a very minimal VM. Let me try to reprouce
> the issue on a bare metal system.
> 
>> I have tried on a couple of system now, and they behave slightly
>> different, so maybe it has something to do with which arch it runs
>> what kind of crash it triggers if any?
>> Also make sure that the "dd"-line in the script is not wrapped like it
>> became in the mail.
> 
> Yes I've fixed that up.
> 
>> For me running trough the test in a couple of systems, I found the following:
>>
>> The systems that always crashes did that during the first or the
>> second run of the script (so probably no need to run for longer than
>> maybe three to four times to verify).

Good news. I was able to reproduce this on bare metal and have a kdump core.





-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-21 14:32   ` Johannes Thumshirn
  2019-10-24 20:23     ` Peter Hjalmarsson
@ 2019-10-27 13:24     ` Su Yue
  2019-10-28  7:50       ` Johannes Thumshirn
  1 sibling, 1 reply; 9+ messages in thread
From: Su Yue @ 2019-10-27 13:24 UTC (permalink / raw)
  To: Johannes Thumshirn, Peter Hjalmarsson, linux-btrfs



On 2019/10/21 10:32 PM, Johannes Thumshirn wrote:
> On 21/10/2019 11:17, Johannes Thumshirn wrote:
> [...]
>>> -----
>>> $ cat run-btrfs-test
>>> modprobe -iv zram num_devices=8
>>> udevadm trigger
>>> sync
>>> zramctl /dev/zram0 -s 8G && \
>>> zramctl /dev/zram1 -s 8G && \
>>> zramctl /dev/zram2 -s 4G && \
>>> zramctl /dev/zram3 -s 4G && \
>>> zramctl /dev/zram4 -s 4G && \
>>> zramctl /dev/zram5 -s 2G && \
>>> zramctl /dev/zram6 -s 2G && \
>>> zramctl /dev/zram7 -s 4G && \
>>> mkfs.btrfs /dev/zram0 && \
>>> mkdir -p /mnt/btrfs-test && \
>>> mount /dev/zram0 /mnt/btrfs-test && \
>>> echo "FS Mounted" && \
>>> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
>>> echo "Devices added" && \
>>> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
>>> bs=32M count=1 && sync ; done
>>> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
>>> btrfs fi sh /mnt/btrfs-test && \
>>> btrfs fi df /mnt/btrfs-test && \
>>> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
>>> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
>>> btrfs fi sh /mnt/btrfs-test && \
>>> btrfs fi df /mnt/btrfs-test
>
> I'm sorry. I ran this script in a loop for 35 iterations on 5.3.6 and
> couldn't reproduce a single crash.
>

Interesting thing I met too. That's not reproducible on my VM but
host (Archlinux v5.3.6 same kernel config).

What's more interesting is that v5.3.7 seems to have fixed the bug.
After some bisect. The commit is

commit 417d26300214f7b593a99c6bc8badb66492ae322
Author: Qu Wenruo <wqu@suse.com>
Date:   Mon Sep 23 14:56:14 2019 +0800

     btrfs: relocation: fix use-after-free on dead relocation roots

     commit 1fac4a54374f7ef385938f3c6cf7649c0fe4f6cd upstream.


--
Su

>
> Is there anything else needed?
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-27 13:24     ` Su Yue
@ 2019-10-28  7:50       ` Johannes Thumshirn
  2019-10-28 10:22         ` Peter Hjalmarsson
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Thumshirn @ 2019-10-28  7:50 UTC (permalink / raw)
  To: Su Yue, Peter Hjalmarsson, linux-btrfs

On 27/10/2019 14:24, Su Yue wrote:
[...]
> 
> Interesting thing I met too. That's not reproducible on my VM but
> host (Archlinux v5.3.6 same kernel config).
> 
> What's more interesting is that v5.3.7 seems to have fixed the bug.
> After some bisect. The commit is
> 
> commit 417d26300214f7b593a99c6bc8badb66492ae322
> Author: Qu Wenruo <wqu@suse.com>
> Date:   Mon Sep 23 14:56:14 2019 +0800
> 
>     btrfs: relocation: fix use-after-free on dead relocation roots
> 
>     commit 1fac4a54374f7ef385938f3c6cf7649c0fe4f6cd upstream.
> 

Good catch, cherry-picking this commit on top of v5.3.5 resolves the
issue in my setup.

-- 
Johannes Thumshirn                            SUSE Labs Filesystems
jthumshirn@suse.de                                +49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error
  2019-10-28  7:50       ` Johannes Thumshirn
@ 2019-10-28 10:22         ` Peter Hjalmarsson
  0 siblings, 0 replies; 9+ messages in thread
From: Peter Hjalmarsson @ 2019-10-28 10:22 UTC (permalink / raw)
  To: Johannes Thumshirn; +Cc: Su Yue, linux-btrfs

Thanks!

5.3.7 hit my system from the Fedora repos just a day or two after my
last mail, and spent last evening trying to reproduce both the problem
the scripts shows, but also the problem I tried to track down when
making the script. Both seems to be gone after upgrading on the
systems I have been able to run the test on so far.

Best Regards,
Peter hjalmarsson

Den mån 28 okt. 2019 kl 08:50 skrev Johannes Thumshirn <jthumshirn@suse.de>:
>
> On 27/10/2019 14:24, Su Yue wrote:
> [...]
> >
> > Interesting thing I met too. That's not reproducible on my VM but
> > host (Archlinux v5.3.6 same kernel config).
> >
> > What's more interesting is that v5.3.7 seems to have fixed the bug.
> > After some bisect. The commit is
> >
> > commit 417d26300214f7b593a99c6bc8badb66492ae322
> > Author: Qu Wenruo <wqu@suse.com>
> > Date:   Mon Sep 23 14:56:14 2019 +0800
> >
> >     btrfs: relocation: fix use-after-free on dead relocation roots
> >
> >     commit 1fac4a54374f7ef385938f3c6cf7649c0fe4f6cd upstream.
> >
>
> Good catch, cherry-picking this commit on top of v5.3.5 resolves the
> issue in my setup.
>
> --
> Johannes Thumshirn                            SUSE Labs Filesystems
> jthumshirn@suse.de                                +49 911 74053 689
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5
> 90409 Nürnberg
> Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Felix Imendörffer
> Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-10-28 10:22 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-19 19:29 "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error Peter Hjalmarsson
2019-10-21  9:17 ` Johannes Thumshirn
2019-10-21 14:32   ` Johannes Thumshirn
2019-10-24 20:23     ` Peter Hjalmarsson
2019-10-25  8:07       ` Johannes Thumshirn
2019-10-25 10:02         ` Johannes Thumshirn
2019-10-27 13:24     ` Su Yue
2019-10-28  7:50       ` Johannes Thumshirn
2019-10-28 10:22         ` Peter Hjalmarsson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).