All of lore.kernel.org
 help / color / mirror / Atom feed
* 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
@ 2021-10-12  0:59 Chris Murphy
  2021-10-12  5:25 ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-12  0:59 UTC (permalink / raw)
  To: Btrfs BTRFS

Linux version 5.14.9-300.fc35.aarch64 Fedora-Cloud-Base-35-20211004.n.0.aarch64
[ 2164.477113] Unable to handle kernel paging request at virtual
address fffffffffffffdd0
[ 2164.483166] Mem abort info:
[ 2164.485300]   ESR = 0x96000004
[ 2164.487824]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 2164.493361]   SET = 0, FnV = 0
[ 2164.496336]   EA = 0, S1PTW = 0
[ 2164.498762]   FSC = 0x04: level 0 translation fault
[ 2164.503031] Data abort info:
[ 2164.509584]   ISV = 0, ISS = 0x00000004
[ 2164.516918]   CM = 0, WnR = 0
[ 2164.523438] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000158751000
[ 2164.533628] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[ 2164.543741] Internal error: Oops: 96000004 [#1] SMP
[ 2164.551652] Modules linked in: virtio_gpu virtio_dma_buf
drm_kms_helper cec fb_sys_fops syscopyarea sysfillrect sysimgblt
joydev virtio_net virtio_balloon net_failover failover vfat fat drm
fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted
5.14.9-300.fc35.aarch64 #1
[ 2164.593732] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 2164.603204] Workqueue: btrfs-delalloc btrfs_work_helper
[ 2164.611402] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 2164.620165] pc : submit_compressed_extents+0x38/0x3d0
[ 2164.628056] lr : async_cow_submit+0x50/0xd0
[ 2164.635258] sp : ffff800010bfbc20
[ 2164.642585] x29: ffff800010bfbc30 x28: 0000000000000000 x27: ffffdf2b47b11000
[ 2164.652135] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff00014152d608
[ 2164.661614] x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c6106980
[ 2164.670886] x20: ffff0000c55e2000 x19: 0000000000000001 x18: ffff0000d3f00bd4
[ 2164.680050] x17: ffff00016f467ff8 x16: 0000000000000006 x15: 72a308ccefd184e0
[ 2164.689179] x14: 5378ed9c2ad24340 x13: 0000000000000020 x12: ffff0001fefa68c0
[ 2164.698178] x11: ffffdf2b47b2b500 x10: 0000000000000000 x9 : ffffdf2b462f2b70
[ 2164.707265] x8 : ffff20d6b742d000 x7 : ffff800010bfbbe0 x6 : ffffdf2b4805ad40
[ 2164.716368] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c61069a0
[ 2164.725454] x2 : 0000000000000000 x1 : ffff00014152d630 x0 : ffff00014152d630
[ 2164.734445] Call trace:
[ 2164.739675]  submit_compressed_extents+0x38/0x3d0
[ 2164.746728]  async_cow_submit+0x50/0xd0
[ 2164.752980]  run_ordered_work+0xc8/0x280
[ 2164.759248]  btrfs_work_helper+0x98/0x250
[ 2164.765449]  process_one_work+0x1f0/0x4ac
[ 2164.771558]  worker_thread+0x188/0x504
[ 2164.777395]  kthread+0x110/0x114
[ 2164.782791]  ret_from_fork+0x10/0x18
[ 2164.788343] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[ 2164.795833] ---[ end trace e44350b86ce16830 ]---


Downstream bug report has been proposed as a btrfs release blocking bug.
https://bugzilla.redhat.com/show_bug.cgi?id=2011928

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12  0:59 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper Chris Murphy
@ 2021-10-12  5:25 ` Nikolay Borisov
  2021-10-12  6:47   ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-12  5:25 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS; +Cc: Qu Wenruo



On 12.10.21 г. 3:59, Chris Murphy wrote:
> Linux version 5.14.9-300.fc35.aarch64 Fedora-Cloud-Base-35-20211004.n.0.aarch64
> [ 2164.477113] Unable to handle kernel paging request at virtual
> address fffffffffffffdd0
> [ 2164.483166] Mem abort info:
> [ 2164.485300]   ESR = 0x96000004
> [ 2164.487824]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 2164.493361]   SET = 0, FnV = 0
> [ 2164.496336]   EA = 0, S1PTW = 0
> [ 2164.498762]   FSC = 0x04: level 0 translation fault
> [ 2164.503031] Data abort info:
> [ 2164.509584]   ISV = 0, ISS = 0x00000004
> [ 2164.516918]   CM = 0, WnR = 0
> [ 2164.523438] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000158751000
> [ 2164.533628] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
> [ 2164.543741] Internal error: Oops: 96000004 [#1] SMP
> [ 2164.551652] Modules linked in: virtio_gpu virtio_dma_buf
> drm_kms_helper cec fb_sys_fops syscopyarea sysfillrect sysimgblt
> joydev virtio_net virtio_balloon net_failover failover vfat fat drm
> fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
> virtio_mmio aes_neon_bs
> [ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted
> 5.14.9-300.fc35.aarch64 #1
> [ 2164.593732] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [ 2164.603204] Workqueue: btrfs-delalloc btrfs_work_helper
> [ 2164.611402] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> [ 2164.620165] pc : submit_compressed_extents+0x38/0x3d0

Qu isn't this the subpage bug you narrowed down a couple of days ago ?

> [ 2164.628056] lr : async_cow_submit+0x50/0xd0
> [ 2164.635258] sp : ffff800010bfbc20
> [ 2164.642585] x29: ffff800010bfbc30 x28: 0000000000000000 x27: ffffdf2b47b11000
> [ 2164.652135] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff00014152d608
> [ 2164.661614] x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c6106980
> [ 2164.670886] x20: ffff0000c55e2000 x19: 0000000000000001 x18: ffff0000d3f00bd4
> [ 2164.680050] x17: ffff00016f467ff8 x16: 0000000000000006 x15: 72a308ccefd184e0
> [ 2164.689179] x14: 5378ed9c2ad24340 x13: 0000000000000020 x12: ffff0001fefa68c0
> [ 2164.698178] x11: ffffdf2b47b2b500 x10: 0000000000000000 x9 : ffffdf2b462f2b70
> [ 2164.707265] x8 : ffff20d6b742d000 x7 : ffff800010bfbbe0 x6 : ffffdf2b4805ad40
> [ 2164.716368] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c61069a0
> [ 2164.725454] x2 : 0000000000000000 x1 : ffff00014152d630 x0 : ffff00014152d630
> [ 2164.734445] Call trace:
> [ 2164.739675]  submit_compressed_extents+0x38/0x3d0
> [ 2164.746728]  async_cow_submit+0x50/0xd0
> [ 2164.752980]  run_ordered_work+0xc8/0x280
> [ 2164.759248]  btrfs_work_helper+0x98/0x250
> [ 2164.765449]  process_one_work+0x1f0/0x4ac
> [ 2164.771558]  worker_thread+0x188/0x504
> [ 2164.777395]  kthread+0x110/0x114
> [ 2164.782791]  ret_from_fork+0x10/0x18
> [ 2164.788343] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> [ 2164.795833] ---[ end trace e44350b86ce16830 ]---
> 
> 
> Downstream bug report has been proposed as a btrfs release blocking bug.
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12  5:25 ` Nikolay Borisov
@ 2021-10-12  6:47   ` Qu Wenruo
  2021-10-12 14:30     ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-12  6:47 UTC (permalink / raw)
  To: Nikolay Borisov, Chris Murphy, Btrfs BTRFS; +Cc: Qu Wenruo



On 2021/10/12 13:25, Nikolay Borisov wrote:
>
>
> On 12.10.21 г. 3:59, Chris Murphy wrote:
>> Linux version 5.14.9-300.fc35.aarch64 Fedora-Cloud-Base-35-20211004.n.0.aarch64
>> [ 2164.477113] Unable to handle kernel paging request at virtual
>> address fffffffffffffdd0
>> [ 2164.483166] Mem abort info:
>> [ 2164.485300]   ESR = 0x96000004
>> [ 2164.487824]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [ 2164.493361]   SET = 0, FnV = 0
>> [ 2164.496336]   EA = 0, S1PTW = 0
>> [ 2164.498762]   FSC = 0x04: level 0 translation fault
>> [ 2164.503031] Data abort info:
>> [ 2164.509584]   ISV = 0, ISS = 0x00000004
>> [ 2164.516918]   CM = 0, WnR = 0
>> [ 2164.523438] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000158751000
>> [ 2164.533628] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
>> [ 2164.543741] Internal error: Oops: 96000004 [#1] SMP
>> [ 2164.551652] Modules linked in: virtio_gpu virtio_dma_buf
>> drm_kms_helper cec fb_sys_fops syscopyarea sysfillrect sysimgblt
>> joydev virtio_net virtio_balloon net_failover failover vfat fat drm
>> fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
>> virtio_mmio aes_neon_bs
>> [ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted
>> 5.14.9-300.fc35.aarch64 #1
>> [ 2164.593732] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
>> [ 2164.603204] Workqueue: btrfs-delalloc btrfs_work_helper
>> [ 2164.611402] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>> [ 2164.620165] pc : submit_compressed_extents+0x38/0x3d0
>
> Qu isn't this the subpage bug you narrowed down a couple of days ago ?

Not exactly.

The bug I pinned down is inside my refactored code of LZO code, not the
generic part, and my refactored code is not yet merged.

Chris, mind to share the code context of the stack?

A quick glance into the code shows it could be some use-after-free bug,
that btrfs_debug() is referring some member of a freed async_extent
structure.

Thanks,
Qu

>
>> [ 2164.628056] lr : async_cow_submit+0x50/0xd0
>> [ 2164.635258] sp : ffff800010bfbc20
>> [ 2164.642585] x29: ffff800010bfbc30 x28: 0000000000000000 x27: ffffdf2b47b11000
>> [ 2164.652135] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff00014152d608
>> [ 2164.661614] x23: 0000000000000000 x22: 0000000000000000 x21: ffff0000c6106980
>> [ 2164.670886] x20: ffff0000c55e2000 x19: 0000000000000001 x18: ffff0000d3f00bd4
>> [ 2164.680050] x17: ffff00016f467ff8 x16: 0000000000000006 x15: 72a308ccefd184e0
>> [ 2164.689179] x14: 5378ed9c2ad24340 x13: 0000000000000020 x12: ffff0001fefa68c0
>> [ 2164.698178] x11: ffffdf2b47b2b500 x10: 0000000000000000 x9 : ffffdf2b462f2b70
>> [ 2164.707265] x8 : ffff20d6b742d000 x7 : ffff800010bfbbe0 x6 : ffffdf2b4805ad40
>> [ 2164.716368] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c61069a0
>> [ 2164.725454] x2 : 0000000000000000 x1 : ffff00014152d630 x0 : ffff00014152d630
>> [ 2164.734445] Call trace:
>> [ 2164.739675]  submit_compressed_extents+0x38/0x3d0
>> [ 2164.746728]  async_cow_submit+0x50/0xd0
>> [ 2164.752980]  run_ordered_work+0xc8/0x280
>> [ 2164.759248]  btrfs_work_helper+0x98/0x250
>> [ 2164.765449]  process_one_work+0x1f0/0x4ac
>> [ 2164.771558]  worker_thread+0x188/0x504
>> [ 2164.777395]  kthread+0x110/0x114
>> [ 2164.782791]  ret_from_fork+0x10/0x18
>> [ 2164.788343] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
>> [ 2164.795833] ---[ end trace e44350b86ce16830 ]---
>>
>>
>> Downstream bug report has been proposed as a btrfs release blocking bug.
>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12  6:47   ` Qu Wenruo
@ 2021-10-12 14:30     ` Chris Murphy
  2021-10-12 21:24       ` Chris Murphy
  2021-10-12 23:55       ` Qu Wenruo
  0 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-12 14:30 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Nikolay Borisov, Chris Murphy, Btrfs BTRFS, Qu Wenruo

On Tue, Oct 12, 2021 at 2:47 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/10/12 13:25, Nikolay Borisov wrote:
> >
> >
> > On 12.10.21 г. 3:59, Chris Murphy wrote:
> >> Linux version 5.14.9-300.fc35.aarch64 Fedora-Cloud-Base-35-20211004.n.0.aarch64
> >> [ 2164.477113] Unable to handle kernel paging request at virtual
> >> address fffffffffffffdd0
> >> [ 2164.483166] Mem abort info:
> >> [ 2164.485300]   ESR = 0x96000004
> >> [ 2164.487824]   EC = 0x25: DABT (current EL), IL = 32 bits
> >> [ 2164.493361]   SET = 0, FnV = 0
> >> [ 2164.496336]   EA = 0, S1PTW = 0
> >> [ 2164.498762]   FSC = 0x04: level 0 translation fault
> >> [ 2164.503031] Data abort info:
> >> [ 2164.509584]   ISV = 0, ISS = 0x00000004
> >> [ 2164.516918]   CM = 0, WnR = 0
> >> [ 2164.523438] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000158751000
> >> [ 2164.533628] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
> >> [ 2164.543741] Internal error: Oops: 96000004 [#1] SMP
> >> [ 2164.551652] Modules linked in: virtio_gpu virtio_dma_buf
> >> drm_kms_helper cec fb_sys_fops syscopyarea sysfillrect sysimgblt
> >> joydev virtio_net virtio_balloon net_failover failover vfat fat drm
> >> fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
> >> virtio_mmio aes_neon_bs
> >> [ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted
> >> 5.14.9-300.fc35.aarch64 #1
> >> [ 2164.593732] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> >> [ 2164.603204] Workqueue: btrfs-delalloc btrfs_work_helper
> >> [ 2164.611402] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> >> [ 2164.620165] pc : submit_compressed_extents+0x38/0x3d0
> >
> > Qu isn't this the subpage bug you narrowed down a couple of days ago ?
>
> Not exactly.
>
> The bug I pinned down is inside my refactored code of LZO code, not the
> generic part, and my refactored code is not yet merged.
>
> Chris, mind to share the code context of the stack?

From the bug report:

* provision Fedora 35 aarhc64 cloud based VM in openstack
* try rebuilding kernel rpm(it seems there is need for some load on
the system to trigger the issue, but it seems to reliably trigger for
me)


So it seems reliably reproducible when compiling the kernel...



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12 14:30     ` Chris Murphy
@ 2021-10-12 21:24       ` Chris Murphy
  2021-10-12 23:55       ` Qu Wenruo
  1 sibling, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-12 21:24 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Btrfs BTRFS, Qu Wenruo

As it turns out we have several bugs that look similar, in that they
share a stack trace including

Workqueue: btrfs-delalloc btrfs_work_helper

But also all three show compression related functions.

https://bugzilla.redhat.com/show_bug.cgi?id=2011928
https://bugzilla.redhat.com/show_bug.cgi?id=2006295
https://bugzilla.redhat.com/show_bug.cgi?id=1949334

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12 14:30     ` Chris Murphy
  2021-10-12 21:24       ` Chris Murphy
@ 2021-10-12 23:55       ` Qu Wenruo
  2021-10-13 12:14         ` Chris Murphy
  1 sibling, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-12 23:55 UTC (permalink / raw)
  To: Chris Murphy, Qu Wenruo; +Cc: Nikolay Borisov, Btrfs BTRFS



On 2021/10/12 22:30, Chris Murphy wrote:
> On Tue, Oct 12, 2021 at 2:47 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2021/10/12 13:25, Nikolay Borisov wrote:
>>>
>>>
>>> On 12.10.21 г. 3:59, Chris Murphy wrote:
>>>> Linux version 5.14.9-300.fc35.aarch64 Fedora-Cloud-Base-35-20211004.n.0.aarch64
>>>> [ 2164.477113] Unable to handle kernel paging request at virtual
>>>> address fffffffffffffdd0
>>>> [ 2164.483166] Mem abort info:
>>>> [ 2164.485300]   ESR = 0x96000004
>>>> [ 2164.487824]   EC = 0x25: DABT (current EL), IL = 32 bits
>>>> [ 2164.493361]   SET = 0, FnV = 0
>>>> [ 2164.496336]   EA = 0, S1PTW = 0
>>>> [ 2164.498762]   FSC = 0x04: level 0 translation fault
>>>> [ 2164.503031] Data abort info:
>>>> [ 2164.509584]   ISV = 0, ISS = 0x00000004
>>>> [ 2164.516918]   CM = 0, WnR = 0
>>>> [ 2164.523438] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000158751000
>>>> [ 2164.533628] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
>>>> [ 2164.543741] Internal error: Oops: 96000004 [#1] SMP
>>>> [ 2164.551652] Modules linked in: virtio_gpu virtio_dma_buf
>>>> drm_kms_helper cec fb_sys_fops syscopyarea sysfillrect sysimgblt
>>>> joydev virtio_net virtio_balloon net_failover failover vfat fat drm
>>>> fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
>>>> virtio_mmio aes_neon_bs
>>>> [ 2164.583368] CPU: 2 PID: 8910 Comm: kworker/u8:3 Not tainted
>>>> 5.14.9-300.fc35.aarch64 #1
>>>> [ 2164.593732] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
>>>> [ 2164.603204] Workqueue: btrfs-delalloc btrfs_work_helper
>>>> [ 2164.611402] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>>>> [ 2164.620165] pc : submit_compressed_extents+0x38/0x3d0
>>>
>>> Qu isn't this the subpage bug you narrowed down a couple of days ago ?
>>
>> Not exactly.
>>
>> The bug I pinned down is inside my refactored code of LZO code, not the
>> generic part, and my refactored code is not yet merged.
>>
>> Chris, mind to share the code context of the stack?
> 
>  From the bug report:
> 
> * provision Fedora 35 aarhc64 cloud based VM in openstack
> * try rebuilding kernel rpm(it seems there is need for some load on
> the system to trigger the issue, but it seems to reliably trigger for
> me)
> 
> 
> So it seems reliably reproducible when compiling the kernel...

I mean, the code line number...

It can be extracted using <linux_source>/scripts/faddr2line script.

Thanks,
Qu


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-12 23:55       ` Qu Wenruo
@ 2021-10-13 12:14         ` Chris Murphy
  2021-10-13 12:18           ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-13 12:14 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Btrfs BTRFS

OK so something like this:

/usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
/boot/vmlinuz-5.14.9-300.fc35.aarch64 submit_compressed_extents+0x38
async_cow_submit+0x50 run_ordered_work+0xc8 btrfs_work_helper+0x98
process_one_work+0x1f0 worker_thread+0x188 kthread+0x110
ret_from_fork+0x10

Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:14         ` Chris Murphy
@ 2021-10-13 12:18           ` Qu Wenruo
  2021-10-13 12:27             ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-13 12:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Btrfs BTRFS



On 2021/10/13 20:14, Chris Murphy wrote:
> OK so something like this:
> 
> /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
> /boot/vmlinuz-5.14.9-300.fc35.aarch64 submit_compressed_extents+0x38
> async_cow_submit+0x50 run_ordered_work+0xc8 btrfs_work_helper+0x98
> process_one_work+0x1f0 worker_thread+0x188 kthread+0x110
> ret_from_fork+0x10


Sorry, it only needs the last the stack (submit_compressed_extents+0x38)

The full command would looks like this:

$ ./scripts/faddr2line fs/btrfs/btrfs.ko submit_compressed_extents+0x38

The modules needs to have debug info though.

Example on my X86_64 VM (which would definitely result wrong line number):

$ ./scripts/faddr2line fs/btrfs/btrfs.ko  submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x440:
submit_compressed_extents at /home/adam/linux/fs/btrfs/inode.c:1041

Thanks,
Qu
> 
> Chris Murphy
> 


^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:18           ` Qu Wenruo
@ 2021-10-13 12:27             ` Chris Murphy
  2021-10-13 12:29               ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-13 12:27 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Btrfs BTRFS

On Wed, Oct 13, 2021 at 8:18 AM Qu Wenruo <wqu@suse.com> wrote:

> Sorry, it only needs the last the stack (submit_compressed_extents+0x38)
>
> The full command would looks like this:
>
> $ ./scripts/faddr2line fs/btrfs/btrfs.ko submit_compressed_extents+0x38

btrfs is built-in on Fedora kernels, so there's no btrfs.ko, when I do:

$ /usr/src/kernels/5.14.9-300.fc35.x86_64/scripts/faddr2line
/boot/vmlinuz-5.14.9-300.fc35.x86_64 submit_compressed_extents+0x38
readelf: /boot/vmlinuz-5.14.9-300.fc35.x86_64: Error: Not an ELF file
- it has the wrong magic bytes at the start
nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
no match for submit_compressed_extents+0x38


> The modules needs to have debug info though.

CONFIG_BTRFS_DEBUG ?

Neither regular nor debug kernels have this set, we're only setting
CONFIG_BTRFS_ASSERT on debug kernels.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:27             ` Chris Murphy
@ 2021-10-13 12:29               ` Nikolay Borisov
  2021-10-13 12:43                 ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-13 12:29 UTC (permalink / raw)
  To: Chris Murphy, Qu Wenruo; +Cc: Qu Wenruo, Btrfs BTRFS



On 13.10.21 г. 15:27, Chris Murphy wrote:
> On Wed, Oct 13, 2021 at 8:18 AM Qu Wenruo <wqu@suse.com> wrote:
> 
>> Sorry, it only needs the last the stack (submit_compressed_extents+0x38)
>>
>> The full command would looks like this:
>>
>> $ ./scripts/faddr2line fs/btrfs/btrfs.ko submit_compressed_extents+0x38
> 
> btrfs is built-in on Fedora kernels, so there's no btrfs.ko, when I do:
> 
> $ /usr/src/kernels/5.14.9-300.fc35.x86_64/scripts/faddr2line
> /boot/vmlinuz-5.14.9-300.fc35.x86_64 submit_compressed_extents+0x38
> readelf: /boot/vmlinuz-5.14.9-300.fc35.x86_64: Error: Not an ELF file
> - it has the wrong magic bytes at the start
> nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
> nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
> no match for submit_compressed_extents+0x38
> 
> 
>> The modules needs to have debug info though.
> 
> CONFIG_BTRFS_DEBUG ?
> 
> Neither regular nor debug kernels have this set, we're only setting
> CONFIG_BTRFS_ASSERT on debug kernels.
> 

No, debug info is intorduced by CONFIG_DEBUG_INFO

so you need the kernel debug package for fedora i.e vmlinuz.debug or
some such?

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:29               ` Nikolay Borisov
@ 2021-10-13 12:43                 ` Chris Murphy
  2021-10-13 12:46                   ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-13 12:43 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Wed, Oct 13, 2021 at 8:29 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 13.10.21 г. 15:27, Chris Murphy wrote:
> > On Wed, Oct 13, 2021 at 8:18 AM Qu Wenruo <wqu@suse.com> wrote:
> >
> >> Sorry, it only needs the last the stack (submit_compressed_extents+0x38)
> >>
> >> The full command would looks like this:
> >>
> >> $ ./scripts/faddr2line fs/btrfs/btrfs.ko submit_compressed_extents+0x38
> >
> > btrfs is built-in on Fedora kernels, so there's no btrfs.ko, when I do:
> >
> > $ /usr/src/kernels/5.14.9-300.fc35.x86_64/scripts/faddr2line
> > /boot/vmlinuz-5.14.9-300.fc35.x86_64 submit_compressed_extents+0x38
> > readelf: /boot/vmlinuz-5.14.9-300.fc35.x86_64: Error: Not an ELF file
> > - it has the wrong magic bytes at the start
> > nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
> > nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
> > no match for submit_compressed_extents+0x38
> >
> >
> >> The modules needs to have debug info though.
> >
> > CONFIG_BTRFS_DEBUG ?
> >
> > Neither regular nor debug kernels have this set, we're only setting
> > CONFIG_BTRFS_ASSERT on debug kernels.
> >
>
> No, debug info is intorduced by CONFIG_DEBUG_INFO

CONFIG_DEBUG_INFO=y in even regular kernels.

> so you need the kernel debug package for fedora i.e vmlinuz.debug or
> some such?

Each kernel has kernel-debuginfo and kernel-debuginfo-common, ~735M.
Installed that and yet I get the same error, so I'm not sure I'm
pointing to the correct object.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:43                 ` Chris Murphy
@ 2021-10-13 12:46                   ` Nikolay Borisov
  2021-10-13 12:55                     ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-13 12:46 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 13.10.21 г. 15:43, Chris Murphy wrote:
> On Wed, Oct 13, 2021 at 8:29 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 13.10.21 г. 15:27, Chris Murphy wrote:
>>> On Wed, Oct 13, 2021 at 8:18 AM Qu Wenruo <wqu@suse.com> wrote:
>>>
>>>> Sorry, it only needs the last the stack (submit_compressed_extents+0x38)
>>>>
>>>> The full command would looks like this:
>>>>
>>>> $ ./scripts/faddr2line fs/btrfs/btrfs.ko submit_compressed_extents+0x38
>>>
>>> btrfs is built-in on Fedora kernels, so there's no btrfs.ko, when I do:
>>>
>>> $ /usr/src/kernels/5.14.9-300.fc35.x86_64/scripts/faddr2line
>>> /boot/vmlinuz-5.14.9-300.fc35.x86_64 submit_compressed_extents+0x38
>>> readelf: /boot/vmlinuz-5.14.9-300.fc35.x86_64: Error: Not an ELF file
>>> - it has the wrong magic bytes at the start
>>> nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
>>> nm: /boot/vmlinuz-5.14.9-300.fc35.x86_64: no symbols
>>> no match for submit_compressed_extents+0x38
>>>
>>>
>>>> The modules needs to have debug info though.
>>>
>>> CONFIG_BTRFS_DEBUG ?
>>>
>>> Neither regular nor debug kernels have this set, we're only setting
>>> CONFIG_BTRFS_ASSERT on debug kernels.
>>>
>>
>> No, debug info is intorduced by CONFIG_DEBUG_INFO
> 
> CONFIG_DEBUG_INFO=y in even regular kernels.
> 
>> so you need the kernel debug package for fedora i.e vmlinuz.debug or
>> some such?
> 
> Each kernel has kernel-debuginfo and kernel-debuginfo-common, ~735M.
> Installed that and yet I get the same error, so I'm not sure I'm
> pointing to the correct object.

Your kernel's debug info is likely somewhere under /usr/lib/debug/`uname
-r`/vmlinunx  - I got this from
https://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes,
step 3

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:46                   ` Nikolay Borisov
@ 2021-10-13 12:55                     ` Chris Murphy
  2021-10-13 19:21                       ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-13 12:55 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

OK here we go (I think):
$ /usr/src/kernels/5.14.9-300.fc35.x86_64/scripts/faddr2line
/usr/lib/debug/lib/modules/5.14.9-300.fc35.x86_64/vmlinux
submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x3f0:
BTRFS_I at fs/btrfs/btrfs_inode.h:234
(inlined by) submit_compressed_extents at fs/btrfs/inode.c:844
[chris@fovo Downloads]$


So I just need to ask the original reporter to do this in aarch64
(above I'm using the function+hex from aarch64 but using x86_64
debuginfo, so it's maybe not a valid line number).


Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 12:55                     ` Chris Murphy
@ 2021-10-13 19:21                       ` Chris Murphy
  2021-10-18  1:57                         ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-13 19:21 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

From the downstream bug:

[root@openqa-a64-worker03 adamwill][PROD]#
/usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
/usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux
submit_compressed_extents+0x38
submit_compressed_extents+0x38/0x3d0:
submit_compressed_extents at
/usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
[root@openqa-a64-worker03 adamwill][PROD]#

https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c26

Also curious: this problem is only happening in openstack
environments, as if the host environment matters. Does that make
sense?


--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-13 19:21                       ` Chris Murphy
@ 2021-10-18  1:57                         ` Chris Murphy
  2021-10-18 11:32                           ` Su Yue
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-18  1:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

Any update on this problem and whether+what more info is needed?

Thanks,
Chris Murphy

On Wed, Oct 13, 2021 at 3:21 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> From the downstream bug:
>
> [root@openqa-a64-worker03 adamwill][PROD]#
> /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
> /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux
> submit_compressed_extents+0x38
> submit_compressed_extents+0x38/0x3d0:
> submit_compressed_extents at
> /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
> [root@openqa-a64-worker03 adamwill][PROD]#
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c26
>
> Also curious: this problem is only happening in openstack
> environments, as if the host environment matters. Does that make
> sense?
>
>
> --
> Chris Murphy



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18  1:57                         ` Chris Murphy
@ 2021-10-18 11:32                           ` Su Yue
  2021-10-18 13:28                             ` Qu Wenruo
  0 siblings, 1 reply; 62+ messages in thread
From: Su Yue @ 2021-10-18 11:32 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Qu Wenruo, Btrfs BTRFS


On Sun 17 Oct 2021 at 21:57, Chris Murphy 
<lists@colorremedies.com> wrote:

> Any update on this problem and whether+what more info is needed?
>
It's interesting the OOPS only happens in openstack environment.
Is it possiable to provide the kernel core dump?

--
Su
> Thanks,
> Chris Murphy
>
> On Wed, Oct 13, 2021 at 3:21 PM Chris Murphy 
> <lists@colorremedies.com> wrote:
>>
>> From the downstream bug:
>>
>> [root@openqa-a64-worker03 adamwill][PROD]#
>> /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
>> /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux
>> submit_compressed_extents+0x38
>> submit_compressed_extents+0x38/0x3d0:
>> submit_compressed_extents at
>> /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
>> [root@openqa-a64-worker03 adamwill][PROD]#
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c26
>>
>> Also curious: this problem is only happening in openstack
>> environments, as if the host environment matters. Does that 
>> make
>> sense?
>>
>>
>> --
>> Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18 11:32                           ` Su Yue
@ 2021-10-18 13:28                             ` Qu Wenruo
  2021-10-18 14:49                               ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-18 13:28 UTC (permalink / raw)
  To: Su Yue, Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS



On 2021/10/18 19:32, Su Yue wrote:
>
> On Sun 17 Oct 2021 at 21:57, Chris Murphy <lists@colorremedies.com> wrote:
>
>> Any update on this problem and whether+what more info is needed?
>>
> It's interesting the OOPS only happens in openstack environment.

Or the toolchain?

I also tried my misc-next build using GCC on my RPI4, fstests with
compression never crashes the kernel.

Thanks,
Qu

> Is it possiable to provide the kernel core dump?
>
> --
> Su
>> Thanks,
>> Chris Murphy
>>
>> On Wed, Oct 13, 2021 at 3:21 PM Chris Murphy <lists@colorremedies.com>
>> wrote:
>>>
>>> From the downstream bug:
>>>
>>> [root@openqa-a64-worker03 adamwill][PROD]#
>>> /usr/src/kernels/5.14.9-300.fc35.aarch64/scripts/faddr2line
>>> /usr/lib/debug/lib/modules/5.14.9-300.fc35.aarch64/vmlinux
>>> submit_compressed_extents+0x38
>>> submit_compressed_extents+0x38/0x3d0:
>>> submit_compressed_extents at
>>> /usr/src/debug/kernel-5.14.9/linux-5.14.9-300.fc35.aarch64/fs/btrfs/inode.c:845
>>>
>>> [root@openqa-a64-worker03 adamwill][PROD]#
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c26
>>>
>>> Also curious: this problem is only happening in openstack
>>> environments, as if the host environment matters. Does that make
>>> sense?
>>>
>>>
>>> --
>>> Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18 13:28                             ` Qu Wenruo
@ 2021-10-18 14:49                               ` Chris Murphy
  2021-10-18 18:24                                 ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-18 14:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Su Yue, Chris Murphy, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

On Mon, Oct 18, 2021 at 9:28 AM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2021/10/18 19:32, Su Yue wrote:
> >
> > On Sun 17 Oct 2021 at 21:57, Chris Murphy <lists@colorremedies.com> wrote:
> >
> >> Any update on this problem and whether+what more info is needed?
> >>
> > It's interesting the OOPS only happens in openstack environment.
>
> Or the toolchain?

In the earliest known instance so far, from April 2021, it was this kernel:

Apr 13 20:47:35 fedora kernel: Linux version 5.11.12-300.fc34.aarch64
(mockbuild@buildvm-a64-10.iad2.fedoraproject.org) (gcc (GCC) 11.0.1
20210324 (Red Hat 11.0.1-0), GNU ld version 2.35.1-41.fc34) #1 SMP Wed
Apr 7 16:12:21 UTC 2021

> > Is it possiable to provide the kernel core dump?

I'll look into it.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18 14:49                               ` Chris Murphy
@ 2021-10-18 18:24                                 ` Chris Murphy
  2021-10-19  1:24                                   ` Su Yue
  2021-10-19  1:25                                   ` Qu Wenruo
  0 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-18 18:24 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Su Yue, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

I've got kdump.service setup and ready, but I'm not sure about two things:

1.
$ git clone https://github.com/kdave/btrfs-devel
Cloning into 'btrfs-devel'...
fatal: error reading section header 'shallow-info'

2.
How to capture the kernel core dump, if I need to do anything to
trigger it other than reproducing the reported problem or if I'll need
to do sysrq+c or other.

If it's faster, I can also get any developer access to the VM...

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18 18:24                                 ` Chris Murphy
@ 2021-10-19  1:24                                   ` Su Yue
  2021-10-19 18:26                                     ` Chris Murphy
  2021-10-19  1:25                                   ` Qu Wenruo
  1 sibling, 1 reply; 62+ messages in thread
From: Su Yue @ 2021-10-19  1:24 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS


On Mon 18 Oct 2021 at 14:24, Chris Murphy 
<lists@colorremedies.com> wrote:

> I've got kdump.service setup and ready, but I'm not sure about 
> two things:
>
> 1.
> $ git clone https://github.com/kdave/btrfs-devel
> Cloning into 'btrfs-devel'...
> fatal: error reading section header 'shallow-info'
>
> 2.
> How to capture the kernel core dump, if I need to do anything to
> trigger it other than reproducing the reported problem or if 
> I'll need
> to do sysrq+c or other.
>
No need to do other things, you can test if kdump works by
triggering a panic using sysrq.
Since it's just a kernel panic, rebuilding kernel rpm to reproduce 
is
enough.

Note: parameter crashkernel can be adjusted if there was no vmcore
produced.

--
Su

> If it's faster, I can also get any developer access to the VM...

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-18 18:24                                 ` Chris Murphy
  2021-10-19  1:24                                   ` Su Yue
@ 2021-10-19  1:25                                   ` Qu Wenruo
  1 sibling, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-10-19  1:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS



On 2021/10/19 02:24, Chris Murphy wrote:
> I've got kdump.service setup and ready, but I'm not sure about two things:
>
> 1.
> $ git clone https://github.com/kdave/btrfs-devel
> Cloning into 'btrfs-devel'...
> fatal: error reading section header 'shallow-info'

This looks strange, maybe you need to clone torvalds tree, and just add
that tree as a remote?

>
> 2.
> How to capture the kernel core dump, if I need to do anything to
> trigger it other than reproducing the reported problem or if I'll need
> to do sysrq+c or other.
>
> If it's faster, I can also get any developer access to the VM...
>
With current backtrace, it already points us to exact where the problem is.

Even if I have the access to the VM, I still need to download the source
code and rebuild the kernel myself to add extra debugging codes.
And I doubt if I built the kernel using different config/compiler, it
would no longer reproduce the bug.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19  1:24                                   ` Su Yue
@ 2021-10-19 18:26                                     ` Chris Murphy
  2021-10-19 23:42                                       ` Su Yue
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-19 18:26 UTC (permalink / raw)
  To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

Still working on the kernel core dump and should have something soon
(I blew up the VM and had to start over); should I run the 'crash'
command on it afterward? Or upload the dump file to e.g. google drive?

Also, I came across this ext4 issue happening on aarch64 (openstack
too), but I have no idea if it's related. And if so, whether it means
there's a common problem outside of btrfs?
https://github.com/coreos/fedora-coreos-tracker/issues/965

I mentioned this bug report up thread:
https://bugzilla.redhat.com/show_bug.cgi?id=1949334
but to summarize: it has the same btrfs call trace we've been looking
at in this email thread, but it's NOT on openstack, but actual
hardware (amberwing).



--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19 18:26                                     ` Chris Murphy
@ 2021-10-19 23:42                                       ` Su Yue
  2021-10-20  1:21                                         ` Qu Wenruo
                                                           ` (3 more replies)
  0 siblings, 4 replies; 62+ messages in thread
From: Su Yue @ 2021-10-19 23:42 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS


On Tue 19 Oct 2021 at 14:26, Chris Murphy 
<lists@colorremedies.com> wrote:

> Still working on the kernel core dump and should have something 
> soon
> (I blew up the VM and had to start over); should I run the 
> 'crash'
> command on it afterward? Or upload the dump file to e.g. google 
> drive?
>
Dump file and vmlinu[zx] kernel file are needed.

> Also, I came across this ext4 issue happening on aarch64 
> (openstack
> too), but I have no idea if it's related. And if so, whether it 
> means
> there's a common problem outside of btrfs?
> https://github.com/coreos/fedora-coreos-tracker/issues/965
>
Already noticed the thing. Let's wait for the vmcore.

Any idea, Qu?

--
Su
> I mentioned this bug report up thread:
> https://bugzilla.redhat.com/show_bug.cgi?id=1949334
> but to summarize: it has the same btrfs call trace we've been 
> looking
> at in this email thread, but it's NOT on openstack, but actual
> hardware (amberwing).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19 23:42                                       ` Su Yue
@ 2021-10-20  1:21                                         ` Qu Wenruo
  2021-10-20  1:25                                         ` Chris Murphy
                                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-10-20  1:21 UTC (permalink / raw)
  To: Su Yue, Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS



On 2021/10/20 07:42, Su Yue wrote:
>
> On Tue 19 Oct 2021 at 14:26, Chris Murphy <lists@colorremedies.com> wrote:
>
>> Still working on the kernel core dump and should have something soon
>> (I blew up the VM and had to start over); should I run the 'crash'
>> command on it afterward? Or upload the dump file to e.g. google drive?
>>
> Dump file and vmlinu[zx] kernel file are needed.
>
>> Also, I came across this ext4 issue happening on aarch64 (openstack
>> too), but I have no idea if it's related. And if so, whether it means
>> there's a common problem outside of btrfs?
>> https://github.com/coreos/fedora-coreos-tracker/issues/965
>>
> Already noticed the thing. Let's wait for the vmcore.

No idea at all.

In fact I'm not even familiar with kdump based analyse, and would prefer
to manually add extra debugging output to make sure things are going as
expected.

BTW, where can I find the compiler used for those pre-compiled kernel?
Currently I'm suspecting the toolchain as the root cause.

Thanks,
Qu
>
> Any idea, Qu?
>
> --
> Su
>> I mentioned this bug report up thread:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1949334
>> but to summarize: it has the same btrfs call trace we've been looking
>> at in this email thread, but it's NOT on openstack, but actual
>> hardware (amberwing).

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19 23:42                                       ` Su Yue
  2021-10-20  1:21                                         ` Qu Wenruo
@ 2021-10-20  1:25                                         ` Chris Murphy
  2021-10-20 23:55                                         ` Chris Murphy
  2021-10-22  2:36                                         ` Chris Murphy
  3 siblings, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-20  1:25 UTC (permalink / raw)
  To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>
>
> On Tue 19 Oct 2021 at 14:26, Chris Murphy
> <lists@colorremedies.com> wrote:
>
> > Still working on the kernel core dump and should have something
> > soon
> > (I blew up the VM and had to start over); should I run the
> > 'crash'
> > command on it afterward? Or upload the dump file to e.g. google
> > drive?
> >
> Dump file and vmlinu[zx] kernel file are needed.
>
> > Also, I came across this ext4 issue happening on aarch64
> > (openstack
> > too), but I have no idea if it's related. And if so, whether it
> > means
> > there's a common problem outside of btrfs?
> > https://github.com/coreos/fedora-coreos-tracker/issues/965
> >
> Already noticed the thing. Let's wait for the vmcore.
>
> Any idea, Qu?
>

So it's been compiling for multiple hours while also doing a large
package installation at the same time for about an hour of that time,
and still no oops or kernel messages...


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19 23:42                                       ` Su Yue
  2021-10-20  1:21                                         ` Qu Wenruo
  2021-10-20  1:25                                         ` Chris Murphy
@ 2021-10-20 23:55                                         ` Chris Murphy
  2021-10-21  0:29                                           ` Su Yue
  2021-10-21  5:56                                           ` Nikolay Borisov
  2021-10-22  2:36                                         ` Chris Murphy
  3 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-20 23:55 UTC (permalink / raw)
  To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>
> Dump file and vmlinu[zx] kernel file are needed.

So we get a splat but kdump doesn't create a vmcore. Do we need to
issue sysrq+c at the time of the hang and splat to create it?

Fedora Linux 35 (Cloud Edition)
Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)

eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
dusty-35 login: [  286.982605] Unable to handle kernel paging request
at virtual address fffffffffffffdd0
[  286.988338] Mem abort info:
[  286.990307]   ESR = 0x96000004
[  286.992596]   EC = 0x25: DABT (current EL), IL = 32 bits
[  286.996316]   SET = 0, FnV = 0
[  286.998454]   EA = 0, S1PTW = 0
[  287.000791]   FSC = 0x04: level 0 translation fault
[  287.004472] Data abort info:
[  287.006540]   ISV = 0, ISS = 0x00000004
[  287.009239]   CM = 0, WnR = 0
[  287.011344] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000054181000
[  287.018245] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[  287.024209] Internal error: Oops: 96000004 [#1] SMP
[  287.027615] Modules linked in: virtio_gpu virtio_dma_buf
drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[  287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
tainted 5.14.10-300.fc35.aarch64 #1
[  287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
[  287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  287.070568] pc : submit_compressed_extents+0x38/0x3d0
[  287.074825] lr : async_cow_submit+0x50/0xd0
[  287.078217] sp : ffff800015d4bc20
[  287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27: ffffb8a2fa941000
[  287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000115873608
[  287.092822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c6f25800
[  287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18: ffff0000c2100bd4
[  287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15: 50006a3d10a961cd
[  287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12: ffff0001fefa68c0
[  287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 : ffffb8a2f9131c40
[  287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 : ffffb8a2fae8ad40
[  287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c6f25820
[  287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 : ffff000115873630
[  287.139760] Call trace:
[  287.141784]  submit_compressed_extents+0x38/0x3d0
[  287.145620]  async_cow_submit+0x50/0xd0
[  287.148801]  run_ordered_work+0xc8/0x280
[  287.152005]  btrfs_work_helper+0x98/0x250
[  287.155450]  process_one_work+0x1f0/0x4ac
[  287.161577]  worker_thread+0x188/0x504
[  287.167461]  kthread+0x110/0x114
[  287.172872]  ret_from_fork+0x10/0x18
[  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[  287.186268] ---[ end trace 41ec405ced3786b6 ]---



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-20 23:55                                         ` Chris Murphy
@ 2021-10-21  0:29                                           ` Su Yue
  2021-10-21  0:37                                             ` Qu Wenruo
  2021-10-21 14:43                                             ` Chris Murphy
  2021-10-21  5:56                                           ` Nikolay Borisov
  1 sibling, 2 replies; 62+ messages in thread
From: Su Yue @ 2021-10-21  0:29 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS


On Wed 20 Oct 2021 at 19:55, Chris Murphy 
<lists@colorremedies.com> wrote:

> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>
>> Dump file and vmlinu[zx] kernel file are needed.
>
> So we get a splat but kdump doesn't create a vmcore. Do we need 
> to
> issue sysrq+c at the time of the hang and splat to create it?
>
Yes, please.

BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
so I think we can exclude the possibility of the toolchain.

--
Su

> Fedora Linux 35 (Cloud Edition)
> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>
> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
> dusty-35 login: [  286.982605] Unable to handle kernel paging 
> request
> at virtual address fffffffffffffdd0
> [  286.988338] Mem abort info:
> [  286.990307]   ESR = 0x96000004
> [  286.992596]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  286.996316]   SET = 0, FnV = 0
> [  286.998454]   EA = 0, S1PTW = 0
> [  287.000791]   FSC = 0x04: level 0 translation fault
> [  287.004472] Data abort info:
> [  287.006540]   ISV = 0, ISS = 0x00000004
> [  287.009239]   CM = 0, WnR = 0
> [  287.011344] swapper pgtable: 4k pages, 48-bit VAs, 
> pgdp=0000000054181000
> [  287.018245] [fffffffffffffdd0] pgd=0000000000000000, 
> p4d=0000000000000000
> [  287.024209] Internal error: Oops: 96000004 [#1] SMP
> [  287.027615] Modules linked in: virtio_gpu virtio_dma_buf
> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
> sysfillrect sysimgblt net_failover virtio_balloon failover vfat 
> fat
> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk 
> qemu_fw_cfg
> virtio_mmio aes_neon_bs
> [  287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded 
> Not
> tainted 5.14.10-300.fc35.aarch64 #1
> [  287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 
> 0.0.0 02/06/2015
> [  287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
> [  287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO 
> BTYPE=--)
> [  287.070568] pc : submit_compressed_extents+0x38/0x3d0
> [  287.074825] lr : async_cow_submit+0x50/0xd0
> [  287.078217] sp : ffff800015d4bc20
> [  287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27: 
> ffffb8a2fa941000
> [  287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24: 
> ffff000115873608
> [  287.092822] x23: 0000000000000000 x22: 0000000000000001 x21: 
> ffff0000c6f25800
> [  287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18: 
> ffff0000c2100bd4
> [  287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15: 
> 50006a3d10a961cd
> [  287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12: 
> ffff0001fefa68c0
> [  287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 : 
> ffffb8a2f9131c40
> [  287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 : 
> ffffb8a2fae8ad40
> [  287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 
> ffff0000c6f25820
> [  287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 : 
> ffff000115873630
> [  287.139760] Call trace:
> [  287.141784]  submit_compressed_extents+0x38/0x3d0
> [  287.145620]  async_cow_submit+0x50/0xd0
> [  287.148801]  run_ordered_work+0xc8/0x280
> [  287.152005]  btrfs_work_helper+0x98/0x250
> [  287.155450]  process_one_work+0x1f0/0x4ac
> [  287.161577]  worker_thread+0x188/0x504
> [  287.167461]  kthread+0x110/0x114
> [  287.172872]  ret_from_fork+0x10/0x18
> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa 
> (f9400356)
> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21  0:29                                           ` Su Yue
@ 2021-10-21  0:37                                             ` Qu Wenruo
  2021-10-21  0:46                                               ` Su Yue
  2021-10-21 14:43                                             ` Chris Murphy
  1 sibling, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-21  0:37 UTC (permalink / raw)
  To: Su Yue, Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS



On 2021/10/21 08:29, Su Yue wrote:
>
> On Wed 20 Oct 2021 at 19:55, Chris Murphy <lists@colorremedies.com> wrote:
>
>> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>>
>>> Dump file and vmlinu[zx] kernel file are needed.
>>
>> So we get a splat but kdump doesn't create a vmcore. Do we need to
>> issue sysrq+c at the time of the hang and splat to create it?
>>
> Yes, please.
>
> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
> so I think we can exclude the possibility of the toolchain.

Or this can also mean, fstests is not enough to trigger it?

Thanks,
Qu

>
> --
> Su
>
>> Fedora Linux 35 (Cloud Edition)
>> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>>
>> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
>> dusty-35 login: [  286.982605] Unable to handle kernel paging request
>> at virtual address fffffffffffffdd0
>> [  286.988338] Mem abort info:
>> [  286.990307]   ESR = 0x96000004
>> [  286.992596]   EC = 0x25: DABT (current EL), IL = 32 bits
>> [  286.996316]   SET = 0, FnV = 0
>> [  286.998454]   EA = 0, S1PTW = 0
>> [  287.000791]   FSC = 0x04: level 0 translation fault
>> [  287.004472] Data abort info:
>> [  287.006540]   ISV = 0, ISS = 0x00000004
>> [  287.009239]   CM = 0, WnR = 0
>> [  287.011344] swapper pgtable: 4k pages, 48-bit VAs,
>> pgdp=0000000054181000
>> [  287.018245] [fffffffffffffdd0] pgd=0000000000000000,
>> p4d=0000000000000000
>> [  287.024209] Internal error: Oops: 96000004 [#1] SMP
>> [  287.027615] Modules linked in: virtio_gpu virtio_dma_buf
>> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
>> sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
>> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
>> virtio_mmio aes_neon_bs
>> [  287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
>> tainted 5.14.10-300.fc35.aarch64 #1
>> [  287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
>> 02/06/2015
>> [  287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
>> [  287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>> [  287.070568] pc : submit_compressed_extents+0x38/0x3d0
>> [  287.074825] lr : async_cow_submit+0x50/0xd0
>> [  287.078217] sp : ffff800015d4bc20
>> [  287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27:
>> ffffb8a2fa941000
>> [  287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24:
>> ffff000115873608
>> [  287.092822] x23: 0000000000000000 x22: 0000000000000001 x21:
>> ffff0000c6f25800
>> [  287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18:
>> ffff0000c2100bd4
>> [  287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15:
>> 50006a3d10a961cd
>> [  287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12:
>> ffff0001fefa68c0
>> [  287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 :
>> ffffb8a2f9131c40
>> [  287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 :
>> ffffb8a2fae8ad40
>> [  287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
>> ffff0000c6f25820
>> [  287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 :
>> ffff000115873630
>> [  287.139760] Call trace:
>> [  287.141784]  submit_compressed_extents+0x38/0x3d0
>> [  287.145620]  async_cow_submit+0x50/0xd0
>> [  287.148801]  run_ordered_work+0xc8/0x280
>> [  287.152005]  btrfs_work_helper+0x98/0x250
>> [  287.155450]  process_one_work+0x1f0/0x4ac
>> [  287.161577]  worker_thread+0x188/0x504
>> [  287.167461]  kthread+0x110/0x114
>> [  287.172872]  ret_from_fork+0x10/0x18
>> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
>> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21  0:37                                             ` Qu Wenruo
@ 2021-10-21  0:46                                               ` Su Yue
  0 siblings, 0 replies; 62+ messages in thread
From: Su Yue @ 2021-10-21  0:46 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Chris Murphy, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS


On Thu 21 Oct 2021 at 08:37, Qu Wenruo <quwenruo.btrfs@gmx.com> 
wrote:

> On 2021/10/21 08:29, Su Yue wrote:
>>
>> On Wed 20 Oct 2021 at 19:55, Chris Murphy 
>> <lists@colorremedies.com> wrote:
>>
>>> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>>>
>>>> Dump file and vmlinu[zx] kernel file are needed.
>>>
>>> So we get a splat but kdump doesn't create a vmcore. Do we 
>>> need to
>>> issue sysrq+c at the time of the hang and splat to create it?
>>>
>> Yes, please.
>>
>> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
>> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang 
>> found,
>> so I think we can exclude the possibility of the toolchain.
>
> Or this can also mean, fstests is not enough to trigger it?
>
Right...Can't deny the possibility without any evidence for now.

--
Su
> Thanks,
> Qu
>
>>
>> --
>> Su
>>
>>> Fedora Linux 35 (Cloud Edition)
>>> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>>>
>>> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
>>> dusty-35 login: [  286.982605] Unable to handle kernel paging 
>>> request
>>> at virtual address fffffffffffffdd0
>>> [  286.988338] Mem abort info:
>>> [  286.990307]   ESR = 0x96000004
>>> [  286.992596]   EC = 0x25: DABT (current EL), IL = 32 bits
>>> [  286.996316]   SET = 0, FnV = 0
>>> [  286.998454]   EA = 0, S1PTW = 0
>>> [  287.000791]   FSC = 0x04: level 0 translation fault
>>> [  287.004472] Data abort info:
>>> [  287.006540]   ISV = 0, ISS = 0x00000004
>>> [  287.009239]   CM = 0, WnR = 0
>>> [  287.011344] swapper pgtable: 4k pages, 48-bit VAs,
>>> pgdp=0000000054181000
>>> [  287.018245] [fffffffffffffdd0] pgd=0000000000000000,
>>> p4d=0000000000000000
>>> [  287.024209] Internal error: Oops: 96000004 [#1] SMP
>>> [  287.027615] Modules linked in: virtio_gpu virtio_dma_buf
>>> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
>>> sysfillrect sysimgblt net_failover virtio_balloon failover 
>>> vfat fat
>>> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk 
>>> qemu_fw_cfg
>>> virtio_mmio aes_neon_bs
>>> [  287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: 
>>> loaded Not
>>> tainted 5.14.10-300.fc35.aarch64 #1
>>> [  287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 
>>> 0.0.0
>>> 02/06/2015
>>> [  287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
>>> [  287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO 
>>> BTYPE=--)
>>> [  287.070568] pc : submit_compressed_extents+0x38/0x3d0
>>> [  287.074825] lr : async_cow_submit+0x50/0xd0
>>> [  287.078217] sp : ffff800015d4bc20
>>> [  287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 
>>> x27:
>>> ffffb8a2fa941000
>>> [  287.087022] x26: fffffffffffffdd0 x25: dead000000000100 
>>> x24:
>>> ffff000115873608
>>> [  287.092822] x23: 0000000000000000 x22: 0000000000000001 
>>> x21:
>>> ffff0000c6f25800
>>> [  287.098591] x20: ffff0000c0596000 x19: 0000000000000001 
>>> x18:
>>> ffff0000c2100bd4
>>> [  287.104387] x17: ffff000115875ff8 x16: 0000000000000006 
>>> x15:
>>> 50006a3d10a961cd
>>> [  287.110159] x14: f0668b836620caa1 x13: 0000000000000020 
>>> x12:
>>> ffff0001fefa68c0
>>> [  287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 
>>> :
>>> ffffb8a2f9131c40
>>> [  287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 
>>> :
>>> ffffb8a2fae8ad40
>>> [  287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 
>>> :
>>> ffff0000c6f25820
>>> [  287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 
>>> :
>>> ffff000115873630
>>> [  287.139760] Call trace:
>>> [  287.141784]  submit_compressed_extents+0x38/0x3d0
>>> [  287.145620]  async_cow_submit+0x50/0xd0
>>> [  287.148801]  run_ordered_work+0xc8/0x280
>>> [  287.152005]  btrfs_work_helper+0x98/0x250
>>> [  287.155450]  process_one_work+0x1f0/0x4ac
>>> [  287.161577]  worker_thread+0x188/0x504
>>> [  287.167461]  kthread+0x110/0x114
>>> [  287.172872]  ret_from_fork+0x10/0x18
>>> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa 
>>> (f9400356)
>>> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-20 23:55                                         ` Chris Murphy
  2021-10-21  0:29                                           ` Su Yue
@ 2021-10-21  5:56                                           ` Nikolay Borisov
  1 sibling, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21  5:56 UTC (permalink / raw)
  To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 21.10.21 г. 2:55, Chris Murphy wrote:
> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>
>> Dump file and vmlinu[zx] kernel file are needed.
> 
> So we get a splat but kdump doesn't create a vmcore. Do we need to
> issue sysrq+c at the time of the hang and splat to create it?

Alternatively you can set the following sysctl to 1;

kernel.panic_on_warn = 1


> 
> Fedora Linux 35 (Cloud Edition)
> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
> 
> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
> dusty-35 login: [  286.982605] Unable to handle kernel paging request
> at virtual address fffffffffffffdd0
> [  286.988338] Mem abort info:
> [  286.990307]   ESR = 0x96000004
> [  286.992596]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  286.996316]   SET = 0, FnV = 0
> [  286.998454]   EA = 0, S1PTW = 0
> [  287.000791]   FSC = 0x04: level 0 translation fault
> [  287.004472] Data abort info:
> [  287.006540]   ISV = 0, ISS = 0x00000004
> [  287.009239]   CM = 0, WnR = 0
> [  287.011344] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000054181000
> [  287.018245] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
> [  287.024209] Internal error: Oops: 96000004 [#1] SMP
> [  287.027615] Modules linked in: virtio_gpu virtio_dma_buf
> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
> sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
> virtio_mmio aes_neon_bs
> [  287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
> tainted 5.14.10-300.fc35.aarch64 #1
> [  287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [  287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
> [  287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> [  287.070568] pc : submit_compressed_extents+0x38/0x3d0
> [  287.074825] lr : async_cow_submit+0x50/0xd0
> [  287.078217] sp : ffff800015d4bc20
> [  287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27: ffffb8a2fa941000
> [  287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000115873608
> [  287.092822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c6f25800
> [  287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18: ffff0000c2100bd4
> [  287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15: 50006a3d10a961cd
> [  287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12: ffff0001fefa68c0
> [  287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 : ffffb8a2f9131c40
> [  287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 : ffffb8a2fae8ad40
> [  287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c6f25820
> [  287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 : ffff000115873630
> [  287.139760] Call trace:
> [  287.141784]  submit_compressed_extents+0x38/0x3d0
> [  287.145620]  async_cow_submit+0x50/0xd0
> [  287.148801]  run_ordered_work+0xc8/0x280
> [  287.152005]  btrfs_work_helper+0x98/0x250
> [  287.155450]  process_one_work+0x1f0/0x4ac
> [  287.161577]  worker_thread+0x188/0x504
> [  287.167461]  kthread+0x110/0x114
> [  287.172872]  ret_from_fork+0x10/0x18
> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---
> 
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21  0:29                                           ` Su Yue
  2021-10-21  0:37                                             ` Qu Wenruo
@ 2021-10-21 14:43                                             ` Chris Murphy
  2021-10-21 14:48                                               ` Chris Murphy
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:43 UTC (permalink / raw)
  To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

On Wed, Oct 20, 2021 at 8:34 PM Su Yue <l@damenly.su> wrote:
>
>
> On Wed 20 Oct 2021 at 19:55, Chris Murphy
> <lists@colorremedies.com> wrote:
>
> > On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
> >>
> >> Dump file and vmlinu[zx] kernel file are needed.
> >
> > So we get a splat but kdump doesn't create a vmcore. Do we need
> > to
> > issue sysrq+c at the time of the hang and splat to create it?
> >
> Yes, please.
>
> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
> so I think we can exclude the possibility of the toolchain.

It's really weird. I was given a vexxhost aarch64 VM to play in and
try to get a vmcore for you guys, but nothing I did triggered the
splat. Then a colleague tried it, same hosting company, and was able
to reproduce it almost immediately. Same distro and kernel. So I don't
know what that means, if it's possible this the provisioning of the VM
could end up on different hardware, and it is some aspect of the
hardware that's resulting in this issue.

But anyway, he will be able to get a kernel core dump soon, and maybe
that'll tell us what's going on.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 14:43                                             ` Chris Murphy
@ 2021-10-21 14:48                                               ` Chris Murphy
  2021-10-21 14:51                                                 ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:48 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

[  287.139760] Call trace:
[  287.141784]  submit_compressed_extents+0x38/0x3d0
[  287.145620]  async_cow_submit+0x50/0xd0
[  287.148801]  run_ordered_work+0xc8/0x280
[  287.152005]  btrfs_work_helper+0x98/0x250
[  287.155450]  process_one_work+0x1f0/0x4ac
[  287.161577]  worker_thread+0x188/0x504
[  287.167461]  kthread+0x110/0x114
[  287.172872]  ret_from_fork+0x10/0x18
[  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[  287.186268] ---[ end trace 41ec405ced3786b6 ]---
[61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
[61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64


So it's at least 17 hours later since the splat. Is it worth sysrq+c
now this long after? Or should I set it up like Nikolay suggests with
kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
avoid problems dumping the kernel core file?

--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 14:48                                               ` Chris Murphy
@ 2021-10-21 14:51                                                 ` Nikolay Borisov
  2021-10-21 14:55                                                   ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21 14:51 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 21.10.21 г. 17:48, Chris Murphy wrote:
> [  287.139760] Call trace:
> [  287.141784]  submit_compressed_extents+0x38/0x3d0
> [  287.145620]  async_cow_submit+0x50/0xd0
> [  287.148801]  run_ordered_work+0xc8/0x280
> [  287.152005]  btrfs_work_helper+0x98/0x250
> [  287.155450]  process_one_work+0x1f0/0x4ac
> [  287.161577]  worker_thread+0x188/0x504
> [  287.167461]  kthread+0x110/0x114
> [  287.172872]  ret_from_fork+0x10/0x18
> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---
> [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
> [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
> 
> 
> So it's at least 17 hours later since the splat. Is it worth sysrq+c
> now this long after? Or should I set it up like Nikolay suggests with
> kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
> avoid problems dumping the kernel core file?

Doing sysrq+c would not have yileded any useful information it was a red
herring. In order to have actionable information the core dump needs to
be initiated from offending context, this means either having a BUG_ON
or a WARN which triggers the panic.

> 
> --
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 14:51                                                 ` Nikolay Borisov
@ 2021-10-21 14:55                                                   ` Chris Murphy
  2021-10-21 15:01                                                     ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:55 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Thu, Oct 21, 2021 at 10:51 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 21.10.21 г. 17:48, Chris Murphy wrote:
> > [  287.139760] Call trace:
> > [  287.141784]  submit_compressed_extents+0x38/0x3d0
> > [  287.145620]  async_cow_submit+0x50/0xd0
> > [  287.148801]  run_ordered_work+0xc8/0x280
> > [  287.152005]  btrfs_work_helper+0x98/0x250
> > [  287.155450]  process_one_work+0x1f0/0x4ac
> > [  287.161577]  worker_thread+0x188/0x504
> > [  287.167461]  kthread+0x110/0x114
> > [  287.172872]  ret_from_fork+0x10/0x18
> > [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> > [  287.186268] ---[ end trace 41ec405ced3786b6 ]---
> > [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
> > [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
> >
> >
> > So it's at least 17 hours later since the splat. Is it worth sysrq+c
> > now this long after? Or should I set it up like Nikolay suggests with
> > kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
> > avoid problems dumping the kernel core file?
>
> Doing sysrq+c would not have yileded any useful information it was a red
> herring. In order to have actionable information the core dump needs to
> be initiated from offending context, this means either having a BUG_ON
> or a WARN which triggers the panic.


OK so I'll put /var/crash on XFS and set kernel.panic_on_warn = 1 and
try to reproduce the problem; and hopefully that triggers kdump.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 14:55                                                   ` Chris Murphy
@ 2021-10-21 15:01                                                     ` Nikolay Borisov
  2021-10-21 15:06                                                       ` Chris Murphy
  2021-10-21 18:07                                                       ` Chris Murphy
  0 siblings, 2 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21 15:01 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 21.10.21 г. 17:55, Chris Murphy wrote:
> On Thu, Oct 21, 2021 at 10:51 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 21.10.21 г. 17:48, Chris Murphy wrote:
>>> [  287.139760] Call trace:
>>> [  287.141784]  submit_compressed_extents+0x38/0x3d0
>>> [  287.145620]  async_cow_submit+0x50/0xd0
>>> [  287.148801]  run_ordered_work+0xc8/0x280
>>> [  287.152005]  btrfs_work_helper+0x98/0x250
>>> [  287.155450]  process_one_work+0x1f0/0x4ac
>>> [  287.161577]  worker_thread+0x188/0x504
>>> [  287.167461]  kthread+0x110/0x114
>>> [  287.172872]  ret_from_fork+0x10/0x18
>>> [  287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
>>> [  287.186268] ---[ end trace 41ec405ced3786b6 ]---
>>> [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
>>> [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
>>>
>>>
>>> So it's at least 17 hours later since the splat. Is it worth sysrq+c
>>> now this long after? Or should I set it up like Nikolay suggests with
>>> kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
>>> avoid problems dumping the kernel core file?
>>
>> Doing sysrq+c would not have yileded any useful information it was a red
>> herring. In order to have actionable information the core dump needs to
>> be initiated from offending context, this means either having a BUG_ON
>> or a WARN which triggers the panic.
> 
> 
> OK so I'll put /var/crash on XFS and set kernel.panic_on_warn = 1 and
> try to reproduce the problem; and hopefully that triggers kdump.

Just to be clear, when you initiate a crash with sysrq+c does it capture
a crashdump? That's the basic test that needs to pass in order to ensure
kdump works as expected.

> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 15:01                                                     ` Nikolay Borisov
@ 2021-10-21 15:06                                                       ` Chris Murphy
  2021-10-21 15:32                                                         ` Chris Murphy
  2021-10-21 18:07                                                       ` Chris Murphy
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 15:06 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Thu, Oct 21, 2021 at 11:01 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> Just to be clear, when you initiate a crash with sysrq+c does it capture
> a crashdump? That's the basic test that needs to pass in order to ensure
> kdump works as expected.

Yes it does.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 15:06                                                       ` Chris Murphy
@ 2021-10-21 15:32                                                         ` Chris Murphy
  0 siblings, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 15:32 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

There is a significant hand in either sshd or dnf when doing package
installs that precedes the splat. The splat doesn't always happen but
the hang does seem to be reproducible. I have a sysrq+t during this
hang here:

https://drive.google.com/file/d/14qsIb3HNlSx91kPq1Uvo_IHnivQ3S3NO/view?usp=sharing

Maybe there's a hint why we're hung up as a prelude to the splat even
though I haven't gotten the warning yet...

--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-21 15:01                                                     ` Nikolay Borisov
  2021-10-21 15:06                                                       ` Chris Murphy
@ 2021-10-21 18:07                                                       ` Chris Murphy
  1 sibling, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 18:07 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

[fedora@dusty-353 ~]$ sudo sysctl -n kernel.panic_on_warn
1

I get the oops, but no kdump activity at all.

Oct 21 17:35:54 dusty-353.novalocal kernel: Unable to handle kernel
paging request at virtual address fffffffffffffdd0
Oct 21 17:35:54 dusty-353.novalocal kernel: Mem abort info:
Oct 21 17:35:54 dusty-353.novalocal kernel:   ESR = 0x96000004
Oct 21 17:35:54 dusty-353.novalocal kernel:   EC = 0x25: DABT (current
EL), IL = 32 bits
Oct 21 17:35:54 dusty-353.novalocal kernel:   SET = 0, FnV = 0
Oct 21 17:35:54 dusty-353.novalocal kernel:   EA = 0, S1PTW = 0
Oct 21 17:35:54 dusty-353.novalocal kernel:   FSC = 0x04: level 0
translation fault
Oct 21 17:35:54 dusty-353.novalocal kernel: Data abort info:
Oct 21 17:35:54 dusty-353.novalocal kernel:   ISV = 0, ISS = 0x00000004
Oct 21 17:35:54 dusty-353.novalocal kernel:   CM = 0, WnR = 0
Oct 21 17:35:54 dusty-353.novalocal kernel: swapper pgtable: 4k pages,
48-bit VAs, pgdp=0000000125461000
Oct 21 17:35:54 dusty-353.novalocal kernel: [fffffffffffffdd0]
pgd=0000000000000000, p4d=0000000000000000
Oct 21 17:35:54 dusty-353.novalocal kernel: Internal error: Oops:
96000004 [#1] SMP
Oct 21 17:35:54 dusty-353.novalocal kernel: Modules linked in:
binfmt_misc virtio_gpu virtio_dma_buf drm_kms_helper joydev cec
fb_sys_fops syscopyarea virtio_net sysfillrect sysimgblt
virtio_balloon net_failover failover vfat fat xfs drm fuse zram
ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio
aes_neon_bs
Oct 21 17:35:54 dusty-353.novalocal kernel: CPU: 1 PID: 4392 Comm:
kworker/u8:12 Kdump: loaded Not tainted 5.14.10-300.fc35.aarch64 #1
Oct 21 17:35:54 dusty-353.novalocal kernel: Hardware name: QEMU KVM
Virtual Machine, BIOS 0.0.0 02/06/2015
Oct 21 17:35:54 dusty-353.novalocal kernel: Workqueue: btrfs-delalloc
btrfs_work_helper
Oct 21 17:35:54 dusty-353.novalocal kernel: pstate: 80400005 (Nzcv
daif +PAN -UAO -TCO BTYPE=--)
Oct 21 17:35:54 dusty-353.novalocal kernel: pc :
submit_compressed_extents+0x38/0x3d0
Oct 21 17:35:54 dusty-353.novalocal kernel: lr : async_cow_submit+0x50/0xd0
Oct 21 17:35:54 dusty-353.novalocal kernel: sp : ffff800010d6bc20
Oct 21 17:35:54 dusty-353.novalocal kernel: x29: ffff800010d6bc30 x28:
0000000000000000 x27: ffffbb96c7421000
Oct 21 17:35:54 dusty-353.novalocal kernel: x26: fffffffffffffdd0 x25:
dead000000000100 x24: ffff00012f950408
Oct 21 17:35:54 dusty-353.novalocal kernel: x23: 0000000000000000 x22:
0000000000000001 x21: ffff0000c07e1f80
Oct 21 17:35:54 dusty-353.novalocal kernel: x20: ffff0000c5af0000 x19:
0000000000000001 x18: ffff0000c2500bd4
Oct 21 17:35:54 dusty-353.novalocal kernel: x17: ffff00012fa0eff8 x16:
0000000000000006 x15: bd47b4a638083142
Oct 21 17:35:54 dusty-353.novalocal kernel: x14: ab8f4df43188bcf5 x13:
0000000000000020 x12: ffff0001fefa78c0
Oct 21 17:35:54 dusty-353.novalocal kernel: x11: ffffbb96c743b500 x10:
0000000000000000 x9 : ffffbb96c5c11c40
Oct 21 17:35:54 dusty-353.novalocal kernel: x8 : ffff446b37afd000 x7 :
ffff800010d6bbe0 x6 : ffffbb96c6c11000
Oct 21 17:35:54 dusty-353.novalocal kernel: x5 : 0000000000000000 x4 :
0000000000000000 x3 : ffff0000c07e1fa0
Oct 21 17:35:54 dusty-353.novalocal kernel: x2 : 0000000000000000 x1 :
ffff00012f950430 x0 : ffff00012f950430
Oct 21 17:35:54 dusty-353.novalocal kernel: Call trace:
Oct 21 17:35:54 dusty-353.novalocal kernel:
submit_compressed_extents+0x38/0x3d0
Oct 21 17:35:54 dusty-353.novalocal kernel:  async_cow_submit+0x50/0xd0
Oct 21 17:35:54 dusty-353.novalocal kernel:  run_ordered_work+0xc8/0x280
Oct 21 17:35:54 dusty-353.novalocal kernel:  btrfs_work_helper+0x98/0x250
Oct 21 17:35:54 dusty-353.novalocal kernel:  process_one_work+0x1f0/0x4ac
Oct 21 17:35:54 dusty-353.novalocal kernel:  worker_thread+0x188/0x504
Oct 21 17:35:54 dusty-353.novalocal kernel:  kthread+0x110/0x114
Oct 21 17:35:54 dusty-353.novalocal kernel:  ret_from_fork+0x10/0x18
Oct 21 17:35:54 dusty-353.novalocal kernel: Code: a9056bf9 f8428437
f9401400 d108c2fa (f9400356)
Oct 21 17:35:54 dusty-353.novalocal kernel: ---[ end trace 718fed28301aa13b ]---


Whereas sysrq+c does create a kdump file...

--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-19 23:42                                       ` Su Yue
                                                           ` (2 preceding siblings ...)
  2021-10-20 23:55                                         ` Chris Murphy
@ 2021-10-22  2:36                                         ` Chris Murphy
  2021-10-22  6:02                                           ` Nikolay Borisov
  2021-10-22 10:44                                           ` Nikolay Borisov
  3 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-22  2:36 UTC (permalink / raw)
  To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS

OK I have a vmcore file:
https://dustymabe.fedorapeople.org/bz2011928-vmcore/

lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing


--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22  2:36                                         ` Chris Murphy
@ 2021-10-22  6:02                                           ` Nikolay Borisov
  2021-10-22  6:17                                             ` Su Yue
  2021-10-22 10:44                                           ` Nikolay Borisov
  1 sibling, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22  6:02 UTC (permalink / raw)
  To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 22.10.21 г. 5:36, Chris Murphy wrote:
> OK I have a vmcore file:
> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
> 
> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing

In order to open the dump we require the vmlinux as well as the debug
vmlinuz and also btrfs.ko.debug file as well.

> 
> 
> --
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22  6:02                                           ` Nikolay Borisov
@ 2021-10-22  6:17                                             ` Su Yue
  0 siblings, 0 replies; 62+ messages in thread
From: Su Yue @ 2021-10-22  6:17 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Qu Wenruo, Qu Wenruo, Btrfs BTRFS


On Fri 22 Oct 2021 at 09:02, Nikolay Borisov <nborisov@suse.com> 
wrote:

> On 22.10.21 г. 5:36, Chris Murphy wrote:
>> OK I have a vmcore file:
>> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
>>
>> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
>> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
>
> In order to open the dump we require the vmlinux as well as the 
> debug
> vmlinuz and also btrfs.ko.debug file as well.
>
kernel-debuginfo-5.14.10-300.fc35.aarch64.rpm is on
https://koji.fedoraproject.org/koji/buildinfo?buildID=1843224

--
Su
>>
>>
>> --
>> Chris Murphy
>>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22  2:36                                         ` Chris Murphy
  2021-10-22  6:02                                           ` Nikolay Borisov
@ 2021-10-22 10:44                                           ` Nikolay Borisov
  2021-10-22 11:43                                             ` Nikolay Borisov
  1 sibling, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22 10:44 UTC (permalink / raw)
  To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 22.10.21 г. 5:36, Chris Murphy wrote:
> OK I have a vmcore file:
> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
> 
> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
> 

So the problem is we have a null inode:


crash> struct async_chunk ffff00012a78eb08
struct async_chunk {
  inode = 0x0,
  locked_page = 0xfffffc000508c240,
  start = 0,
  end = 4095,
  write_flags = 0,
  extents = {
    next = 0xffff00012a78eb30,
    prev = 0xffff00012a78eb30
  },
  blkcg_css = 0x0,
  work = {
    func = 0xffffd7c4c03c05c0 <async_cow_start>,
    ordered_func = 0xffffd7c4c03c1bf0 <async_cow_submit>,
    ordered_free = 0xffffd7c4c03be2e0 <async_cow_free>,
    normal_work = {
      data = {
        counter = 256
      },
      entry = {
        next = 0xffff00012a78eb68,
        prev = 0xffff00012a78eb68
      },
      func = 0xffffd7c4c03f9e84 <btrfs_work_helper>
    },
    ordered_list = {
      next = 0xffff00012a78ee80,
      prev = 0xffff0000c6d83510
    },
    wq = 0xffff0000c6d83500,
    flags = 3
  },
  pending = 0xffff00012a78eb00
}


But this makes no sense since before submit_compressed_extents is called
we have an explicit check for async_hunk->inode presence but AFAICS this
is not done in a concurrent context. So this either leaves some hw issue
or some race which manifests due to ARM's weak mm.

> 
> --
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22 10:44                                           ` Nikolay Borisov
@ 2021-10-22 11:43                                             ` Nikolay Borisov
  2021-10-22 17:18                                               ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22 11:43 UTC (permalink / raw)
  To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 22.10.21 г. 13:44, Nikolay Borisov wrote:
> 
> 
> On 22.10.21 г. 5:36, Chris Murphy wrote:
>> OK I have a vmcore file:
>> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
>>
>> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
>> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
>>
> 
> So the problem is we have a null inode:
> 
> 
> crash> struct async_chunk ffff00012a78eb08
> struct async_chunk {
>   inode = 0x0,
>   locked_page = 0xfffffc000508c240,
>   start = 0,
>   end = 4095,
>   write_flags = 0,
>   extents = {
>     next = 0xffff00012a78eb30,
>     prev = 0xffff00012a78eb30
>   },
>   blkcg_css = 0x0,
>   work = {
>     func = 0xffffd7c4c03c05c0 <async_cow_start>,
>     ordered_func = 0xffffd7c4c03c1bf0 <async_cow_submit>,
>     ordered_free = 0xffffd7c4c03be2e0 <async_cow_free>,
>     normal_work = {
>       data = {
>         counter = 256
>       },
>       entry = {
>         next = 0xffff00012a78eb68,
>         prev = 0xffff00012a78eb68
>       },
>       func = 0xffffd7c4c03f9e84 <btrfs_work_helper>
>     },
>     ordered_list = {
>       next = 0xffff00012a78ee80,
>       prev = 0xffff0000c6d83510
>     },
>     wq = 0xffff0000c6d83500,
>     flags = 3
>   },
>   pending = 0xffff00012a78eb00
> }
> 
> 
> But this makes no sense since before submit_compressed_extents is called
> we have an explicit check for async_hunk->inode presence but AFAICS this
> is not done in a concurrent context. So this either leaves some hw issue
> or some race which manifests due to ARM's weak mm.

I also looked at the assembly generated in async_cow_submit to see if
anything funny happens while the async_chunk->inode check is performed -
everything looks fine. Also given that the extents list is empty and the
inode is NULL I'd assume that the "write" side is also correct i.e the
code in async_cow_start. This pretty much excludes a codegen problem.

Chris can you add the following line in submit_compressed_extents right
before the BTRFS_I() function is called:

 WARN_ON(!async_chunk->inode);

And re-run the workload again?


> 
>>
>> --
>> Chris Murphy
>>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22 11:43                                             ` Nikolay Borisov
@ 2021-10-22 17:18                                               ` Chris Murphy
  2021-10-23 10:09                                                 ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-22 17:18 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Fri, Oct 22, 2021 at 7:43 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> I also looked at the assembly generated in async_cow_submit to see if
> anything funny happens while the async_chunk->inode check is performed -
> everything looks fine. Also given that the extents list is empty and the
> inode is NULL I'd assume that the "write" side is also correct i.e the
> code in async_cow_start. This pretty much excludes a codegen problem.
>
> Chris can you add the following line in submit_compressed_extents right
> before the BTRFS_I() function is called:
>
>  WARN_ON(!async_chunk->inode);
>
> And re-run the workload again?

I'll look into how we can do this. I build kernels per
https://kernelnewbies.org/KernelBuild but maybe it's better to do it
within Fedora infrastructure to keep things more the same and
reproducible? I'm not really sure, so I've asked in the bug
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c41 - if you have
two cents to add let me know in this thread or that one.

Any other configs to change while we're building a new kernel?
CONFIG_BTRFS_ASSERT=y ?

inode.c
849:static noinline void submit_compressed_extents(struct async_chunk
*async_chunk)
850-{
851-    struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);

becomes

849:static noinline void submit_compressed_extents(struct async_chunk
*async_chunk)
850-{
851-    WARN_ON(!async_chunk->inode);
852-    struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);

?
(I'm looking at 5.15-rc6)



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-22 17:18                                               ` Chris Murphy
@ 2021-10-23 10:09                                                 ` Nikolay Borisov
  2021-10-25 14:48                                                   ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-23 10:09 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 22.10.21 г. 20:18, Chris Murphy wrote:
> On Fri, Oct 22, 2021 at 7:43 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> I also looked at the assembly generated in async_cow_submit to see if
>> anything funny happens while the async_chunk->inode check is performed -
>> everything looks fine. Also given that the extents list is empty and the
>> inode is NULL I'd assume that the "write" side is also correct i.e the
>> code in async_cow_start. This pretty much excludes a codegen problem.
>>
>> Chris can you add the following line in submit_compressed_extents right
>> before the BTRFS_I() function is called:
>>
>>  WARN_ON(!async_chunk->inode);
>>
>> And re-run the workload again?
> 
> I'll look into how we can do this. I build kernels per
> https://kernelnewbies.org/KernelBuild but maybe it's better to do it
> within Fedora infrastructure to keep things more the same and
> reproducible? I'm not really sure, so I've asked in the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c41 - if you have
> two cents to add let me know in this thread or that one.
> 
> Any other configs to change while we're building a new kernel?
> CONFIG_BTRFS_ASSERT=y ?
> 
> inode.c
> 849:static noinline void submit_compressed_extents(struct async_chunk
> *async_chunk)
> 850-{
> 851-    struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
> 
> becomes
> 
> 849:static noinline void submit_compressed_extents(struct async_chunk
> *async_chunk)
> 850-{
> 851-    WARN_ON(!async_chunk->inode);
> 852-    struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
> 
> ?
> (I'm looking at 5.15-rc6)

Yes.

> 
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-23 10:09                                                 ` Nikolay Borisov
@ 2021-10-25 14:48                                                   ` Chris Murphy
  2021-10-25 18:34                                                     ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 14:48 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

https://bugzilla.redhat.com/show_bug.cgi?id=2011928

Comment 45 (attachment) is a dmesg sysrq+t during the hang with a
5.14.14 kernel with the WARN_ON added but no OOPS or call trace
occurred

Comment 46 (attachment) is a dmesg with a 5.14.10 kernel with the
WARN_ON added, with OOPS and call trace; excerpt of this pasted below


[  992.788137] ------------[ cut here ]------------
[  992.793018] WARNING: CPU: 0 PID: 1509 at fs/btrfs/inode.c:844
submit_compressed_extents+0x3d4/0x3e0
[  992.802276] Modules linked in: rfkill virtio_gpu virtio_dma_buf
drm_kms_helper joydev cec fb_sys_fops virtio_net syscopyarea
net_failover sysfillrect sysimgblt virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[  992.828320] CPU: 0 PID: 1509 Comm: kworker/u8:12 Not tainted
5.14.10-300.fc35.dusty.aarch64 #1
[  992.837159] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  992.844076] Workqueue: btrfs-delalloc btrfs_work_helper
[  992.849339] pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[  992.855262] pc : submit_compressed_extents+0x3d4/0x3e0
[  992.860357] lr : async_cow_submit+0x50/0xd0
[  992.864444] sp : ffff800012023c20
[  992.867667] x29: ffff800012023c30 x28: 0000000000000000 x27: ffffdd47ca411000
[  992.874799] x26: ffff000128f2c548 x25: dead000000000100 x24: ffff000128f2c508
[  992.881862] x23: 0000000000000000 x22: 0000000000000001 x21: ffff00018f9d5e80
[  992.888931] x20: ffff0000c0672000 x19: 0000000000000001 x18: ffff0000c4c00bd4
[  992.896105] x17: ffff00012d53aff8 x16: 0000000000000006 x15: 7a1cde357ab19b01
[  992.903348] x14: 5eac0029a606c741 x13: 0000000000000020 x12: ffff0001fefa78c0
[  992.910639] x11: ffffdd47ca42b500 x10: 0000000000000000 x9 : ffffdd47c8c01c50
[  992.917872] x8 : ffff22ba34aec000 x7 : ffff800012023be0 x6 : ffffdd47ca95ad40
[  992.925086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff00018f9d5ea0
[  992.932221] x2 : 0000000000000000 x1 : ffff000128f2c508 x0 : ffff000128f2c508
[  992.939392] Call trace:
[  992.941854]  submit_compressed_extents+0x3d4/0x3e0
[  992.946737]  async_cow_submit+0x50/0xd0
[  992.950574]  run_ordered_work+0xc8/0x280
[  992.954560]  btrfs_work_helper+0x98/0x250
[  992.958594]  process_one_work+0x1f0/0x4ac
[  992.962619]  worker_thread+0x188/0x504
[  992.966390]  kthread+0x110/0x114
[  992.969681]  ret_from_fork+0x10/0x18
[  992.973313] ---[ end trace 11b751608cbdcfac ]---
[  992.978203] Unable to handle kernel paging request at virtual
address fffffffffffffdd0
[  992.986011] Mem abort info:
[  992.993975]   ESR = 0x96000004
[  992.996786]   EC = 0x25: DABT (current EL), IL = 32 bits
[  993.001795]   SET = 0, FnV = 0
[  993.004646]   EA = 0, S1PTW = 0
[  993.007455]   FSC = 0x04: level 0 translation fault
[  993.012081] Data abort info:
[  993.014712]   ISV = 0, ISS = 0x00000004
[  993.021058]   CM = 0, WnR = 0
[  993.026357] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000009c051000
[  993.035411] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[  993.044400] Internal error: Oops: 96000004 [#1] SMP
[  993.051651] Modules linked in: rfkill virtio_gpu virtio_dma_buf
drm_kms_helper joydev cec fb_sys_fops virtio_net syscopyarea
net_failover sysfillrect sysimgblt virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[  993.083344] CPU: 0 PID: 1509 Comm: kworker/u8:12 Tainted: G
W         5.14.10-300.fc35.dusty.aarch64 #1
[  993.095545] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  993.104796] Workqueue: btrfs-delalloc btrfs_work_helper
[  993.112752] pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[  993.121096] pc : submit_compressed_extents+0x44/0x3e0
[  993.128333] lr : async_cow_submit+0x50/0xd0
[  993.134773] sp : ffff800012023c20
[  993.140397] x29: ffff800012023c30 x28: 0000000000000000 x27: ffffdd47ca411000
[  993.149489] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000128f2c508
[  993.158723] x23: 0000000000000000 x22: 0000000000000001 x21: ffff00018f9d5e80
[  993.167904] x20: fffffffffffffe18 x19: 0000000000000001 x18: ffff0000c4c00bd4
[  993.177039] x17: ffff00012d53aff8 x16: 0000000000000006 x15: 7a1cde357ab19b01
[  993.186386] x14: 5eac0029a606c741 x13: 0000000000000020 x12: ffff0001fefa78c0
[  993.195490] x11: ffffdd47ca42b500 x10: 0000000000000000 x9 : ffffdd47c8c01c50
[  993.204603] x8 : ffff22ba34aec000 x7 : ffff800012023be0 x6 : ffffdd47ca95ad40
[  993.213749] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff00018f9d5ea0
[  993.222960] x2 : 0000000000000000 x1 : ffff000128f2c530 x0 : ffff000128f2c530
[  993.232079] Call trace:
[  993.236821]  submit_compressed_extents+0x44/0x3e0
[  993.243682]  async_cow_submit+0x50/0xd0
[  993.249829]  run_ordered_work+0xc8/0x280
[  993.255974]  btrfs_work_helper+0x98/0x250
[  993.262187]  process_one_work+0x1f0/0x4ac
[  993.268381]  worker_thread+0x188/0x504
[  993.274252]  kthread+0x110/0x114
[  993.279894]  ret_from_fork+0x10/0x18
[  993.285819] Code: d108c2fa 9100a301 f9401700 d107a2f4 (f9400356)
[  993.294256] ---[ end trace 11b751608cbdcfad ]---


I don't see any new information here though.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-25 14:48                                                   ` Chris Murphy
@ 2021-10-25 18:34                                                     ` Chris Murphy
  2021-10-25 19:40                                                       ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 18:34 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

> > Vendor ID:           Cavium
> > Model:               1
> > Model name:          ThunderX 88XX

I still haven't hit the WARN_ON. But weirdly I'm not getting the oops
with 5.14.14 but can hit it with 5.14.10... though the sample size is
small. And it definitely is smelling like a race. I'll keep trying to
hit it with 5.14.10 because I want to see if this WARN_ON will get hit
and give us more information.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-25 18:34                                                     ` Chris Murphy
@ 2021-10-25 19:40                                                       ` Chris Murphy
  2021-10-26  7:14                                                         ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 19:40 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

Got another sysrq+t here, while dnf is completely hung while 'dnf
install kernel-debuginfo' packages, for a long time without any call
traces or indication why it's stuck. ps aux shows it's running, but
consuming no meaningful cpu; top shows very high ~25% wa, the rest is
idle. Essentially no user or system process consumption.

https://bugzilla.redhat.com/attachment.cgi?id=1836995


Excerpts of items that are in D state:

[ 9595.270460] kernel: task:kworker/u8:7    state:D stack:    0 pid:
1296 ppid:     2 flags:0x00000008
[ 9595.280269] kernel: Workqueue: events_unbound
btrfs_async_reclaim_metadata_space
[ 9595.288593] kernel: Call trace:
[ 9595.292822] kernel:  __switch_to+0x160/0x1d4
[ 9595.298383] kernel:  __schedule+0x22c/0x5f0
[ 9595.303605] kernel:  schedule+0x54/0xdc
[ 9595.308644] kernel:  schedule_preempt_disabled+0x1c/0x30
[ 9595.314929] kernel:  __mutex_lock.constprop.0+0x184/0x544
[ 9595.321559] kernel:  __mutex_lock_slowpath+0x1c/0x30
[ 9595.327579] kernel:  mutex_lock+0x6c/0x80
[ 9595.332600] kernel:  btrfs_start_delalloc_roots+0x78/0x320
[ 9595.339303] kernel:  shrink_delalloc+0xf4/0x260
[ 9595.344883] kernel:  flush_space+0x110/0x2a0
[ 9595.350402] kernel:  btrfs_async_reclaim_metadata_space+0x130/0x350
[ 9595.357574] kernel:  process_one_work+0x1f0/0x4ac
[ 9595.363215] kernel:  worker_thread+0x188/0x504
[ 9595.368921] kernel:  kthread+0x110/0x114
[ 9595.373958] kernel:  ret_from_fork+0x10/0x18
[ 9595.379413] kernel: task:kworker/u8:9    state:D stack:    0 pid:
1300 ppid:     2 flags:0x00000008
[ 9595.389417] kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 9595.396867] kernel: Call trace:
[ 9595.401256] kernel:  __switch_to+0x160/0x1d4
[ 9595.406688] kernel:  __schedule+0x22c/0x5f0
[ 9595.411998] kernel:  schedule+0x54/0xdc
[ 9595.417000] kernel:  inode_sleep_on_writeback+0x8c/0xb0
[ 9595.423152] kernel:  wb_writeback+0x174/0x3dc
[ 9595.428734] kernel:  wb_do_writeback+0x114/0x394
[ 9595.434404] kernel:  wb_workfn+0x80/0x2a0
[ 9595.439815] kernel:  process_one_work+0x1f0/0x4ac
[ 9595.445807] kernel:  worker_thread+0x260/0x504
[ 9595.451559] kernel:  kthread+0x110/0x114
[ 9595.456623] kernel:  ret_from_fork+0x10/0x18
[ 9595.461987] kernel: task:kworker/u8:13   state:D stack:    0 pid:
1304 ppid:     2 flags:0x00000008
[ 9595.472144] kernel: Workqueue: events_unbound
btrfs_preempt_reclaim_metadata_space
[ 9595.480865] kernel: Call trace:
[ 9595.485360] kernel:  __switch_to+0x160/0x1d4
[ 9595.491154] kernel:  __schedule+0x22c/0x5f0
[ 9595.496601] kernel:  schedule+0x54/0xdc
[ 9595.501702] kernel:  io_schedule+0x48/0x6c
[ 9595.507098] kernel:  wait_on_page_bit_common+0x15c/0x400
[ 9595.513421] kernel:  __lock_page+0x60/0x80
[ 9595.518791] kernel:  extent_write_cache_pages+0x29c/0x3cc
[ 9595.525199] kernel:  extent_writepages+0x44/0xb0
[ 9595.531110] kernel:  btrfs_writepages+0x1c/0x30
[ 9595.536813] kernel:  do_writepages+0x44/0xf0
[ 9595.542223] kernel:  __writeback_single_inode+0x48/0x400
[ 9595.548938] kernel:  writeback_single_inode+0xf4/0x240
[ 9595.555245] kernel:  sync_inode+0x1c/0x2c
[ 9595.560604] kernel:  start_delalloc_inodes+0x188/0x450
[ 9595.567634] kernel:  btrfs_start_delalloc_roots+0x194/0x320
[ 9595.574325] kernel:  shrink_delalloc+0xf4/0x260
[ 9595.580087] kernel:  flush_space+0x110/0x2a0
[ 9595.585381] kernel:  btrfs_preempt_reclaim_metadata_space+0x148/0x270
[ 9595.593048] kernel:  process_one_work+0x1f0/0x4ac
[ 9595.599040] kernel:  worker_thread+0x188/0x504
[ 9595.604515] kernel:  kthread+0x110/0x114
[ 9595.609959] kernel:  ret_from_fork+0x10/0x18

...

[ 9596.146831] kernel: task:dnf             state:D stack:    0
pid:14580 ppid: 14579 flags:0x00000000
[ 9596.156309] kernel: Call trace:
[ 9596.160424] kernel:  __switch_to+0x160/0x1d4
[ 9596.165512] kernel:  __schedule+0x22c/0x5f0
[ 9596.170758] kernel:  schedule+0x54/0xdc
[ 9596.175419] kernel:  wb_wait_for_completion+0x78/0xac
[ 9596.181577] kernel:  __writeback_inodes_sb_nr+0x80/0xa0
[ 9596.187695] kernel:  writeback_inodes_sb+0x58/0x70
[ 9596.193322] kernel:  sync_filesystem+0x50/0xc0
[ 9596.198714] kernel:  __arm64_sys_syncfs+0x54/0xb0
[ 9596.204163] kernel:  invoke_syscall+0x50/0x120
[ 9596.209724] kernel:  el0_svc_common+0x48/0x100
[ 9596.214986] kernel:  do_el0_svc+0x34/0xa0
[ 9596.220105] kernel:  el0_svc+0x2c/0x54
[ 9596.224755] kernel:  el0t_64_sync_handler+0xa4/0x130
[ 9596.230745] kernel:  el0t_64_sync+0x19c/0x1a0

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-25 19:40                                                       ` Chris Murphy
@ 2021-10-26  7:14                                                         ` Nikolay Borisov
  2021-10-26 12:51                                                           ` Chris Murphy
  2021-10-27 18:22                                                           ` Chris Murphy
  0 siblings, 2 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26  7:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 25.10.21 г. 22:40, Chris Murphy wrote:
> Got another sysrq+t here, while dnf is completely hung while 'dnf
> install kernel-debuginfo' packages, for a long time without any call
> traces or indication why it's stuck. ps aux shows it's running, but
> consuming no meaningful cpu; top shows very high ~25% wa, the rest is
> idle. Essentially no user or system process consumption.
> 
> https://bugzilla.redhat.com/attachment.cgi?id=1836995
> 

<snip>


I think I identified a race that could cause the crash, can you apply the 
following diff and re-run the tests and leave them for a couple of days. 
Preferably apply it on 5.4.10 so that there is the highest chance to reproduce: 

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 309516e6a968..a3d788dcbd34 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
                                  ordered_list);
                if (!test_bit(WORK_DONE_BIT, &work->flags))
                        break;
+               /*
+                * Orders all subsequent loads after WORK_DONE_BIT, paired with
+                * the smp_mb__before_atomic in btrfs_work_helper
+                */
+               smp_rmb();
 
                /*
                 * we are going to call the ordered done function, but
@@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
        thresh_exec_hook(wq);
        work->func(work);
        if (need_order) {
+               /*
+                * Ensures all вритес done in ->func are ordered before
+                * setting the WORK_DONE_BIT making them visible to ordered
+                * func
+                */
+               smp_mb__before_atomic();
                set_bit(WORK_DONE_BIT, &work->flags);
                run_ordered_work(wq, work);
        } else {


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26  7:14                                                         ` Nikolay Borisov
@ 2021-10-26 12:51                                                           ` Chris Murphy
  2021-10-26 13:05                                                             ` Nikolay Borisov
  2021-10-27 18:22                                                           ` Chris Murphy
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 12:51 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> I think I identified a race that could cause the crash, can you apply the
> following diff and re-run the tests and leave them for a couple of days.
> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 309516e6a968..a3d788dcbd34 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>                                   ordered_list);
>                 if (!test_bit(WORK_DONE_BIT, &work->flags))
>                         break;
> +               /*
> +                * Orders all subsequent loads after WORK_DONE_BIT, paired with
> +                * the smp_mb__before_atomic in btrfs_work_helper
> +                */
> +               smp_rmb();
>
>                 /*
>                  * we are going to call the ordered done function, but
> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>         thresh_exec_hook(wq);
>         work->func(work);
>         if (need_order) {
> +               /*
> +                * Ensures all вритес done in ->func are ordered before
> +                * setting the WORK_DONE_BIT making them visible to ordered
> +                * func
> +                */
> +               smp_mb__before_atomic();
>                 set_bit(WORK_DONE_BIT, &work->flags);
>                 run_ordered_work(wq, work);
>         } else {
>

Couple typos: 'вритес' looks like keyboard layout hiccup and should be
'writes'; and 5.4.10 should be 5.14.10 (I'm betting all the tea in
China that upstream isn't asking me to test a patch on a two year old
kernel).

Unfortunately the test we have is non-automated, it's "install this
package set" and wait. It always hangs, usually recovers without an
oops, but sometimes there's an oops. So it's pretty tedious to test it
with the "testcase" we currently have. I'd like a better one that
triggers this faster, but more importantly would be a reliable one.
We'll do our best though. Thanks!

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 12:51                                                           ` Chris Murphy
@ 2021-10-26 13:05                                                             ` Nikolay Borisov
  2021-10-26 18:08                                                               ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 13:05 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 26.10.21 г. 15:51, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> I think I identified a race that could cause the crash, can you apply the
>> following diff and re-run the tests and leave them for a couple of days.
>> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>>
>> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
>> index 309516e6a968..a3d788dcbd34 100644
>> --- a/fs/btrfs/async-thread.c
>> +++ b/fs/btrfs/async-thread.c
>> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>>                                   ordered_list);
>>                 if (!test_bit(WORK_DONE_BIT, &work->flags))
>>                         break;
>> +               /*
>> +                * Orders all subsequent loads after WORK_DONE_BIT, paired with
>> +                * the smp_mb__before_atomic in btrfs_work_helper
>> +                */
>> +               smp_rmb();
>>
>>                 /*
>>                  * we are going to call the ordered done function, but
>> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>>         thresh_exec_hook(wq);
>>         work->func(work);
>>         if (need_order) {
>> +               /*
>> +                * Ensures all вритес done in ->func are ordered before
>> +                * setting the WORK_DONE_BIT making them visible to ordered
>> +                * func
>> +                */
>> +               smp_mb__before_atomic();
>>                 set_bit(WORK_DONE_BIT, &work->flags);
>>                 run_ordered_work(wq, work);
>>         } else {
>>
> 
> Couple typos: 'вритес' looks like keyboard layout hiccup and should be
> 'writes'; and 5.4.10 should be 5.14.10 (I'm betting all the tea in
> China that upstream isn't asking me to test a patch on a two year old
> kernel).

Correct in both cases :)

> 
> Unfortunately the test we have is non-automated, it's "install this
> package set" and wait. It always hangs, usually recovers without an
> oops, but sometimes there's an oops. So it's pretty tedious to test it
> with the "testcase" we currently have. I'd like a better one that
> triggers this faster, but more importantly would be a reliable one.
> We'll do our best though. Thanks!

I thought the hang and the crash one are two different issues. What the
above diff is supposed to solve is the case in which
submit_compressed_extent is called with async_chunk->inode is null.

For the lockup issue it might or might not be related to this. But it
will be best if a crashdump is provided when the hang has occurred.
Looking at the task call trace in
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1836995
doesn't point at a hang. Just a bunch of threads waiting on IO in the
metadata reclaim path.

> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 13:05                                                             ` Nikolay Borisov
@ 2021-10-26 18:08                                                               ` Chris Murphy
  2021-10-26 18:14                                                                 ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:08 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 26, 2021 at 9:05 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 26.10.21 г. 15:51, Chris Murphy wrote:

> > Unfortunately the test we have is non-automated, it's "install this
> > package set" and wait. It always hangs, usually recovers without an
> > oops, but sometimes there's an oops. So it's pretty tedious to test it
> > with the "testcase" we currently have. I'd like a better one that
> > triggers this faster, but more importantly would be a reliable one.
> > We'll do our best though. Thanks!
>
> I thought the hang and the crash one are two different issues. What the
> above diff is supposed to solve is the case in which
> submit_compressed_extent is called with async_chunk->inode is null.

I don't know whether the hang and crash are related at all. I've been
unable to get a sysrq+t that shows anything when "dnf install
libreoffice" hangs, which I suspect could be dbus related where a
bunch of services get clobbered and restarted during the metric ton of
dependencies that libreoffice brings into a cloud base image. But
there is a consistent hang just installing kernel debug info and maybe
half the time the VM just falls over and isn't responsive at all -
later we sometimes see the submit_compressed_extent call trace in
virtual serial console. So yeah, I don't know...


> For the lockup issue it might or might not be related to this. But it
> will be best if a crashdump is provided when the hang has occurred.

How do I trigger the crashdump for the hang? Maybe set one of these to 1?

kernel.hardlockup_panic = 0
kernel.hung_task_panic = 0
kernel.max_rcu_stall_to_panic = 0
kernel.panic_on_rcu_stall = 0

> Looking at the task call trace in
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1836995
> doesn't point at a hang. Just a bunch of threads waiting on IO in the
> metadata reclaim path.

Well it stayed that way for hours and never recovered, I couldn't ssh
in either. And in the most recent case there was an oops with the
submit_compressed_extent call trace.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 18:08                                                               ` Chris Murphy
@ 2021-10-26 18:14                                                                 ` Nikolay Borisov
  2021-10-26 18:26                                                                   ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 18:14 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 26.10.21 г. 21:08, Chris Murphy wrote:
> I don't know whether the hang and crash are related at all. I've been
> unable to get a sysrq+t that shows anything when "dnf install
> libreoffice" hangs, which I suspect could be dbus related where a
> bunch of services get clobbered and restarted during the metric ton of
> dependencies that libreoffice brings into a cloud base image. But


Since this is a qemy virtual machine it's possible to acquire a direct
memory dump from qemu's management console. There's a dump-guest-memory
via qemu's management console alternatively via virsh one can do the
procedure described here:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core


if you can provide a memory dump + kernel vmlinux then I will be happy
to look into this. In the meantime the barriers fixes should remedy crash.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 18:14                                                                 ` Nikolay Borisov
@ 2021-10-26 18:26                                                                   ` Chris Murphy
  2021-10-26 18:31                                                                     ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:26 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 26.10.21 г. 21:08, Chris Murphy wrote:
> > I don't know whether the hang and crash are related at all. I've been
> > unable to get a sysrq+t that shows anything when "dnf install
> > libreoffice" hangs, which I suspect could be dbus related where a
> > bunch of services get clobbered and restarted during the metric ton of
> > dependencies that libreoffice brings into a cloud base image. But
>
>
> Since this is a qemy virtual machine it's possible to acquire a direct
> memory dump from qemu's management console. There's a dump-guest-memory
> via qemu's management console alternatively via virsh one can do the
> procedure described here:
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
>
>
> if you can provide a memory dump + kernel vmlinux then I will be happy
> to look into this. In the meantime the barriers fixes should remedy crash.

OK thanks. I'll start testing a kernel built with this patch, and then
move on to capturing a memory dump of the VM if we're still seeing
hangs.

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 18:26                                                                   ` Chris Murphy
@ 2021-10-26 18:31                                                                     ` Chris Murphy
  2021-10-26 18:35                                                                       ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:31 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 26, 2021 at 2:26 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
> >
> >
> >
> > On 26.10.21 г. 21:08, Chris Murphy wrote:
> > > I don't know whether the hang and crash are related at all. I've been
> > > unable to get a sysrq+t that shows anything when "dnf install
> > > libreoffice" hangs, which I suspect could be dbus related where a
> > > bunch of services get clobbered and restarted during the metric ton of
> > > dependencies that libreoffice brings into a cloud base image. But
> >
> >
> > Since this is a qemy virtual machine it's possible to acquire a direct
> > memory dump from qemu's management console. There's a dump-guest-memory
> > via qemu's management console alternatively via virsh one can do the
> > procedure described here:
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
> >
> >
> > if you can provide a memory dump + kernel vmlinux then I will be happy
> > to look into this. In the meantime the barriers fixes should remedy crash.
>
> OK thanks. I'll start testing a kernel built with this patch, and then
> move on to capturing a memory dump of the VM if we're still seeing
> hangs.

With or without the --memory-only option?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26 18:31                                                                     ` Chris Murphy
@ 2021-10-26 18:35                                                                       ` Nikolay Borisov
  0 siblings, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 18:35 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 26.10.21 г. 21:31, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 2:26 PM Chris Murphy <lists@colorremedies.com> wrote:
>>
>> On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
>>>
>>>
>>>
>>> On 26.10.21 г. 21:08, Chris Murphy wrote:
>>>> I don't know whether the hang and crash are related at all. I've been
>>>> unable to get a sysrq+t that shows anything when "dnf install
>>>> libreoffice" hangs, which I suspect could be dbus related where a
>>>> bunch of services get clobbered and restarted during the metric ton of
>>>> dependencies that libreoffice brings into a cloud base image. But
>>>
>>>
>>> Since this is a qemy virtual machine it's possible to acquire a direct
>>> memory dump from qemu's management console. There's a dump-guest-memory
>>> via qemu's management console alternatively via virsh one can do the
>>> procedure described here:
>>> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
>>>
>>>
>>> if you can provide a memory dump + kernel vmlinux then I will be happy
>>> to look into this. In the meantime the barriers fixes should remedy crash.
>>
>> OK thanks. I'll start testing a kernel built with this patch, and then
>> move on to capturing a memory dump of the VM if we're still seeing
>> hangs.
> 
> With or without the --memory-only option?


Yes (though I have never used the virsh method, but straight the hmp one).
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-26  7:14                                                         ` Nikolay Borisov
  2021-10-26 12:51                                                           ` Chris Murphy
@ 2021-10-27 18:22                                                           ` Chris Murphy
  2021-10-28  5:36                                                             ` Nikolay Borisov
  1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-27 18:22 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:

> I think I identified a race that could cause the crash, can you apply the
> following diff and re-run the tests and leave them for a couple of days.
> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 309516e6a968..a3d788dcbd34 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>                                   ordered_list);
>                 if (!test_bit(WORK_DONE_BIT, &work->flags))
>                         break;
> +               /*
> +                * Orders all subsequent loads after WORK_DONE_BIT, paired with
> +                * the smp_mb__before_atomic in btrfs_work_helper
> +                */
> +               smp_rmb();
>
>                 /*
>                  * we are going to call the ordered done function, but
> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>         thresh_exec_hook(wq);
>         work->func(work);
>         if (need_order) {
> +               /*
> +                * Ensures all вритес done in ->func are ordered before
> +                * setting the WORK_DONE_BIT making them visible to ordered
> +                * func
> +                */
> +               smp_mb__before_atomic();
>                 set_bit(WORK_DONE_BIT, &work->flags);
>                 run_ordered_work(wq, work);
>         } else {
>

So far this appears to be working well - thanks!
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54


--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-27 18:22                                                           ` Chris Murphy
@ 2021-10-28  5:36                                                             ` Nikolay Borisov
  2021-11-02 14:23                                                               ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-28  5:36 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 27.10.21 г. 21:22, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
> 
>> I think I identified a race that could cause the crash, can you apply the
>> following diff and re-run the tests and leave them for a couple of days.
>> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>>
>> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
>> index 309516e6a968..a3d788dcbd34 100644
>> --- a/fs/btrfs/async-thread.c
>> +++ b/fs/btrfs/async-thread.c
>> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>>                                   ordered_list);
>>                 if (!test_bit(WORK_DONE_BIT, &work->flags))
>>                         break;
>> +               /*
>> +                * Orders all subsequent loads after WORK_DONE_BIT, paired with
>> +                * the smp_mb__before_atomic in btrfs_work_helper
>> +                */
>> +               smp_rmb();
>>
>>                 /*
>>                  * we are going to call the ordered done function, but
>> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>>         thresh_exec_hook(wq);
>>         work->func(work);
>>         if (need_order) {
>> +               /*
>> +                * Ensures all вритес done in ->func are ordered before
>> +                * setting the WORK_DONE_BIT making them visible to ordered
>> +                * func
>> +                */
>> +               smp_mb__before_atomic();
>>                 set_bit(WORK_DONE_BIT, &work->flags);
>>                 run_ordered_work(wq, work);
>>         } else {
>>
> 
> So far this appears to be working well - thanks!
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54

Great, but due to the nature of the bug I'd rather wait at least until
the beginning of next week before sending an official patch so that this
can be tested more. In your comment you state 3/3 kernel debug info
installs and 6/6 libreoffice installs, how do those numbers compare
without the fix?

> 
> 
> --
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-10-28  5:36                                                             ` Nikolay Borisov
@ 2021-11-02 14:23                                                               ` Chris Murphy
  2021-11-02 14:25                                                                 ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-11-02 14:23 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 27.10.21 г. 21:22, Chris Murphy wrote:
> > On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
> >
> >> I think I identified a race that could cause the crash, can you apply the
> >> following diff and re-run the tests and leave them for a couple of days.
> >> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
> >>
> >> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> >> index 309516e6a968..a3d788dcbd34 100644
> >> --- a/fs/btrfs/async-thread.c
> >> +++ b/fs/btrfs/async-thread.c
> >> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
> >>                                   ordered_list);
> >>                 if (!test_bit(WORK_DONE_BIT, &work->flags))
> >>                         break;
> >> +               /*
> >> +                * Orders all subsequent loads after WORK_DONE_BIT, paired with
> >> +                * the smp_mb__before_atomic in btrfs_work_helper
> >> +                */
> >> +               smp_rmb();
> >>
> >>                 /*
> >>                  * we are going to call the ordered done function, but
> >> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
> >>         thresh_exec_hook(wq);
> >>         work->func(work);
> >>         if (need_order) {
> >> +               /*
> >> +                * Ensures all вритес done in ->func are ordered before
> >> +                * setting the WORK_DONE_BIT making them visible to ordered
> >> +                * func
> >> +                */
> >> +               smp_mb__before_atomic();
> >>                 set_bit(WORK_DONE_BIT, &work->flags);
> >>                 run_ordered_work(wq, work);
> >>         } else {
> >>
> >
> > So far this appears to be working well - thanks!
> > https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>
> Great, but due to the nature of the bug I'd rather wait at least until
> the beginning of next week before sending an official patch so that this
> can be tested more. In your comment you state 3/3 kernel debug info
> installs and 6/6 libreoffice installs, how do those numbers compare
> without the fix?

More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
of those would result in a call trace.



--
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-11-02 14:23                                                               ` Chris Murphy
@ 2021-11-02 14:25                                                                 ` Nikolay Borisov
  2021-11-05 16:12                                                                   ` Chris Murphy
  0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-11-02 14:25 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 2.11.21 г. 16:23, Chris Murphy wrote:
> On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:

<snip>

>>>
>>> So far this appears to be working well - thanks!
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>>
>> Great, but due to the nature of the bug I'd rather wait at least until
>> the beginning of next week before sending an official patch so that this
>> can be tested more. In your comment you state 3/3 kernel debug info
>> installs and 6/6 libreoffice installs, how do those numbers compare
>> without the fix?
> 
> More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
> of those would result in a call trace.

As you might have seen I did send a proper patch, if you've continued
testing it over the weekend and still haven't encountered an issue you
can reply with a Tested-by to the patch .

> 
> 
> 
> --
> Chris Murphy
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-11-02 14:25                                                                 ` Nikolay Borisov
@ 2021-11-05 16:12                                                                   ` Chris Murphy
  2021-11-07  9:11                                                                     ` Nikolay Borisov
  0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-11-05 16:12 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS

On Tue, Nov 2, 2021 at 10:25 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 2.11.21 г. 16:23, Chris Murphy wrote:
> > On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> <snip>
>
> >>>
> >>> So far this appears to be working well - thanks!
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
> >>
> >> Great, but due to the nature of the bug I'd rather wait at least until
> >> the beginning of next week before sending an official patch so that this
> >> can be tested more. In your comment you state 3/3 kernel debug info
> >> installs and 6/6 libreoffice installs, how do those numbers compare
> >> without the fix?
> >
> > More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
> > of those would result in a call trace.
>
> As you might have seen I did send a proper patch, if you've continued
> testing it over the weekend and still haven't encountered an issue you
> can reply with a Tested-by to the patch .

Did that.

Also, I just noticed the downstream bug comment that another tester
has run the original patch for several days and can't reproduce the
problem.

But the side note is that without the patch, they were experiencing
file system corruption, i.e. it would not mount following the crash.
Let me know if it's worth asking the tester for mount time failure
kernel messages; or a btrfs check of the corrupted system. I guess
this race is expected to never manifest on x86?
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c55



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
  2021-11-05 16:12                                                                   ` Chris Murphy
@ 2021-11-07  9:11                                                                     ` Nikolay Borisov
  0 siblings, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-11-07  9:11 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS



On 5.11.21 г. 18:12, Chris Murphy wrote:
> On Tue, Nov 2, 2021 at 10:25 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 2.11.21 г. 16:23, Chris Murphy wrote:
>>> On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> <snip>
>>
>>>>>
>>>>> So far this appears to be working well - thanks!
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>>>>
>>>> Great, but due to the nature of the bug I'd rather wait at least until
>>>> the beginning of next week before sending an official patch so that this
>>>> can be tested more. In your comment you state 3/3 kernel debug info
>>>> installs and 6/6 libreoffice installs, how do those numbers compare
>>>> without the fix?
>>>
>>> More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
>>> of those would result in a call trace.
>>
>> As you might have seen I did send a proper patch, if you've continued
>> testing it over the weekend and still haven't encountered an issue you
>> can reply with a Tested-by to the patch .
> 
> Did that.
> 
> Also, I just noticed the downstream bug comment that another tester
> has run the original patch for several days and can't reproduce the
> problem.
> 
> But the side note is that without the patch, they were experiencing
> file system corruption, i.e. it would not mount following the crash.
> Let me know if it's worth asking the tester for mount time failure
> kernel messages; or a btrfs check of the corrupted system. I guess

Sure, let's see if there's anything else stemming from this.

> this race is expected to never manifest on x86?

Yes, x86 is strongly ordered so it won't need the barriers hence the
issue doesn't exist there.

> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c55
> 
> 
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2021-11-07  9:11 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-12  0:59 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper Chris Murphy
2021-10-12  5:25 ` Nikolay Borisov
2021-10-12  6:47   ` Qu Wenruo
2021-10-12 14:30     ` Chris Murphy
2021-10-12 21:24       ` Chris Murphy
2021-10-12 23:55       ` Qu Wenruo
2021-10-13 12:14         ` Chris Murphy
2021-10-13 12:18           ` Qu Wenruo
2021-10-13 12:27             ` Chris Murphy
2021-10-13 12:29               ` Nikolay Borisov
2021-10-13 12:43                 ` Chris Murphy
2021-10-13 12:46                   ` Nikolay Borisov
2021-10-13 12:55                     ` Chris Murphy
2021-10-13 19:21                       ` Chris Murphy
2021-10-18  1:57                         ` Chris Murphy
2021-10-18 11:32                           ` Su Yue
2021-10-18 13:28                             ` Qu Wenruo
2021-10-18 14:49                               ` Chris Murphy
2021-10-18 18:24                                 ` Chris Murphy
2021-10-19  1:24                                   ` Su Yue
2021-10-19 18:26                                     ` Chris Murphy
2021-10-19 23:42                                       ` Su Yue
2021-10-20  1:21                                         ` Qu Wenruo
2021-10-20  1:25                                         ` Chris Murphy
2021-10-20 23:55                                         ` Chris Murphy
2021-10-21  0:29                                           ` Su Yue
2021-10-21  0:37                                             ` Qu Wenruo
2021-10-21  0:46                                               ` Su Yue
2021-10-21 14:43                                             ` Chris Murphy
2021-10-21 14:48                                               ` Chris Murphy
2021-10-21 14:51                                                 ` Nikolay Borisov
2021-10-21 14:55                                                   ` Chris Murphy
2021-10-21 15:01                                                     ` Nikolay Borisov
2021-10-21 15:06                                                       ` Chris Murphy
2021-10-21 15:32                                                         ` Chris Murphy
2021-10-21 18:07                                                       ` Chris Murphy
2021-10-21  5:56                                           ` Nikolay Borisov
2021-10-22  2:36                                         ` Chris Murphy
2021-10-22  6:02                                           ` Nikolay Borisov
2021-10-22  6:17                                             ` Su Yue
2021-10-22 10:44                                           ` Nikolay Borisov
2021-10-22 11:43                                             ` Nikolay Borisov
2021-10-22 17:18                                               ` Chris Murphy
2021-10-23 10:09                                                 ` Nikolay Borisov
2021-10-25 14:48                                                   ` Chris Murphy
2021-10-25 18:34                                                     ` Chris Murphy
2021-10-25 19:40                                                       ` Chris Murphy
2021-10-26  7:14                                                         ` Nikolay Borisov
2021-10-26 12:51                                                           ` Chris Murphy
2021-10-26 13:05                                                             ` Nikolay Borisov
2021-10-26 18:08                                                               ` Chris Murphy
2021-10-26 18:14                                                                 ` Nikolay Borisov
2021-10-26 18:26                                                                   ` Chris Murphy
2021-10-26 18:31                                                                     ` Chris Murphy
2021-10-26 18:35                                                                       ` Nikolay Borisov
2021-10-27 18:22                                                           ` Chris Murphy
2021-10-28  5:36                                                             ` Nikolay Borisov
2021-11-02 14:23                                                               ` Chris Murphy
2021-11-02 14:25                                                                 ` Nikolay Borisov
2021-11-05 16:12                                                                   ` Chris Murphy
2021-11-07  9:11                                                                     ` Nikolay Borisov
2021-10-19  1:25                                   ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.