* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-19 23:42 ` Su Yue
@ 2021-10-20 1:21 ` Qu Wenruo
2021-10-20 1:25 ` Chris Murphy
` (2 subsequent siblings)
3 siblings, 0 replies; 62+ messages in thread
From: Qu Wenruo @ 2021-10-20 1:21 UTC (permalink / raw)
To: Su Yue, Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On 2021/10/20 07:42, Su Yue wrote:
>
> On Tue 19 Oct 2021 at 14:26, Chris Murphy <lists@colorremedies.com> wrote:
>
>> Still working on the kernel core dump and should have something soon
>> (I blew up the VM and had to start over); should I run the 'crash'
>> command on it afterward? Or upload the dump file to e.g. google drive?
>>
> Dump file and vmlinu[zx] kernel file are needed.
>
>> Also, I came across this ext4 issue happening on aarch64 (openstack
>> too), but I have no idea if it's related. And if so, whether it means
>> there's a common problem outside of btrfs?
>> https://github.com/coreos/fedora-coreos-tracker/issues/965
>>
> Already noticed the thing. Let's wait for the vmcore.
No idea at all.
In fact I'm not even familiar with kdump based analyse, and would prefer
to manually add extra debugging output to make sure things are going as
expected.
BTW, where can I find the compiler used for those pre-compiled kernel?
Currently I'm suspecting the toolchain as the root cause.
Thanks,
Qu
>
> Any idea, Qu?
>
> --
> Su
>> I mentioned this bug report up thread:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1949334
>> but to summarize: it has the same btrfs call trace we've been looking
>> at in this email thread, but it's NOT on openstack, but actual
>> hardware (amberwing).
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-19 23:42 ` Su Yue
2021-10-20 1:21 ` Qu Wenruo
@ 2021-10-20 1:25 ` Chris Murphy
2021-10-20 23:55 ` Chris Murphy
2021-10-22 2:36 ` Chris Murphy
3 siblings, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-20 1:25 UTC (permalink / raw)
To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>
>
> On Tue 19 Oct 2021 at 14:26, Chris Murphy
> <lists@colorremedies.com> wrote:
>
> > Still working on the kernel core dump and should have something
> > soon
> > (I blew up the VM and had to start over); should I run the
> > 'crash'
> > command on it afterward? Or upload the dump file to e.g. google
> > drive?
> >
> Dump file and vmlinu[zx] kernel file are needed.
>
> > Also, I came across this ext4 issue happening on aarch64
> > (openstack
> > too), but I have no idea if it's related. And if so, whether it
> > means
> > there's a common problem outside of btrfs?
> > https://github.com/coreos/fedora-coreos-tracker/issues/965
> >
> Already noticed the thing. Let's wait for the vmcore.
>
> Any idea, Qu?
>
So it's been compiling for multiple hours while also doing a large
package installation at the same time for about an hour of that time,
and still no oops or kernel messages...
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-19 23:42 ` Su Yue
2021-10-20 1:21 ` Qu Wenruo
2021-10-20 1:25 ` Chris Murphy
@ 2021-10-20 23:55 ` Chris Murphy
2021-10-21 0:29 ` Su Yue
2021-10-21 5:56 ` Nikolay Borisov
2021-10-22 2:36 ` Chris Murphy
3 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-20 23:55 UTC (permalink / raw)
To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>
> Dump file and vmlinu[zx] kernel file are needed.
So we get a splat but kdump doesn't create a vmcore. Do we need to
issue sysrq+c at the time of the hang and splat to create it?
Fedora Linux 35 (Cloud Edition)
Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
dusty-35 login: [ 286.982605] Unable to handle kernel paging request
at virtual address fffffffffffffdd0
[ 286.988338] Mem abort info:
[ 286.990307] ESR = 0x96000004
[ 286.992596] EC = 0x25: DABT (current EL), IL = 32 bits
[ 286.996316] SET = 0, FnV = 0
[ 286.998454] EA = 0, S1PTW = 0
[ 287.000791] FSC = 0x04: level 0 translation fault
[ 287.004472] Data abort info:
[ 287.006540] ISV = 0, ISS = 0x00000004
[ 287.009239] CM = 0, WnR = 0
[ 287.011344] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000054181000
[ 287.018245] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[ 287.024209] Internal error: Oops: 96000004 [#1] SMP
[ 287.027615] Modules linked in: virtio_gpu virtio_dma_buf
drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[ 287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
tainted 5.14.10-300.fc35.aarch64 #1
[ 287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
[ 287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[ 287.070568] pc : submit_compressed_extents+0x38/0x3d0
[ 287.074825] lr : async_cow_submit+0x50/0xd0
[ 287.078217] sp : ffff800015d4bc20
[ 287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27: ffffb8a2fa941000
[ 287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000115873608
[ 287.092822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c6f25800
[ 287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18: ffff0000c2100bd4
[ 287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15: 50006a3d10a961cd
[ 287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12: ffff0001fefa68c0
[ 287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 : ffffb8a2f9131c40
[ 287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 : ffffb8a2fae8ad40
[ 287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c6f25820
[ 287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 : ffff000115873630
[ 287.139760] Call trace:
[ 287.141784] submit_compressed_extents+0x38/0x3d0
[ 287.145620] async_cow_submit+0x50/0xd0
[ 287.148801] run_ordered_work+0xc8/0x280
[ 287.152005] btrfs_work_helper+0x98/0x250
[ 287.155450] process_one_work+0x1f0/0x4ac
[ 287.161577] worker_thread+0x188/0x504
[ 287.167461] kthread+0x110/0x114
[ 287.172872] ret_from_fork+0x10/0x18
[ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-20 23:55 ` Chris Murphy
@ 2021-10-21 0:29 ` Su Yue
2021-10-21 0:37 ` Qu Wenruo
2021-10-21 14:43 ` Chris Murphy
2021-10-21 5:56 ` Nikolay Borisov
1 sibling, 2 replies; 62+ messages in thread
From: Su Yue @ 2021-10-21 0:29 UTC (permalink / raw)
To: Chris Murphy; +Cc: Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On Wed 20 Oct 2021 at 19:55, Chris Murphy
<lists@colorremedies.com> wrote:
> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>
>> Dump file and vmlinu[zx] kernel file are needed.
>
> So we get a splat but kdump doesn't create a vmcore. Do we need
> to
> issue sysrq+c at the time of the hang and splat to create it?
>
Yes, please.
BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
so I think we can exclude the possibility of the toolchain.
--
Su
> Fedora Linux 35 (Cloud Edition)
> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>
> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
> dusty-35 login: [ 286.982605] Unable to handle kernel paging
> request
> at virtual address fffffffffffffdd0
> [ 286.988338] Mem abort info:
> [ 286.990307] ESR = 0x96000004
> [ 286.992596] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 286.996316] SET = 0, FnV = 0
> [ 286.998454] EA = 0, S1PTW = 0
> [ 287.000791] FSC = 0x04: level 0 translation fault
> [ 287.004472] Data abort info:
> [ 287.006540] ISV = 0, ISS = 0x00000004
> [ 287.009239] CM = 0, WnR = 0
> [ 287.011344] swapper pgtable: 4k pages, 48-bit VAs,
> pgdp=0000000054181000
> [ 287.018245] [fffffffffffffdd0] pgd=0000000000000000,
> p4d=0000000000000000
> [ 287.024209] Internal error: Oops: 96000004 [#1] SMP
> [ 287.027615] Modules linked in: virtio_gpu virtio_dma_buf
> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
> sysfillrect sysimgblt net_failover virtio_balloon failover vfat
> fat
> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk
> qemu_fw_cfg
> virtio_mmio aes_neon_bs
> [ 287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded
> Not
> tainted 5.14.10-300.fc35.aarch64 #1
> [ 287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS
> 0.0.0 02/06/2015
> [ 287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
> [ 287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO
> BTYPE=--)
> [ 287.070568] pc : submit_compressed_extents+0x38/0x3d0
> [ 287.074825] lr : async_cow_submit+0x50/0xd0
> [ 287.078217] sp : ffff800015d4bc20
> [ 287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27:
> ffffb8a2fa941000
> [ 287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24:
> ffff000115873608
> [ 287.092822] x23: 0000000000000000 x22: 0000000000000001 x21:
> ffff0000c6f25800
> [ 287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18:
> ffff0000c2100bd4
> [ 287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15:
> 50006a3d10a961cd
> [ 287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12:
> ffff0001fefa68c0
> [ 287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 :
> ffffb8a2f9131c40
> [ 287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 :
> ffffb8a2fae8ad40
> [ 287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
> ffff0000c6f25820
> [ 287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 :
> ffff000115873630
> [ 287.139760] Call trace:
> [ 287.141784] submit_compressed_extents+0x38/0x3d0
> [ 287.145620] async_cow_submit+0x50/0xd0
> [ 287.148801] run_ordered_work+0xc8/0x280
> [ 287.152005] btrfs_work_helper+0x98/0x250
> [ 287.155450] process_one_work+0x1f0/0x4ac
> [ 287.161577] worker_thread+0x188/0x504
> [ 287.167461] kthread+0x110/0x114
> [ 287.172872] ret_from_fork+0x10/0x18
> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa
> (f9400356)
> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 0:29 ` Su Yue
@ 2021-10-21 0:37 ` Qu Wenruo
2021-10-21 0:46 ` Su Yue
2021-10-21 14:43 ` Chris Murphy
1 sibling, 1 reply; 62+ messages in thread
From: Qu Wenruo @ 2021-10-21 0:37 UTC (permalink / raw)
To: Su Yue, Chris Murphy; +Cc: Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On 2021/10/21 08:29, Su Yue wrote:
>
> On Wed 20 Oct 2021 at 19:55, Chris Murphy <lists@colorremedies.com> wrote:
>
>> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>>
>>> Dump file and vmlinu[zx] kernel file are needed.
>>
>> So we get a splat but kdump doesn't create a vmcore. Do we need to
>> issue sysrq+c at the time of the hang and splat to create it?
>>
> Yes, please.
>
> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
> so I think we can exclude the possibility of the toolchain.
Or this can also mean, fstests is not enough to trigger it?
Thanks,
Qu
>
> --
> Su
>
>> Fedora Linux 35 (Cloud Edition)
>> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>>
>> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
>> dusty-35 login: [ 286.982605] Unable to handle kernel paging request
>> at virtual address fffffffffffffdd0
>> [ 286.988338] Mem abort info:
>> [ 286.990307] ESR = 0x96000004
>> [ 286.992596] EC = 0x25: DABT (current EL), IL = 32 bits
>> [ 286.996316] SET = 0, FnV = 0
>> [ 286.998454] EA = 0, S1PTW = 0
>> [ 287.000791] FSC = 0x04: level 0 translation fault
>> [ 287.004472] Data abort info:
>> [ 287.006540] ISV = 0, ISS = 0x00000004
>> [ 287.009239] CM = 0, WnR = 0
>> [ 287.011344] swapper pgtable: 4k pages, 48-bit VAs,
>> pgdp=0000000054181000
>> [ 287.018245] [fffffffffffffdd0] pgd=0000000000000000,
>> p4d=0000000000000000
>> [ 287.024209] Internal error: Oops: 96000004 [#1] SMP
>> [ 287.027615] Modules linked in: virtio_gpu virtio_dma_buf
>> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
>> sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
>> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
>> virtio_mmio aes_neon_bs
>> [ 287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
>> tainted 5.14.10-300.fc35.aarch64 #1
>> [ 287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
>> 02/06/2015
>> [ 287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
>> [ 287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>> [ 287.070568] pc : submit_compressed_extents+0x38/0x3d0
>> [ 287.074825] lr : async_cow_submit+0x50/0xd0
>> [ 287.078217] sp : ffff800015d4bc20
>> [ 287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27:
>> ffffb8a2fa941000
>> [ 287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24:
>> ffff000115873608
>> [ 287.092822] x23: 0000000000000000 x22: 0000000000000001 x21:
>> ffff0000c6f25800
>> [ 287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18:
>> ffff0000c2100bd4
>> [ 287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15:
>> 50006a3d10a961cd
>> [ 287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12:
>> ffff0001fefa68c0
>> [ 287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 :
>> ffffb8a2f9131c40
>> [ 287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 :
>> ffffb8a2fae8ad40
>> [ 287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 :
>> ffff0000c6f25820
>> [ 287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 :
>> ffff000115873630
>> [ 287.139760] Call trace:
>> [ 287.141784] submit_compressed_extents+0x38/0x3d0
>> [ 287.145620] async_cow_submit+0x50/0xd0
>> [ 287.148801] run_ordered_work+0xc8/0x280
>> [ 287.152005] btrfs_work_helper+0x98/0x250
>> [ 287.155450] process_one_work+0x1f0/0x4ac
>> [ 287.161577] worker_thread+0x188/0x504
>> [ 287.167461] kthread+0x110/0x114
>> [ 287.172872] ret_from_fork+0x10/0x18
>> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
>> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 0:37 ` Qu Wenruo
@ 2021-10-21 0:46 ` Su Yue
0 siblings, 0 replies; 62+ messages in thread
From: Su Yue @ 2021-10-21 0:46 UTC (permalink / raw)
To: Qu Wenruo; +Cc: Chris Murphy, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On Thu 21 Oct 2021 at 08:37, Qu Wenruo <quwenruo.btrfs@gmx.com>
wrote:
> On 2021/10/21 08:29, Su Yue wrote:
>>
>> On Wed 20 Oct 2021 at 19:55, Chris Murphy
>> <lists@colorremedies.com> wrote:
>>
>>> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>>>
>>>> Dump file and vmlinu[zx] kernel file are needed.
>>>
>>> So we get a splat but kdump doesn't create a vmcore. Do we
>>> need to
>>> issue sysrq+c at the time of the hang and splat to create it?
>>>
>> Yes, please.
>>
>> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
>> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang
>> found,
>> so I think we can exclude the possibility of the toolchain.
>
> Or this can also mean, fstests is not enough to trigger it?
>
Right...Can't deny the possibility without any evidence for now.
--
Su
> Thanks,
> Qu
>
>>
>> --
>> Su
>>
>>> Fedora Linux 35 (Cloud Edition)
>>> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>>>
>>> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
>>> dusty-35 login: [ 286.982605] Unable to handle kernel paging
>>> request
>>> at virtual address fffffffffffffdd0
>>> [ 286.988338] Mem abort info:
>>> [ 286.990307] ESR = 0x96000004
>>> [ 286.992596] EC = 0x25: DABT (current EL), IL = 32 bits
>>> [ 286.996316] SET = 0, FnV = 0
>>> [ 286.998454] EA = 0, S1PTW = 0
>>> [ 287.000791] FSC = 0x04: level 0 translation fault
>>> [ 287.004472] Data abort info:
>>> [ 287.006540] ISV = 0, ISS = 0x00000004
>>> [ 287.009239] CM = 0, WnR = 0
>>> [ 287.011344] swapper pgtable: 4k pages, 48-bit VAs,
>>> pgdp=0000000054181000
>>> [ 287.018245] [fffffffffffffdd0] pgd=0000000000000000,
>>> p4d=0000000000000000
>>> [ 287.024209] Internal error: Oops: 96000004 [#1] SMP
>>> [ 287.027615] Modules linked in: virtio_gpu virtio_dma_buf
>>> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
>>> sysfillrect sysimgblt net_failover virtio_balloon failover
>>> vfat fat
>>> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk
>>> qemu_fw_cfg
>>> virtio_mmio aes_neon_bs
>>> [ 287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump:
>>> loaded Not
>>> tainted 5.14.10-300.fc35.aarch64 #1
>>> [ 287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS
>>> 0.0.0
>>> 02/06/2015
>>> [ 287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
>>> [ 287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO
>>> BTYPE=--)
>>> [ 287.070568] pc : submit_compressed_extents+0x38/0x3d0
>>> [ 287.074825] lr : async_cow_submit+0x50/0xd0
>>> [ 287.078217] sp : ffff800015d4bc20
>>> [ 287.081008] x29: ffff800015d4bc30 x28: 0000000000000000
>>> x27:
>>> ffffb8a2fa941000
>>> [ 287.087022] x26: fffffffffffffdd0 x25: dead000000000100
>>> x24:
>>> ffff000115873608
>>> [ 287.092822] x23: 0000000000000000 x22: 0000000000000001
>>> x21:
>>> ffff0000c6f25800
>>> [ 287.098591] x20: ffff0000c0596000 x19: 0000000000000001
>>> x18:
>>> ffff0000c2100bd4
>>> [ 287.104387] x17: ffff000115875ff8 x16: 0000000000000006
>>> x15:
>>> 50006a3d10a961cd
>>> [ 287.110159] x14: f0668b836620caa1 x13: 0000000000000020
>>> x12:
>>> ffff0001fefa68c0
>>> [ 287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9
>>> :
>>> ffffb8a2f9131c40
>>> [ 287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6
>>> :
>>> ffffb8a2fae8ad40
>>> [ 287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3
>>> :
>>> ffff0000c6f25820
>>> [ 287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0
>>> :
>>> ffff000115873630
>>> [ 287.139760] Call trace:
>>> [ 287.141784] submit_compressed_extents+0x38/0x3d0
>>> [ 287.145620] async_cow_submit+0x50/0xd0
>>> [ 287.148801] run_ordered_work+0xc8/0x280
>>> [ 287.152005] btrfs_work_helper+0x98/0x250
>>> [ 287.155450] process_one_work+0x1f0/0x4ac
>>> [ 287.161577] worker_thread+0x188/0x504
>>> [ 287.167461] kthread+0x110/0x114
>>> [ 287.172872] ret_from_fork+0x10/0x18
>>> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa
>>> (f9400356)
>>> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 0:29 ` Su Yue
2021-10-21 0:37 ` Qu Wenruo
@ 2021-10-21 14:43 ` Chris Murphy
2021-10-21 14:48 ` Chris Murphy
1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:43 UTC (permalink / raw)
To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
On Wed, Oct 20, 2021 at 8:34 PM Su Yue <l@damenly.su> wrote:
>
>
> On Wed 20 Oct 2021 at 19:55, Chris Murphy
> <lists@colorremedies.com> wrote:
>
> > On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
> >>
> >> Dump file and vmlinu[zx] kernel file are needed.
> >
> > So we get a splat but kdump doesn't create a vmcore. Do we need
> > to
> > issue sysrq+c at the time of the hang and splat to create it?
> >
> Yes, please.
>
> BTW, I ran xfstests with 5.14.10-300.fc35.aarch64 and
> 5.14.12-200.fc34.aarch64 in several rounds. No panic/hang found,
> so I think we can exclude the possibility of the toolchain.
It's really weird. I was given a vexxhost aarch64 VM to play in and
try to get a vmcore for you guys, but nothing I did triggered the
splat. Then a colleague tried it, same hosting company, and was able
to reproduce it almost immediately. Same distro and kernel. So I don't
know what that means, if it's possible this the provisioning of the VM
could end up on different hardware, and it is some aspect of the
hardware that's resulting in this issue.
But anyway, he will be able to get a kernel core dump soon, and maybe
that'll tell us what's going on.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 14:43 ` Chris Murphy
@ 2021-10-21 14:48 ` Chris Murphy
2021-10-21 14:51 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:48 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
[ 287.139760] Call trace:
[ 287.141784] submit_compressed_extents+0x38/0x3d0
[ 287.145620] async_cow_submit+0x50/0xd0
[ 287.148801] run_ordered_work+0xc8/0x280
[ 287.152005] btrfs_work_helper+0x98/0x250
[ 287.155450] process_one_work+0x1f0/0x4ac
[ 287.161577] worker_thread+0x188/0x504
[ 287.167461] kthread+0x110/0x114
[ 287.172872] ret_from_fork+0x10/0x18
[ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
[ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
[61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
[61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
So it's at least 17 hours later since the splat. Is it worth sysrq+c
now this long after? Or should I set it up like Nikolay suggests with
kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
avoid problems dumping the kernel core file?
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 14:48 ` Chris Murphy
@ 2021-10-21 14:51 ` Nikolay Borisov
2021-10-21 14:55 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21 14:51 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 21.10.21 г. 17:48, Chris Murphy wrote:
> [ 287.139760] Call trace:
> [ 287.141784] submit_compressed_extents+0x38/0x3d0
> [ 287.145620] async_cow_submit+0x50/0xd0
> [ 287.148801] run_ordered_work+0xc8/0x280
> [ 287.152005] btrfs_work_helper+0x98/0x250
> [ 287.155450] process_one_work+0x1f0/0x4ac
> [ 287.161577] worker_thread+0x188/0x504
> [ 287.167461] kthread+0x110/0x114
> [ 287.172872] ret_from_fork+0x10/0x18
> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
> [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
> [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
>
>
> So it's at least 17 hours later since the splat. Is it worth sysrq+c
> now this long after? Or should I set it up like Nikolay suggests with
> kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
> avoid problems dumping the kernel core file?
Doing sysrq+c would not have yileded any useful information it was a red
herring. In order to have actionable information the core dump needs to
be initiated from offending context, this means either having a BUG_ON
or a WARN which triggers the panic.
>
> --
> Chris Murphy
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 14:51 ` Nikolay Borisov
@ 2021-10-21 14:55 ` Chris Murphy
2021-10-21 15:01 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 14:55 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Thu, Oct 21, 2021 at 10:51 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 21.10.21 г. 17:48, Chris Murphy wrote:
> > [ 287.139760] Call trace:
> > [ 287.141784] submit_compressed_extents+0x38/0x3d0
> > [ 287.145620] async_cow_submit+0x50/0xd0
> > [ 287.148801] run_ordered_work+0xc8/0x280
> > [ 287.152005] btrfs_work_helper+0x98/0x250
> > [ 287.155450] process_one_work+0x1f0/0x4ac
> > [ 287.161577] worker_thread+0x188/0x504
> > [ 287.167461] kthread+0x110/0x114
> > [ 287.172872] ret_from_fork+0x10/0x18
> > [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> > [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
> > [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
> > [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
> >
> >
> > So it's at least 17 hours later since the splat. Is it worth sysrq+c
> > now this long after? Or should I set it up like Nikolay suggests with
> > kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
> > avoid problems dumping the kernel core file?
>
> Doing sysrq+c would not have yileded any useful information it was a red
> herring. In order to have actionable information the core dump needs to
> be initiated from offending context, this means either having a BUG_ON
> or a WARN which triggers the panic.
OK so I'll put /var/crash on XFS and set kernel.panic_on_warn = 1 and
try to reproduce the problem; and hopefully that triggers kdump.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 14:55 ` Chris Murphy
@ 2021-10-21 15:01 ` Nikolay Borisov
2021-10-21 15:06 ` Chris Murphy
2021-10-21 18:07 ` Chris Murphy
0 siblings, 2 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21 15:01 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 21.10.21 г. 17:55, Chris Murphy wrote:
> On Thu, Oct 21, 2021 at 10:51 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 21.10.21 г. 17:48, Chris Murphy wrote:
>>> [ 287.139760] Call trace:
>>> [ 287.141784] submit_compressed_extents+0x38/0x3d0
>>> [ 287.145620] async_cow_submit+0x50/0xd0
>>> [ 287.148801] run_ordered_work+0xc8/0x280
>>> [ 287.152005] btrfs_work_helper+0x98/0x250
>>> [ 287.155450] process_one_work+0x1f0/0x4ac
>>> [ 287.161577] worker_thread+0x188/0x504
>>> [ 287.167461] kthread+0x110/0x114
>>> [ 287.172872] ret_from_fork+0x10/0x18
>>> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
>>> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
>>> [61620.974232] audit: audit_backlog=2976 > audit_backlog_limit=64
>>> [61620.978698] audit: audit_lost=1 audit_rate_limit=0 audit_backlog_limit=64
>>>
>>>
>>> So it's at least 17 hours later since the splat. Is it worth sysrq+c
>>> now this long after? Or should I set it up like Nikolay suggests with
>>> kernel.panic_on_warn = 1? Maybe I should also put /var/crash on XFS to
>>> avoid problems dumping the kernel core file?
>>
>> Doing sysrq+c would not have yileded any useful information it was a red
>> herring. In order to have actionable information the core dump needs to
>> be initiated from offending context, this means either having a BUG_ON
>> or a WARN which triggers the panic.
>
>
> OK so I'll put /var/crash on XFS and set kernel.panic_on_warn = 1 and
> try to reproduce the problem; and hopefully that triggers kdump.
Just to be clear, when you initiate a crash with sysrq+c does it capture
a crashdump? That's the basic test that needs to pass in order to ensure
kdump works as expected.
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 15:01 ` Nikolay Borisov
@ 2021-10-21 15:06 ` Chris Murphy
2021-10-21 15:32 ` Chris Murphy
2021-10-21 18:07 ` Chris Murphy
1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 15:06 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Thu, Oct 21, 2021 at 11:01 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> Just to be clear, when you initiate a crash with sysrq+c does it capture
> a crashdump? That's the basic test that needs to pass in order to ensure
> kdump works as expected.
Yes it does.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-21 15:01 ` Nikolay Borisov
2021-10-21 15:06 ` Chris Murphy
@ 2021-10-21 18:07 ` Chris Murphy
1 sibling, 0 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-21 18:07 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
[fedora@dusty-353 ~]$ sudo sysctl -n kernel.panic_on_warn
1
I get the oops, but no kdump activity at all.
Oct 21 17:35:54 dusty-353.novalocal kernel: Unable to handle kernel
paging request at virtual address fffffffffffffdd0
Oct 21 17:35:54 dusty-353.novalocal kernel: Mem abort info:
Oct 21 17:35:54 dusty-353.novalocal kernel: ESR = 0x96000004
Oct 21 17:35:54 dusty-353.novalocal kernel: EC = 0x25: DABT (current
EL), IL = 32 bits
Oct 21 17:35:54 dusty-353.novalocal kernel: SET = 0, FnV = 0
Oct 21 17:35:54 dusty-353.novalocal kernel: EA = 0, S1PTW = 0
Oct 21 17:35:54 dusty-353.novalocal kernel: FSC = 0x04: level 0
translation fault
Oct 21 17:35:54 dusty-353.novalocal kernel: Data abort info:
Oct 21 17:35:54 dusty-353.novalocal kernel: ISV = 0, ISS = 0x00000004
Oct 21 17:35:54 dusty-353.novalocal kernel: CM = 0, WnR = 0
Oct 21 17:35:54 dusty-353.novalocal kernel: swapper pgtable: 4k pages,
48-bit VAs, pgdp=0000000125461000
Oct 21 17:35:54 dusty-353.novalocal kernel: [fffffffffffffdd0]
pgd=0000000000000000, p4d=0000000000000000
Oct 21 17:35:54 dusty-353.novalocal kernel: Internal error: Oops:
96000004 [#1] SMP
Oct 21 17:35:54 dusty-353.novalocal kernel: Modules linked in:
binfmt_misc virtio_gpu virtio_dma_buf drm_kms_helper joydev cec
fb_sys_fops syscopyarea virtio_net sysfillrect sysimgblt
virtio_balloon net_failover failover vfat fat xfs drm fuse zram
ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg virtio_mmio
aes_neon_bs
Oct 21 17:35:54 dusty-353.novalocal kernel: CPU: 1 PID: 4392 Comm:
kworker/u8:12 Kdump: loaded Not tainted 5.14.10-300.fc35.aarch64 #1
Oct 21 17:35:54 dusty-353.novalocal kernel: Hardware name: QEMU KVM
Virtual Machine, BIOS 0.0.0 02/06/2015
Oct 21 17:35:54 dusty-353.novalocal kernel: Workqueue: btrfs-delalloc
btrfs_work_helper
Oct 21 17:35:54 dusty-353.novalocal kernel: pstate: 80400005 (Nzcv
daif +PAN -UAO -TCO BTYPE=--)
Oct 21 17:35:54 dusty-353.novalocal kernel: pc :
submit_compressed_extents+0x38/0x3d0
Oct 21 17:35:54 dusty-353.novalocal kernel: lr : async_cow_submit+0x50/0xd0
Oct 21 17:35:54 dusty-353.novalocal kernel: sp : ffff800010d6bc20
Oct 21 17:35:54 dusty-353.novalocal kernel: x29: ffff800010d6bc30 x28:
0000000000000000 x27: ffffbb96c7421000
Oct 21 17:35:54 dusty-353.novalocal kernel: x26: fffffffffffffdd0 x25:
dead000000000100 x24: ffff00012f950408
Oct 21 17:35:54 dusty-353.novalocal kernel: x23: 0000000000000000 x22:
0000000000000001 x21: ffff0000c07e1f80
Oct 21 17:35:54 dusty-353.novalocal kernel: x20: ffff0000c5af0000 x19:
0000000000000001 x18: ffff0000c2500bd4
Oct 21 17:35:54 dusty-353.novalocal kernel: x17: ffff00012fa0eff8 x16:
0000000000000006 x15: bd47b4a638083142
Oct 21 17:35:54 dusty-353.novalocal kernel: x14: ab8f4df43188bcf5 x13:
0000000000000020 x12: ffff0001fefa78c0
Oct 21 17:35:54 dusty-353.novalocal kernel: x11: ffffbb96c743b500 x10:
0000000000000000 x9 : ffffbb96c5c11c40
Oct 21 17:35:54 dusty-353.novalocal kernel: x8 : ffff446b37afd000 x7 :
ffff800010d6bbe0 x6 : ffffbb96c6c11000
Oct 21 17:35:54 dusty-353.novalocal kernel: x5 : 0000000000000000 x4 :
0000000000000000 x3 : ffff0000c07e1fa0
Oct 21 17:35:54 dusty-353.novalocal kernel: x2 : 0000000000000000 x1 :
ffff00012f950430 x0 : ffff00012f950430
Oct 21 17:35:54 dusty-353.novalocal kernel: Call trace:
Oct 21 17:35:54 dusty-353.novalocal kernel:
submit_compressed_extents+0x38/0x3d0
Oct 21 17:35:54 dusty-353.novalocal kernel: async_cow_submit+0x50/0xd0
Oct 21 17:35:54 dusty-353.novalocal kernel: run_ordered_work+0xc8/0x280
Oct 21 17:35:54 dusty-353.novalocal kernel: btrfs_work_helper+0x98/0x250
Oct 21 17:35:54 dusty-353.novalocal kernel: process_one_work+0x1f0/0x4ac
Oct 21 17:35:54 dusty-353.novalocal kernel: worker_thread+0x188/0x504
Oct 21 17:35:54 dusty-353.novalocal kernel: kthread+0x110/0x114
Oct 21 17:35:54 dusty-353.novalocal kernel: ret_from_fork+0x10/0x18
Oct 21 17:35:54 dusty-353.novalocal kernel: Code: a9056bf9 f8428437
f9401400 d108c2fa (f9400356)
Oct 21 17:35:54 dusty-353.novalocal kernel: ---[ end trace 718fed28301aa13b ]---
Whereas sysrq+c does create a kdump file...
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-20 23:55 ` Chris Murphy
2021-10-21 0:29 ` Su Yue
@ 2021-10-21 5:56 ` Nikolay Borisov
1 sibling, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-21 5:56 UTC (permalink / raw)
To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 21.10.21 г. 2:55, Chris Murphy wrote:
> On Tue, Oct 19, 2021 at 9:10 PM Su Yue <l@damenly.su> wrote:
>>
>> Dump file and vmlinu[zx] kernel file are needed.
>
> So we get a splat but kdump doesn't create a vmcore. Do we need to
> issue sysrq+c at the time of the hang and splat to create it?
Alternatively you can set the following sysctl to 1;
kernel.panic_on_warn = 1
>
> Fedora Linux 35 (Cloud Edition)
> Kernel 5.14.10-300.fc35.aarch64 on an aarch64 (ttyAMA0)
>
> eth0: 199.204.45.141 2604:e100:1:0:f816:3eff:fe72:c876
> dusty-35 login: [ 286.982605] Unable to handle kernel paging request
> at virtual address fffffffffffffdd0
> [ 286.988338] Mem abort info:
> [ 286.990307] ESR = 0x96000004
> [ 286.992596] EC = 0x25: DABT (current EL), IL = 32 bits
> [ 286.996316] SET = 0, FnV = 0
> [ 286.998454] EA = 0, S1PTW = 0
> [ 287.000791] FSC = 0x04: level 0 translation fault
> [ 287.004472] Data abort info:
> [ 287.006540] ISV = 0, ISS = 0x00000004
> [ 287.009239] CM = 0, WnR = 0
> [ 287.011344] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000054181000
> [ 287.018245] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
> [ 287.024209] Internal error: Oops: 96000004 [#1] SMP
> [ 287.027615] Modules linked in: virtio_gpu virtio_dma_buf
> drm_kms_helper cec joydev fb_sys_fops syscopyarea virtio_net
> sysfillrect sysimgblt net_failover virtio_balloon failover vfat fat
> drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
> virtio_mmio aes_neon_bs
> [ 287.047659] CPU: 0 PID: 3558 Comm: kworker/u8:7 Kdump: loaded Not
> tainted 5.14.10-300.fc35.aarch64 #1
> [ 287.055269] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
> [ 287.060932] Workqueue: btrfs-delalloc btrfs_work_helper
> [ 287.065353] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> [ 287.070568] pc : submit_compressed_extents+0x38/0x3d0
> [ 287.074825] lr : async_cow_submit+0x50/0xd0
> [ 287.078217] sp : ffff800015d4bc20
> [ 287.081008] x29: ffff800015d4bc30 x28: 0000000000000000 x27: ffffb8a2fa941000
> [ 287.087022] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000115873608
> [ 287.092822] x23: 0000000000000000 x22: 0000000000000001 x21: ffff0000c6f25800
> [ 287.098591] x20: ffff0000c0596000 x19: 0000000000000001 x18: ffff0000c2100bd4
> [ 287.104387] x17: ffff000115875ff8 x16: 0000000000000006 x15: 50006a3d10a961cd
> [ 287.110159] x14: f0668b836620caa1 x13: 0000000000000020 x12: ffff0001fefa68c0
> [ 287.116170] x11: ffffb8a2fa95b500 x10: 0000000000000000 x9 : ffffb8a2f9131c40
> [ 287.122120] x8 : ffff475f045bb000 x7 : ffff800015d4bbe0 x6 : ffffb8a2fae8ad40
> [ 287.128086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff0000c6f25820
> [ 287.133953] x2 : 0000000000000000 x1 : ffff000115873630 x0 : ffff000115873630
> [ 287.139760] Call trace:
> [ 287.141784] submit_compressed_extents+0x38/0x3d0
> [ 287.145620] async_cow_submit+0x50/0xd0
> [ 287.148801] run_ordered_work+0xc8/0x280
> [ 287.152005] btrfs_work_helper+0x98/0x250
> [ 287.155450] process_one_work+0x1f0/0x4ac
> [ 287.161577] worker_thread+0x188/0x504
> [ 287.167461] kthread+0x110/0x114
> [ 287.172872] ret_from_fork+0x10/0x18
> [ 287.178558] Code: a9056bf9 f8428437 f9401400 d108c2fa (f9400356)
> [ 287.186268] ---[ end trace 41ec405ced3786b6 ]---
>
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-19 23:42 ` Su Yue
` (2 preceding siblings ...)
2021-10-20 23:55 ` Chris Murphy
@ 2021-10-22 2:36 ` Chris Murphy
2021-10-22 6:02 ` Nikolay Borisov
2021-10-22 10:44 ` Nikolay Borisov
3 siblings, 2 replies; 62+ messages in thread
From: Chris Murphy @ 2021-10-22 2:36 UTC (permalink / raw)
To: Su Yue; +Cc: Chris Murphy, Qu Wenruo, Nikolay Borisov, Qu Wenruo, Btrfs BTRFS
OK I have a vmcore file:
https://dustymabe.fedorapeople.org/bz2011928-vmcore/
lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-22 2:36 ` Chris Murphy
@ 2021-10-22 6:02 ` Nikolay Borisov
2021-10-22 6:17 ` Su Yue
2021-10-22 10:44 ` Nikolay Borisov
1 sibling, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22 6:02 UTC (permalink / raw)
To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 22.10.21 г. 5:36, Chris Murphy wrote:
> OK I have a vmcore file:
> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
>
> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
In order to open the dump we require the vmlinux as well as the debug
vmlinuz and also btrfs.ko.debug file as well.
>
>
> --
> Chris Murphy
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-22 2:36 ` Chris Murphy
2021-10-22 6:02 ` Nikolay Borisov
@ 2021-10-22 10:44 ` Nikolay Borisov
2021-10-22 11:43 ` Nikolay Borisov
1 sibling, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22 10:44 UTC (permalink / raw)
To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 22.10.21 г. 5:36, Chris Murphy wrote:
> OK I have a vmcore file:
> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
>
> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
>
So the problem is we have a null inode:
crash> struct async_chunk ffff00012a78eb08
struct async_chunk {
inode = 0x0,
locked_page = 0xfffffc000508c240,
start = 0,
end = 4095,
write_flags = 0,
extents = {
next = 0xffff00012a78eb30,
prev = 0xffff00012a78eb30
},
blkcg_css = 0x0,
work = {
func = 0xffffd7c4c03c05c0 <async_cow_start>,
ordered_func = 0xffffd7c4c03c1bf0 <async_cow_submit>,
ordered_free = 0xffffd7c4c03be2e0 <async_cow_free>,
normal_work = {
data = {
counter = 256
},
entry = {
next = 0xffff00012a78eb68,
prev = 0xffff00012a78eb68
},
func = 0xffffd7c4c03f9e84 <btrfs_work_helper>
},
ordered_list = {
next = 0xffff00012a78ee80,
prev = 0xffff0000c6d83510
},
wq = 0xffff0000c6d83500,
flags = 3
},
pending = 0xffff00012a78eb00
}
But this makes no sense since before submit_compressed_extents is called
we have an explicit check for async_hunk->inode presence but AFAICS this
is not done in a concurrent context. So this either leaves some hw issue
or some race which manifests due to ARM's weak mm.
>
> --
> Chris Murphy
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-22 10:44 ` Nikolay Borisov
@ 2021-10-22 11:43 ` Nikolay Borisov
2021-10-22 17:18 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-22 11:43 UTC (permalink / raw)
To: Chris Murphy, Su Yue; +Cc: Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 22.10.21 г. 13:44, Nikolay Borisov wrote:
>
>
> On 22.10.21 г. 5:36, Chris Murphy wrote:
>> OK I have a vmcore file:
>> https://dustymabe.fedorapeople.org/bz2011928-vmcore/
>>
>> lib/modules/5.14.10-300.fc35.aarch64/vmlinuz
>> https://drive.google.com/file/d/1xXM8XGRi_Wzyupbm4MSNteF0rwUzO4GE/view?usp=sharing
>>
>
> So the problem is we have a null inode:
>
>
> crash> struct async_chunk ffff00012a78eb08
> struct async_chunk {
> inode = 0x0,
> locked_page = 0xfffffc000508c240,
> start = 0,
> end = 4095,
> write_flags = 0,
> extents = {
> next = 0xffff00012a78eb30,
> prev = 0xffff00012a78eb30
> },
> blkcg_css = 0x0,
> work = {
> func = 0xffffd7c4c03c05c0 <async_cow_start>,
> ordered_func = 0xffffd7c4c03c1bf0 <async_cow_submit>,
> ordered_free = 0xffffd7c4c03be2e0 <async_cow_free>,
> normal_work = {
> data = {
> counter = 256
> },
> entry = {
> next = 0xffff00012a78eb68,
> prev = 0xffff00012a78eb68
> },
> func = 0xffffd7c4c03f9e84 <btrfs_work_helper>
> },
> ordered_list = {
> next = 0xffff00012a78ee80,
> prev = 0xffff0000c6d83510
> },
> wq = 0xffff0000c6d83500,
> flags = 3
> },
> pending = 0xffff00012a78eb00
> }
>
>
> But this makes no sense since before submit_compressed_extents is called
> we have an explicit check for async_hunk->inode presence but AFAICS this
> is not done in a concurrent context. So this either leaves some hw issue
> or some race which manifests due to ARM's weak mm.
I also looked at the assembly generated in async_cow_submit to see if
anything funny happens while the async_chunk->inode check is performed -
everything looks fine. Also given that the extents list is empty and the
inode is NULL I'd assume that the "write" side is also correct i.e the
code in async_cow_start. This pretty much excludes a codegen problem.
Chris can you add the following line in submit_compressed_extents right
before the BTRFS_I() function is called:
WARN_ON(!async_chunk->inode);
And re-run the workload again?
>
>>
>> --
>> Chris Murphy
>>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-22 11:43 ` Nikolay Borisov
@ 2021-10-22 17:18 ` Chris Murphy
2021-10-23 10:09 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-22 17:18 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Fri, Oct 22, 2021 at 7:43 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> I also looked at the assembly generated in async_cow_submit to see if
> anything funny happens while the async_chunk->inode check is performed -
> everything looks fine. Also given that the extents list is empty and the
> inode is NULL I'd assume that the "write" side is also correct i.e the
> code in async_cow_start. This pretty much excludes a codegen problem.
>
> Chris can you add the following line in submit_compressed_extents right
> before the BTRFS_I() function is called:
>
> WARN_ON(!async_chunk->inode);
>
> And re-run the workload again?
I'll look into how we can do this. I build kernels per
https://kernelnewbies.org/KernelBuild but maybe it's better to do it
within Fedora infrastructure to keep things more the same and
reproducible? I'm not really sure, so I've asked in the bug
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c41 - if you have
two cents to add let me know in this thread or that one.
Any other configs to change while we're building a new kernel?
CONFIG_BTRFS_ASSERT=y ?
inode.c
849:static noinline void submit_compressed_extents(struct async_chunk
*async_chunk)
850-{
851- struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
becomes
849:static noinline void submit_compressed_extents(struct async_chunk
*async_chunk)
850-{
851- WARN_ON(!async_chunk->inode);
852- struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
?
(I'm looking at 5.15-rc6)
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-22 17:18 ` Chris Murphy
@ 2021-10-23 10:09 ` Nikolay Borisov
2021-10-25 14:48 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-23 10:09 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 22.10.21 г. 20:18, Chris Murphy wrote:
> On Fri, Oct 22, 2021 at 7:43 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> I also looked at the assembly generated in async_cow_submit to see if
>> anything funny happens while the async_chunk->inode check is performed -
>> everything looks fine. Also given that the extents list is empty and the
>> inode is NULL I'd assume that the "write" side is also correct i.e the
>> code in async_cow_start. This pretty much excludes a codegen problem.
>>
>> Chris can you add the following line in submit_compressed_extents right
>> before the BTRFS_I() function is called:
>>
>> WARN_ON(!async_chunk->inode);
>>
>> And re-run the workload again?
>
> I'll look into how we can do this. I build kernels per
> https://kernelnewbies.org/KernelBuild but maybe it's better to do it
> within Fedora infrastructure to keep things more the same and
> reproducible? I'm not really sure, so I've asked in the bug
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c41 - if you have
> two cents to add let me know in this thread or that one.
>
> Any other configs to change while we're building a new kernel?
> CONFIG_BTRFS_ASSERT=y ?
>
> inode.c
> 849:static noinline void submit_compressed_extents(struct async_chunk
> *async_chunk)
> 850-{
> 851- struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
>
> becomes
>
> 849:static noinline void submit_compressed_extents(struct async_chunk
> *async_chunk)
> 850-{
> 851- WARN_ON(!async_chunk->inode);
> 852- struct btrfs_inode *inode = BTRFS_I(async_chunk->inode);
>
> ?
> (I'm looking at 5.15-rc6)
Yes.
>
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-23 10:09 ` Nikolay Borisov
@ 2021-10-25 14:48 ` Chris Murphy
2021-10-25 18:34 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 14:48 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
https://bugzilla.redhat.com/show_bug.cgi?id=2011928
Comment 45 (attachment) is a dmesg sysrq+t during the hang with a
5.14.14 kernel with the WARN_ON added but no OOPS or call trace
occurred
Comment 46 (attachment) is a dmesg with a 5.14.10 kernel with the
WARN_ON added, with OOPS and call trace; excerpt of this pasted below
[ 992.788137] ------------[ cut here ]------------
[ 992.793018] WARNING: CPU: 0 PID: 1509 at fs/btrfs/inode.c:844
submit_compressed_extents+0x3d4/0x3e0
[ 992.802276] Modules linked in: rfkill virtio_gpu virtio_dma_buf
drm_kms_helper joydev cec fb_sys_fops virtio_net syscopyarea
net_failover sysfillrect sysimgblt virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[ 992.828320] CPU: 0 PID: 1509 Comm: kworker/u8:12 Not tainted
5.14.10-300.fc35.dusty.aarch64 #1
[ 992.837159] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 992.844076] Workqueue: btrfs-delalloc btrfs_work_helper
[ 992.849339] pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[ 992.855262] pc : submit_compressed_extents+0x3d4/0x3e0
[ 992.860357] lr : async_cow_submit+0x50/0xd0
[ 992.864444] sp : ffff800012023c20
[ 992.867667] x29: ffff800012023c30 x28: 0000000000000000 x27: ffffdd47ca411000
[ 992.874799] x26: ffff000128f2c548 x25: dead000000000100 x24: ffff000128f2c508
[ 992.881862] x23: 0000000000000000 x22: 0000000000000001 x21: ffff00018f9d5e80
[ 992.888931] x20: ffff0000c0672000 x19: 0000000000000001 x18: ffff0000c4c00bd4
[ 992.896105] x17: ffff00012d53aff8 x16: 0000000000000006 x15: 7a1cde357ab19b01
[ 992.903348] x14: 5eac0029a606c741 x13: 0000000000000020 x12: ffff0001fefa78c0
[ 992.910639] x11: ffffdd47ca42b500 x10: 0000000000000000 x9 : ffffdd47c8c01c50
[ 992.917872] x8 : ffff22ba34aec000 x7 : ffff800012023be0 x6 : ffffdd47ca95ad40
[ 992.925086] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff00018f9d5ea0
[ 992.932221] x2 : 0000000000000000 x1 : ffff000128f2c508 x0 : ffff000128f2c508
[ 992.939392] Call trace:
[ 992.941854] submit_compressed_extents+0x3d4/0x3e0
[ 992.946737] async_cow_submit+0x50/0xd0
[ 992.950574] run_ordered_work+0xc8/0x280
[ 992.954560] btrfs_work_helper+0x98/0x250
[ 992.958594] process_one_work+0x1f0/0x4ac
[ 992.962619] worker_thread+0x188/0x504
[ 992.966390] kthread+0x110/0x114
[ 992.969681] ret_from_fork+0x10/0x18
[ 992.973313] ---[ end trace 11b751608cbdcfac ]---
[ 992.978203] Unable to handle kernel paging request at virtual
address fffffffffffffdd0
[ 992.986011] Mem abort info:
[ 992.993975] ESR = 0x96000004
[ 992.996786] EC = 0x25: DABT (current EL), IL = 32 bits
[ 993.001795] SET = 0, FnV = 0
[ 993.004646] EA = 0, S1PTW = 0
[ 993.007455] FSC = 0x04: level 0 translation fault
[ 993.012081] Data abort info:
[ 993.014712] ISV = 0, ISS = 0x00000004
[ 993.021058] CM = 0, WnR = 0
[ 993.026357] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000009c051000
[ 993.035411] [fffffffffffffdd0] pgd=0000000000000000, p4d=0000000000000000
[ 993.044400] Internal error: Oops: 96000004 [#1] SMP
[ 993.051651] Modules linked in: rfkill virtio_gpu virtio_dma_buf
drm_kms_helper joydev cec fb_sys_fops virtio_net syscopyarea
net_failover sysfillrect sysimgblt virtio_balloon failover vfat fat
drm fuse zram ip_tables crct10dif_ce ghash_ce virtio_blk qemu_fw_cfg
virtio_mmio aes_neon_bs
[ 993.083344] CPU: 0 PID: 1509 Comm: kworker/u8:12 Tainted: G
W 5.14.10-300.fc35.dusty.aarch64 #1
[ 993.095545] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[ 993.104796] Workqueue: btrfs-delalloc btrfs_work_helper
[ 993.112752] pstate: 20400005 (nzCv daif +PAN -UAO -TCO BTYPE=--)
[ 993.121096] pc : submit_compressed_extents+0x44/0x3e0
[ 993.128333] lr : async_cow_submit+0x50/0xd0
[ 993.134773] sp : ffff800012023c20
[ 993.140397] x29: ffff800012023c30 x28: 0000000000000000 x27: ffffdd47ca411000
[ 993.149489] x26: fffffffffffffdd0 x25: dead000000000100 x24: ffff000128f2c508
[ 993.158723] x23: 0000000000000000 x22: 0000000000000001 x21: ffff00018f9d5e80
[ 993.167904] x20: fffffffffffffe18 x19: 0000000000000001 x18: ffff0000c4c00bd4
[ 993.177039] x17: ffff00012d53aff8 x16: 0000000000000006 x15: 7a1cde357ab19b01
[ 993.186386] x14: 5eac0029a606c741 x13: 0000000000000020 x12: ffff0001fefa78c0
[ 993.195490] x11: ffffdd47ca42b500 x10: 0000000000000000 x9 : ffffdd47c8c01c50
[ 993.204603] x8 : ffff22ba34aec000 x7 : ffff800012023be0 x6 : ffffdd47ca95ad40
[ 993.213749] x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffff00018f9d5ea0
[ 993.222960] x2 : 0000000000000000 x1 : ffff000128f2c530 x0 : ffff000128f2c530
[ 993.232079] Call trace:
[ 993.236821] submit_compressed_extents+0x44/0x3e0
[ 993.243682] async_cow_submit+0x50/0xd0
[ 993.249829] run_ordered_work+0xc8/0x280
[ 993.255974] btrfs_work_helper+0x98/0x250
[ 993.262187] process_one_work+0x1f0/0x4ac
[ 993.268381] worker_thread+0x188/0x504
[ 993.274252] kthread+0x110/0x114
[ 993.279894] ret_from_fork+0x10/0x18
[ 993.285819] Code: d108c2fa 9100a301 f9401700 d107a2f4 (f9400356)
[ 993.294256] ---[ end trace 11b751608cbdcfad ]---
I don't see any new information here though.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-25 14:48 ` Chris Murphy
@ 2021-10-25 18:34 ` Chris Murphy
2021-10-25 19:40 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 18:34 UTC (permalink / raw)
To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
> > Vendor ID: Cavium
> > Model: 1
> > Model name: ThunderX 88XX
I still haven't hit the WARN_ON. But weirdly I'm not getting the oops
with 5.14.14 but can hit it with 5.14.10... though the sample size is
small. And it definitely is smelling like a race. I'll keep trying to
hit it with 5.14.10 because I want to see if this WARN_ON will get hit
and give us more information.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-25 18:34 ` Chris Murphy
@ 2021-10-25 19:40 ` Chris Murphy
2021-10-26 7:14 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-25 19:40 UTC (permalink / raw)
To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
Got another sysrq+t here, while dnf is completely hung while 'dnf
install kernel-debuginfo' packages, for a long time without any call
traces or indication why it's stuck. ps aux shows it's running, but
consuming no meaningful cpu; top shows very high ~25% wa, the rest is
idle. Essentially no user or system process consumption.
https://bugzilla.redhat.com/attachment.cgi?id=1836995
Excerpts of items that are in D state:
[ 9595.270460] kernel: task:kworker/u8:7 state:D stack: 0 pid:
1296 ppid: 2 flags:0x00000008
[ 9595.280269] kernel: Workqueue: events_unbound
btrfs_async_reclaim_metadata_space
[ 9595.288593] kernel: Call trace:
[ 9595.292822] kernel: __switch_to+0x160/0x1d4
[ 9595.298383] kernel: __schedule+0x22c/0x5f0
[ 9595.303605] kernel: schedule+0x54/0xdc
[ 9595.308644] kernel: schedule_preempt_disabled+0x1c/0x30
[ 9595.314929] kernel: __mutex_lock.constprop.0+0x184/0x544
[ 9595.321559] kernel: __mutex_lock_slowpath+0x1c/0x30
[ 9595.327579] kernel: mutex_lock+0x6c/0x80
[ 9595.332600] kernel: btrfs_start_delalloc_roots+0x78/0x320
[ 9595.339303] kernel: shrink_delalloc+0xf4/0x260
[ 9595.344883] kernel: flush_space+0x110/0x2a0
[ 9595.350402] kernel: btrfs_async_reclaim_metadata_space+0x130/0x350
[ 9595.357574] kernel: process_one_work+0x1f0/0x4ac
[ 9595.363215] kernel: worker_thread+0x188/0x504
[ 9595.368921] kernel: kthread+0x110/0x114
[ 9595.373958] kernel: ret_from_fork+0x10/0x18
[ 9595.379413] kernel: task:kworker/u8:9 state:D stack: 0 pid:
1300 ppid: 2 flags:0x00000008
[ 9595.389417] kernel: Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 9595.396867] kernel: Call trace:
[ 9595.401256] kernel: __switch_to+0x160/0x1d4
[ 9595.406688] kernel: __schedule+0x22c/0x5f0
[ 9595.411998] kernel: schedule+0x54/0xdc
[ 9595.417000] kernel: inode_sleep_on_writeback+0x8c/0xb0
[ 9595.423152] kernel: wb_writeback+0x174/0x3dc
[ 9595.428734] kernel: wb_do_writeback+0x114/0x394
[ 9595.434404] kernel: wb_workfn+0x80/0x2a0
[ 9595.439815] kernel: process_one_work+0x1f0/0x4ac
[ 9595.445807] kernel: worker_thread+0x260/0x504
[ 9595.451559] kernel: kthread+0x110/0x114
[ 9595.456623] kernel: ret_from_fork+0x10/0x18
[ 9595.461987] kernel: task:kworker/u8:13 state:D stack: 0 pid:
1304 ppid: 2 flags:0x00000008
[ 9595.472144] kernel: Workqueue: events_unbound
btrfs_preempt_reclaim_metadata_space
[ 9595.480865] kernel: Call trace:
[ 9595.485360] kernel: __switch_to+0x160/0x1d4
[ 9595.491154] kernel: __schedule+0x22c/0x5f0
[ 9595.496601] kernel: schedule+0x54/0xdc
[ 9595.501702] kernel: io_schedule+0x48/0x6c
[ 9595.507098] kernel: wait_on_page_bit_common+0x15c/0x400
[ 9595.513421] kernel: __lock_page+0x60/0x80
[ 9595.518791] kernel: extent_write_cache_pages+0x29c/0x3cc
[ 9595.525199] kernel: extent_writepages+0x44/0xb0
[ 9595.531110] kernel: btrfs_writepages+0x1c/0x30
[ 9595.536813] kernel: do_writepages+0x44/0xf0
[ 9595.542223] kernel: __writeback_single_inode+0x48/0x400
[ 9595.548938] kernel: writeback_single_inode+0xf4/0x240
[ 9595.555245] kernel: sync_inode+0x1c/0x2c
[ 9595.560604] kernel: start_delalloc_inodes+0x188/0x450
[ 9595.567634] kernel: btrfs_start_delalloc_roots+0x194/0x320
[ 9595.574325] kernel: shrink_delalloc+0xf4/0x260
[ 9595.580087] kernel: flush_space+0x110/0x2a0
[ 9595.585381] kernel: btrfs_preempt_reclaim_metadata_space+0x148/0x270
[ 9595.593048] kernel: process_one_work+0x1f0/0x4ac
[ 9595.599040] kernel: worker_thread+0x188/0x504
[ 9595.604515] kernel: kthread+0x110/0x114
[ 9595.609959] kernel: ret_from_fork+0x10/0x18
...
[ 9596.146831] kernel: task:dnf state:D stack: 0
pid:14580 ppid: 14579 flags:0x00000000
[ 9596.156309] kernel: Call trace:
[ 9596.160424] kernel: __switch_to+0x160/0x1d4
[ 9596.165512] kernel: __schedule+0x22c/0x5f0
[ 9596.170758] kernel: schedule+0x54/0xdc
[ 9596.175419] kernel: wb_wait_for_completion+0x78/0xac
[ 9596.181577] kernel: __writeback_inodes_sb_nr+0x80/0xa0
[ 9596.187695] kernel: writeback_inodes_sb+0x58/0x70
[ 9596.193322] kernel: sync_filesystem+0x50/0xc0
[ 9596.198714] kernel: __arm64_sys_syncfs+0x54/0xb0
[ 9596.204163] kernel: invoke_syscall+0x50/0x120
[ 9596.209724] kernel: el0_svc_common+0x48/0x100
[ 9596.214986] kernel: do_el0_svc+0x34/0xa0
[ 9596.220105] kernel: el0_svc+0x2c/0x54
[ 9596.224755] kernel: el0t_64_sync_handler+0xa4/0x130
[ 9596.230745] kernel: el0t_64_sync+0x19c/0x1a0
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-25 19:40 ` Chris Murphy
@ 2021-10-26 7:14 ` Nikolay Borisov
2021-10-26 12:51 ` Chris Murphy
2021-10-27 18:22 ` Chris Murphy
0 siblings, 2 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 7:14 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 25.10.21 г. 22:40, Chris Murphy wrote:
> Got another sysrq+t here, while dnf is completely hung while 'dnf
> install kernel-debuginfo' packages, for a long time without any call
> traces or indication why it's stuck. ps aux shows it's running, but
> consuming no meaningful cpu; top shows very high ~25% wa, the rest is
> idle. Essentially no user or system process consumption.
>
> https://bugzilla.redhat.com/attachment.cgi?id=1836995
>
<snip>
I think I identified a race that could cause the crash, can you apply the
following diff and re-run the tests and leave them for a couple of days.
Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index 309516e6a968..a3d788dcbd34 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
ordered_list);
if (!test_bit(WORK_DONE_BIT, &work->flags))
break;
+ /*
+ * Orders all subsequent loads after WORK_DONE_BIT, paired with
+ * the smp_mb__before_atomic in btrfs_work_helper
+ */
+ smp_rmb();
/*
* we are going to call the ordered done function, but
@@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
thresh_exec_hook(wq);
work->func(work);
if (need_order) {
+ /*
+ * Ensures all вритес done in ->func are ordered before
+ * setting the WORK_DONE_BIT making them visible to ordered
+ * func
+ */
+ smp_mb__before_atomic();
set_bit(WORK_DONE_BIT, &work->flags);
run_ordered_work(wq, work);
} else {
^ permalink raw reply related [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 7:14 ` Nikolay Borisov
@ 2021-10-26 12:51 ` Chris Murphy
2021-10-26 13:05 ` Nikolay Borisov
2021-10-27 18:22 ` Chris Murphy
1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 12:51 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> I think I identified a race that could cause the crash, can you apply the
> following diff and re-run the tests and leave them for a couple of days.
> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 309516e6a968..a3d788dcbd34 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
> ordered_list);
> if (!test_bit(WORK_DONE_BIT, &work->flags))
> break;
> + /*
> + * Orders all subsequent loads after WORK_DONE_BIT, paired with
> + * the smp_mb__before_atomic in btrfs_work_helper
> + */
> + smp_rmb();
>
> /*
> * we are going to call the ordered done function, but
> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
> thresh_exec_hook(wq);
> work->func(work);
> if (need_order) {
> + /*
> + * Ensures all вритес done in ->func are ordered before
> + * setting the WORK_DONE_BIT making them visible to ordered
> + * func
> + */
> + smp_mb__before_atomic();
> set_bit(WORK_DONE_BIT, &work->flags);
> run_ordered_work(wq, work);
> } else {
>
Couple typos: 'вритес' looks like keyboard layout hiccup and should be
'writes'; and 5.4.10 should be 5.14.10 (I'm betting all the tea in
China that upstream isn't asking me to test a patch on a two year old
kernel).
Unfortunately the test we have is non-automated, it's "install this
package set" and wait. It always hangs, usually recovers without an
oops, but sometimes there's an oops. So it's pretty tedious to test it
with the "testcase" we currently have. I'd like a better one that
triggers this faster, but more importantly would be a reliable one.
We'll do our best though. Thanks!
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 12:51 ` Chris Murphy
@ 2021-10-26 13:05 ` Nikolay Borisov
2021-10-26 18:08 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 13:05 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 26.10.21 г. 15:51, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> I think I identified a race that could cause the crash, can you apply the
>> following diff and re-run the tests and leave them for a couple of days.
>> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>>
>> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
>> index 309516e6a968..a3d788dcbd34 100644
>> --- a/fs/btrfs/async-thread.c
>> +++ b/fs/btrfs/async-thread.c
>> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>> ordered_list);
>> if (!test_bit(WORK_DONE_BIT, &work->flags))
>> break;
>> + /*
>> + * Orders all subsequent loads after WORK_DONE_BIT, paired with
>> + * the smp_mb__before_atomic in btrfs_work_helper
>> + */
>> + smp_rmb();
>>
>> /*
>> * we are going to call the ordered done function, but
>> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>> thresh_exec_hook(wq);
>> work->func(work);
>> if (need_order) {
>> + /*
>> + * Ensures all вритес done in ->func are ordered before
>> + * setting the WORK_DONE_BIT making them visible to ordered
>> + * func
>> + */
>> + smp_mb__before_atomic();
>> set_bit(WORK_DONE_BIT, &work->flags);
>> run_ordered_work(wq, work);
>> } else {
>>
>
> Couple typos: 'вритес' looks like keyboard layout hiccup and should be
> 'writes'; and 5.4.10 should be 5.14.10 (I'm betting all the tea in
> China that upstream isn't asking me to test a patch on a two year old
> kernel).
Correct in both cases :)
>
> Unfortunately the test we have is non-automated, it's "install this
> package set" and wait. It always hangs, usually recovers without an
> oops, but sometimes there's an oops. So it's pretty tedious to test it
> with the "testcase" we currently have. I'd like a better one that
> triggers this faster, but more importantly would be a reliable one.
> We'll do our best though. Thanks!
I thought the hang and the crash one are two different issues. What the
above diff is supposed to solve is the case in which
submit_compressed_extent is called with async_chunk->inode is null.
For the lockup issue it might or might not be related to this. But it
will be best if a crashdump is provided when the hang has occurred.
Looking at the task call trace in
https://bugzilla-attachments.redhat.com/attachment.cgi?id=1836995
doesn't point at a hang. Just a bunch of threads waiting on IO in the
metadata reclaim path.
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 13:05 ` Nikolay Borisov
@ 2021-10-26 18:08 ` Chris Murphy
2021-10-26 18:14 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:08 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 26, 2021 at 9:05 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 26.10.21 г. 15:51, Chris Murphy wrote:
> > Unfortunately the test we have is non-automated, it's "install this
> > package set" and wait. It always hangs, usually recovers without an
> > oops, but sometimes there's an oops. So it's pretty tedious to test it
> > with the "testcase" we currently have. I'd like a better one that
> > triggers this faster, but more importantly would be a reliable one.
> > We'll do our best though. Thanks!
>
> I thought the hang and the crash one are two different issues. What the
> above diff is supposed to solve is the case in which
> submit_compressed_extent is called with async_chunk->inode is null.
I don't know whether the hang and crash are related at all. I've been
unable to get a sysrq+t that shows anything when "dnf install
libreoffice" hangs, which I suspect could be dbus related where a
bunch of services get clobbered and restarted during the metric ton of
dependencies that libreoffice brings into a cloud base image. But
there is a consistent hang just installing kernel debug info and maybe
half the time the VM just falls over and isn't responsive at all -
later we sometimes see the submit_compressed_extent call trace in
virtual serial console. So yeah, I don't know...
> For the lockup issue it might or might not be related to this. But it
> will be best if a crashdump is provided when the hang has occurred.
How do I trigger the crashdump for the hang? Maybe set one of these to 1?
kernel.hardlockup_panic = 0
kernel.hung_task_panic = 0
kernel.max_rcu_stall_to_panic = 0
kernel.panic_on_rcu_stall = 0
> Looking at the task call trace in
> https://bugzilla-attachments.redhat.com/attachment.cgi?id=1836995
> doesn't point at a hang. Just a bunch of threads waiting on IO in the
> metadata reclaim path.
Well it stayed that way for hours and never recovered, I couldn't ssh
in either. And in the most recent case there was an oops with the
submit_compressed_extent call trace.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 18:08 ` Chris Murphy
@ 2021-10-26 18:14 ` Nikolay Borisov
2021-10-26 18:26 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 18:14 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 26.10.21 г. 21:08, Chris Murphy wrote:
> I don't know whether the hang and crash are related at all. I've been
> unable to get a sysrq+t that shows anything when "dnf install
> libreoffice" hangs, which I suspect could be dbus related where a
> bunch of services get clobbered and restarted during the metric ton of
> dependencies that libreoffice brings into a cloud base image. But
Since this is a qemy virtual machine it's possible to acquire a direct
memory dump from qemu's management console. There's a dump-guest-memory
via qemu's management console alternatively via virsh one can do the
procedure described here:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
if you can provide a memory dump + kernel vmlinux then I will be happy
to look into this. In the meantime the barriers fixes should remedy crash.
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 18:14 ` Nikolay Borisov
@ 2021-10-26 18:26 ` Chris Murphy
2021-10-26 18:31 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:26 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 26.10.21 г. 21:08, Chris Murphy wrote:
> > I don't know whether the hang and crash are related at all. I've been
> > unable to get a sysrq+t that shows anything when "dnf install
> > libreoffice" hangs, which I suspect could be dbus related where a
> > bunch of services get clobbered and restarted during the metric ton of
> > dependencies that libreoffice brings into a cloud base image. But
>
>
> Since this is a qemy virtual machine it's possible to acquire a direct
> memory dump from qemu's management console. There's a dump-guest-memory
> via qemu's management console alternatively via virsh one can do the
> procedure described here:
> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
>
>
> if you can provide a memory dump + kernel vmlinux then I will be happy
> to look into this. In the meantime the barriers fixes should remedy crash.
OK thanks. I'll start testing a kernel built with this patch, and then
move on to capturing a memory dump of the VM if we're still seeing
hangs.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 18:26 ` Chris Murphy
@ 2021-10-26 18:31 ` Chris Murphy
2021-10-26 18:35 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-26 18:31 UTC (permalink / raw)
To: Chris Murphy; +Cc: Nikolay Borisov, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 26, 2021 at 2:26 PM Chris Murphy <lists@colorremedies.com> wrote:
>
> On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
> >
> >
> >
> > On 26.10.21 г. 21:08, Chris Murphy wrote:
> > > I don't know whether the hang and crash are related at all. I've been
> > > unable to get a sysrq+t that shows anything when "dnf install
> > > libreoffice" hangs, which I suspect could be dbus related where a
> > > bunch of services get clobbered and restarted during the metric ton of
> > > dependencies that libreoffice brings into a cloud base image. But
> >
> >
> > Since this is a qemy virtual machine it's possible to acquire a direct
> > memory dump from qemu's management console. There's a dump-guest-memory
> > via qemu's management console alternatively via virsh one can do the
> > procedure described here:
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
> >
> >
> > if you can provide a memory dump + kernel vmlinux then I will be happy
> > to look into this. In the meantime the barriers fixes should remedy crash.
>
> OK thanks. I'll start testing a kernel built with this patch, and then
> move on to capturing a memory dump of the VM if we're still seeing
> hangs.
With or without the --memory-only option?
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 18:31 ` Chris Murphy
@ 2021-10-26 18:35 ` Nikolay Borisov
0 siblings, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-26 18:35 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 26.10.21 г. 21:31, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 2:26 PM Chris Murphy <lists@colorremedies.com> wrote:
>>
>> On Tue, Oct 26, 2021 at 2:14 PM Nikolay Borisov <nborisov@suse.com> wrote:
>>>
>>>
>>>
>>> On 26.10.21 г. 21:08, Chris Murphy wrote:
>>>> I don't know whether the hang and crash are related at all. I've been
>>>> unable to get a sysrq+t that shows anything when "dnf install
>>>> libreoffice" hangs, which I suspect could be dbus related where a
>>>> bunch of services get clobbered and restarted during the metric ton of
>>>> dependencies that libreoffice brings into a cloud base image. But
>>>
>>>
>>> Since this is a qemy virtual machine it's possible to acquire a direct
>>> memory dump from qemu's management console. There's a dump-guest-memory
>>> via qemu's management console alternatively via virsh one can do the
>>> procedure described here:
>>> https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/sect-domain_commands-creating_a_dump_file_of_a_domains_core
>>>
>>>
>>> if you can provide a memory dump + kernel vmlinux then I will be happy
>>> to look into this. In the meantime the barriers fixes should remedy crash.
>>
>> OK thanks. I'll start testing a kernel built with this patch, and then
>> move on to capturing a memory dump of the VM if we're still seeing
>> hangs.
>
> With or without the --memory-only option?
Yes (though I have never used the virsh method, but straight the hmp one).
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-26 7:14 ` Nikolay Borisov
2021-10-26 12:51 ` Chris Murphy
@ 2021-10-27 18:22 ` Chris Murphy
2021-10-28 5:36 ` Nikolay Borisov
1 sibling, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-10-27 18:22 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
> I think I identified a race that could cause the crash, can you apply the
> following diff and re-run the tests and leave them for a couple of days.
> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>
> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> index 309516e6a968..a3d788dcbd34 100644
> --- a/fs/btrfs/async-thread.c
> +++ b/fs/btrfs/async-thread.c
> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
> ordered_list);
> if (!test_bit(WORK_DONE_BIT, &work->flags))
> break;
> + /*
> + * Orders all subsequent loads after WORK_DONE_BIT, paired with
> + * the smp_mb__before_atomic in btrfs_work_helper
> + */
> + smp_rmb();
>
> /*
> * we are going to call the ordered done function, but
> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
> thresh_exec_hook(wq);
> work->func(work);
> if (need_order) {
> + /*
> + * Ensures all вритес done in ->func are ordered before
> + * setting the WORK_DONE_BIT making them visible to ordered
> + * func
> + */
> + smp_mb__before_atomic();
> set_bit(WORK_DONE_BIT, &work->flags);
> run_ordered_work(wq, work);
> } else {
>
So far this appears to be working well - thanks!
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-27 18:22 ` Chris Murphy
@ 2021-10-28 5:36 ` Nikolay Borisov
2021-11-02 14:23 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-10-28 5:36 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 27.10.21 г. 21:22, Chris Murphy wrote:
> On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>> I think I identified a race that could cause the crash, can you apply the
>> following diff and re-run the tests and leave them for a couple of days.
>> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
>>
>> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
>> index 309516e6a968..a3d788dcbd34 100644
>> --- a/fs/btrfs/async-thread.c
>> +++ b/fs/btrfs/async-thread.c
>> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
>> ordered_list);
>> if (!test_bit(WORK_DONE_BIT, &work->flags))
>> break;
>> + /*
>> + * Orders all subsequent loads after WORK_DONE_BIT, paired with
>> + * the smp_mb__before_atomic in btrfs_work_helper
>> + */
>> + smp_rmb();
>>
>> /*
>> * we are going to call the ordered done function, but
>> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
>> thresh_exec_hook(wq);
>> work->func(work);
>> if (need_order) {
>> + /*
>> + * Ensures all вритес done in ->func are ordered before
>> + * setting the WORK_DONE_BIT making them visible to ordered
>> + * func
>> + */
>> + smp_mb__before_atomic();
>> set_bit(WORK_DONE_BIT, &work->flags);
>> run_ordered_work(wq, work);
>> } else {
>>
>
> So far this appears to be working well - thanks!
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
Great, but due to the nature of the bug I'd rather wait at least until
the beginning of next week before sending an official patch so that this
can be tested more. In your comment you state 3/3 kernel debug info
installs and 6/6 libreoffice installs, how do those numbers compare
without the fix?
>
>
> --
> Chris Murphy
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-10-28 5:36 ` Nikolay Borisov
@ 2021-11-02 14:23 ` Chris Murphy
2021-11-02 14:25 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-11-02 14:23 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 27.10.21 г. 21:22, Chris Murphy wrote:
> > On Tue, Oct 26, 2021 at 3:14 AM Nikolay Borisov <nborisov@suse.com> wrote:
> >
> >> I think I identified a race that could cause the crash, can you apply the
> >> following diff and re-run the tests and leave them for a couple of days.
> >> Preferably apply it on 5.4.10 so that there is the highest chance to reproduce:
> >>
> >> diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
> >> index 309516e6a968..a3d788dcbd34 100644
> >> --- a/fs/btrfs/async-thread.c
> >> +++ b/fs/btrfs/async-thread.c
> >> @@ -234,6 +234,11 @@ static void run_ordered_work(struct __btrfs_workqueue *wq,
> >> ordered_list);
> >> if (!test_bit(WORK_DONE_BIT, &work->flags))
> >> break;
> >> + /*
> >> + * Orders all subsequent loads after WORK_DONE_BIT, paired with
> >> + * the smp_mb__before_atomic in btrfs_work_helper
> >> + */
> >> + smp_rmb();
> >>
> >> /*
> >> * we are going to call the ordered done function, but
> >> @@ -317,6 +322,12 @@ static void btrfs_work_helper(struct work_struct *normal_work)
> >> thresh_exec_hook(wq);
> >> work->func(work);
> >> if (need_order) {
> >> + /*
> >> + * Ensures all вритес done in ->func are ordered before
> >> + * setting the WORK_DONE_BIT making them visible to ordered
> >> + * func
> >> + */
> >> + smp_mb__before_atomic();
> >> set_bit(WORK_DONE_BIT, &work->flags);
> >> run_ordered_work(wq, work);
> >> } else {
> >>
> >
> > So far this appears to be working well - thanks!
> > https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>
> Great, but due to the nature of the bug I'd rather wait at least until
> the beginning of next week before sending an official patch so that this
> can be tested more. In your comment you state 3/3 kernel debug info
> installs and 6/6 libreoffice installs, how do those numbers compare
> without the fix?
More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
of those would result in a call trace.
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-11-02 14:23 ` Chris Murphy
@ 2021-11-02 14:25 ` Nikolay Borisov
2021-11-05 16:12 ` Chris Murphy
0 siblings, 1 reply; 62+ messages in thread
From: Nikolay Borisov @ 2021-11-02 14:25 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 2.11.21 г. 16:23, Chris Murphy wrote:
> On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
<snip>
>>>
>>> So far this appears to be working well - thanks!
>>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>>
>> Great, but due to the nature of the bug I'd rather wait at least until
>> the beginning of next week before sending an official patch so that this
>> can be tested more. In your comment you state 3/3 kernel debug info
>> installs and 6/6 libreoffice installs, how do those numbers compare
>> without the fix?
>
> More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
> of those would result in a call trace.
As you might have seen I did send a proper patch, if you've continued
testing it over the weekend and still haven't encountered an issue you
can reply with a Tested-by to the patch .
>
>
>
> --
> Chris Murphy
>
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-11-02 14:25 ` Nikolay Borisov
@ 2021-11-05 16:12 ` Chris Murphy
2021-11-07 9:11 ` Nikolay Borisov
0 siblings, 1 reply; 62+ messages in thread
From: Chris Murphy @ 2021-11-05 16:12 UTC (permalink / raw)
To: Nikolay Borisov; +Cc: Chris Murphy, Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On Tue, Nov 2, 2021 at 10:25 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
>
>
> On 2.11.21 г. 16:23, Chris Murphy wrote:
> > On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>
> <snip>
>
> >>>
> >>> So far this appears to be working well - thanks!
> >>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
> >>
> >> Great, but due to the nature of the bug I'd rather wait at least until
> >> the beginning of next week before sending an official patch so that this
> >> can be tested more. In your comment you state 3/3 kernel debug info
> >> installs and 6/6 libreoffice installs, how do those numbers compare
> >> without the fix?
> >
> > More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
> > of those would result in a call trace.
>
> As you might have seen I did send a proper patch, if you've continued
> testing it over the weekend and still haven't encountered an issue you
> can reply with a Tested-by to the patch .
Did that.
Also, I just noticed the downstream bug comment that another tester
has run the original patch for several days and can't reproduce the
problem.
But the side note is that without the patch, they were experiencing
file system corruption, i.e. it would not mount following the crash.
Let me know if it's worth asking the tester for mount time failure
kernel messages; or a btrfs check of the corrupted system. I guess
this race is expected to never manifest on x86?
https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c55
--
Chris Murphy
^ permalink raw reply [flat|nested] 62+ messages in thread
* Re: 5.14.9 aarch64 OOPS Workqueue: btrfs-delalloc btrfs_work_helper
2021-11-05 16:12 ` Chris Murphy
@ 2021-11-07 9:11 ` Nikolay Borisov
0 siblings, 0 replies; 62+ messages in thread
From: Nikolay Borisov @ 2021-11-07 9:11 UTC (permalink / raw)
To: Chris Murphy; +Cc: Su Yue, Qu Wenruo, Qu Wenruo, Btrfs BTRFS
On 5.11.21 г. 18:12, Chris Murphy wrote:
> On Tue, Nov 2, 2021 at 10:25 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>>
>>
>> On 2.11.21 г. 16:23, Chris Murphy wrote:
>>> On Thu, Oct 28, 2021 at 1:36 AM Nikolay Borisov <nborisov@suse.com> wrote:
>>
>> <snip>
>>
>>>>>
>>>>> So far this appears to be working well - thanks!
>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c54
>>>>
>>>> Great, but due to the nature of the bug I'd rather wait at least until
>>>> the beginning of next week before sending an official patch so that this
>>>> can be tested more. In your comment you state 3/3 kernel debug info
>>>> installs and 6/6 libreoffice installs, how do those numbers compare
>>>> without the fix?
>>>
>>> More than 1/2 of the time there'd be an indefinite hang. Perhaps 1/3
>>> of those would result in a call trace.
>>
>> As you might have seen I did send a proper patch, if you've continued
>> testing it over the weekend and still haven't encountered an issue you
>> can reply with a Tested-by to the patch .
>
> Did that.
>
> Also, I just noticed the downstream bug comment that another tester
> has run the original patch for several days and can't reproduce the
> problem.
>
> But the side note is that without the patch, they were experiencing
> file system corruption, i.e. it would not mount following the crash.
> Let me know if it's worth asking the tester for mount time failure
> kernel messages; or a btrfs check of the corrupted system. I guess
Sure, let's see if there's anything else stemming from this.
> this race is expected to never manifest on x86?
Yes, x86 is strongly ordered so it won't need the barriers hence the
issue doesn't exist there.
> https://bugzilla.redhat.com/show_bug.cgi?id=2011928#c55
>
>
>
^ permalink raw reply [flat|nested] 62+ messages in thread