All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] FS/DLM module triggered kernel BUG
@ 2021-08-23  5:42 Gang He
  2021-08-23 13:49 ` Alexander Aring
  0 siblings, 1 reply; 5+ messages in thread
From: Gang He @ 2021-08-23  5:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hello Guys,

I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.
Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not. 

The backtrace is as below,

[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
[Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
[Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
[Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
[Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
[Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
[Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
[Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
[Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
[Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
[Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
[Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
[Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
[Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
[Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
[Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
[Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
[Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
[Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
[Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
[Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
[Fri Aug 20 16:24:14 2021] Call Trace:
[Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]
[Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
[Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
[Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
[Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
[Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
[Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
[Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
[Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
[Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30
[Fri Aug 20 16:24:14 2021] Modules linked in: rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core ocfs2_stack_user ocfs2 ocfs2_nodemanager ocfs2_stackglue quota_tree dlm af_packet iscsi_ibft iscsi_boot_sysfs rfkill intel_rapl_msr hid_generic intel_rapl_common usbhid virtio_net pcspkr joydev net_failover virtio_balloon i2c_piix4 failover tiny_power_button button fuse configfs crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ata_generic uhci_hcd ehci_pci ehci_hcd cirrus drm_kms_helper aesni_intel usbcore crypto_simd syscopyarea sysfillrect sysimgblt fb_sys_fops cec cryptd rc_core drm serio_raw i6300esb virtio_blk ata_piix floppy qemu_fw_cfg btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua virtio_rng
[Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650
[Fri Aug 20 16:24:14 2021] ---[ end trace 2ddfa38b9d824d93 ]---
[Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
[Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
[Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
[Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
[Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
[Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
[Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
[Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
[Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
[Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0

Thanks
Gang




^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] FS/DLM module triggered kernel BUG
  2021-08-23  5:42 [Cluster-devel] FS/DLM module triggered kernel BUG Gang He
@ 2021-08-23 13:49 ` Alexander Aring
  2021-08-24  5:36   ` Gang He
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander Aring @ 2021-08-23 13:49 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Gang He,

On Mon, Aug 23, 2021 at 1:43 AM Gang He <GHe@suse.com> wrote:
>
> Hello Guys,
>
> I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.

What do you exactly do? I would like to test it on a recent upstream
version, or you can do it?

> Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not.
>

could be, see below.

> The backtrace is as below,
>
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
> [Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615

here we disconnect from nodeid 172204615.

> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
> [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
> [Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
> [Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
> [Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
> [Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
> [Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
> [Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
> [Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
> [Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
> [Fri Aug 20 16:24:14 2021] Call Trace:
> [Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]

It would be interesting if we got here some message from nodeid
172204615 and I think this is what happens. There is maybe some use
after free going on and we should not receive anymore messages from
nodeid 172204615.
I recently added some dlm tracing infrastructure. It should be simple
to add a trace event here, print out the nodeid and compare
timestamps.

I recently fixed a synchronization issue which is not part of kernel
5.13.8 and has something to do with what you are seeing here.
There exists a workaround or a simple test if this really affects you,
simply create a dummy lockspace on all nodes so we actually never do
any disconnects and look if you are running again into this issue.

> [Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
> [Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
> [Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
> [Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
> [Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
> [Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
> [Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
> [Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
> [Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30

- Alex



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] FS/DLM module triggered kernel BUG
  2021-08-23 13:49 ` Alexander Aring
@ 2021-08-24  5:36   ` Gang He
  2021-08-24 14:18     ` Alexander Aring
  0 siblings, 1 reply; 5+ messages in thread
From: Gang He @ 2021-08-24  5:36 UTC (permalink / raw)
  To: cluster-devel.redhat.com



On 2021/8/23 21:49, Alexander Aring wrote:
> Hi Gang He,
> 
> On Mon, Aug 23, 2021 at 1:43 AM Gang He <GHe@suse.com> wrote:
>>
>> Hello Guys,
>>
>> I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.
> 
> What do you exactly do? I would like to test it on a recent upstream
> version, or you can do it?
I am not specifically to test the dlm kernel module.
I am doing ocfs2 related testing with opensuse Tumbleweed, which 
includes a very new kernel version.
But sometimes the ocfs2 test cases were blocked/aborted, due to this DLM 
problem.

> 
>> Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not.
>>
> 
> could be, see below.
> 
>> The backtrace is as below,
>>
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
>> [Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615
> 
> here we disconnect from nodeid 172204615.
> 
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
>> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
>> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
>> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
>> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
>> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
>> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
>> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
>> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
>> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
>> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
>> [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
>> [Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
>> [Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
>> [Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
>> [Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
>> [Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
>> [Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
>> [Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
>> [Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
>> [Fri Aug 20 16:24:14 2021] Call Trace:
>> [Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]
> 
> It would be interesting if we got here some message from nodeid
> 172204615 and I think this is what happens. There is maybe some use
> after free going on and we should not receive anymore messages from
> nodeid 172204615.
> I recently added some dlm tracing infrastructure. It should be simple
> to add a trace event here, print out the nodeid and compare
> timestamps.
> 
> I recently fixed a synchronization issue which is not part of kernel
> 5.13.8 and has something to do with what you are seeing here.
> There exists a workaround or a simple test if this really affects you,
> simply create a dummy lockspace on all nodes so we actually never do
> any disconnects and look if you are running again into this issue.
What is this git commit? I do not want to see any kernel (warning) print 
about DLM kernel module. Sometimes, DLM would enter a stuck state after 
the DLM kernel print.
Since there were a few commits in the past weeks, I just wonder if there 
is any regression problem.

Thanks
Gang


> 
>> [Fri Aug 20 16:24:14 2021]  dlm_process_incoming_buffer+0x38/0x90 [dlm]
>> [Fri Aug 20 16:24:14 2021]  receive_from_sock+0xd4/0x1f0 [dlm]
>> [Fri Aug 20 16:24:14 2021]  process_recv_sockets+0x1a/0x20 [dlm]
>> [Fri Aug 20 16:24:14 2021]  process_one_work+0x1df/0x370
>> [Fri Aug 20 16:24:14 2021]  worker_thread+0x50/0x400
>> [Fri Aug 20 16:24:14 2021]  ? process_one_work+0x370/0x370
>> [Fri Aug 20 16:24:14 2021]  kthread+0x127/0x150
>> [Fri Aug 20 16:24:14 2021]  ? set_kthread_struct+0x40/0x40
>> [Fri Aug 20 16:24:14 2021]  ret_from_fork+0x22/0x30
> 
> - Alex
> 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] FS/DLM module triggered kernel BUG
  2021-08-24  5:36   ` Gang He
@ 2021-08-24 14:18     ` Alexander Aring
  2021-08-24 20:31       ` Alexander Aring
  0 siblings, 1 reply; 5+ messages in thread
From: Alexander Aring @ 2021-08-24 14:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Gang He,

On Tue, Aug 24, 2021 at 1:36 AM Gang He <ghe@suse.com> wrote:
>
>
>
> On 2021/8/23 21:49, Alexander Aring wrote:
> > Hi Gang He,
> >
> > On Mon, Aug 23, 2021 at 1:43 AM Gang He <GHe@suse.com> wrote:
> >>
> >> Hello Guys,
> >>
> >> I used kernel 5.13.8, I sometimes encountered the dlm module triggered kernel BUG.
> >
> > What do you exactly do? I would like to test it on a recent upstream
> > version, or you can do it?
> I am not specifically to test the dlm kernel module.
> I am doing ocfs2 related testing with opensuse Tumbleweed, which
> includes a very new kernel version.
> But sometimes the ocfs2 test cases were blocked/aborted, due to this DLM
> problem.
>

I see. What is the ocfs2 test trying to do? Maybe I will be able to
reproduce it.

> >
> >> Since the dlm kernel module is not the latest source code, I am not sure if this problem is fixed, or not.
> >>
> >
> > could be, see below.
> >
> >> The backtrace is as below,
> >>
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: remove member 172204615
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_members 2 nodes
> >> [Fri Aug 20 16:24:14 2021] dlm: connection 000000005ef82293 got EOF from 172204615
> >
> > here we disconnect from nodeid 172204615.
> >
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
> >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
> >> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
> >> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
> >> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
> >> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
> >> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
> >> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
> >> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> >> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
> >> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20
> >> [Fri Aug 20 16:24:14 2021] Code: 01 65 48 ff 04 c2 f0 83 44 24 fc 00 44 89 c0 c3 0f 1f 44 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00 48 8b 87 e8 0c 00 00 48 63 f6 <65> 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f 44 00 00 41 54 49 89 fc 55
> >> [Fri Aug 20 16:24:14 2021] RSP: 0018:ffffbd9a041ebd80 EFLAGS: 00010282
> >> [Fri Aug 20 16:24:14 2021] RAX: 00003cc9c100ec00 RBX: 00000000000000dc RCX: 0000000000000830
> >> [Fri Aug 20 16:24:14 2021] RDX: 0000000000000000 RSI: 0000000000000f48 RDI: ffffffffc06b4420
> >> [Fri Aug 20 16:24:14 2021] RBP: ffffa0d028423974 R08: 0000000000000001 R09: 0000000000000004
> >> [Fri Aug 20 16:24:14 2021] R10: 0000000000000000 R11: 0000000000000000 R12: ffffa0d028425000
> >> [Fri Aug 20 16:24:14 2021] R13: 000000000a43a2f2 R14: ffffa0d028425770 R15: 000000000a43a2f2
> >> [Fri Aug 20 16:24:14 2021] FS:  0000000000000000(0000) GS:ffffa0d03ed00000(0000) knlGS:0000000000000000
> >> [Fri Aug 20 16:24:14 2021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [Fri Aug 20 16:24:14 2021] CR2: ffffdd99ffd16650 CR3: 0000000002696000 CR4: 00000000000406e0
> >> [Fri Aug 20 16:24:14 2021] Call Trace:
> >> [Fri Aug 20 16:24:14 2021]  dlm_receive_buffer+0x66/0x150 [dlm]
> >
> > It would be interesting if we got here some message from nodeid
> > 172204615 and I think this is what happens. There is maybe some use
> > after free going on and we should not receive anymore messages from
> > nodeid 172204615.
> > I recently added some dlm tracing infrastructure. It should be simple
> > to add a trace event here, print out the nodeid and compare
> > timestamps.
> >
> > I recently fixed a synchronization issue which is not part of kernel
> > 5.13.8 and has something to do with what you are seeing here.
> > There exists a workaround or a simple test if this really affects you,
> > simply create a dummy lockspace on all nodes so we actually never do
> > any disconnects and look if you are running again into this issue.
> What is this git commit? I do not want to see any kernel (warning) print
> about DLM kernel module. Sometimes, DLM would enter a stuck state after
> the DLM kernel print.

Yes, a BUG() usually makes the kernel in an unusable state. It is not
a warning, it's a serious issue.

> Since there were a few commits in the past weeks, I just wonder if there
> is any regression problem.
>

Possible or they made an existing issue more likely to hit as before.
It would be useful to get the information from which nodeid the
message came that is currently parsed by dlm_receive_buffer().
As I said I experience issues myself on dlm connection termination
which should be fixed on recent upstream. May you can try the
commands:

git cherry-pick 8aa31cbf20ad..7d3848c03e09
git cherry-pick 700ab1c363c7..957adb68b3f7
git cherry-pick 2df6b7627a81

should do a sync to mainline on dlm in v5.13.8. It would be very
interesting if you still experience such problems with those patches.
Please report back.

Normally (and I think also in your case as well) you have lvmlockd
running all the time in the background which leaves lockspaces open
e.g. lvm_global and lvm_$VG. In my opinion that is one reason why most
users didn't hit issues in this area before. Please test it still
without lvmlockd running.

- Alex



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] FS/DLM module triggered kernel BUG
  2021-08-24 14:18     ` Alexander Aring
@ 2021-08-24 20:31       ` Alexander Aring
  0 siblings, 0 replies; 5+ messages in thread
From: Alexander Aring @ 2021-08-24 20:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Gang He,

On Tue, Aug 24, 2021 at 10:18 AM Alexander Aring <aahringo@redhat.com> wrote:
...
> > >
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: generation 4 slots 2 1:172204786 2:172204748
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 8 in 1 new
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_directory 1 out 1 messages
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_masters 33587 of 33599
> > >> [Fri Aug 20 16:24:14 2021] dlm: CEC5E19D749E473B99A0B792AD570441: dlm_recover_locks 0 out
> > >> [Fri Aug 20 16:24:14 2021] BUG: unable to handle page fault for address: ffffdd99ffd16650
> > >> [Fri Aug 20 16:24:14 2021] #PF: supervisor write access in kernel mode
> > >> [Fri Aug 20 16:24:14 2021] #PF: error_code(0x0002) - not-present page
> > >> [Fri Aug 20 16:24:14 2021] PGD 1040067 P4D 1040067 PUD 19c3067 PMD 19c4067 PTE 0
> > >> [Fri Aug 20 16:24:14 2021] Oops: 0002 [#1] SMP PTI
> > >> [Fri Aug 20 16:24:14 2021] CPU: 1 PID: 25221 Comm: kworker/u4:1 Tainted: G        W         5.13.8-1-default #1 openSUSE Tumbleweed
> > >> [Fri Aug 20 16:24:14 2021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
> > >> [Fri Aug 20 16:24:14 2021] Workqueue: dlm_recv process_recv_sockets [dlm]
> > >> [Fri Aug 20 16:24:14 2021] RIP: 0010:__srcu_read_unlock+0x15/0x20

that was suspicious to me and I was looking into the code in v5.13.8
again and found an issue. I believe you are hitting an out-of-bounds
array access of __srcu_read_unlock() while some concurrency handling
was updating the idx parameter which became invalid at that moment.
However the idx handling could be invalid in several other cases. It's
fixed in the current mainline kernel, but v5.13.8 is still broken. I
will send a patch marked as RFC for you. Please test it and report
back, then I will resend it for v5.13.8.

- Alex



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-24 20:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-23  5:42 [Cluster-devel] FS/DLM module triggered kernel BUG Gang He
2021-08-23 13:49 ` Alexander Aring
2021-08-24  5:36   ` Gang He
2021-08-24 14:18     ` Alexander Aring
2021-08-24 20:31       ` Alexander Aring

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.