All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: unable to handle kernel paging request
@ 2017-09-28 16:33 Igor Druzhinin
  2017-09-28 17:56 ` Mike Snitzer
  0 siblings, 1 reply; 8+ messages in thread
From: Igor Druzhinin @ 2017-09-28 16:33 UTC (permalink / raw)
  To: dm-devel, snitzer

Hi Mike,

I'm from XenServer engineering. We're currently using 4.4 stable kernel
(4.4.52 now) and intermittently seeing the following issue with dm-multipath
target during our storage certification test:

[   2619.971820]    ERR: blk_update_request: I/O error, dev sdd, sector 39176
[   2619.971838]   WARN: device-mapper: multipath: Failing path 8:48.
[   2641.971739]    ERR:  connection158:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295548436, last ping 4295549688, now 4295550940
[   2641.971786]   INFO:  connection158:0: detected conn error (1022)
[   2646.979770]   INFO:  session158: session recovery timed out after 5 secs
[   2646.979787]    ERR: sd 162:0:0:0: rejecting I/O to offline device
[   2646.979812]   INFO: sd 162:0:0:0: [sdc] killing request
[   2646.979841]   INFO: sd 162:0:0:0: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
[   2646.979848]   INFO: sd 162:0:0:0: [sdc] CDB: Write(10) 2a 00 00 00 99 08 00 08 00 00
[   2646.979852]    ERR: blk_update_request: I/O error, dev sdc, sector 39176
[   2646.979873]   WARN: device-mapper: multipath: Failing path 8:32.
[   2647.095324]  ALERT: BUG: unable to handle kernel paging request at ffffc90040154048
[   2647.095354]  ALERT: IP: [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
[   2647.095377]   WARN: PGD 2ee66067 PUD 29093067 PMD 29094067 PTE 0
[   2647.095391]   WARN: Oops: 0000 [#1] SMP 
[   2647.095400]   WARN: Modules linked in: dm_service_time bnx2i(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi arc4 md4 cifs bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas ipmi_si ipmi_msghandler dm_mod sg coretemp i7core_edac edac_core hed lpc_ich tpm_tis tpm mfd_core shpchp nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod sr_mod cdrom hid_generic usbhid hid ata_generic pata_acpi ata_piix libata bnx2(O) mptsas scsi_transport_sas mptscsih mptbase ehci_pci ehci_hcd ixgbe(O) vxlan ip6_udp_tunnel udp_t
 unnel ptp pps_core scsi_dh_rdac
[   2647.095651]   WARN:  scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
[   2647.095672]   WARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.4.0+10 #1
[   2647.095693]   WARN: Hardware name: Dell Inc. PowerEdge R310/0P229K, BIOS 1.0.0 02/11/2010
[   2647.095717]   WARN: task: ffff88002919d040 ti: ffff8800291a8000 task.ti: ffff8800291a8000
[   2647.095737]   WARN: RIP: e030:[<ffffffffa03cb52c>]  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
[   2647.095768]   WARN: RSP: e02b:ffff88002d643d48  EFLAGS: 00010082
[   2647.095781]   WARN: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88002d640000
[   2647.095797]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90040154040
[   2647.095814]   WARN: RBP: ffff88002d643d70 R08: 000000000000001c R09: 000000000000001c
[   2647.095829]   WARN: R10: 0000000000000009 R11: 0000000000007ffe R12: ffff880013ee0c60
[   2647.095844]   WARN: R13: ffff880013ee0bf0 R14: 0000000000000000 R15: ffff8800053a3e58
[   2647.095866]   WARN: FS:  00007fc029a14900(0000) GS:ffff88002d640000(0000) knlGS:0000000000000000
[   2647.095887]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
[   2647.095900]   WARN: CR2: ffffc90040154048 CR3: 0000000026ed4000 CR4: 0000000000002660
[   2647.095916]   WARN: Stack:
[   2647.095925]   WARN:  ffff880013ee0a80 ffffe8ffffa04a00 0000000000000002 0000000000000000
[   2647.095949]   WARN:  ffff8800053a3e58 ffff88002d643d98 ffffffff812fb9d1 0000000000000000
[   2647.095973]   WARN:  ffff880013ee0bf0 ffff880013e3e600 ffff88002d643da8 ffffffff812fba0c
[   2647.095996]   WARN: Call Trace:
[   2647.096007]   WARN:  <IRQ> 
[   2647.096019]   WARN:  [<ffffffff812fb9d1>] __blk_mq_complete_request+0xc1/0xe0
[   2647.096041]   WARN:  [<ffffffff812fba0c>] blk_mq_complete_request+0x1c/0x20
[   2647.096062]   WARN:  [<ffffffffa03cb195>] dm_complete_request+0x45/0x50 [dm_mod]
[   2647.096082]   WARN:  [<ffffffffa03cb1da>] end_clone_request+0x3a/0x40 [dm_mod]
[   2647.096099]   WARN:  [<ffffffff812f29a9>] blk_finish_request+0x99/0xd0
[   2647.096126]   WARN:  [<ffffffffa005d9c3>] scsi_end_request+0x153/0x1a0 [scsi_mod]
[   2647.096157]   WARN:  [<ffffffffa00607f9>] scsi_io_completion+0x239/0x610 [scsi_mod]
[   2647.096188]   WARN:  [<ffffffffa0057f98>] scsi_finish_command+0x108/0x120 [scsi_mod]
[   2647.096226]   WARN:  [<ffffffffa005ff08>] scsi_softirq_done+0x128/0x140 [scsi_mod]
[   2647.096248]   WARN:  [<ffffffff812f9888>] blk_done_softirq+0x78/0x90
[   2647.096272]   WARN:  [<ffffffff810761e9>] __do_softirq+0x129/0x290
[   2647.096295]   WARN:  [<ffffffff81076522>] irq_exit+0x42/0x90
[   2647.096315]   WARN:  [<ffffffff813c64d5>] xen_evtchn_do_upcall+0x35/0x50
[   2647.096335]   WARN:  [<ffffffff815a4dae>] xen_do_hypervisor_callback+0x1e/0x40
[   2647.096351]   WARN:  <EOI> 
[   2647.096365]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[   2647.096393]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
[   2647.096412]   WARN:  [<ffffffff8100c460>] ? xen_safe_halt+0x10/0x20
[   2647.096433]   WARN:  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
[   2647.096454]   WARN:  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
[   2647.096474]   WARN:  [<ffffffff810aadb2>] ? default_idle_call+0x32/0x40
[   2647.096495]   WARN:  [<ffffffff810ab00c>] ? cpu_startup_entry+0x1ec/0x330
[   2647.096517]   WARN:  [<ffffffff81013c18>] ? cpu_bringup_and_idle+0x18/0x20
[   2647.096531]   WARN: Code: ff ff e9 67 01 00 00 48 8b 43 40 41 8b 5d 40 4d 8b ac 24 60 01 00 00 48 c1 e8 17 49 8b 7d 08 48 83 f0 01 83 e0 01 48 85 ff 74 1f <48> 8b 57 08 84 c0 4c 8b 42 60 74 13 4d 85 c0 74 0e 89 da 49 8d 
[   2647.096640]  ALERT: RIP  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
[   2647.096662]   WARN:  RSP <ffff88002d643d48>
[   2647.096673]   WARN: CR2: ffffc90040154048
[   2647.096993]   WARN: ---[ end trace 450fe22bd6f6d100 ]---
[   2647.100979]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt

This happens on multipath failover stress test. The issue is obviously
only present if multiqueue is enabled for DM. Preliminary analysis
revealed that it's some sort of a race condition - the request is
trying to access an already reallocated array of dm_target structures.

As I don't have any extensive experience in debugging that sort of
problems in the block layer I'm stuck at this. It's complicated to
test a modern kernel in our environment so the question is:

Is it some known problem? And if not what can we possibly do to finally
understand the root cause of it?

Thanks,
Igor

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-28 16:33 BUG: unable to handle kernel paging request Igor Druzhinin
@ 2017-09-28 17:56 ` Mike Snitzer
  2017-09-28 19:45   ` Igor Druzhinin
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2017-09-28 17:56 UTC (permalink / raw)
  To: Igor Druzhinin; +Cc: dm-devel

On Thu, Sep 28 2017 at 12:33pm -0400,
Igor Druzhinin <igor.druzhinin@citrix.com> wrote:

> Hi Mike,
> 
> I'm from XenServer engineering. We're currently using 4.4 stable kernel
> (4.4.52 now) and intermittently seeing the following issue with dm-multipath
> target during our storage certification test:
> 
> [   2619.971820]    ERR: blk_update_request: I/O error, dev sdd, sector 39176
> [   2619.971838]   WARN: device-mapper: multipath: Failing path 8:48.
> [   2641.971739]    ERR:  connection158:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295548436, last ping 4295549688, now 4295550940
> [   2641.971786]   INFO:  connection158:0: detected conn error (1022)
> [   2646.979770]   INFO:  session158: session recovery timed out after 5 secs
> [   2646.979787]    ERR: sd 162:0:0:0: rejecting I/O to offline device
> [   2646.979812]   INFO: sd 162:0:0:0: [sdc] killing request
> [   2646.979841]   INFO: sd 162:0:0:0: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> [   2646.979848]   INFO: sd 162:0:0:0: [sdc] CDB: Write(10) 2a 00 00 00 99 08 00 08 00 00
> [   2646.979852]    ERR: blk_update_request: I/O error, dev sdc, sector 39176
> [   2646.979873]   WARN: device-mapper: multipath: Failing path 8:32.
> [   2647.095324]  ALERT: BUG: unable to handle kernel paging request at ffffc90040154048
> [   2647.095354]  ALERT: IP: [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> [   2647.095377]   WARN: PGD 2ee66067 PUD 29093067 PMD 29094067 PTE 0
> [   2647.095391]   WARN: Oops: 0000 [#1] SMP 
> [   2647.095400]   WARN: Modules linked in: dm_service_time bnx2i(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi arc4 md4 cifs bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas ipmi_si ipmi_msghandler dm_mod sg coretemp i7core_edac edac_core hed lpc_ich tpm_tis tpm mfd_core shpchp nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod sr_mod cdrom hid_generic usbhid hid ata_generic pata_acpi ata_piix libata bnx2(O) mptsas scsi_transport_sas mptscsih mptbase ehci_pci ehci_hcd ixgbe(O) vxlan ip6_udp_tunnel udp
 _tunnel ptp pps_core scsi_dh_rdac
> [   2647.095651]   WARN:  scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
> [   2647.095672]   WARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.4.0+10 #1
> [   2647.095693]   WARN: Hardware name: Dell Inc. PowerEdge R310/0P229K, BIOS 1.0.0 02/11/2010
> [   2647.095717]   WARN: task: ffff88002919d040 ti: ffff8800291a8000 task.ti: ffff8800291a8000
> [   2647.095737]   WARN: RIP: e030:[<ffffffffa03cb52c>]  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> [   2647.095768]   WARN: RSP: e02b:ffff88002d643d48  EFLAGS: 00010082
> [   2647.095781]   WARN: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88002d640000
> [   2647.095797]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90040154040
> [   2647.095814]   WARN: RBP: ffff88002d643d70 R08: 000000000000001c R09: 000000000000001c
> [   2647.095829]   WARN: R10: 0000000000000009 R11: 0000000000007ffe R12: ffff880013ee0c60
> [   2647.095844]   WARN: R13: ffff880013ee0bf0 R14: 0000000000000000 R15: ffff8800053a3e58
> [   2647.095866]   WARN: FS:  00007fc029a14900(0000) GS:ffff88002d640000(0000) knlGS:0000000000000000
> [   2647.095887]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
> [   2647.095900]   WARN: CR2: ffffc90040154048 CR3: 0000000026ed4000 CR4: 0000000000002660
> [   2647.095916]   WARN: Stack:
> [   2647.095925]   WARN:  ffff880013ee0a80 ffffe8ffffa04a00 0000000000000002 0000000000000000
> [   2647.095949]   WARN:  ffff8800053a3e58 ffff88002d643d98 ffffffff812fb9d1 0000000000000000
> [   2647.095973]   WARN:  ffff880013ee0bf0 ffff880013e3e600 ffff88002d643da8 ffffffff812fba0c
> [   2647.095996]   WARN: Call Trace:
> [   2647.096007]   WARN:  <IRQ> 
> [   2647.096019]   WARN:  [<ffffffff812fb9d1>] __blk_mq_complete_request+0xc1/0xe0
> [   2647.096041]   WARN:  [<ffffffff812fba0c>] blk_mq_complete_request+0x1c/0x20
> [   2647.096062]   WARN:  [<ffffffffa03cb195>] dm_complete_request+0x45/0x50 [dm_mod]
> [   2647.096082]   WARN:  [<ffffffffa03cb1da>] end_clone_request+0x3a/0x40 [dm_mod]
> [   2647.096099]   WARN:  [<ffffffff812f29a9>] blk_finish_request+0x99/0xd0
> [   2647.096126]   WARN:  [<ffffffffa005d9c3>] scsi_end_request+0x153/0x1a0 [scsi_mod]
> [   2647.096157]   WARN:  [<ffffffffa00607f9>] scsi_io_completion+0x239/0x610 [scsi_mod]
> [   2647.096188]   WARN:  [<ffffffffa0057f98>] scsi_finish_command+0x108/0x120 [scsi_mod]
> [   2647.096226]   WARN:  [<ffffffffa005ff08>] scsi_softirq_done+0x128/0x140 [scsi_mod]
> [   2647.096248]   WARN:  [<ffffffff812f9888>] blk_done_softirq+0x78/0x90
> [   2647.096272]   WARN:  [<ffffffff810761e9>] __do_softirq+0x129/0x290
> [   2647.096295]   WARN:  [<ffffffff81076522>] irq_exit+0x42/0x90
> [   2647.096315]   WARN:  [<ffffffff813c64d5>] xen_evtchn_do_upcall+0x35/0x50
> [   2647.096335]   WARN:  [<ffffffff815a4dae>] xen_do_hypervisor_callback+0x1e/0x40
> [   2647.096351]   WARN:  <EOI> 
> [   2647.096365]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [   2647.096393]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> [   2647.096412]   WARN:  [<ffffffff8100c460>] ? xen_safe_halt+0x10/0x20
> [   2647.096433]   WARN:  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
> [   2647.096454]   WARN:  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
> [   2647.096474]   WARN:  [<ffffffff810aadb2>] ? default_idle_call+0x32/0x40
> [   2647.096495]   WARN:  [<ffffffff810ab00c>] ? cpu_startup_entry+0x1ec/0x330
> [   2647.096517]   WARN:  [<ffffffff81013c18>] ? cpu_bringup_and_idle+0x18/0x20
> [   2647.096531]   WARN: Code: ff ff e9 67 01 00 00 48 8b 43 40 41 8b 5d 40 4d 8b ac 24 60 01 00 00 48 c1 e8 17 49 8b 7d 08 48 83 f0 01 83 e0 01 48 85 ff 74 1f <48> 8b 57 08 84 c0 4c 8b 42 60 74 13 4d 85 c0 74 0e 89 da 49 8d 
> [   2647.096640]  ALERT: RIP  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> [   2647.096662]   WARN:  RSP <ffff88002d643d48>
> [   2647.096673]   WARN: CR2: ffffc90040154048
> [   2647.096993]   WARN: ---[ end trace 450fe22bd6f6d100 ]---
> [   2647.100979]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
> 
> This happens on multipath failover stress test. The issue is obviously
> only present if multiqueue is enabled for DM. Preliminary analysis
> revealed that it's some sort of a race condition - the request is
> trying to access an already reallocated array of dm_target structures.
> 
> As I don't have any extensive experience in debugging that sort of
> problems in the block layer I'm stuck at this. It's complicated to
> test a modern kernel in our environment so the question is:
> 
> Is it some known problem? And if not what can we possibly do to finally
> understand the root cause of it?

This was likely fixed with commit 61febef40bfe8ab68259d8545257686e8a0d91d1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-28 17:56 ` Mike Snitzer
@ 2017-09-28 19:45   ` Igor Druzhinin
  2017-09-28 20:17     ` Mike Snitzer
  0 siblings, 1 reply; 8+ messages in thread
From: Igor Druzhinin @ 2017-09-28 19:45 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: dm-devel

On 28/09/17 18:56, Mike Snitzer wrote:
> On Thu, Sep 28 2017 at 12:33pm -0400,
> Igor Druzhinin <igor.druzhinin@citrix.com> wrote:
> 
>> Hi Mike,
>>
>> I'm from XenServer engineering. We're currently using 4.4 stable kernel
>> (4.4.52 now) and intermittently seeing the following issue with dm-multipath
>> target during our storage certification test:
>>
>> [   2619.971820]    ERR: blk_update_request: I/O error, dev sdd, sector 39176
>> [   2619.971838]   WARN: device-mapper: multipath: Failing path 8:48.
>> [   2641.971739]    ERR:  connection158:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295548436, last ping 4295549688, now 4295550940
>> [   2641.971786]   INFO:  connection158:0: detected conn error (1022)
>> [   2646.979770]   INFO:  session158: session recovery timed out after 5 secs
>> [   2646.979787]    ERR: sd 162:0:0:0: rejecting I/O to offline device
>> [   2646.979812]   INFO: sd 162:0:0:0: [sdc] killing request
>> [   2646.979841]   INFO: sd 162:0:0:0: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>> [   2646.979848]   INFO: sd 162:0:0:0: [sdc] CDB: Write(10) 2a 00 00 00 99 08 00 08 00 00
>> [   2646.979852]    ERR: blk_update_request: I/O error, dev sdc, sector 39176
>> [   2646.979873]   WARN: device-mapper: multipath: Failing path 8:32.
>> [   2647.095324]  ALERT: BUG: unable to handle kernel paging request at ffffc90040154048
>> [   2647.095354]  ALERT: IP: [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>> [   2647.095377]   WARN: PGD 2ee66067 PUD 29093067 PMD 29094067 PTE 0
>> [   2647.095391]   WARN: Oops: 0000 [#1] SMP 
>> [   2647.095400]   WARN: Modules linked in: dm_service_time bnx2i(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi arc4 md4 cifs bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas ipmi_si ipmi_msghandler dm_mod sg coretemp i7core_edac edac_core hed lpc_ich tpm_tis tpm mfd_core shpchp nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod sr_mod cdrom hid_generic usbhid hid ata_generic pata_acpi ata_piix libata bnx2(O) mptsas scsi_transport_sas mptscsih mptbase ehci_pci ehci_hcd ixgbe(O) vxlan ip6_udp_tunnel ud
 p_tunnel ptp pps_core scsi_dh_rdac
>> [   2647.095651]   WARN:  scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
>> [   2647.095672]   WARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.4.0+10 #1
>> [   2647.095693]   WARN: Hardware name: Dell Inc. PowerEdge R310/0P229K, BIOS 1.0.0 02/11/2010
>> [   2647.095717]   WARN: task: ffff88002919d040 ti: ffff8800291a8000 task.ti: ffff8800291a8000
>> [   2647.095737]   WARN: RIP: e030:[<ffffffffa03cb52c>]  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>> [   2647.095768]   WARN: RSP: e02b:ffff88002d643d48  EFLAGS: 00010082
>> [   2647.095781]   WARN: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88002d640000
>> [   2647.095797]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90040154040
>> [   2647.095814]   WARN: RBP: ffff88002d643d70 R08: 000000000000001c R09: 000000000000001c
>> [   2647.095829]   WARN: R10: 0000000000000009 R11: 0000000000007ffe R12: ffff880013ee0c60
>> [   2647.095844]   WARN: R13: ffff880013ee0bf0 R14: 0000000000000000 R15: ffff8800053a3e58
>> [   2647.095866]   WARN: FS:  00007fc029a14900(0000) GS:ffff88002d640000(0000) knlGS:0000000000000000
>> [   2647.095887]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>> [   2647.095900]   WARN: CR2: ffffc90040154048 CR3: 0000000026ed4000 CR4: 0000000000002660
>> [   2647.095916]   WARN: Stack:
>> [   2647.095925]   WARN:  ffff880013ee0a80 ffffe8ffffa04a00 0000000000000002 0000000000000000
>> [   2647.095949]   WARN:  ffff8800053a3e58 ffff88002d643d98 ffffffff812fb9d1 0000000000000000
>> [   2647.095973]   WARN:  ffff880013ee0bf0 ffff880013e3e600 ffff88002d643da8 ffffffff812fba0c
>> [   2647.095996]   WARN: Call Trace:
>> [   2647.096007]   WARN:  <IRQ> 
>> [   2647.096019]   WARN:  [<ffffffff812fb9d1>] __blk_mq_complete_request+0xc1/0xe0
>> [   2647.096041]   WARN:  [<ffffffff812fba0c>] blk_mq_complete_request+0x1c/0x20
>> [   2647.096062]   WARN:  [<ffffffffa03cb195>] dm_complete_request+0x45/0x50 [dm_mod]
>> [   2647.096082]   WARN:  [<ffffffffa03cb1da>] end_clone_request+0x3a/0x40 [dm_mod]
>> [   2647.096099]   WARN:  [<ffffffff812f29a9>] blk_finish_request+0x99/0xd0
>> [   2647.096126]   WARN:  [<ffffffffa005d9c3>] scsi_end_request+0x153/0x1a0 [scsi_mod]
>> [   2647.096157]   WARN:  [<ffffffffa00607f9>] scsi_io_completion+0x239/0x610 [scsi_mod]
>> [   2647.096188]   WARN:  [<ffffffffa0057f98>] scsi_finish_command+0x108/0x120 [scsi_mod]
>> [   2647.096226]   WARN:  [<ffffffffa005ff08>] scsi_softirq_done+0x128/0x140 [scsi_mod]
>> [   2647.096248]   WARN:  [<ffffffff812f9888>] blk_done_softirq+0x78/0x90
>> [   2647.096272]   WARN:  [<ffffffff810761e9>] __do_softirq+0x129/0x290
>> [   2647.096295]   WARN:  [<ffffffff81076522>] irq_exit+0x42/0x90
>> [   2647.096315]   WARN:  [<ffffffff813c64d5>] xen_evtchn_do_upcall+0x35/0x50
>> [   2647.096335]   WARN:  [<ffffffff815a4dae>] xen_do_hypervisor_callback+0x1e/0x40
>> [   2647.096351]   WARN:  <EOI> 
>> [   2647.096365]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [   2647.096393]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>> [   2647.096412]   WARN:  [<ffffffff8100c460>] ? xen_safe_halt+0x10/0x20
>> [   2647.096433]   WARN:  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
>> [   2647.096454]   WARN:  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
>> [   2647.096474]   WARN:  [<ffffffff810aadb2>] ? default_idle_call+0x32/0x40
>> [   2647.096495]   WARN:  [<ffffffff810ab00c>] ? cpu_startup_entry+0x1ec/0x330
>> [   2647.096517]   WARN:  [<ffffffff81013c18>] ? cpu_bringup_and_idle+0x18/0x20
>> [   2647.096531]   WARN: Code: ff ff e9 67 01 00 00 48 8b 43 40 41 8b 5d 40 4d 8b ac 24 60 01 00 00 48 c1 e8 17 49 8b 7d 08 48 83 f0 01 83 e0 01 48 85 ff 74 1f <48> 8b 57 08 84 c0 4c 8b 42 60 74 13 4d 85 c0 74 0e 89 da 49 8d 
>> [   2647.096640]  ALERT: RIP  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>> [   2647.096662]   WARN:  RSP <ffff88002d643d48>
>> [   2647.096673]   WARN: CR2: ffffc90040154048
>> [   2647.096993]   WARN: ---[ end trace 450fe22bd6f6d100 ]---
>> [   2647.100979]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
>>
>> This happens on multipath failover stress test. The issue is obviously
>> only present if multiqueue is enabled for DM. Preliminary analysis
>> revealed that it's some sort of a race condition - the request is
>> trying to access an already reallocated array of dm_target structures.
>>
>> As I don't have any extensive experience in debugging that sort of
>> problems in the block layer I'm stuck at this. It's complicated to
>> test a modern kernel in our environment so the question is:
>>
>> Is it some known problem? And if not what can we possibly do to finally
>> understand the root cause of it?
> 
> This was likely fixed with commit 61febef40bfe8ab68259d8545257686e8a0d91d1
> 

I doubt so. It says:
Fixes: eb8db831be80 ("dm: always defer request allocation to the owner of the request_queue")
which is not present in 4.4.

And the crash happens in the other code branch (clone completion):

static void dm_done(struct request *clone, int error, bool mapped)
{
	int r = error;
	struct dm_rq_target_io *tio = clone->end_io_data;
	dm_request_endio_fn rq_end_io = NULL;

	if (tio->ti) {
		rq_end_io = tio->ti->type->rq_end_io; <=== crash while dereferencing tio->ti

		if (mapped && rq_end_io)
			r = rq_end_io(tio->ti, clone, error, &tio->info);
	}

Igor

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-28 19:45   ` Igor Druzhinin
@ 2017-09-28 20:17     ` Mike Snitzer
  2017-09-28 23:19       ` Igor Druzhinin
  0 siblings, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2017-09-28 20:17 UTC (permalink / raw)
  To: Igor Druzhinin; +Cc: dm-devel

On Thu, Sep 28 2017 at  3:45pm -0400,
Igor Druzhinin <igor.druzhinin@citrix.com> wrote:

> On 28/09/17 18:56, Mike Snitzer wrote:
> > On Thu, Sep 28 2017 at 12:33pm -0400,
> > Igor Druzhinin <igor.druzhinin@citrix.com> wrote:
> > 
> >> Hi Mike,
> >>
> >> I'm from XenServer engineering. We're currently using 4.4 stable kernel
> >> (4.4.52 now) and intermittently seeing the following issue with dm-multipath
> >> target during our storage certification test:
> >>
> >> [   2619.971820]    ERR: blk_update_request: I/O error, dev sdd, sector 39176
> >> [   2619.971838]   WARN: device-mapper: multipath: Failing path 8:48.
> >> [   2641.971739]    ERR:  connection158:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295548436, last ping 4295549688, now 4295550940
> >> [   2641.971786]   INFO:  connection158:0: detected conn error (1022)
> >> [   2646.979770]   INFO:  session158: session recovery timed out after 5 secs
> >> [   2646.979787]    ERR: sd 162:0:0:0: rejecting I/O to offline device
> >> [   2646.979812]   INFO: sd 162:0:0:0: [sdc] killing request
> >> [   2646.979841]   INFO: sd 162:0:0:0: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
> >> [   2646.979848]   INFO: sd 162:0:0:0: [sdc] CDB: Write(10) 2a 00 00 00 99 08 00 08 00 00
> >> [   2646.979852]    ERR: blk_update_request: I/O error, dev sdc, sector 39176
> >> [   2646.979873]   WARN: device-mapper: multipath: Failing path 8:32.
> >> [   2647.095324]  ALERT: BUG: unable to handle kernel paging request at ffffc90040154048
> >> [   2647.095354]  ALERT: IP: [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> >> [   2647.095377]   WARN: PGD 2ee66067 PUD 29093067 PMD 29094067 PTE 0
> >> [   2647.095391]   WARN: Oops: 0000 [#1] SMP 
> >> [   2647.095400]   WARN: Modules linked in: dm_service_time bnx2i(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi arc4 md4 cifs bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas ipmi_si ipmi_msghandler dm_mod sg coretemp i7core_edac edac_core hed lpc_ich tpm_tis tpm mfd_core shpchp nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod sr_mod cdrom hid_generic usbhid hid ata_generic pata_acpi ata_piix libata bnx2(O) mptsas scsi_transport_sas mptscsih mptbase ehci_pci ehci_hcd ixgbe(O) vxlan ip6_udp_tunnel 
 udp_tunnel ptp pps_core scsi_dh_rdac
> >> [   2647.095651]   WARN:  scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
> >> [   2647.095672]   WARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.4.0+10 #1
> >> [   2647.095693]   WARN: Hardware name: Dell Inc. PowerEdge R310/0P229K, BIOS 1.0.0 02/11/2010
> >> [   2647.095717]   WARN: task: ffff88002919d040 ti: ffff8800291a8000 task.ti: ffff8800291a8000
> >> [   2647.095737]   WARN: RIP: e030:[<ffffffffa03cb52c>]  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> >> [   2647.095768]   WARN: RSP: e02b:ffff88002d643d48  EFLAGS: 00010082
> >> [   2647.095781]   WARN: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88002d640000
> >> [   2647.095797]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90040154040
> >> [   2647.095814]   WARN: RBP: ffff88002d643d70 R08: 000000000000001c R09: 000000000000001c
> >> [   2647.095829]   WARN: R10: 0000000000000009 R11: 0000000000007ffe R12: ffff880013ee0c60
> >> [   2647.095844]   WARN: R13: ffff880013ee0bf0 R14: 0000000000000000 R15: ffff8800053a3e58
> >> [   2647.095866]   WARN: FS:  00007fc029a14900(0000) GS:ffff88002d640000(0000) knlGS:0000000000000000
> >> [   2647.095887]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
> >> [   2647.095900]   WARN: CR2: ffffc90040154048 CR3: 0000000026ed4000 CR4: 0000000000002660
> >> [   2647.095916]   WARN: Stack:
> >> [   2647.095925]   WARN:  ffff880013ee0a80 ffffe8ffffa04a00 0000000000000002 0000000000000000
> >> [   2647.095949]   WARN:  ffff8800053a3e58 ffff88002d643d98 ffffffff812fb9d1 0000000000000000
> >> [   2647.095973]   WARN:  ffff880013ee0bf0 ffff880013e3e600 ffff88002d643da8 ffffffff812fba0c
> >> [   2647.095996]   WARN: Call Trace:
> >> [   2647.096007]   WARN:  <IRQ> 
> >> [   2647.096019]   WARN:  [<ffffffff812fb9d1>] __blk_mq_complete_request+0xc1/0xe0
> >> [   2647.096041]   WARN:  [<ffffffff812fba0c>] blk_mq_complete_request+0x1c/0x20
> >> [   2647.096062]   WARN:  [<ffffffffa03cb195>] dm_complete_request+0x45/0x50 [dm_mod]
> >> [   2647.096082]   WARN:  [<ffffffffa03cb1da>] end_clone_request+0x3a/0x40 [dm_mod]
> >> [   2647.096099]   WARN:  [<ffffffff812f29a9>] blk_finish_request+0x99/0xd0
> >> [   2647.096126]   WARN:  [<ffffffffa005d9c3>] scsi_end_request+0x153/0x1a0 [scsi_mod]
> >> [   2647.096157]   WARN:  [<ffffffffa00607f9>] scsi_io_completion+0x239/0x610 [scsi_mod]
> >> [   2647.096188]   WARN:  [<ffffffffa0057f98>] scsi_finish_command+0x108/0x120 [scsi_mod]
> >> [   2647.096226]   WARN:  [<ffffffffa005ff08>] scsi_softirq_done+0x128/0x140 [scsi_mod]
> >> [   2647.096248]   WARN:  [<ffffffff812f9888>] blk_done_softirq+0x78/0x90
> >> [   2647.096272]   WARN:  [<ffffffff810761e9>] __do_softirq+0x129/0x290
> >> [   2647.096295]   WARN:  [<ffffffff81076522>] irq_exit+0x42/0x90
> >> [   2647.096315]   WARN:  [<ffffffff813c64d5>] xen_evtchn_do_upcall+0x35/0x50
> >> [   2647.096335]   WARN:  [<ffffffff815a4dae>] xen_do_hypervisor_callback+0x1e/0x40
> >> [   2647.096351]   WARN:  <EOI> 
> >> [   2647.096365]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> >> [   2647.096393]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
> >> [   2647.096412]   WARN:  [<ffffffff8100c460>] ? xen_safe_halt+0x10/0x20
> >> [   2647.096433]   WARN:  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
> >> [   2647.096454]   WARN:  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
> >> [   2647.096474]   WARN:  [<ffffffff810aadb2>] ? default_idle_call+0x32/0x40
> >> [   2647.096495]   WARN:  [<ffffffff810ab00c>] ? cpu_startup_entry+0x1ec/0x330
> >> [   2647.096517]   WARN:  [<ffffffff81013c18>] ? cpu_bringup_and_idle+0x18/0x20
> >> [   2647.096531]   WARN: Code: ff ff e9 67 01 00 00 48 8b 43 40 41 8b 5d 40 4d 8b ac 24 60 01 00 00 48 c1 e8 17 49 8b 7d 08 48 83 f0 01 83 e0 01 48 85 ff 74 1f <48> 8b 57 08 84 c0 4c 8b 42 60 74 13 4d 85 c0 74 0e 89 da 49 8d 
> >> [   2647.096640]  ALERT: RIP  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
> >> [   2647.096662]   WARN:  RSP <ffff88002d643d48>
> >> [   2647.096673]   WARN: CR2: ffffc90040154048
> >> [   2647.096993]   WARN: ---[ end trace 450fe22bd6f6d100 ]---
> >> [   2647.100979]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
> >>
> >> This happens on multipath failover stress test. The issue is obviously
> >> only present if multiqueue is enabled for DM. Preliminary analysis
> >> revealed that it's some sort of a race condition - the request is
> >> trying to access an already reallocated array of dm_target structures.
> >>
> >> As I don't have any extensive experience in debugging that sort of
> >> problems in the block layer I'm stuck at this. It's complicated to
> >> test a modern kernel in our environment so the question is:
> >>
> >> Is it some known problem? And if not what can we possibly do to finally
> >> understand the root cause of it?
> > 
> > This was likely fixed with commit 61febef40bfe8ab68259d8545257686e8a0d91d1
> > 
> 
> I doubt so. It says:
> Fixes: eb8db831be80 ("dm: always defer request allocation to the owner of the request_queue")
> which is not present in 4.4.

Don't let that fool you.  The issue was still relevant and fix is
needed.  It is just more accute with the newer kernels.

> And the crash happens in the other code branch (clone completion):
> 
> static void dm_done(struct request *clone, int error, bool mapped)
> {
> 	int r = error;
> 	struct dm_rq_target_io *tio = clone->end_io_data;
> 	dm_request_endio_fn rq_end_io = NULL;
> 
> 	if (tio->ti) {
> 		rq_end_io = tio->ti->type->rq_end_io; <=== crash while dereferencing tio->ti
> 
> 		if (mapped && rq_end_io)
> 			r = rq_end_io(tio->ti, clone, error, &tio->info);
> 	}

Anyway, the issue you're talking about was also fixed, just not sure of
the fix right now but will keep looking.

See: https://www.redhat.com/archives/dm-devel/2016-February/msg00073.html

Could be the various fixes that Bart did to cope with queue teardown
races.

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-28 20:17     ` Mike Snitzer
@ 2017-09-28 23:19       ` Igor Druzhinin
  2017-09-29 15:08         ` Mike Snitzer
  2022-04-20 21:14         ` [dm-devel] " Casey Chen
  0 siblings, 2 replies; 8+ messages in thread
From: Igor Druzhinin @ 2017-09-28 23:19 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: bart.vanassche, dm-devel

On 28/09/17 21:17, Mike Snitzer wrote:
> On Thu, Sep 28 2017 at  3:45pm -0400,
> Igor Druzhinin <igor.druzhinin@citrix.com> wrote:
> 
>> On 28/09/17 18:56, Mike Snitzer wrote:
>>> On Thu, Sep 28 2017 at 12:33pm -0400,
>>> Igor Druzhinin <igor.druzhinin@citrix.com> wrote:
>>>
>>>> Hi Mike,
>>>>
>>>> I'm from XenServer engineering. We're currently using 4.4 stable kernel
>>>> (4.4.52 now) and intermittently seeing the following issue with dm-multipath
>>>> target during our storage certification test:
>>>>
>>>> [   2619.971820]    ERR: blk_update_request: I/O error, dev sdd, sector 39176
>>>> [   2619.971838]   WARN: device-mapper: multipath: Failing path 8:48.
>>>> [   2641.971739]    ERR:  connection158:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295548436, last ping 4295549688, now 4295550940
>>>> [   2641.971786]   INFO:  connection158:0: detected conn error (1022)
>>>> [   2646.979770]   INFO:  session158: session recovery timed out after 5 secs
>>>> [   2646.979787]    ERR: sd 162:0:0:0: rejecting I/O to offline device
>>>> [   2646.979812]   INFO: sd 162:0:0:0: [sdc] killing request
>>>> [   2646.979841]   INFO: sd 162:0:0:0: [sdc] FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
>>>> [   2646.979848]   INFO: sd 162:0:0:0: [sdc] CDB: Write(10) 2a 00 00 00 99 08 00 08 00 00
>>>> [   2646.979852]    ERR: blk_update_request: I/O error, dev sdc, sector 39176
>>>> [   2646.979873]   WARN: device-mapper: multipath: Failing path 8:32.
>>>> [   2647.095324]  ALERT: BUG: unable to handle kernel paging request at ffffc90040154048
>>>> [   2647.095354]  ALERT: IP: [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>>>> [   2647.095377]   WARN: PGD 2ee66067 PUD 29093067 PMD 29094067 PTE 0
>>>> [   2647.095391]   WARN: Oops: 0000 [#1] SMP 
>>>> [   2647.095400]   WARN: Modules linked in: dm_service_time bnx2i(O) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi arc4 md4 cifs bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nf_defrag_ipv6 libcrc32c 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack iptable_filter dm_multipath ipmi_devintf dcdbas ipmi_si ipmi_msghandler dm_mod sg coretemp i7core_edac edac_core hed lpc_ich tpm_tis tpm mfd_core shpchp nls_utf8 isofs nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 sd_mod sr_mod cdrom hid_generic usbhid hid ata_generic pata_acpi ata_piix libata bnx2(O) mptsas scsi_transport_sas mptscsih mptbase ehci_pci ehci_hcd ixgbe(O) vxlan ip6_udp_tunnel 
 udp_tunnel ptp pps_core scsi_dh_rdac
>>>> [   2647.095651]   WARN:  scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod ipv6
>>>> [   2647.095672]   WARN: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           O    4.4.0+10 #1
>>>> [   2647.095693]   WARN: Hardware name: Dell Inc. PowerEdge R310/0P229K, BIOS 1.0.0 02/11/2010
>>>> [   2647.095717]   WARN: task: ffff88002919d040 ti: ffff8800291a8000 task.ti: ffff8800291a8000
>>>> [   2647.095737]   WARN: RIP: e030:[<ffffffffa03cb52c>]  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>>>> [   2647.095768]   WARN: RSP: e02b:ffff88002d643d48  EFLAGS: 00010082
>>>> [   2647.095781]   WARN: RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88002d640000
>>>> [   2647.095797]   WARN: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffc90040154040
>>>> [   2647.095814]   WARN: RBP: ffff88002d643d70 R08: 000000000000001c R09: 000000000000001c
>>>> [   2647.095829]   WARN: R10: 0000000000000009 R11: 0000000000007ffe R12: ffff880013ee0c60
>>>> [   2647.095844]   WARN: R13: ffff880013ee0bf0 R14: 0000000000000000 R15: ffff8800053a3e58
>>>> [   2647.095866]   WARN: FS:  00007fc029a14900(0000) GS:ffff88002d640000(0000) knlGS:0000000000000000
>>>> [   2647.095887]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
>>>> [   2647.095900]   WARN: CR2: ffffc90040154048 CR3: 0000000026ed4000 CR4: 0000000000002660
>>>> [   2647.095916]   WARN: Stack:
>>>> [   2647.095925]   WARN:  ffff880013ee0a80 ffffe8ffffa04a00 0000000000000002 0000000000000000
>>>> [   2647.095949]   WARN:  ffff8800053a3e58 ffff88002d643d98 ffffffff812fb9d1 0000000000000000
>>>> [   2647.095973]   WARN:  ffff880013ee0bf0 ffff880013e3e600 ffff88002d643da8 ffffffff812fba0c
>>>> [   2647.095996]   WARN: Call Trace:
>>>> [   2647.096007]   WARN:  <IRQ> 
>>>> [   2647.096019]   WARN:  [<ffffffff812fb9d1>] __blk_mq_complete_request+0xc1/0xe0
>>>> [   2647.096041]   WARN:  [<ffffffff812fba0c>] blk_mq_complete_request+0x1c/0x20
>>>> [   2647.096062]   WARN:  [<ffffffffa03cb195>] dm_complete_request+0x45/0x50 [dm_mod]
>>>> [   2647.096082]   WARN:  [<ffffffffa03cb1da>] end_clone_request+0x3a/0x40 [dm_mod]
>>>> [   2647.096099]   WARN:  [<ffffffff812f29a9>] blk_finish_request+0x99/0xd0
>>>> [   2647.096126]   WARN:  [<ffffffffa005d9c3>] scsi_end_request+0x153/0x1a0 [scsi_mod]
>>>> [   2647.096157]   WARN:  [<ffffffffa00607f9>] scsi_io_completion+0x239/0x610 [scsi_mod]
>>>> [   2647.096188]   WARN:  [<ffffffffa0057f98>] scsi_finish_command+0x108/0x120 [scsi_mod]
>>>> [   2647.096226]   WARN:  [<ffffffffa005ff08>] scsi_softirq_done+0x128/0x140 [scsi_mod]
>>>> [   2647.096248]   WARN:  [<ffffffff812f9888>] blk_done_softirq+0x78/0x90
>>>> [   2647.096272]   WARN:  [<ffffffff810761e9>] __do_softirq+0x129/0x290
>>>> [   2647.096295]   WARN:  [<ffffffff81076522>] irq_exit+0x42/0x90
>>>> [   2647.096315]   WARN:  [<ffffffff813c64d5>] xen_evtchn_do_upcall+0x35/0x50
>>>> [   2647.096335]   WARN:  [<ffffffff815a4dae>] xen_do_hypervisor_callback+0x1e/0x40
>>>> [   2647.096351]   WARN:  <EOI> 
>>>> [   2647.096365]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>>>> [   2647.096393]   WARN:  [<ffffffff810013aa>] ? xen_hypercall_sched_op+0xa/0x20
>>>> [   2647.096412]   WARN:  [<ffffffff8100c460>] ? xen_safe_halt+0x10/0x20
>>>> [   2647.096433]   WARN:  [<ffffffff81020d67>] ? default_idle+0x57/0xf0
>>>> [   2647.096454]   WARN:  [<ffffffff8102149f>] ? arch_cpu_idle+0xf/0x20
>>>> [   2647.096474]   WARN:  [<ffffffff810aadb2>] ? default_idle_call+0x32/0x40
>>>> [   2647.096495]   WARN:  [<ffffffff810ab00c>] ? cpu_startup_entry+0x1ec/0x330
>>>> [   2647.096517]   WARN:  [<ffffffff81013c18>] ? cpu_bringup_and_idle+0x18/0x20
>>>> [   2647.096531]   WARN: Code: ff ff e9 67 01 00 00 48 8b 43 40 41 8b 5d 40 4d 8b ac 24 60 01 00 00 48 c1 e8 17 49 8b 7d 08 48 83 f0 01 83 e0 01 48 85 ff 74 1f <48> 8b 57 08 84 c0 4c 8b 42 60 74 13 4d 85 c0 74 0e 89 da 49 8d 
>>>> [   2647.096640]  ALERT: RIP  [<ffffffffa03cb52c>] dm_softirq_done+0xcc/0x220 [dm_mod]
>>>> [   2647.096662]   WARN:  RSP <ffff88002d643d48>
>>>> [   2647.096673]   WARN: CR2: ffffc90040154048
>>>> [   2647.096993]   WARN: ---[ end trace 450fe22bd6f6d100 ]---
>>>> [   2647.100979]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
>>>>
>>>> This happens on multipath failover stress test. The issue is obviously
>>>> only present if multiqueue is enabled for DM. Preliminary analysis
>>>> revealed that it's some sort of a race condition - the request is
>>>> trying to access an already reallocated array of dm_target structures.
>>>>
>>>> As I don't have any extensive experience in debugging that sort of
>>>> problems in the block layer I'm stuck at this. It's complicated to
>>>> test a modern kernel in our environment so the question is:
>>>>
>>>> Is it some known problem? And if not what can we possibly do to finally
>>>> understand the root cause of it?
>>>
>>> This was likely fixed with commit 61febef40bfe8ab68259d8545257686e8a0d91d1
>>>
>>
>> I doubt so. It says:
>> Fixes: eb8db831be80 ("dm: always defer request allocation to the owner of the request_queue")
>> which is not present in 4.4.
> 
> Don't let that fool you.  The issue was still relevant and fix is
> needed.  It is just more accute with the newer kernels.
> 
>> And the crash happens in the other code branch (clone completion):
>>
>> static void dm_done(struct request *clone, int error, bool mapped)
>> {
>> 	int r = error;
>> 	struct dm_rq_target_io *tio = clone->end_io_data;
>> 	dm_request_endio_fn rq_end_io = NULL;
>>
>> 	if (tio->ti) {
>> 		rq_end_io = tio->ti->type->rq_end_io; <=== crash while dereferencing tio->ti
>>
>> 		if (mapped && rq_end_io)
>> 			r = rq_end_io(tio->ti, clone, error, &tio->info);
>> 	}
> 
> Anyway, the issue you're talking about was also fixed, just not sure of
> the fix right now but will keep looking.
> 
> See: https://www.redhat.com/archives/dm-devel/2016-February/msg00073.html
> 
> Could be the various fixes that Bart did to cope with queue teardown
> races.

Do you mean those fixes?

https://www.spinics.net/lists/linux-scsi/msg101312.html

CC'ing Bart as he probably knows a lot about the issue.

Igor

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-28 23:19       ` Igor Druzhinin
@ 2017-09-29 15:08         ` Mike Snitzer
  2017-09-29 15:57           ` Igor Druzhinin
  2022-04-20 21:14         ` [dm-devel] " Casey Chen
  1 sibling, 1 reply; 8+ messages in thread
From: Mike Snitzer @ 2017-09-29 15:08 UTC (permalink / raw)
  To: Igor Druzhinin; +Cc: bart.vanassche, dm-devel

On Thu, Sep 28 2017 at  7:19pm -0400,
Igor Druzhinin <igor.druzhinin@citrix.com> wrote:

> On 28/09/17 21:17, Mike Snitzer wrote:
> > 
> > Anyway, the issue you're talking about was also fixed, just not sure of
> > the fix right now but will keep looking.
> > 
> > See: https://www.redhat.com/archives/dm-devel/2016-February/msg00073.html
> > 
> > Could be the various fixes that Bart did to cope with queue teardown
> > races.
> 
> Do you mean those fixes?
> 
> https://www.spinics.net/lists/linux-scsi/msg101312.html

Yes, but there have been _a lot_ of fixes and performance improvements
to DM multipath's blk-mq support independent of those fixes (fixes range
from SCSI, to block core to DM core, to DM multipath).

Trying to make a "stable" 4.4 kernel have all the fixes is a noble task
but I have no interest in taking it on.  Nor do I have time to coach you
(or others) through it.

Just rebase to a newer kernel!  (Sorry)

Mike

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request
  2017-09-29 15:08         ` Mike Snitzer
@ 2017-09-29 15:57           ` Igor Druzhinin
  0 siblings, 0 replies; 8+ messages in thread
From: Igor Druzhinin @ 2017-09-29 15:57 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: bart.vanassche, dm-devel

On 29/09/17 16:08, Mike Snitzer wrote:
> On Thu, Sep 28 2017 at  7:19pm -0400,
> Igor Druzhinin <igor.druzhinin@citrix.com> wrote:
> 
>> On 28/09/17 21:17, Mike Snitzer wrote:
>>>
>>> Anyway, the issue you're talking about was also fixed, just not sure of
>>> the fix right now but will keep looking.
>>>
>>> See: https://www.redhat.com/archives/dm-devel/2016-February/msg00073.html
>>>
>>> Could be the various fixes that Bart did to cope with queue teardown
>>> races.
>>
>> Do you mean those fixes?
>>
>> https://www.spinics.net/lists/linux-scsi/msg101312.html
> 
> Yes, but there have been _a lot_ of fixes and performance improvements
> to DM multipath's blk-mq support independent of those fixes (fixes range
> from SCSI, to block core to DM core, to DM multipath).
> 
> Trying to make a "stable" 4.4 kernel have all the fixes is a noble task
> but I have no interest in taking it on.  Nor do I have time to coach you
> (or others) through it.
> 
> Just rebase to a newer kernel!  (Sorry)

Ok, thank you for the reply.

This was actually my initial concern - the changes are substantial and
depend on each other. I just wanted to check if there is any "targeted"
fix that I'm not aware of or unable to recognize.

Igor
 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [dm-devel] BUG: unable to handle kernel paging request
  2017-09-28 23:19       ` Igor Druzhinin
  2017-09-29 15:08         ` Mike Snitzer
@ 2022-04-20 21:14         ` Casey Chen
  1 sibling, 0 replies; 8+ messages in thread
From: Casey Chen @ 2022-04-20 21:14 UTC (permalink / raw)
  To: igor.druzhinin; +Cc: bart.vanassche, dm-devel, Casey Chen, snitzer, yzhong

Hi,

It is unclear to me whether the issue has been solved or not from previous discussions.
We found a similar crash on v4.4.262. The crash happened at the exact line of code as
described by Igor. I wonder what's the fix and possible ways to test the fix. Thanks!

[14501930.727650,06] BUG: unable to handle kernel paging request at ffffc901aed4d048
[14501930.730983,06] IP: [<ffffffff816612cf>] dm_softirq_done+0x8f/0x270
[14501930.732776,06] PGD 18fcbb067 PUD 2e3b7b067 PMD 2cab08067 PTE 0
[14501930.734640,06] Oops: 0000 [#1] SMP
[14501930.736507,06] Dumping ftrace buffer:
...
[14501930.818509,06] CPU: 6 PID: 14767 Comm: <masked off> Tainted: P        W  O    4.4.262+
...
[14501930.938477,06] Call Trace:
[14501930.941301,06]  <IRQ>
[14501930.941351,06]  [<ffffffff8130b6c3>] __blk_mq_complete_request+0x83/0xf0
[14501930.946847,06]  [<ffffffff8130b74c>] blk_mq_complete_request+0x1c/0x20
[14501930.949574,06]  [<ffffffff81660494>] dm_complete_request+0x34/0x50
[14501930.952256,06]  [<ffffffff816604df>] end_clone_request+0x2f/0x50
[14501930.954883,06]  [<ffffffff8130a7cd>] blk_mq_end_request+0x3d/0x70
[14501930.957460,06]  [<ffffffffa028484c>] nvme_rdma_complete_rq+0x4c/0x210 [nvme_rdma]
[14501930.962447,06]  [<ffffffff810760ac>] ? __local_bh_enable+0x3c/0x70
[14501930.964972,06]  [<ffffffff813097f0>] ? blkdev_issue_zeroout+0x260/0x260
[14501930.967459,06]  [<ffffffff81309809>] __blk_mq_complete_request_remote+0x19/0x20
[14501930.972291,06]  [<ffffffff810ef099>] flush_smp_call_function_queue+0xa9/0x170
[14501930.977138,06]  [<ffffffff810efaa3>] generic_smp_call_function_single_interrupt+0x13/0x60
[14501930.982005,06]  [<ffffffff810495bd>] smp_call_function_interrupt+0x2d/0x40
[14501930.984492,06]  [<ffffffff810496be>] smp_call_function_single_interrupt+0xe/0x10
[14501930.989316,06]  [<ffffffff817d0221>] call_function_single_interrupt+0x91/0xa0
[14501930.994301,06]  <EOI>
[14501930.994350,06]  [<ffffffff810031a0>] ? exit_to_usermode_loop+0x50/0xb0
[14501930.999364,06]  [<ffffffff81003199>] ? exit_to_usermode_loop+0x49/0xb0
[14501931.001829,06]  [<ffffffff81003b62>] prepare_exit_to_usermode+0x32/0x40
[14501931.004261,06]  [<ffffffff817cf979>] retint_user+0x8/0x20
[14501931.006626,06] Code: 16 41 83 fc 02 0f 85 df 01 00 00 49 8b 76 10 49 8b 3e e8 05 fe ff ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 8b 7e 08 48 85 ff 74 bd <48> 8b 47 08 48 8b 40 60 48 85 c0 74 b0 44 89 e2 49 8d 4e 48 4c
[14501931.014038,06] RIP  [<ffffffff816612cf>] dm_softirq_done+0x8f/0x270
[14501931.016532,06]  RSP <ffff8817e0ac3e28>
[14501931.018971,06] CR2: ffffc901aed4d048
[14501931.021382,06] ---[ end trace 67b8f26e34c76805 ]---

Regards,
Casey

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-04-21  5:51 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-28 16:33 BUG: unable to handle kernel paging request Igor Druzhinin
2017-09-28 17:56 ` Mike Snitzer
2017-09-28 19:45   ` Igor Druzhinin
2017-09-28 20:17     ` Mike Snitzer
2017-09-28 23:19       ` Igor Druzhinin
2017-09-29 15:08         ` Mike Snitzer
2017-09-29 15:57           ` Igor Druzhinin
2022-04-20 21:14         ` [dm-devel] " Casey Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.