linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* NVMeoF RDMA IB:  I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
       [not found] <1823328454.445263.1568796846850.JavaMail.zimbra@redhat.com>
@ 2019-09-18  9:13 ` Yi Zhang
  2019-09-18 14:21   ` Max Gurtovoy
  0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2019-09-18  9:13 UTC (permalink / raw)
  To: linux-nvme; +Cc: sagi

Hello 
I observed bellow I/O timeout and NULL pointer on 5.3.0, pls help check it, let me know if you need more info or test patch, thanks

reproducer:
1. do fio background testing
2. stress rescan/reset_controller operation
echo 1 > /sys/block/nvme2n1/device/nvme2/rescan_controller
echo 1 > /sys/block/nvme2n1/device/nvme2/reset_controller

kernel log:
[  384.865550] nvme nvme2: creating 48 I/O queues.
[  386.784069] nvme nvme2: creating 48 I/O queues.
[  387.771002] nvme_ns_head_make_request: 159989 callbacks suppressed
[  387.771004] block nvme2n1: no usable path - requeuing I/O
[  387.771012] block nvme2n1: no usable path - requeuing I/O
[  387.771051] block nvme2n1: no usable path - requeuing I/O
[  387.771061] block nvme2n1: no usable path - requeuing I/O
[  387.771065] block nvme2n1: no usable path - requeuing I/O
[  387.771070] block nvme2n1: no usable path - requeuing I/O
[  387.771077] block nvme2n1: no usable path - requeuing I/O
[  387.771124] block nvme2n1: no usable path - requeuing I/O
[  387.771146] block nvme2n1: no usable path - requeuing I/O
[  387.771155] block nvme2n1: no usable path - requeuing I/O
[  449.670780] nvme nvme2: I/O 0 QID 0 timeout
[  449.691674] nvme nvme2: Connect command failed, error wo/DNR bit: 7
[  449.697974] BUG: kernel NULL pointer dereference, address: 00000000000000c8
[  449.704945] #PF: supervisor read access in kernel mode
[  449.710082] #PF: error_code(0x0000) - not-present page
[  449.715221] PGD 0 P4D 0 
[  449.717761] Oops: 0000 [#1] SMP PTI
[  449.721254] CPU: 45 PID: 1145 Comm: kworker/u98:2 Not tainted 5.3.0 #12
[  449.727866] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 2.2.11 06/13/2019
[  449.735448] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
[  449.742325] RIP: 0010:rdma_disconnect+0x2e/0x90 [rdma_cm]
[  449.747722] Code: 00 55 53 48 89 fb 48 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00 48 69 c0 b8 00 00 00 48 03 81 80 04 00 00 <8b> 40 10 a8 04 75 0d a8 08 74 42 5b 31 f6 5d e9 fe 72 b2 ff 48 89
[  449.766466] RSP: 0018:ffffb01f87323de0 EFLAGS: 00010206
[  449.771693] RAX: 00000000000000b8 RBX: ffff9e4d5a474c00 RCX: ffff9e4d5a475c00
[  449.778825] RDX: 0000000000000819 RSI: ffff9e4dffb96b88 RDI: ffff9e41af404e00
[  449.785956] RBP: 0000000000000000 R08: 00000000000008e7 R09: 000000000000002d
[  449.793098] R10: ffffb01f87323df8 R11: ffffb01f87323ac0 R12: 0000000000000007
[  449.800229] R13: ffff9e4db1332000 R14: ffff9e42045e2540 R15: ffff9e4db1332000
[  449.807361] FS:  0000000000000000(0000) GS:ffff9e4dffb80000(0000) knlGS:0000000000000000
[  449.815456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  449.821195] CR2: 00000000000000c8 CR3: 000000183e60a003 CR4: 00000000007606e0
[  449.828327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  449.835457] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  449.842581] PKRU: 55555554
[  449.845287] Call Trace:
[  449.847758]  nvme_rdma_start_queue+0x8f/0xc0 [nvme_rdma]
[  449.853070]  nvme_rdma_setup_ctrl+0x4ef/0x6a0 [nvme_rdma]
[  449.858469]  nvme_rdma_reset_ctrl_work+0x4e/0x70 [nvme_rdma]
[  449.864132]  process_one_work+0x1a1/0x360
[  449.868140]  worker_thread+0x1c9/0x380
[  449.871894]  ? process_one_work+0x360/0x360
[  449.876081]  kthread+0x10c/0x130
[  449.879310]  ? kthread_create_on_node+0x60/0x60
[  449.883846]  ret_from_fork+0x35/0x40
[  449.887422] Modules linked in: nvme_rdma nvme_fabrics nvmet_rdma nvmet 8021q garp mrp stp llc ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp iw_cxgb4 libcxgb mlx5_ib vfat fat opa_vnic ib_umad ib_ipoib intel_rapl_msr intel_rapl_common isst_if_common rpcrdma sunrpc skx_edac nfit rdma_ucm libnvdimm ib_iser rdma_cm x86_pkg_temp_thermal intel_powerclamp iw_cm ib_cm coretemp libiscsi kvm_intel scsi_transport_iscsi hfi1 kvm iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif irqbypass rdmavt bnxt_re crct10dif_pclmul crc32_pclmul ib_uverbs ghash_clmulni_intel intel_cstate intel_uncore ib_core dell_smbios mei_me ipmi_si intel_rapl_perf wmi_bmof dell_wmi_descriptor pcspkr sg mei i2c_i801 lpc_ich ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_vram_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt ahci fb_sys_fops nvme libahci csiostor drm cxgb4 bnxt_en crc32c_intel nvme_core libata megaraid_sas scsi_trans
 port_fc
[  449.887456]  mlxfw tg3 wmi dm_mirror dm_region_hash dm_log dm_mod
[  449.980847] CR2: 00000000000000c8
[  449.984182] ---[ end trace aeab63ac2e6510db ]---
[  450.046884] RIP: 0010:rdma_disconnect+0x2e/0x90 [rdma_cm]
[  450.052282] Code: 00 55 53 48 89 fb 48 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00 48 69 c0 b8 00 00 00 48 03 81 80 04 00 00 <8b> 40 10 a8 04 75 0d a8 08 74 42 5b 31 f6 5d e9 fe 72 b2 ff 48 89
[  450.071027] RSP: 0018:ffffb01f87323de0 EFLAGS: 00010206
[  450.076255] RAX: 00000000000000b8 RBX: ffff9e4d5a474c00 RCX: ffff9e4d5a475c00
[  450.083387] RDX: 0000000000000819 RSI: ffff9e4dffb96b88 RDI: ffff9e41af404e00
[  450.090517] RBP: 0000000000000000 R08: 00000000000008e7 R09: 000000000000002d
[  450.097642] R10: ffffb01f87323df8 R11: ffffb01f87323ac0 R12: 0000000000000007
[  450.104774] R13: ffff9e4db1332000 R14: ffff9e42045e2540 R15: ffff9e4db1332000
[  450.111899] FS:  0000000000000000(0000) GS:ffff9e4dffb80000(0000) knlGS:0000000000000000
[  450.119985] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  450.125729] CR2: 00000000000000c8 CR3: 000000183e60a003 CR4: 00000000007606e0
[  450.132853] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  450.139987] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  450.147118] PKRU: 55555554
[  450.149831] Kernel panic - not syncing: Fatal exception
[  450.155135] Kernel Offset: 0x17200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  450.224883] ---[ end Kernel panic - not syncing: Fatal exception ]---

# gdb /lib/modules/5.3.0/kernel/drivers/nvme/host/nvme-rdma.ko 
Reading symbols from /lib/modules/5.3.0/kernel/drivers/nvme/host/nvme-rdma.ko...done.

(gdb) l *(nvme_rdma_start_queue+0x8f)
0x65f is in nvme_rdma_start_queue (drivers/nvme/host/rdma.c:568).
563	}
564	
565	static void __nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
566	{
567		rdma_disconnect(queue->cm_id);
568		ib_drain_qp(queue->qp);
569	}
570	
571	static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
572	{
(gdb) 

# lspci | grep -i mel
3b:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]



Best Regards,
  Yi Zhang



_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-18  9:13 ` NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background Yi Zhang
@ 2019-09-18 14:21   ` Max Gurtovoy
  2019-09-20  3:37     ` Yi Zhang
  0 siblings, 1 reply; 7+ messages in thread
From: Max Gurtovoy @ 2019-09-18 14:21 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme; +Cc: sagi


On 9/18/2019 12:13 PM, Yi Zhang wrote:
> Hello
> I observed bellow I/O timeout and NULL pointer on 5.3.0, pls help check it, let me know if you need more info or test patch, thanks

Hi,

Can you try to repro it with older kernel (5.2.0) ?

I want to understand if its a regression from the last few months...


>
> reproducer:
> 1. do fio background testing
> 2. stress rescan/reset_controller operation
> echo 1 > /sys/block/nvme2n1/device/nvme2/rescan_controller
> echo 1 > /sys/block/nvme2n1/device/nvme2/reset_controller
>
> kernel log:
> [  384.865550] nvme nvme2: creating 48 I/O queues.
> [  386.784069] nvme nvme2: creating 48 I/O queues.
> [  387.771002] nvme_ns_head_make_request: 159989 callbacks suppressed
> [  387.771004] block nvme2n1: no usable path - requeuing I/O
> [  387.771012] block nvme2n1: no usable path - requeuing I/O
> [  387.771051] block nvme2n1: no usable path - requeuing I/O
> [  387.771061] block nvme2n1: no usable path - requeuing I/O
> [  387.771065] block nvme2n1: no usable path - requeuing I/O
> [  387.771070] block nvme2n1: no usable path - requeuing I/O
> [  387.771077] block nvme2n1: no usable path - requeuing I/O
> [  387.771124] block nvme2n1: no usable path - requeuing I/O
> [  387.771146] block nvme2n1: no usable path - requeuing I/O
> [  387.771155] block nvme2n1: no usable path - requeuing I/O
> [  449.670780] nvme nvme2: I/O 0 QID 0 timeout
> [  449.691674] nvme nvme2: Connect command failed, error wo/DNR bit: 7
> [  449.697974] BUG: kernel NULL pointer dereference, address: 00000000000000c8
> [  449.704945] #PF: supervisor read access in kernel mode
> [  449.710082] #PF: error_code(0x0000) - not-present page
> [  449.715221] PGD 0 P4D 0
> [  449.717761] Oops: 0000 [#1] SMP PTI
> [  449.721254] CPU: 45 PID: 1145 Comm: kworker/u98:2 Not tainted 5.3.0 #12
> [  449.727866] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 2.2.11 06/13/2019
> [  449.735448] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work [nvme_rdma]
> [  449.742325] RIP: 0010:rdma_disconnect+0x2e/0x90 [rdma_cm]
> [  449.747722] Code: 00 55 53 48 89 fb 48 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00 48 69 c0 b8 00 00 00 48 03 81 80 04 00 00 <8b> 40 10 a8 04 75 0d a8 08 74 42 5b 31 f6 5d e9 fe 72 b2 ff 48 89
> [  449.766466] RSP: 0018:ffffb01f87323de0 EFLAGS: 00010206
> [  449.771693] RAX: 00000000000000b8 RBX: ffff9e4d5a474c00 RCX: ffff9e4d5a475c00
> [  449.778825] RDX: 0000000000000819 RSI: ffff9e4dffb96b88 RDI: ffff9e41af404e00
> [  449.785956] RBP: 0000000000000000 R08: 00000000000008e7 R09: 000000000000002d
> [  449.793098] R10: ffffb01f87323df8 R11: ffffb01f87323ac0 R12: 0000000000000007
> [  449.800229] R13: ffff9e4db1332000 R14: ffff9e42045e2540 R15: ffff9e4db1332000
> [  449.807361] FS:  0000000000000000(0000) GS:ffff9e4dffb80000(0000) knlGS:0000000000000000
> [  449.815456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  449.821195] CR2: 00000000000000c8 CR3: 000000183e60a003 CR4: 00000000007606e0
> [  449.828327] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  449.835457] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  449.842581] PKRU: 55555554
> [  449.845287] Call Trace:
> [  449.847758]  nvme_rdma_start_queue+0x8f/0xc0 [nvme_rdma]
> [  449.853070]  nvme_rdma_setup_ctrl+0x4ef/0x6a0 [nvme_rdma]
> [  449.858469]  nvme_rdma_reset_ctrl_work+0x4e/0x70 [nvme_rdma]
> [  449.864132]  process_one_work+0x1a1/0x360
> [  449.868140]  worker_thread+0x1c9/0x380
> [  449.871894]  ? process_one_work+0x360/0x360
> [  449.876081]  kthread+0x10c/0x130
> [  449.879310]  ? kthread_create_on_node+0x60/0x60
> [  449.883846]  ret_from_fork+0x35/0x40
> [  449.887422] Modules linked in: nvme_rdma nvme_fabrics nvmet_rdma nvmet 8021q garp mrp stp llc ib_isert iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp iw_cxgb4 libcxgb mlx5_ib vfat fat opa_vnic ib_umad ib_ipoib intel_rapl_msr intel_rapl_common isst_if_common rpcrdma sunrpc skx_edac nfit rdma_ucm libnvdimm ib_iser rdma_cm x86_pkg_temp_thermal intel_powerclamp iw_cm ib_cm coretemp libiscsi kvm_intel scsi_transport_iscsi hfi1 kvm iTCO_wdt iTCO_vendor_support dcdbas ipmi_ssif irqbypass rdmavt bnxt_re crct10dif_pclmul crc32_pclmul ib_uverbs ghash_clmulni_intel intel_cstate intel_uncore ib_core dell_smbios mei_me ipmi_si intel_rapl_perf wmi_bmof dell_wmi_descriptor pcspkr sg mei i2c_i801 lpc_ich ipmi_devintf ipmi_msghandler acpi_power_meter xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_vram_helper mlx5_core ttm drm_kms_helper syscopyarea sysfillrect sysimgblt ahci fb_sys_fops nvme libahci csiostor drm cxgb4 bnxt_en crc32c_intel nvme_core libata megaraid_sas scsi_tra
 nsport_fc
> [  449.887456]  mlxfw tg3 wmi dm_mirror dm_region_hash dm_log dm_mod
> [  449.980847] CR2: 00000000000000c8
> [  449.984182] ---[ end trace aeab63ac2e6510db ]---
> [  450.046884] RIP: 0010:rdma_disconnect+0x2e/0x90 [rdma_cm]
> [  450.052282] Code: 00 55 53 48 89 fb 48 8b bf a8 02 00 00 48 85 ff 74 65 48 8b 0b 0f b6 83 c0 01 00 00 48 69 c0 b8 00 00 00 48 03 81 80 04 00 00 <8b> 40 10 a8 04 75 0d a8 08 74 42 5b 31 f6 5d e9 fe 72 b2 ff 48 89
> [  450.071027] RSP: 0018:ffffb01f87323de0 EFLAGS: 00010206
> [  450.076255] RAX: 00000000000000b8 RBX: ffff9e4d5a474c00 RCX: ffff9e4d5a475c00
> [  450.083387] RDX: 0000000000000819 RSI: ffff9e4dffb96b88 RDI: ffff9e41af404e00
> [  450.090517] RBP: 0000000000000000 R08: 00000000000008e7 R09: 000000000000002d
> [  450.097642] R10: ffffb01f87323df8 R11: ffffb01f87323ac0 R12: 0000000000000007
> [  450.104774] R13: ffff9e4db1332000 R14: ffff9e42045e2540 R15: ffff9e4db1332000
> [  450.111899] FS:  0000000000000000(0000) GS:ffff9e4dffb80000(0000) knlGS:0000000000000000
> [  450.119985] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  450.125729] CR2: 00000000000000c8 CR3: 000000183e60a003 CR4: 00000000007606e0
> [  450.132853] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  450.139987] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  450.147118] PKRU: 55555554
> [  450.149831] Kernel panic - not syncing: Fatal exception
> [  450.155135] Kernel Offset: 0x17200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [  450.224883] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
> # gdb /lib/modules/5.3.0/kernel/drivers/nvme/host/nvme-rdma.ko
> Reading symbols from /lib/modules/5.3.0/kernel/drivers/nvme/host/nvme-rdma.ko...done.
>
> (gdb) l *(nvme_rdma_start_queue+0x8f)
> 0x65f is in nvme_rdma_start_queue (drivers/nvme/host/rdma.c:568).
> 563	}
> 564	
> 565	static void __nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
> 566	{
> 567		rdma_disconnect(queue->cm_id);
> 568		ib_drain_qp(queue->qp);
> 569	}
> 570	
> 571	static void nvme_rdma_stop_queue(struct nvme_rdma_queue *queue)
> 572	{
> (gdb)
>
> # lspci | grep -i mel
> 3b:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
>
>
>
> Best Regards,
>    Yi Zhang
>
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.infradead.org%2Fmailman%2Flistinfo%2Flinux-nvme&amp;data=02%7C01%7Cmaxg%40mellanox.com%7C81816cef9009489a9a9608d73c1887fd%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C1%7C637043948365925030&amp;sdata=xDFYdW4FFjVGrqJ74xlr4tTN4hs4Qwwvt7eaGoui%2F%2Fg%3D&amp;reserved=0


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-18 14:21   ` Max Gurtovoy
@ 2019-09-20  3:37     ` Yi Zhang
  2019-09-20 16:58       ` Sagi Grimberg
  0 siblings, 1 reply; 7+ messages in thread
From: Yi Zhang @ 2019-09-20  3:37 UTC (permalink / raw)
  To: Max Gurtovoy, linux-nvme; +Cc: sagi



On 9/18/19 10:21 PM, Max Gurtovoy wrote:
>
> On 9/18/2019 12:13 PM, Yi Zhang wrote:
>> Hello
>> I observed bellow I/O timeout and NULL pointer on 5.3.0, pls help 
>> check it, let me know if you need more info or test patch, thanks
>
> Hi,
>
> Can you try to repro it with older kernel (5.2.0) ?
>
> I want to understand if its a regression from the last few months...
>
Hi Max
 From my test, seems there are two issue here, one it I/O timeout during 
reset_controller,
the other one is reconnecting after I/O timeout lead kernel NULL 
pointer, and it was introduced from v5.2.0-rc1

v5.1
#echo 1 > /sys/block/nvme2n1/device/reset_controller
echo: write error: Network dropped connection on reset

[55542.073929] nvme nvme2: Please enable CONFIG_NVME_MULTIPATH for full 
support of multi-port devices.
[55542.083098] nvme nvme2: creating 48 I/O queues.
[55607.920015] nvme nvme2: I/O 0 QID 0 timeout
[55607.937863] nvme nvme2: Connect command failed, error wo/DNR bit: 7
[55607.944157] nvme nvme2: failed to connect queue: 0 ret=7
[55607.949478] nvme nvme2: Reconnecting in 10 seconds...
[55618.166395] nvme nvme2: Please enable CONFIG_NVME_MULTIPATH for full 
support of multi-port devices.
[55618.175575] nvme nvme2: creating 48 I/O queues.
[55618.960387] nvme nvme2: Successfully reconnected (2 attempts)


v5.2.0-rc1
[ 1463.516023] nvme nvme2: creating 48 I/O queues.
[ 1464.311387] nvme2c2n1: detected capacity change from 0 to 1600321314816
[ 1464.435225] nvme2c2n1: detected capacity change from 0 to 1600321314816
[ 1486.515899] nvme_ns_head_make_request: 167 callbacks suppressed
[ 1486.515903] block nvme2n1: no path available - requeuing I/O
[ 1486.527513] block nvme2n1: no path available - requeuing I/O
[ 1486.533183] block nvme2n1: no path available - requeuing I/O
[ 1486.538852] block nvme2n1: no path available - requeuing I/O
[ 1486.544536] block nvme2n1: no path available - requeuing I/O
[ 1486.550214] block nvme2n1: no path available - requeuing I/O
[ 1486.555880] block nvme2n1: no path available - requeuing I/O
[ 1486.561548] block nvme2n1: no path available - requeuing I/O
[ 1486.567232] block nvme2n1: no path available - requeuing I/O
[ 1486.572910] block nvme2n1: no path available - requeuing I/O
[ 1526.450439] nvme nvme2: I/O 0 QID 0 timeout
[ 1526.469411] nvme nvme2: Connect command failed, error wo/DNR bit: 7
[ 1526.475702] nvme nvme2: failed to connect queue: 0 ret=7
[ 1526.481034] general protection fault: 0000 [#1] SMP PTI
[ 1526.486281] CPU: 21 PID: 16138 Comm: kworker/u98:2 Not tainted 
5.2.0-rc1 #15
[ 1526.493336] Hardware name: Dell Inc. PowerEdge R740/00WGD1, BIOS 
2.2.11 06/13/2019
[ 1526.500903] Workqueue: nvme-reset-wq nvme_rdma_reset_ctrl_work 
[nvme_rdma]
[ 1526.507794] RIP: 0010:__x86_indirect_thunk_rax+0x10/0x20
[ 1526.513120] Code: 03 01 d1 89 ca e9 80 ec ca ff 48 8d 0c c8 e9 81 ea 
ca ff 90 90 90 90 90 90 90 e8 07 00 00 00 f3 90 0f ae e8 eb f9 48 89 04 
24 <c3> 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 e8 07 00 00 00 f3
[ 1526.531864] RSP: 0018:ffffc19d2ade7e18 EFLAGS: 00010246
[ 1526.537089] RAX: 730424073f013200 RBX: 0000000000000000 RCX: 
0000000000000001
[ 1526.544223] RDX: 0000000000000040 RSI: 0000000be2a19e00 RDI: 
ffff9e02ff907d80
[ 1526.551355] RBP: 0000000000000007 R08: 0000000000000000 R09: 
0000000000000015
[ 1526.558513] R10: 0000000000000000 R11: ffffc19d2ade7b60 R12: 
0000000000000000
[ 1526.565644] R13: ffff9e0e8b9a02f8 R14: ffff9e0e8b9a0000 R15: 
ffff9e0e8b9a06e0
[ 1526.572779] FS:  0000000000000000(0000) GS:ffff9e0f06680000(0000) 
knlGS:0000000000000000
[ 1526.580863] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1526.586634] CR2: 00007f0179cd1ec8 CR3: 000000183f00a006 CR4: 
00000000007606e0
[ 1526.593769] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1526.600900] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 1526.608033] PKRU: 55555554
[ 1526.610753] Call Trace:
[ 1526.613201]  ? nvme_rdma_setup_ctrl+0x1ae/0x6b0 [nvme_rdma]
[ 1526.618782]  ? nvme_rdma_reset_ctrl_work+0x4e/0x70 [nvme_rdma]
[ 1526.624642]  ? process_one_work+0x1a1/0x3a0
[ 1526.628824]  ? worker_thread+0x1c9/0x380
[ 1526.632751]  ? process_one_work+0x3a0/0x3a0
[ 1526.636937]  ? kthread+0x10c/0x130
[ 1526.640341]  ? kthread_create_on_node+0x60/0x60
[ 1526.644876]  ? ret_from_fork+0x35/0x40
[ 1526.648635] Modules linked in: nvme_rdma nvme_fabrics nvmet_rdma 
nvmet 8021q garp mrp stp llc ib_isert iscsi_target_mod ib_srpt 
target_core_mod ib_srp scsi_transport_srp vfat fat mlx5_ib opa_vnic 
ib_umad ib_ipoib intel_rapl skx_edac nfit libnvdimm x86_pkg_temp_thermal 
intel_powerclamp coretemp rpcrdma sunrpc kvm_intel hfi1 kvm rdma_ucm 
ib_iser libiscsi scsi_transport_iscsi iw_cxgb4 ipmi_ssif rdma_cm iw_cm 
rdmavt ib_cm bnxt_re libcxgb irqbypass ib_uverbs crct10dif_pclmul 
crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt 
iTCO_vendor_support intel_uncore ib_core dcdbas mei_me dell_smbios 
ipmi_si wmi_bmof dell_wmi_descriptor intel_rapl_perf sg pcspkr mei 
ipmi_devintf i2c_i801 lpc_ich ipmi_msghandler acpi_power_meter xfs 
libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops mlx5_core ttm drm cxgb4 csiostor ahci 
nvme libahci bnxt_en crc32c_intel nvme_core libata megaraid_sas 
scsi_transport_fc mlxfw tg3 wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1526.736119] ---[ end trace 0124b9a2f8dcf21d ]---
[ 1526.757400] RIP: 0010:__x86_indirect_thunk_rax+0x10/0x20
[ 1526.762713] Code: 03 01 d1 89 ca e9 80 ec ca ff 48 8d 0c c8 e9 81 ea 
ca ff 90 90 90 90 90 90 90 e8 07 00 00 00 f3 90 0f ae e8 eb f9 48 89 04 
24 <c3> 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00 00 00 e8 07 00 00 00 f3
[ 1526.781472] RSP: 0018:ffffc19d2ade7e18 EFLAGS: 00010246
[ 1526.786715] RAX: 730424073f013200 RBX: 0000000000000000 RCX: 
0000000000000001
[ 1526.793863] RDX: 0000000000000040 RSI: 0000000be2a19e00 RDI: 
ffff9e02ff907d80
[ 1526.801014] RBP: 0000000000000007 R08: 0000000000000000 R09: 
0000000000000015
[ 1526.808162] R10: 0000000000000000 R11: ffffc19d2ade7b60 R12: 
0000000000000000
[ 1526.815304] R13: ffff9e0e8b9a02f8 R14: ffff9e0e8b9a0000 R15: 
ffff9e0e8b9a06e0
[ 1526.822457] FS:  0000000000000000(0000) GS:ffff9e0f06680000(0000) 
knlGS:0000000000000000
[ 1526.830558] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1526.836337] CR2: 00007f0179cd1ec8 CR3: 000000183f00a006 CR4: 
00000000007606e0
[ 1526.843472] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1526.850628] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[ 1526.857767] PKRU: 55555554
[ 1526.860501] Kernel panic - not syncing: Fatal exception
[ 1526.865813] Kernel Offset: 0x13000000 from 0xffffffff81000000 
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1526.893337] ---[ end Kernel panic - not syncing: Fatal exception ]---

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-20  3:37     ` Yi Zhang
@ 2019-09-20 16:58       ` Sagi Grimberg
  2019-09-23 15:25         ` Max Gurtovoy
  0 siblings, 1 reply; 7+ messages in thread
From: Sagi Grimberg @ 2019-09-20 16:58 UTC (permalink / raw)
  To: Yi Zhang, Max Gurtovoy, linux-nvme

Thanks for reporting Yi,

Does this fix your issue?

--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index dfa07bb9dfeb..981da9ce3cfc 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -614,7 +614,8 @@ static int nvme_rdma_start_queue(struct 
nvme_rdma_ctrl *ctrl, int idx)
         if (!ret) {
                 set_bit(NVME_RDMA_Q_LIVE, &queue->flags);
         } else {
-               __nvme_rdma_stop_queue(queue);
+               if (test_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
+                       __nvme_rdma_stop_queue(queue);
                 dev_info(ctrl->ctrl.device,
                         "failed to connect queue: %d ret=%d\n", idx, ret);
         }
--

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-20 16:58       ` Sagi Grimberg
@ 2019-09-23 15:25         ` Max Gurtovoy
  2019-09-24  4:52           ` Yi Zhang
  2019-09-27  8:17           ` Yi Zhang
  0 siblings, 2 replies; 7+ messages in thread
From: Max Gurtovoy @ 2019-09-23 15:25 UTC (permalink / raw)
  To: Sagi Grimberg, Yi Zhang, linux-nvme

Any update Yi ?

we must fix this issue..

On 9/20/2019 7:58 PM, Sagi Grimberg wrote:
> Thanks for reporting Yi,
>
> Does this fix your issue?
>
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index dfa07bb9dfeb..981da9ce3cfc 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -614,7 +614,8 @@ static int nvme_rdma_start_queue(struct 
> nvme_rdma_ctrl *ctrl, int idx)
>         if (!ret) {
>                 set_bit(NVME_RDMA_Q_LIVE, &queue->flags);
>         } else {
> -               __nvme_rdma_stop_queue(queue);
> +               if (test_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
> +                       __nvme_rdma_stop_queue(queue);
>                 dev_info(ctrl->ctrl.device,
>                         "failed to connect queue: %d ret=%d\n", idx, 
> ret);
>         }
> -- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-23 15:25         ` Max Gurtovoy
@ 2019-09-24  4:52           ` Yi Zhang
  2019-09-27  8:17           ` Yi Zhang
  1 sibling, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2019-09-24  4:52 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: Sagi Grimberg, linux-nvme

Hi Max/Sagi
Sorry for the late response, will update it today.

Best Regards,
  Yi Zhang


----- Original Message -----
From: "Max Gurtovoy" <maxg@mellanox.com>
To: "Sagi Grimberg" <sagi@grimberg.me>, "Yi Zhang" <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org
Sent: Monday, September 23, 2019 11:25:31 PM
Subject: Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background

Any update Yi ?

we must fix this issue..

On 9/20/2019 7:58 PM, Sagi Grimberg wrote:
> Thanks for reporting Yi,
>
> Does this fix your issue?
>
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index dfa07bb9dfeb..981da9ce3cfc 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -614,7 +614,8 @@ static int nvme_rdma_start_queue(struct 
> nvme_rdma_ctrl *ctrl, int idx)
>         if (!ret) {
>                 set_bit(NVME_RDMA_Q_LIVE, &queue->flags);
>         } else {
> -               __nvme_rdma_stop_queue(queue);
> +               if (test_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
> +                       __nvme_rdma_stop_queue(queue);
>                 dev_info(ctrl->ctrl.device,
>                         "failed to connect queue: %d ret=%d\n", idx, 
> ret);
>         }
> -- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background
  2019-09-23 15:25         ` Max Gurtovoy
  2019-09-24  4:52           ` Yi Zhang
@ 2019-09-27  8:17           ` Yi Zhang
  1 sibling, 0 replies; 7+ messages in thread
From: Yi Zhang @ 2019-09-27  8:17 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Max Gurtovoy, linux-nvme

Hi Sagi

Confirmed the NULL pointer issue was fixed by this patch, and "I/O 1 QID 0 timeout" still exists, thanks.

<6>[ 5807.293577] nvme nvme2: creating 48 I/O queues.
<6>[ 5817.210168] nvme nvme2: Removing ctrl: NQN "testnqn"
<6>[ 5821.130348] nvme nvme2: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.186:4420
<6>[ 5821.139830] nvme nvme2: Removing ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery"
<6>[ 5821.185128] nvme nvme2: creating 48 I/O queues.
<6>[ 5821.925736] nvme nvme2: mapped 48/0/0 default/read/poll queues.
<6>[ 5821.950855] nvme nvme2: new ctrl: NQN "testnqn", addr 172.31.0.186:4420
<6>[ 5821.953152] nvme2n1: detected capacity change from 0 to 1600321314816
<4>[ 5826.546586] nvme_ns_head_make_request: 250644 callbacks suppressed
<4>[ 5826.546589] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546591] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546606] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546609] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546624] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546627] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546629] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546633] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546635] block nvme2n1: no usable path - requeuing I/O
<4>[ 5826.546636] block nvme2n1: no usable path - requeuing I/O
<4>[ 5837.481054] hfi1_opa0.8024: P_Key 0x8024 is not found
<4>[ 5837.486199] hfi1_opa0.8022: P_Key 0x8022 is not found
<6>[ 5837.503278] IPv6: ADDRCONF(NETDEV_CHANGE): hfi1_opa0: link becomes ready
<4>[ 5882.465388] hfi1_opa0.8024: P_Key 0x8024 is not found
<4>[ 5882.470520] hfi1_opa0.8022: P_Key 0x8022 is not found
<6>[ 5882.487647] IPv6: ADDRCONF(NETDEV_CHANGE): hfi1_opa0: link becomes ready
<4>[ 5888.515395] nvme nvme2: I/O 1 QID 0 timeout
<3>[ 5888.533361] nvme nvme2: Connect command failed, error wo/DNR bit: 7
<6>[ 5888.539645] nvme nvme2: failed to connect queue: 0 ret=7
<6>[ 5888.544994] nvme nvme2: Reconnecting in 10 seconds...
<6>[ 5898.774955] nvme nvme2: creating 48 I/O queues.
<6>[ 5899.570053] nvme nvme2: Successfully reconnected (2 attempts)
<4>[ 5927.818466] hfi1_opa0.8024: P_Key 0x8024 is not found
<4>[ 5927.823550] hfi1_opa0.8022: P_Key 0x8022 is not found
<6>[ 5927.843972] IPv6: ADDRCONF(NETDEV_CHANGE): hfi1_opa0: link becomes ready
<6>[ 6004.970479] nvme nvme2: Removing ctrl: NQN "testnqn"


Best Regards,
  Yi Zhang


----- Original Message -----
From: "Max Gurtovoy" <maxg@mellanox.com>
To: "Sagi Grimberg" <sagi@grimberg.me>, "Yi Zhang" <yi.zhang@redhat.com>, linux-nvme@lists.infradead.org
Sent: Monday, September 23, 2019 11:25:31 PM
Subject: Re: NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background

Any update Yi ?

we must fix this issue..

On 9/20/2019 7:58 PM, Sagi Grimberg wrote:
> Thanks for reporting Yi,
>
> Does this fix your issue?
>
> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index dfa07bb9dfeb..981da9ce3cfc 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -614,7 +614,8 @@ static int nvme_rdma_start_queue(struct 
> nvme_rdma_ctrl *ctrl, int idx)
>         if (!ret) {
>                 set_bit(NVME_RDMA_Q_LIVE, &queue->flags);
>         } else {
> -               __nvme_rdma_stop_queue(queue);
> +               if (test_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
> +                       __nvme_rdma_stop_queue(queue);
>                 dev_info(ctrl->ctrl.device,
>                         "failed to connect queue: %d ret=%d\n", idx, 
> ret);
>         }
> -- 

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2019-09-27  8:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1823328454.445263.1568796846850.JavaMail.zimbra@redhat.com>
2019-09-18  9:13 ` NVMeoF RDMA IB: I/O timeout and NULL pointer observed during rescan_controller/reset_controller with fio background Yi Zhang
2019-09-18 14:21   ` Max Gurtovoy
2019-09-20  3:37     ` Yi Zhang
2019-09-20 16:58       ` Sagi Grimberg
2019-09-23 15:25         ` Max Gurtovoy
2019-09-24  4:52           ` Yi Zhang
2019-09-27  8:17           ` Yi Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).