All of lore.kernel.org
 help / color / mirror / Atom feed
* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-21 19:38 shahar.salzman
       [not found] ` <08131a05-1f56-ef61-990a-7fff04eea095-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 171+ messages in thread
From: shahar.salzman @ 2017-02-21 19:38 UTC (permalink / raw)


Hi experts,

I have been running into some unexplained behaviour with NVMEf, I hope 
that you
can help me or at least point me in the right direction with this issue.

I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4 
backend
devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla 
kernel, I
had a probelm connecting the NVME using the OFED drivers, so I removed 
all the
mlx_compat and everything which depends on it.

When I perform simultaneous writes (non direct fio) from both of the 
initiators
to the same device (overlapping areas), I get NVMEf disconnect followed 
by "dump
error cqe", successful reconnect, and then on one of the servers I get a
WARN_ON. After this the server gets stuck and I have to power cycle it 
to get it
back up...

The other server BTW, also gets stuck, but it returns to normal after 
several
minutes.

Bellow are the errors on the problematic server, I can provide the full 
logs from both
initiators and target if needed.

When trying to recreate, I also got the initiator stuck (this time 
without the
WARN_ON), by attempting to "kill -9" the fio process.

I was thinking of adding some ftrace events to the NVME host/target to ease
debugging, do you think that this would be beneficial? Is there work on 
this?

Thanks,
Shahar

Here are the printouts from the server that got stuck:

Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216] 
mlx5_0:dump_cqe:262:(pid 0): dump error cqe
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000 
08007806 25000129 015557d0
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0: 
MEMREG for CQE 0xffff96ddd747a638 failed with status
memory management operation error (6)
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0: 
reconnecting in 10 seconds
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512] 
mlx5_0:dump_cqe:262:(pid 0): dump error cqe
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000 
00000000 00000000 00000000
Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000 
08007806 25000126 00692bd0
Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0: 
Successfully reconnected
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887] 
mlx5_0:dump_cqe:262:(pid 0): dump error cqe
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000 
08007806 25000158 04cdd7d0
...
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737] 
mlx5_0:dump_cqe:262:(pid 0): dump error cqe
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000 
00000000 00000000 00000000
Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000 
93005204 00000155 00a385e0
Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0: 
Successfully reconnected
Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832] 
mlx5_0:dump_cqe:262:(pid 0): dump error cqe
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000 
00000000 00000000 00000000
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000 
00000000 00000000 00000000
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000 
00000000 00000000 00000000
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000 
93005204 00000167 b44e76e0
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV 
for CQE 0xffff96fe64f18750 failed with status local protection error (4)
Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0: 
reconnecting in 10 seconds
...
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core 
0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will 
cause a leak of a command resource
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[ 
cut here ]------------
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18 
PID: 22709 at drivers/infiniband/core/verbs.c:1963 
__ib_drain_sq+0x135/0x1d0 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain 
send queue: -110
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked 
in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core 
ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE) 
scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE) 
nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE 
nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc 
iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801 
i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si 
ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink 
ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E) 
libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E) 
wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E) 
i2c_algo_bit(E) [last unloaded: nvme_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID: 
22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name: 
Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue: 
nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8 
ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18 
0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000 
000007ab00000006 0507000000000000 ffff96de27e6fad8
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341] 
[<ffffffffb537f3ff>] dump_stack+0x67/0x98
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352] 
[<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364] 
[<ffffffffb5091a7d>] __warn+0xfd/0x120
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368] 
[<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378] 
[<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388] 
[<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398] 
[<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408] 
[<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418] 
[<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422] 
[<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426] 
[<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429] 
[<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434] 
[<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444] 
[<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454] 
[<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456] 
[<ffffffffb50ad653>] worker_thread+0x153/0x660
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464] 
[<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468] 
[<ffffffffb5799706>] ? __schedule+0x226/0x6a0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471] 
[<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474] 
[<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477] 
[<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480] 
[<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483] 
[<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487] 
[<ffffffffb50b237d>] kthread+0xcd/0xf0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492] 
[<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495] 
[<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499] 
[<ffffffffb579ded5>] ret_from_fork+0x25/0x30
Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace 
eb0e5ba7dc81a687 ]---
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core 
0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will 
cause a leak of a command resource
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[ 
cut here ]------------
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18 
PID: 22709 at drivers/infiniband/core/verbs.c:1998 
__ib_drain_rq+0x12a/0x1c0 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain 
recv queue: -110
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked 
in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core 
ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE) 
scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE) 
nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE 
nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc 
iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801 
i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si 
ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink 
ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E) 
libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E) 
wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E) 
i2c_algo_bit(E) [last unloaded: nvme_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID: 
22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name: 
Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue: 
nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8 
ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18 
0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000 
000007ce00000006 0507000000000000 ffff96de27e6fad8
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169] 
[<ffffffffb537f3ff>] dump_stack+0x67/0x98
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180] 
[<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185] 
[<ffffffffb5091a7d>] __warn+0xfd/0x120
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189] 
[<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199] 
[<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216] 
[<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225] 
[<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235] 
[<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246] 
[<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255] 
[<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259] 
[<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263] 
[<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267] 
[<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270] 
[<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275] 
[<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278] 
[<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281] 
[<ffffffffb50ad653>] worker_thread+0x153/0x660
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285] 
[<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289] 
[<ffffffffb5799706>] ? __schedule+0x226/0x6a0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291] 
[<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294] 
[<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297] 
[<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300] 
[<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302] 
[<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306] 
[<ffffffffb50b237d>] kthread+0xcd/0xf0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310] 
[<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313] 
[<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316] 
[<ffffffffb579ded5>] ret_from_fork+0x25/0x30
Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace 
eb0e5ba7dc81a688 ]---
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] 
unexpectedly returned with status 0x0100
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed 
while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] 
unexpectedly returned with status 0x0100
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed 
while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] 
unexpectedly returned with status 0x0100
Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed 
while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core 
0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will 
cause a leak of a command resource
Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922] 
mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e 
to RESET failed
Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] 
unexpectedly returned with status 0x0100
Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed 
while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-21 19:38 Unexpected issues with 2 NVME initiators using the same target shahar.salzman
@ 2017-02-21 22:50     ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-21 22:50 UTC (permalink / raw)
  To: shahar.salzman, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


> I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
> backend
> devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
> kernel, I
> had a probelm connecting the NVME using the OFED drivers, so I removed
> all the
> mlx_compat and everything which depends on it.

Would it be possible to test with latest upstream kernel?

>
> When I perform simultaneous writes (non direct fio) from both of the
> initiators
> to the same device (overlapping areas), I get NVMEf disconnect followed
> by "dump
> error cqe", successful reconnect, and then on one of the servers I get a
> WARN_ON. After this the server gets stuck and I have to power cycle it
> to get it
> back up...

The error cqes seem to indicate that a memory registration operation
failed which escalated to something worse.

I noticed some issues before with CX4 having problems with memory
registration in the presence of network retransmissions (due to
network congestion).

I notified Mellanox folks on that too, CC'ing Linux-rdma for some
more attention.

After that, I see that ib_modify_qp failed which I've never seen
before (might indicate the the device is in bad shape), and the WARN_ON
is really weird given that nvme-rdma never uses IB_POLL_DIRECT.

> Here are the printouts from the server that got stuck:
>
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
> 08007806 25000129 015557d0
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
> MEMREG for CQE 0xffff96ddd747a638 failed with status
> memory management operation error (6)
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
> reconnecting in 10 seconds
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
> 08007806 25000126 00692bd0
> Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
> Successfully reconnected
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> 08007806 25000158 04cdd7d0
> ...
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
> 93005204 00000155 00a385e0
> Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
> Successfully reconnected
> Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> 93005204 00000167 b44e76e0
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
> for CQE 0xffff96fe64f18750 failed with status local protection error (4)
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
> reconnecting in 10 seconds
> ...
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> cause a leak of a command resource
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
> cut here ]------------
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
> PID: 22709 at drivers/infiniband/core/verbs.c:1963
> __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
> send queue: -110
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
> i2c_algo_bit(E) [last unloaded: nvme_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
> 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
> ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
> 000007ab00000006 0507000000000000 ffff96de27e6fad8
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
> [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
> [<ffffffffb5091a7d>] __warn+0xfd/0x120
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
> [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
> [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
> [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
> [<ffffffffb50ad653>] worker_thread+0x153/0x660
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
> [<ffffffffb50b237d>] kthread+0xcd/0xf0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
> eb0e5ba7dc81a687 ]---
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> cause a leak of a command resource
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
> cut here ]------------
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
> PID: 22709 at drivers/infiniband/core/verbs.c:1998
> __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
> recv queue: -110
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
> i2c_algo_bit(E) [last unloaded: nvme_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
> 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
> ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
> 000007ce00000006 0507000000000000 ffff96de27e6fad8
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
> [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
> [<ffffffffb5091a7d>] __warn+0xfd/0x120
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
> [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
> [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
> [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
> [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
> [<ffffffffb50ad653>] worker_thread+0x153/0x660
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
> [<ffffffffb50b237d>] kthread+0xcd/0xf0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
> eb0e5ba7dc81a688 ]---
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
> cause a leak of a command resource
> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
> mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
> to RESET failed
> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
> unexpectedly returned with status 0x0100
> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-21 22:50     ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-21 22:50 UTC (permalink / raw)



> I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
> backend
> devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
> kernel, I
> had a probelm connecting the NVME using the OFED drivers, so I removed
> all the
> mlx_compat and everything which depends on it.

Would it be possible to test with latest upstream kernel?

>
> When I perform simultaneous writes (non direct fio) from both of the
> initiators
> to the same device (overlapping areas), I get NVMEf disconnect followed
> by "dump
> error cqe", successful reconnect, and then on one of the servers I get a
> WARN_ON. After this the server gets stuck and I have to power cycle it
> to get it
> back up...

The error cqes seem to indicate that a memory registration operation
failed which escalated to something worse.

I noticed some issues before with CX4 having problems with memory
registration in the presence of network retransmissions (due to
network congestion).

I notified Mellanox folks on that too, CC'ing Linux-rdma for some
more attention.

After that, I see that ib_modify_qp failed which I've never seen
before (might indicate the the device is in bad shape), and the WARN_ON
is really weird given that nvme-rdma never uses IB_POLL_DIRECT.

> Here are the printouts from the server that got stuck:
>
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
> 08007806 25000129 015557d0
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
> MEMREG for CQE 0xffff96ddd747a638 failed with status
> memory management operation error (6)
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
> reconnecting in 10 seconds
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
> 08007806 25000126 00692bd0
> Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
> Successfully reconnected
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> 08007806 25000158 04cdd7d0
> ...
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> 00000000 00000000 00000000
> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
> 93005204 00000155 00a385e0
> Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
> Successfully reconnected
> Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> 00000000 00000000 00000000
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> 93005204 00000167 b44e76e0
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
> for CQE 0xffff96fe64f18750 failed with status local protection error (4)
> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
> reconnecting in 10 seconds
> ...
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> cause a leak of a command resource
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
> cut here ]------------
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
> PID: 22709 at drivers/infiniband/core/verbs.c:1963
> __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
> send queue: -110
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
> i2c_algo_bit(E) [last unloaded: nvme_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
> 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
> ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
> 000007ab00000006 0507000000000000 ffff96de27e6fad8
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
> [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
> [<ffffffffb5091a7d>] __warn+0xfd/0x120
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
> [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
> [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
> [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
> [<ffffffffb50ad653>] worker_thread+0x153/0x660
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
> [<ffffffffb50b237d>] kthread+0xcd/0xf0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
> eb0e5ba7dc81a687 ]---
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> cause a leak of a command resource
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
> cut here ]------------
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
> PID: 22709 at drivers/infiniband/core/verbs.c:1998
> __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
> recv queue: -110
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
> i2c_algo_bit(E) [last unloaded: nvme_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
> 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
> ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
> 000007ce00000006 0507000000000000 ffff96de27e6fad8
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
> [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
> [<ffffffffb5091a7d>] __warn+0xfd/0x120
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
> [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
> [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
> [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
> [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
> [<ffffffffb50ad653>] worker_thread+0x153/0x660
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
> [<ffffffffb50b237d>] kthread+0xcd/0xf0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
> eb0e5ba7dc81a688 ]---
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
> unexpectedly returned with status 0x0100
> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
> 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
> cause a leak of a command resource
> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
> mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
> to RESET failed
> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
> unexpectedly returned with status 0x0100
> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-21 22:50     ` Sagi Grimberg
@ 2017-02-22 16:52         ` Laurence Oberman
  -1 siblings, 0 replies; 171+ messages in thread
From: Laurence Oberman @ 2017-02-22 16:52 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: shahar.salzman, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA



----- Original Message -----
> From: "Sagi Grimberg" <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
> To: "shahar.salzman" <shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Sent: Tuesday, February 21, 2017 5:50:39 PM
> Subject: Re: Unexpected issues with 2 NVME initiators using the same target
> 
> 
> > I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
> > backend
> > devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
> > kernel, I
> > had a probelm connecting the NVME using the OFED drivers, so I removed
> > all the
> > mlx_compat and everything which depends on it.
> 
> Would it be possible to test with latest upstream kernel?
> 
> >
> > When I perform simultaneous writes (non direct fio) from both of the
> > initiators
> > to the same device (overlapping areas), I get NVMEf disconnect followed
> > by "dump
> > error cqe", successful reconnect, and then on one of the servers I get a
> > WARN_ON. After this the server gets stuck and I have to power cycle it
> > to get it
> > back up...
> 
> The error cqes seem to indicate that a memory registration operation
> failed which escalated to something worse.
> 
> I noticed some issues before with CX4 having problems with memory
> registration in the presence of network retransmissions (due to
> network congestion).
> 
> I notified Mellanox folks on that too, CC'ing Linux-rdma for some
> more attention.
> 
> After that, I see that ib_modify_qp failed which I've never seen
> before (might indicate the the device is in bad shape), and the WARN_ON
> is really weird given that nvme-rdma never uses IB_POLL_DIRECT.
> 
> > Here are the printouts from the server that got stuck:
> >
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
> > 08007806 25000129 015557d0
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
> > MEMREG for CQE 0xffff96ddd747a638 failed with status
> > memory management operation error (6)
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
> > reconnecting in 10 seconds
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
> > 08007806 25000126 00692bd0
> > Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
> > Successfully reconnected
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> > 08007806 25000158 04cdd7d0
> > ...
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
> > 93005204 00000155 00a385e0
> > Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
> > Successfully reconnected
> > Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> > 93005204 00000167 b44e76e0
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
> > for CQE 0xffff96fe64f18750 failed with status local protection error (4)
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
> > reconnecting in 10 seconds
> > ...
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
> > cut here ]------------
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
> > PID: 22709 at drivers/infiniband/core/verbs.c:1963
> > __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
> > send queue: -110
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
> > in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> > ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> > scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> > nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> > iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> > i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> > ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> > ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> > libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> > wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
> > i2c_algo_bit(E) [last unloaded: nvme_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
> > 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
> > Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
> > nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
> > ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
> > 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
> > 000007ab00000006 0507000000000000 ffff96de27e6fad8
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
> > [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
> > [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
> > [<ffffffffb5091a7d>] __warn+0xfd/0x120
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
> > [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
> > [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
> > [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
> > [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
> > [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
> > [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
> > [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
> > [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
> > [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
> > [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
> > [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
> > [<ffffffffb50ad653>] worker_thread+0x153/0x660
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
> > [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
> > [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
> > [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
> > [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
> > [<ffffffffb50b237d>] kthread+0xcd/0xf0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
> > [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
> > [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
> > [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
> > eb0e5ba7dc81a687 ]---
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
> > cut here ]------------
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
> > PID: 22709 at drivers/infiniband/core/verbs.c:1998
> > __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
> > recv queue: -110
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
> > in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> > ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> > scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> > nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> > iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> > i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> > ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> > ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> > libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> > wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
> > i2c_algo_bit(E) [last unloaded: nvme_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
> > 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
> > Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
> > nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
> > ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
> > 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
> > 000007ce00000006 0507000000000000 ffff96de27e6fad8
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
> > [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
> > [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
> > [<ffffffffb5091a7d>] __warn+0xfd/0x120
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
> > [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
> > [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
> > [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
> > [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
> > [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
> > [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
> > [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
> > [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
> > [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
> > [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
> > [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
> > [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
> > [<ffffffffb50ad653>] worker_thread+0x153/0x660
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
> > [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
> > [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
> > [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
> > [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
> > [<ffffffffb50b237d>] kthread+0xcd/0xf0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
> > [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
> > [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
> > [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
> > eb0e5ba7dc81a688 ]---
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
> > Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
> > mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
> > to RESET failed
> > Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
> >
> >
> > _______________________________________________
> > Linux-nvme mailing list
> > Linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Sagi,

Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.

Thanks
Laurnce



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-22 16:52         ` Laurence Oberman
  0 siblings, 0 replies; 171+ messages in thread
From: Laurence Oberman @ 2017-02-22 16:52 UTC (permalink / raw)




----- Original Message -----
> From: "Sagi Grimberg" <sagi at grimberg.me>
> To: "shahar.salzman" <shahar.salzman at gmail.com>, linux-nvme at lists.infradead.org, linux-rdma at vger.kernel.org
> Sent: Tuesday, February 21, 2017 5:50:39 PM
> Subject: Re: Unexpected issues with 2 NVME initiators using the same target
> 
> 
> > I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
> > backend
> > devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
> > kernel, I
> > had a probelm connecting the NVME using the OFED drivers, so I removed
> > all the
> > mlx_compat and everything which depends on it.
> 
> Would it be possible to test with latest upstream kernel?
> 
> >
> > When I perform simultaneous writes (non direct fio) from both of the
> > initiators
> > to the same device (overlapping areas), I get NVMEf disconnect followed
> > by "dump
> > error cqe", successful reconnect, and then on one of the servers I get a
> > WARN_ON. After this the server gets stuck and I have to power cycle it
> > to get it
> > back up...
> 
> The error cqes seem to indicate that a memory registration operation
> failed which escalated to something worse.
> 
> I noticed some issues before with CX4 having problems with memory
> registration in the presence of network retransmissions (due to
> network congestion).
> 
> I notified Mellanox folks on that too, CC'ing Linux-rdma for some
> more attention.
> 
> After that, I see that ib_modify_qp failed which I've never seen
> before (might indicate the the device is in bad shape), and the WARN_ON
> is really weird given that nvme-rdma never uses IB_POLL_DIRECT.
> 
> > Here are the printouts from the server that got stuck:
> >
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
> > 08007806 25000129 015557d0
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
> > MEMREG for CQE 0xffff96ddd747a638 failed with status
> > memory management operation error (6)
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
> > reconnecting in 10 seconds
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
> > 08007806 25000126 00692bd0
> > Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
> > Successfully reconnected
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> > 08007806 25000158 04cdd7d0
> > ...
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
> > 93005204 00000155 00a385e0
> > Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
> > Successfully reconnected
> > Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
> > mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> > 00000000 00000000 00000000
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> > 93005204 00000167 b44e76e0
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
> > for CQE 0xffff96fe64f18750 failed with status local protection error (4)
> > Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
> > reconnecting in 10 seconds
> > ...
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
> > cut here ]------------
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
> > PID: 22709 at drivers/infiniband/core/verbs.c:1963
> > __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
> > send queue: -110
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
> > in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> > ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> > scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> > nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> > iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> > i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> > ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> > ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> > libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> > wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
> > i2c_algo_bit(E) [last unloaded: nvme_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
> > 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
> > Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
> > nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
> > ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
> > 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
> > 000007ab00000006 0507000000000000 ffff96de27e6fad8
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
> > [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
> > [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
> > [<ffffffffb5091a7d>] __warn+0xfd/0x120
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
> > [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
> > [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
> > [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
> > [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
> > [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
> > [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
> > [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
> > [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
> > [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
> > [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
> > [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
> > [<ffffffffb50ad653>] worker_thread+0x153/0x660
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
> > [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
> > [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
> > [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
> > [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
> > [<ffffffffb50b237d>] kthread+0xcd/0xf0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
> > [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
> > [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
> > [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> > Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
> > eb0e5ba7dc81a687 ]---
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
> > cut here ]------------
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
> > PID: 22709 at drivers/infiniband/core/verbs.c:1998
> > __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
> > recv queue: -110
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
> > in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> > ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> > scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
> > nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
> > nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> > nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
> > iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
> > i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
> > ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
> > ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
> > libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
> > wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
> > i2c_algo_bit(E) [last unloaded: nvme_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
> > 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
> > Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
> > nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
> > ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
> > 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
> > 000007ce00000006 0507000000000000 ffff96de27e6fad8
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
> > [<ffffffffb537f3ff>] dump_stack+0x67/0x98
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
> > [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
> > [<ffffffffb5091a7d>] __warn+0xfd/0x120
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
> > [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
> > [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
> > [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
> > [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
> > [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
> > [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
> > [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
> > [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
> > [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
> > [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
> > [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
> > [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
> > [<ffffffffb50ad653>] worker_thread+0x153/0x660
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
> > [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
> > [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
> > [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
> > [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
> > [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
> > [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
> > [<ffffffffb50b237d>] kthread+0xcd/0xf0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
> > [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
> > [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
> > [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> > Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
> > eb0e5ba7dc81a688 ]---
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
> > Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
> > 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
> > cause a leak of a command resource
> > Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
> > mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
> > to RESET failed
> > Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
> > unexpectedly returned with status 0x0100
> > Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
> > while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
> >
> >
> > _______________________________________________
> > Linux-nvme mailing list
> > Linux-nvme at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-nvme
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hi Sagi,

Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.

Thanks
Laurnce

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-22 16:52         ` Laurence Oberman
@ 2017-02-22 19:39             ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-22 19:39 UTC (permalink / raw)
  To: Laurence Oberman
  Cc: shahar.salzman, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


> Hi Sagi,
>
> Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
> Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.

We didn't add gaps support to nvme, the block layer splits gappy
io for us.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-22 19:39             ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-22 19:39 UTC (permalink / raw)



> Hi Sagi,
>
> Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
> Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.

We didn't add gaps support to nvme, the block layer splits gappy
io for us.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-22 16:52         ` Laurence Oberman
@ 2017-02-26  8:03             ` shahar.salzman
  -1 siblings, 0 replies; 171+ messages in thread
From: shahar.salzman @ 2017-02-26  8:03 UTC (permalink / raw)
  To: Laurence Oberman, Sagi Grimberg
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi,

I will try with kernel 4.10 when I get my hands on the machine.

In addition, I found that the machine did not have the kernel-firmware 
package installed. Could this cause the "strange" CX4 behavior? I will 
obviously re-test with the kernel-firmware package.

Shahar


On 02/22/2017 06:52 PM, Laurence Oberman wrote:
>
> ----- Original Message -----
>> From: "Sagi Grimberg" <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
>> To: "shahar.salzman" <shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Sent: Tuesday, February 21, 2017 5:50:39 PM
>> Subject: Re: Unexpected issues with 2 NVME initiators using the same target
>>
>>
>>> I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
>>> backend
>>> devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
>>> kernel, I
>>> had a probelm connecting the NVME using the OFED drivers, so I removed
>>> all the
>>> mlx_compat and everything which depends on it.
>> Would it be possible to test with latest upstream kernel?
>>
>>> When I perform simultaneous writes (non direct fio) from both of the
>>> initiators
>>> to the same device (overlapping areas), I get NVMEf disconnect followed
>>> by "dump
>>> error cqe", successful reconnect, and then on one of the servers I get a
>>> WARN_ON. After this the server gets stuck and I have to power cycle it
>>> to get it
>>> back up...
>> The error cqes seem to indicate that a memory registration operation
>> failed which escalated to something worse.
>>
>> I noticed some issues before with CX4 having problems with memory
>> registration in the presence of network retransmissions (due to
>> network congestion).
>>
>> I notified Mellanox folks on that too, CC'ing Linux-rdma for some
>> more attention.
>>
>> After that, I see that ib_modify_qp failed which I've never seen
>> before (might indicate the the device is in bad shape), and the WARN_ON
>> is really weird given that nvme-rdma never uses IB_POLL_DIRECT.
>>
>>> Here are the printouts from the server that got stuck:
>>>
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
>>> 08007806 25000129 015557d0
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
>>> MEMREG for CQE 0xffff96ddd747a638 failed with status
>>> memory management operation error (6)
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
>>> reconnecting in 10 seconds
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
>>> 08007806 25000126 00692bd0
>>> Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
>>> Successfully reconnected
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
>>> 08007806 25000158 04cdd7d0
>>> ...
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
>>> 93005204 00000155 00a385e0
>>> Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
>>> Successfully reconnected
>>> Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
>>> 93005204 00000167 b44e76e0
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
>>> for CQE 0xffff96fe64f18750 failed with status local protection error (4)
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
>>> reconnecting in 10 seconds
>>> ...
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
>>> cut here ]------------
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
>>> PID: 22709 at drivers/infiniband/core/verbs.c:1963
>>> __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
>>> send queue: -110
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
>>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
>>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
>>> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
>>> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
>>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
>>> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
>>> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
>>> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
>>> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
>>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
>>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
>>> i2c_algo_bit(E) [last unloaded: nvme_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
>>> 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
>>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
>>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
>>> ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
>>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
>>> 000007ab00000006 0507000000000000 ffff96de27e6fad8
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
>>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
>>> [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
>>> [<ffffffffb5091a7d>] __warn+0xfd/0x120
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
>>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
>>> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
>>> [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
>>> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
>>> [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
>>> [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
>>> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
>>> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
>>> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
>>> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
>>> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
>>> [<ffffffffb50ad653>] worker_thread+0x153/0x660
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
>>> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
>>> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
>>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
>>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
>>> [<ffffffffb50b237d>] kthread+0xcd/0xf0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
>>> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
>>> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
>>> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
>>> eb0e5ba7dc81a687 ]---
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
>>> cut here ]------------
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
>>> PID: 22709 at drivers/infiniband/core/verbs.c:1998
>>> __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
>>> recv queue: -110
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
>>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
>>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
>>> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
>>> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
>>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
>>> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
>>> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
>>> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
>>> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
>>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
>>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
>>> i2c_algo_bit(E) [last unloaded: nvme_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
>>> 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
>>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
>>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
>>> ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
>>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
>>> 000007ce00000006 0507000000000000 ffff96de27e6fad8
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
>>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
>>> [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
>>> [<ffffffffb5091a7d>] __warn+0xfd/0x120
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
>>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
>>> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
>>> [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
>>> [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
>>> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
>>> [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
>>> [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
>>> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
>>> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
>>> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
>>> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
>>> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
>>> [<ffffffffb50ad653>] worker_thread+0x153/0x660
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
>>> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
>>> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
>>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
>>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
>>> [<ffffffffb50b237d>] kthread+0xcd/0xf0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
>>> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
>>> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
>>> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
>>> eb0e5ba7dc81a688 ]---
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
>>> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
>>> mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
>>> to RESET failed
>>> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> Hi Sagi,
>
> Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
> Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.
>
> Thanks
> Laurnce
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-26  8:03             ` shahar.salzman
  0 siblings, 0 replies; 171+ messages in thread
From: shahar.salzman @ 2017-02-26  8:03 UTC (permalink / raw)


Hi,

I will try with kernel 4.10 when I get my hands on the machine.

In addition, I found that the machine did not have the kernel-firmware 
package installed. Could this cause the "strange" CX4 behavior? I will 
obviously re-test with the kernel-firmware package.

Shahar


On 02/22/2017 06:52 PM, Laurence Oberman wrote:
>
> ----- Original Message -----
>> From: "Sagi Grimberg" <sagi at grimberg.me>
>> To: "shahar.salzman" <shahar.salzman at gmail.com>, linux-nvme at lists.infradead.org, linux-rdma at vger.kernel.org
>> Sent: Tuesday, February 21, 2017 5:50:39 PM
>> Subject: Re: Unexpected issues with 2 NVME initiators using the same target
>>
>>
>>> I am using 2 initiators + 1 target using nvmet with 1 subsystem and 4
>>> backend
>>> devices. Kernel is 4.9.6, NVME/rdma drivers are all from the vanilla
>>> kernel, I
>>> had a probelm connecting the NVME using the OFED drivers, so I removed
>>> all the
>>> mlx_compat and everything which depends on it.
>> Would it be possible to test with latest upstream kernel?
>>
>>> When I perform simultaneous writes (non direct fio) from both of the
>>> initiators
>>> to the same device (overlapping areas), I get NVMEf disconnect followed
>>> by "dump
>>> error cqe", successful reconnect, and then on one of the servers I get a
>>> WARN_ON. After this the server gets stuck and I have to power cycle it
>>> to get it
>>> back up...
>> The error cqes seem to indicate that a memory registration operation
>> failed which escalated to something worse.
>>
>> I noticed some issues before with CX4 having problems with memory
>> registration in the presence of network retransmissions (due to
>> network congestion).
>>
>> I notified Mellanox folks on that too, CC'ing Linux-rdma for some
>> more attention.
>>
>> After that, I see that ib_modify_qp failed which I've never seen
>> before (might indicate the the device is in bad shape), and the WARN_ON
>> is really weird given that nvme-rdma never uses IB_POLL_DIRECT.
>>
>>> Here are the printouts from the server that got stuck:
>>>
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204219] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
>>> 08007806 25000129 015557d0
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
>>> MEMREG for CQE 0xffff96ddd747a638 failed with status
>>> memory management operation error (6)
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204375] nvme nvme0:
>>> reconnecting in 10 seconds
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205514] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
>>> 08007806 25000126 00692bd0
>>> Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
>>> Successfully reconnected
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682890] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
>>> 08007806 25000158 04cdd7d0
>>> ...
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687739] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
>>> 93005204 00000155 00a385e0
>>> Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
>>> Successfully reconnected
>>> Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK --
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927832]
>>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927835] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
>>> 00000000 00000000 00000000
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
>>> 93005204 00000167 b44e76e0
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0: RECV
>>> for CQE 0xffff96fe64f18750 failed with status local protection error (4)
>>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme nvme0:
>>> reconnecting in 10 seconds
>>> ...
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736190] ------------[
>>> cut here ]------------
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736211] WARNING: CPU: 18
>>> PID: 22709 at drivers/infiniband/core/verbs.c:1963
>>> __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to drain
>>> send queue: -110
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736213] Modules linked
>>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
>>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
>>> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
>>> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
>>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
>>> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
>>> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
>>> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
>>> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
>>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
>>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736293]  drm(E)
>>> i2c_algo_bit(E) [last unloaded: nvme_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
>>> 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
>>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312] Workqueue:
>>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
>>> ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
>>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736324] 0000000300000000
>>> 000007ab00000006 0507000000000000 ffff96de27e6fad8
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736328] Call Trace:
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
>>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736352]
>>> [<ffffffffc06a00e5>] ? __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736364]
>>> [<ffffffffb5091a7d>] __warn+0xfd/0x120
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736368]
>>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736378]
>>> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736388]
>>> [<ffffffffc06a00e5>] __ib_drain_sq+0x135/0x1d0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736398]
>>> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736408]
>>> [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
>>> [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736422]
>>> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736426]
>>> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736429]
>>> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736434]
>>> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736444]
>>> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736454]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736456]
>>> [<ffffffffb50ad653>] worker_thread+0x153/0x660
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736464]
>>> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736468]
>>> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736471]
>>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736474]
>>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736477]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736480]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736483]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736487]
>>> [<ffffffffb50b237d>] kthread+0xcd/0xf0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736492]
>>> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736495]
>>> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736499]
>>> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
>>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end trace
>>> eb0e5ba7dc81a687 ]---
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176059] ------------[
>>> cut here ]------------
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176073] WARNING: CPU: 18
>>> PID: 22709 at drivers/infiniband/core/verbs.c:1998
>>> __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to drain
>>> recv queue: -110
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176076] Modules linked
>>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
>>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
>>> scsi_transport_fc dm_multipath drbd lru_cache netconsole mst_pciconf(OE)
>>> nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc grace ipt_MASQUERADE
>>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse binfmt_misc
>>> iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb i2c_i801
>>> i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure ipmi_ssif ipmi_si
>>> ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib ib_core mlx5_core devlink
>>> ptp pps_core tpm_tis tpm_tis_core tpm ext4(E) mbcache(E) jbd2(E) isci(E)
>>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E) megaraid_sas(E)
>>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E)
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176134]  drm(E)
>>> i2c_algo_bit(E) [last unloaded: nvme_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
>>> 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
>>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a 01/22/2014
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147] Workqueue:
>>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
>>> ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
>>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176159] 0000000300000000
>>> 000007ce00000006 0507000000000000 ffff96de27e6fad8
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176163] Call Trace:
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
>>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176180]
>>> [<ffffffffc069feea>] ? __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176185]
>>> [<ffffffffb5091a7d>] __warn+0xfd/0x120
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176189]
>>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176199]
>>> [<ffffffffc069fdb5>] ? ib_modify_qp+0x45/0x50 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176216]
>>> [<ffffffffc06de770>] ? mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176225]
>>> [<ffffffffc069feea>] __ib_drain_rq+0x12a/0x1c0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176235]
>>> [<ffffffffc069f5a0>] ? ib_create_srq+0xa0/0xa0 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176246]
>>> [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
>>> [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176259]
>>> [<ffffffffc0c4d25b>] nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176263]
>>> [<ffffffffc0c4d2ad>] nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176267]
>>> [<ffffffffc0c4d884>] nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma]
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176270]
>>> [<ffffffffb50ac7ce>] process_one_work+0x17e/0x4f0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176275]
>>> [<ffffffffb50cefc5>] ? dequeue_task_fair+0x85/0x870
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176278]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176281]
>>> [<ffffffffb50ad653>] worker_thread+0x153/0x660
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176285]
>>> [<ffffffffb5026b4c>] ? __switch_to+0x1dc/0x670
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176289]
>>> [<ffffffffb5799706>] ? __schedule+0x226/0x6a0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176291]
>>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176294]
>>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176297]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176300]
>>> [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176302]
>>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176306]
>>> [<ffffffffb50b237d>] kthread+0xcd/0xf0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176310]
>>> [<ffffffffb50bc40e>] ? schedule_tail+0x1e/0xc0
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176313]
>>> [<ffffffffb50b22b0>] ? __kthread_init_worker+0x40/0x40
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176316]
>>> [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
>>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end trace
>>> eb0e5ba7dc81a688 ]---
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
>>> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
>>> 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout. Will
>>> cause a leak of a command resource
>>> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615922]
>>> mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP 0x00015e
>>> to RESET failed
>>> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740]
>>> unexpectedly returned with status 0x0100
>>> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
>>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
>>>
>>>
>>> _______________________________________________
>>> Linux-nvme mailing list
>>> Linux-nvme at lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> Hi Sagi,
>
> Have not looked in depth here but is this maybe the GAPS issue again I bumped into where we had to revert the patch for the SRP transport.
> Would we have to do a similar revert in the NVME space or modify mlx5 where this issue exists.
>
> Thanks
> Laurnce
>
>
>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-02-26  8:03             ` shahar.salzman
  (?)
@ 2017-02-26 17:58             ` Gruher, Joseph R
       [not found]               ` <DE927C68B458BE418D582EC97927A92854655137-8oqHQFITsIFcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  -1 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-02-26 17:58 UTC (permalink / raw)


> Hi,
> 
> I will try with kernel 4.10 when I get my hands on the machine.
> 
> In addition, I found that the machine did not have the kernel-firmware package
> installed. Could this cause the "strange" CX4 behavior? I will obviously re-test
> with the kernel-firmware package.
> 
> Shahar
> 
> 
> On 02/22/2017 06:52 PM, Laurence Oberman wrote:
> >
> > ----- Original Message -----
> >> From: "Sagi Grimberg" <sagi at grimberg.me>
> >> To: "shahar.salzman" <shahar.salzman at gmail.com>,
> >> linux-nvme at lists.infradead.org, linux-rdma at vger.kernel.org
> >> Sent: Tuesday, February 21, 2017 5:50:39 PM
> >> Subject: Re: Unexpected issues with 2 NVME initiators using the same
> >> target
> >>
> >>
> >>> I am using 2 initiators + 1 target using nvmet with 1 subsystem and
> >>> 4 backend devices. Kernel is 4.9.6, NVME/rdma drivers are all from
> >>> the vanilla kernel, I had a probelm connecting the NVME using the
> >>> OFED drivers, so I removed all the mlx_compat and everything which
> >>> depends on it.
> >> Would it be possible to test with latest upstream kernel?
> >>
> >>> When I perform simultaneous writes (non direct fio) from both of the
> >>> initiators to the same device (overlapping areas), I get NVMEf
> >>> disconnect followed by "dump error cqe", successful reconnect, and
> >>> then on one of the servers I get a WARN_ON. After this the server
> >>> gets stuck and I have to power cycle it to get it back up...
> >> The error cqes seem to indicate that a memory registration operation
> >> failed which escalated to something worse.

In our lab we are dealing with an issue which has some of the same symptoms.  Wanted to add to the thread in case it is useful here.  We have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly connected (no switch) to a single initiator system with a matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO workload, and dmesg from both systems.  

We are able to provoke this problem with a variety of workloads but a high bandwidth read operation seems to cause it the most reliably, harder to produce with smaller block sizes.  For some reason the problem seems produced when we stop and restart IO - I can run the FIO workload on the initiator system for 1-2 hours without any new events in dmesg, pushing about 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and the errors and reconnect events happen reliably at that point.  Working to characterize further this week and also to see if we can produce on a smaller configuration.  Happy to provide any additional details that would be useful or try any fixes!

On the initiator we see events like this:

[51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
[51390.065644] 00000000 00000000 00000000 00000000
[51390.065645] 00000000 00000000 00000000 00000000
[51390.065646] 00000000 00000000 00000000 00000000
[51390.065648] 00000000 08007806 250003ab 02b9dcd2
[51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed with status memory management operation error (6)
[51390.079156] nvme nvme3: reconnecting in 10 seconds
[51400.432782] nvme nvme3: Successfully reconnected

On the target we see events like this:

[51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
[51370.394696] 00000000 00000000 00000000 00000000
[51370.394697] 00000000 00000000 00000000 00000000
[51370.394699] 00000000 00000000 00000000 00000000
[51370.394701] 00000000 00008813 080003ea 00c3b1d2

Sometimes, but less frequently, we also will see events on the target like this as part of the problem:

[21322.678571] nvmet: ctrl 1 fatal error occurred!


> >> I noticed some issues before with CX4 having problems with memory
> >> registration in the presence of network retransmissions (due to
> >> network congestion).
> >>
> >> I notified Mellanox folks on that too, CC'ing Linux-rdma for some
> >> more attention.
> >>
> >> After that, I see that ib_modify_qp failed which I've never seen
> >> before (might indicate the the device is in bad shape), and the
> >> WARN_ON is really weird given that nvme-rdma never uses
> IB_POLL_DIRECT.
> >>
> >>> Here are the printouts from the server that got stuck:
> >>>
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204216]
> >>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe Feb  6 09:20:13
> >>> kblock01-knode02 kernel: [59976.204219] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204220] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204221] 00000000
> >>> 08007806 25000129 015557d0
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.204234] nvme nvme0:
> >>> MEMREG for CQE 0xffff96ddd747a638 failed with status memory
> >>> management operation error (6) Feb  6 09:20:13 kblock01-knode02
> >>> kernel: [59976.204375] nvme nvme0:
> >>> reconnecting in 10 seconds
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205512]
> >>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe Feb  6 09:20:13
> >>> kblock01-knode02 kernel: [59976.205514] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205515] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:13 kblock01-knode02 kernel: [59976.205516] 00000000
> >>> 08007806 25000126 00692bd0
> >>> Feb  6 09:20:23 kblock01-knode02 kernel: [59986.452887] nvme nvme0:
> >>> Successfully reconnected
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682887]
> >>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe Feb  6 09:20:24
> >>> kblock01-knode02 kernel: [59986.682890] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682891] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.682892] 00000000
> >>> 08007806 25000158 04cdd7d0
> >>> ...
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687737]
> >>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe Feb  6 09:20:24
> >>> kblock01-knode02 kernel: [59986.687739] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687740] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:20:24 kblock01-knode02 kernel: [59986.687741] 00000000
> >>> 93005204 00000155 00a385e0
> >>> Feb  6 09:20:34 kblock01-knode02 kernel: [59997.389290] nvme nvme0:
> >>> Successfully reconnected
> >>> Feb  6 09:21:19 kblock01-knode02 rsyslogd: -- MARK -- Feb  6
> >>> 09:21:38 kblock01-knode02 kernel: [60060.927832]
> >>> mlx5_0:dump_cqe:262:(pid 0): dump error cqe Feb  6 09:21:38
> >>> kblock01-knode02 kernel: [60060.927835] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927836] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> >>> 00000000 00000000 00000000
> >>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927837] 00000000
> >>> 93005204 00000167 b44e76e0
> >>> Feb  6 09:21:38 kblock01-knode02 kernel: [60060.927846] nvme nvme0:
> >>> RECV for CQE 0xffff96fe64f18750 failed with status local protection
> >>> error (4) Feb  6 09:21:38 kblock01-knode02 kernel: [60060.928200] nvme
> nvme0:
> >>> reconnecting in 10 seconds
> >>> ...
> >>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736182] mlx5_core
> >>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout.
> >>> Will cause a leak of a command resource Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736190] ------------[ cut here
> >>> ]------------ Feb  6 09:23:54 kblock01-knode02 kernel:
> >>> [60196.736211] WARNING: CPU: 18
> >>> PID: 22709 at drivers/infiniband/core/verbs.c:1963
> >>> __ib_drain_sq+0x135/0x1d0 [ib_core]
> >>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736212] failed to
> >>> drain send queue: -110 Feb  6 09:23:54 kblock01-knode02 kernel:
> >>> [60196.736213] Modules linked
> >>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> >>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> >>> scsi_transport_fc dm_multipath drbd lru_cache netconsole
> >>> mst_pciconf(OE) nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc
> >>> grace ipt_MASQUERADE
> >>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse
> >>> binfmt_misc iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb
> >>> i2c_i801 i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure
> >>> ipmi_ssif ipmi_si ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib
> >>> ib_core mlx5_core devlink ptp pps_core tpm_tis tpm_tis_core tpm
> >>> ext4(E) mbcache(E) jbd2(E) isci(E)
> >>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E)
> >>> megaraid_sas(E)
> >>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E) Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736293]  drm(E)
> >>> i2c_algo_bit(E) [last unloaded: nvme_core] Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736301] CPU: 18 PID:
> >>> 22709 Comm: kworker/18:4 Tainted: P           OE   4.9.6-KM1 #0
> >>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736303] Hardware name:
> >>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a
> >>> 01/22/2014 Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736312]
> Workqueue:
> >>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Feb  6
> >>> 09:23:54 kblock01-knode02 kernel: [60196.736315] ffff96de27e6f9b8
> >>> ffffffffb537f3ff ffffffffc06a00e5 ffff96de27e6fa18 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736320] ffff96de27e6fa18
> >>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736324] 0000000300000000
> >>> 000007ab00000006 0507000000000000 ffff96de27e6fad8 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736328] Call Trace:
> >>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736341]
> >>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736352] [<ffffffffc06a00e5>] ?
> >>> __ib_drain_sq+0x135/0x1d0 [ib_core] Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736364] [<ffffffffb5091a7d>] __warn+0xfd/0x120 Feb  6
> >>> 09:23:54 kblock01-knode02 kernel: [60196.736368]
> >>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736378] [<ffffffffc069fdb5>] ?
> >>> ib_modify_qp+0x45/0x50 [ib_core] Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736388] [<ffffffffc06a00e5>]
> >>> __ib_drain_sq+0x135/0x1d0 [ib_core] Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736398] [<ffffffffc069f5a0>] ?
> >>> ib_create_srq+0xa0/0xa0 [ib_core] Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736408] [<ffffffffc06a01a5>] ib_drain_sq+0x25/0x30
> >>> [ib_core] Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736418]
> >>> [<ffffffffc06a01c6>] ib_drain_qp+0x16/0x40 [ib_core] Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736422] [<ffffffffc0c4d25b>]
> >>> nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma] Feb  6
> 09:23:54
> >>> kblock01-knode02 kernel: [60196.736426] [<ffffffffc0c4d2ad>]
> >>> nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma] Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736429] [<ffffffffc0c4d884>]
> >>> nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma] Feb  6
> 09:23:54
> >>> kblock01-knode02 kernel: [60196.736434] [<ffffffffb50ac7ce>]
> >>> process_one_work+0x17e/0x4f0 Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736444] [<ffffffffb50cefc5>] ?
> >>> dequeue_task_fair+0x85/0x870 Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736454] [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0 Feb
> >>> 6 09:23:54 kblock01-knode02 kernel: [60196.736456]
> >>> [<ffffffffb50ad653>] worker_thread+0x153/0x660 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736464] [<ffffffffb5026b4c>] ?
> >>> __switch_to+0x1dc/0x670 Feb  6 09:23:54 kblock01-knode02 kernel:
> >>> [60196.736468] [<ffffffffb5799706>] ? __schedule+0x226/0x6a0 Feb  6
> >>> 09:23:54 kblock01-knode02 kernel: [60196.736471]
> >>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20 Feb  6
> >>> 09:23:54 kblock01-knode02 kernel: [60196.736474]
> >>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736477] [<ffffffffb50ad500>] ?
> >>> workqueue_prepare_cpu+0x80/0x80 Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736480] [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0 Feb
> >>> 6 09:23:54 kblock01-knode02 kernel: [60196.736483]
> >>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80 Feb  6
> >>> 09:23:54 kblock01-knode02 kernel: [60196.736487]
> >>> [<ffffffffb50b237d>] kthread+0xcd/0xf0 Feb  6 09:23:54
> >>> kblock01-knode02 kernel: [60196.736492] [<ffffffffb50bc40e>] ?
> >>> schedule_tail+0x1e/0xc0 Feb  6 09:23:54 kblock01-knode02 kernel:
> >>> [60196.736495] [<ffffffffb50b22b0>] ?
> >>> __kthread_init_worker+0x40/0x40 Feb  6 09:23:54 kblock01-knode02
> >>> kernel: [60196.736499] [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> >>> Feb  6 09:23:54 kblock01-knode02 kernel: [60196.736502] ---[ end
> >>> trace
> >>> eb0e5ba7dc81a687 ]---
> >>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176054] mlx5_core
> >>> 0000:04:00.0: wait_func:879:(pid 22709): 2ERR_QP(0x507) timeout.
> >>> Will cause a leak of a command resource Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176059] ------------[ cut here
> >>> ]------------ Feb  6 09:24:55 kblock01-knode02 kernel:
> >>> [60258.176073] WARNING: CPU: 18
> >>> PID: 22709 at drivers/infiniband/core/verbs.c:1998
> >>> __ib_drain_rq+0x12a/0x1c0 [ib_core]
> >>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176075] failed to
> >>> drain recv queue: -110 Feb  6 09:24:55 kblock01-knode02 kernel:
> >>> [60258.176076] Modules linked
> >>> in: nvme_rdma rdma_cm ib_cm iw_cm nvme_fabrics nvme_core
> >>> ocs_fc_scst(POE) scst(OE) mptctl mptbase qla2xxx_scst(OE)
> >>> scsi_transport_fc dm_multipath drbd lru_cache netconsole
> >>> mst_pciconf(OE) nfsd nfs_acl auth_rpcgss ipmi_devintf lockd sunrpc
> >>> grace ipt_MASQUERADE
> >>> nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 nf_nat
> >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack fuse
> >>> binfmt_misc iTCO_wdt iTCO_vendor_support pcspkr serio_raw joydev igb
> >>> i2c_i801 i2c_smbus lpc_ich mei_me mei ioatdma dca ses enclosure
> >>> ipmi_ssif ipmi_si ipmi_msghandler bnx2x libcrc32c mdio mlx5_ib
> >>> ib_core mlx5_core devlink ptp pps_core tpm_tis tpm_tis_core tpm
> >>> ext4(E) mbcache(E) jbd2(E) isci(E)
> >>> libsas(E) mpt3sas(E) scsi_transport_sas(E) raid_class(E)
> >>> megaraid_sas(E)
> >>> wmi(E) mgag200(E) ttm(E) drm_kms_helper(E) Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176134]  drm(E)
> >>> i2c_algo_bit(E) [last unloaded: nvme_core] Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176140] CPU: 18 PID:
> >>> 22709 Comm: kworker/18:4 Tainted: P        W  OE   4.9.6-KM1 #0
> >>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176141] Hardware name:
> >>> Supermicro SYS-1027R-72BRFTP5-EI007/X9DRW-7/iTPF, BIOS 3.0a
> >>> 01/22/2014 Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176147]
> Workqueue:
> >>> nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma] Feb  6
> >>> 09:24:55 kblock01-knode02 kernel: [60258.176150] ffff96de27e6f9b8
> >>> ffffffffb537f3ff ffffffffc069feea ffff96de27e6fa18 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176155] ffff96de27e6fa18
> >>> 0000000000000000 ffff96de27e6fa08 ffffffffb5091a7d Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176159] 0000000300000000
> >>> 000007ce00000006 0507000000000000 ffff96de27e6fad8 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176163] Call Trace:
> >>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176169]
> >>> [<ffffffffb537f3ff>] dump_stack+0x67/0x98 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176180] [<ffffffffc069feea>] ?
> >>> __ib_drain_rq+0x12a/0x1c0 [ib_core] Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176185] [<ffffffffb5091a7d>] __warn+0xfd/0x120 Feb  6
> >>> 09:24:55 kblock01-knode02 kernel: [60258.176189]
> >>> [<ffffffffb5091b59>] warn_slowpath_fmt+0x49/0x50 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176199] [<ffffffffc069fdb5>] ?
> >>> ib_modify_qp+0x45/0x50 [ib_core] Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176216] [<ffffffffc06de770>] ?
> >>> mlx5_ib_modify_qp+0x980/0xec0 [mlx5_ib] Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176225] [<ffffffffc069feea>]
> >>> __ib_drain_rq+0x12a/0x1c0 [ib_core] Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176235] [<ffffffffc069f5a0>] ?
> >>> ib_create_srq+0xa0/0xa0 [ib_core] Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176246] [<ffffffffc069ffa5>] ib_drain_rq+0x25/0x30
> >>> [ib_core] Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176255]
> >>> [<ffffffffc06a01dc>] ib_drain_qp+0x2c/0x40 [ib_core] Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176259] [<ffffffffc0c4d25b>]
> >>> nvme_rdma_stop_and_free_queue+0x2b/0x50 [nvme_rdma] Feb  6
> 09:24:55
> >>> kblock01-knode02 kernel: [60258.176263] [<ffffffffc0c4d2ad>]
> >>> nvme_rdma_free_io_queues+0x2d/0x40 [nvme_rdma] Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176267] [<ffffffffc0c4d884>]
> >>> nvme_rdma_reconnect_ctrl_work+0x34/0x1e0 [nvme_rdma] Feb  6
> 09:24:55
> >>> kblock01-knode02 kernel: [60258.176270] [<ffffffffb50ac7ce>]
> >>> process_one_work+0x17e/0x4f0 Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176275] [<ffffffffb50cefc5>] ?
> >>> dequeue_task_fair+0x85/0x870 Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176278] [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0 Feb
> >>> 6 09:24:55 kblock01-knode02 kernel: [60258.176281]
> >>> [<ffffffffb50ad653>] worker_thread+0x153/0x660 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176285] [<ffffffffb5026b4c>] ?
> >>> __switch_to+0x1dc/0x670 Feb  6 09:24:55 kblock01-knode02 kernel:
> >>> [60258.176289] [<ffffffffb5799706>] ? __schedule+0x226/0x6a0 Feb  6
> >>> 09:24:55 kblock01-knode02 kernel: [60258.176291]
> >>> [<ffffffffb50be7c2>] ? default_wake_function+0x12/0x20 Feb  6
> >>> 09:24:55 kblock01-knode02 kernel: [60258.176294]
> >>> [<ffffffffb50d7636>] ? __wake_up_common+0x56/0x90 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176297] [<ffffffffb50ad500>] ?
> >>> workqueue_prepare_cpu+0x80/0x80 Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176300] [<ffffffffb5799c6a>] ? schedule+0x3a/0xa0 Feb
> >>> 6 09:24:55 kblock01-knode02 kernel: [60258.176302]
> >>> [<ffffffffb50ad500>] ? workqueue_prepare_cpu+0x80/0x80 Feb  6
> >>> 09:24:55 kblock01-knode02 kernel: [60258.176306]
> >>> [<ffffffffb50b237d>] kthread+0xcd/0xf0 Feb  6 09:24:55
> >>> kblock01-knode02 kernel: [60258.176310] [<ffffffffb50bc40e>] ?
> >>> schedule_tail+0x1e/0xc0 Feb  6 09:24:55 kblock01-knode02 kernel:
> >>> [60258.176313] [<ffffffffb50b22b0>] ?
> >>> __kthread_init_worker+0x40/0x40 Feb  6 09:24:55 kblock01-knode02
> >>> kernel: [60258.176316] [<ffffffffb579ded5>] ret_from_fork+0x25/0x30
> >>> Feb  6 09:24:55 kblock01-knode02 kernel: [60258.176318] ---[ end
> >>> trace
> >>> eb0e5ba7dc81a688 ]---
> >>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21322]
> >>> unexpectedly returned with status 0x0100 Feb  6 09:25:26
> >>> kblock01-knode02 udevd[9344]: worker [21322] failed while handling
> >>> '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n1'
> >>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21323]
> >>> unexpectedly returned with status 0x0100 Feb  6 09:25:26
> >>> kblock01-knode02 udevd[9344]: worker [21323] failed while handling
> >>> '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n2'
> >>> Feb  6 09:25:26 kblock01-knode02 udevd[9344]: worker [21741]
> >>> unexpectedly returned with status 0x0100 Feb  6 09:25:26
> >>> kblock01-knode02 udevd[9344]: worker [21741] failed while handling
> >>> '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n3'
> >>> Feb  6 09:25:57 kblock01-knode02 kernel: [60319.615916] mlx5_core
> >>> 0000:04:00.0: wait_func:879:(pid 22709): 2RST_QP(0x50a) timeout.
> >>> Will cause a leak of a command resource Feb  6 09:25:57
> >>> kblock01-knode02 kernel: [60319.615922]
> >>> mlx5_0:destroy_qp_common:1936:(pid 22709): mlx5_ib: modify QP
> >>> 0x00015e to RESET failed Feb  6 09:26:19 kblock01-knode02
> >>> udevd[9344]: worker [21740] unexpectedly returned with status 0x0100
> >>> Feb  6 09:26:19 kblock01-knode02 udevd[9344]: worker [21740] failed
> >>> while handling '/devices/virtual/nvme-fabrics/ctl/nvme0/nvme0n4'
> >>>
> >>>
> >>> _______________________________________________
> >>> Linux-nvme mailing list
> >>> Linux-nvme at lists.infradead.org
> >>> http://lists.infradead.org/mailman/listinfo/linux-nvme
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> >> in the body of a message to majordomo at vger.kernel.org More majordomo
> >> info at  http://vger.kernel.org/majordomo-info.html
> >>
> > Hi Sagi,
> >
> > Have not looked in depth here but is this maybe the GAPS issue again I
> bumped into where we had to revert the patch for the SRP transport.
> > Would we have to do a similar revert in the NVME space or modify mlx5
> where this issue exists.
> >
> > Thanks
> > Laurnce
> >
> >
> >
> 
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 16-disk-nvmetcli-config.json
Type: application/octet-stream
Size: 6141 bytes
Desc: 16-disk-nvmetcli-config.json
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170226/d15552a1/attachment-0001.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: 1_Warning.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170226/d15552a1/attachment-0003.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: initiator-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170226/d15552a1/attachment-0004.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: target-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170226/d15552a1/attachment-0005.txt>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-26  8:03             ` shahar.salzman
@ 2017-02-27 20:13                 ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-27 20:13 UTC (permalink / raw)
  To: shahar.salzman, Laurence Oberman
  Cc: linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA


> Hi,
>
> I will try with kernel 4.10 when I get my hands on the machine.
>
> In addition, I found that the machine did not have the kernel-firmware
> package installed. Could this cause the "strange" CX4 behavior? I will
> obviously re-test with the kernel-firmware package.

Not really, Mellanox FW is obtained by mellanox provided tools.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-27 20:13                 ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-27 20:13 UTC (permalink / raw)



> Hi,
>
> I will try with kernel 4.10 when I get my hands on the machine.
>
> In addition, I found that the machine did not have the kernel-firmware
> package installed. Could this cause the "strange" CX4 behavior? I will
> obviously re-test with the kernel-firmware package.

Not really, Mellanox FW is obtained by mellanox provided tools.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-02-26 17:58             ` Gruher, Joseph R
@ 2017-02-27 20:33                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-27 20:33 UTC (permalink / raw)
  To: Gruher, Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


Hey Joseph,

> In our lab we are dealing with an issue which has some of the same symptoms.  Wanted to add to the thread in case it is useful here.  We have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly connected (no switch) to a single initiator system with a matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO workload, and dmesg from both systems.
>
> We are able to provoke this problem with a variety of workloads but a high bandwidth read operation seems to cause it the most reliably, harder to produce with smaller block sizes.  For some reason the problem seems produced when we stop and restart IO - I can run the FIO workload on the initiator system for 1-2 hours without any new events in dmesg, pushing about 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and the errors and reconnect events happen reliably at that point.  Working to characterize further this week and also to see if we can produce on a smaller configuration.  Happy to provide any additional details that would be useful or try any fixes!
>
> On the initiator we see events like this:
>
> [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> [51390.065644] 00000000 00000000 00000000 00000000
> [51390.065645] 00000000 00000000 00000000 00000000
> [51390.065646] 00000000 00000000 00000000 00000000
> [51390.065648] 00000000 08007806 250003ab 02b9dcd2
> [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed with status memory management operation error (6)
> [51390.079156] nvme nvme3: reconnecting in 10 seconds
> [51400.432782] nvme nvme3: Successfully reconnected

Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
vendor specific syndromes on this output.

> On the target we see events like this:
>
> [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
> [51370.394696] 00000000 00000000 00000000 00000000
> [51370.394697] 00000000 00000000 00000000 00000000
> [51370.394699] 00000000 00000000 00000000 00000000
> [51370.394701] 00000000 00008813 080003ea 00c3b1d2

If the host is failing on memory mapping while the target is initiating
rdma access it makes sense that it will see errors.

>
> Sometimes, but less frequently, we also will see events on the target like this as part of the problem:
>
> [21322.678571] nvmet: ctrl 1 fatal error occurred!

Again, also makes sense because for nvmet this is a fatal error and we
need to teardown the controller.

You can try out this patch to see if it makes the memreg issues to go
away:
--
diff --git a/drivers/infiniband/hw/mlx5/qp.c 
b/drivers/infiniband/hw/mlx5/qp.c
index ad8a2638e339..0f9a12570262 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3893,7 +3893,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 goto out;

                         case IB_WR_LOCAL_INV:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;
                                 qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
                                 ctrl->imm = 
cpu_to_be32(wr->ex.invalidate_rkey);
                                 set_linv_wr(qp, &seg, &size);
@@ -3901,7 +3901,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 break;

                         case IB_WR_REG_MR:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;
                                 qp->sq.wr_data[idx] = IB_WR_REG_MR;
                                 ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
                                 err = set_reg_wr(qp, reg_wr(wr), &seg, 
&size);
--

Note that this will have a big performance (negative) impact on small
read workloads.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-02-27 20:33                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-02-27 20:33 UTC (permalink / raw)



Hey Joseph,

> In our lab we are dealing with an issue which has some of the same symptoms.  Wanted to add to the thread in case it is useful here.  We have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly connected (no switch) to a single initiator system with a matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO workload, and dmesg from both systems.
>
> We are able to provoke this problem with a variety of workloads but a high bandwidth read operation seems to cause it the most reliably, harder to produce with smaller block sizes.  For some reason the problem seems produced when we stop and restart IO - I can run the FIO workload on the initiator system for 1-2 hours without any new events in dmesg, pushing about 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and the errors and reconnect events happen reliably at that point.  Working to characterize further this week and also to see if we can produce on a smaller configuration.  Happy to provide any additional details that would be useful or try any fixes!
>
> On the initiator we see events like this:
>
> [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> [51390.065644] 00000000 00000000 00000000 00000000
> [51390.065645] 00000000 00000000 00000000 00000000
> [51390.065646] 00000000 00000000 00000000 00000000
> [51390.065648] 00000000 08007806 250003ab 02b9dcd2
> [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed with status memory management operation error (6)
> [51390.079156] nvme nvme3: reconnecting in 10 seconds
> [51400.432782] nvme nvme3: Successfully reconnected

Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
vendor specific syndromes on this output.

> On the target we see events like this:
>
> [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
> [51370.394696] 00000000 00000000 00000000 00000000
> [51370.394697] 00000000 00000000 00000000 00000000
> [51370.394699] 00000000 00000000 00000000 00000000
> [51370.394701] 00000000 00008813 080003ea 00c3b1d2

If the host is failing on memory mapping while the target is initiating
rdma access it makes sense that it will see errors.

>
> Sometimes, but less frequently, we also will see events on the target like this as part of the problem:
>
> [21322.678571] nvmet: ctrl 1 fatal error occurred!

Again, also makes sense because for nvmet this is a fatal error and we
need to teardown the controller.

You can try out this patch to see if it makes the memreg issues to go
away:
--
diff --git a/drivers/infiniband/hw/mlx5/qp.c 
b/drivers/infiniband/hw/mlx5/qp.c
index ad8a2638e339..0f9a12570262 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3893,7 +3893,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 goto out;

                         case IB_WR_LOCAL_INV:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;
                                 qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
                                 ctrl->imm = 
cpu_to_be32(wr->ex.invalidate_rkey);
                                 set_linv_wr(qp, &seg, &size);
@@ -3901,7 +3901,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct 
ib_send_wr *wr,
                                 break;

                         case IB_WR_REG_MR:
-                               next_fence = 
MLX5_FENCE_MODE_INITIATOR_SMALL;
+                               next_fence = 
MLX5_FENCE_MODE_STRONG_ORDERING;
                                 qp->sq.wr_data[idx] = IB_WR_REG_MR;
                                 ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
                                 err = set_reg_wr(qp, reg_wr(wr), &seg, 
&size);
--

Note that this will have a big performance (negative) impact on small
read workloads.

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-02-27 20:33                   ` Sagi Grimberg
  (?)
@ 2017-02-27 20:57                   ` Gruher, Joseph R
  -1 siblings, 0 replies; 171+ messages in thread
From: Gruher, Joseph R @ 2017-02-27 20:57 UTC (permalink / raw)


> Seems to me this is a CX4 FW issue. Mellanox can elaborate on these vendor
> specific syndromes on this output.

> You can try out this patch to see if it makes the memreg issues to go
> away:

Thanks for the response Sagi!  We will try to engage Mellanox and also see if we can load the patch.

-Joe


> -----Original Message-----
> From: Sagi Grimberg [mailto:sagi at grimberg.me]
> Sent: Monday, February 27, 2017 12:33 PM
> To: Gruher, Joseph R <joseph.r.gruher at intel.com>; shahar.salzman
> <shahar.salzman at gmail.com>; Laurence Oberman <loberman at redhat.com>;
> Riches Jr, Robert M <robert.m.riches.jr at intel.com>
> Cc: linux-rdma at vger.kernel.org; linux-nvme at lists.infradead.org
> Subject: Re: Unexpected issues with 2 NVME initiators using the same target
> 
> 
> Hey Joseph,
> 
> > In our lab we are dealing with an issue which has some of the same
> symptoms.  Wanted to add to the thread in case it is useful here.  We have a
> target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly
> connected (no switch) to a single initiator system with a matching Mellanox
> CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.
> All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO
> workload, and dmesg from both systems.
> >
> > We are able to provoke this problem with a variety of workloads but a high
> bandwidth read operation seems to cause it the most reliably, harder to
> produce with smaller block sizes.  For some reason the problem seems
> produced when we stop and restart IO - I can run the FIO workload on the
> initiator system for 1-2 hours without any new events in dmesg, pushing about
> 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and
> the errors and reconnect events happen reliably at that point.  Working to
> characterize further this week and also to see if we can produce on a smaller
> configuration.  Happy to provide any additional details that would be useful or
> try any fixes!
> >
> > On the initiator we see events like this:
> >
> > [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > [51390.065644] 00000000 00000000 00000000 00000000 [51390.065645]
> > 00000000 00000000 00000000 00000000 [51390.065646] 00000000
> 00000000
> > 00000000 00000000 [51390.065648] 00000000 08007806 250003ab
> 02b9dcd2
> > [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed
> > with status memory management operation error (6) [51390.079156] nvme
> > nvme3: reconnecting in 10 seconds [51400.432782] nvme nvme3:
> > Successfully reconnected
> 
> Seems to me this is a CX4 FW issue. Mellanox can elaborate on these vendor
> specific syndromes on this output.
> 
> > On the target we see events like this:
> >
> > [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
> > [51370.394696] 00000000 00000000 00000000 00000000 [51370.394697]
> > 00000000 00000000 00000000 00000000 [51370.394699] 00000000
> 00000000
> > 00000000 00000000 [51370.394701] 00000000 00008813 080003ea
> 00c3b1d2
> 
> If the host is failing on memory mapping while the target is initiating rdma
> access it makes sense that it will see errors.
> 
> >
> > Sometimes, but less frequently, we also will see events on the target like this
> as part of the problem:
> >
> > [21322.678571] nvmet: ctrl 1 fatal error occurred!
> 
> Again, also makes sense because for nvmet this is a fatal error and we need to
> teardown the controller.
> 
> You can try out this patch to see if it makes the memreg issues to go
> away:
> --
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a2638e339..0f9a12570262 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3893,7 +3893,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                                  goto out;
> 
>                          case IB_WR_LOCAL_INV:
> -                               next_fence =
> MLX5_FENCE_MODE_INITIATOR_SMALL;
> +                               next_fence =
> MLX5_FENCE_MODE_STRONG_ORDERING;
>                                  qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
>                                  ctrl->imm = cpu_to_be32(wr->ex.invalidate_rkey);
>                                  set_linv_wr(qp, &seg, &size); @@ -3901,7 +3901,7 @@ int
> mlx5_ib_post_send(struct ib_qp *ibqp, struct ib_send_wr *wr,
>                                  break;
> 
>                          case IB_WR_REG_MR:
> -                               next_fence =
> MLX5_FENCE_MODE_INITIATOR_SMALL;
> +                               next_fence =
> MLX5_FENCE_MODE_STRONG_ORDERING;
>                                  qp->sq.wr_data[idx] = IB_WR_REG_MR;
>                                  ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
>                                  err = set_reg_wr(qp, reg_wr(wr), &seg, &size);
> --
> 
> Note that this will have a big performance (negative) impact on small read
> workloads.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-02-27 20:33                   ` Sagi Grimberg
  (?)
  (?)
@ 2017-03-05 18:23                   ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-03-05 18:23 UTC (permalink / raw)


On Mon, Feb 27, 2017@10:33:16PM +0200, Sagi Grimberg wrote:
>
> Hey Joseph,
>
> > In our lab we are dealing with an issue which has some of the same symptoms.  Wanted to add to the thread in case it is useful here.  We have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb NIC directly connected (no switch) to a single initiator system with a matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.  I've attached our nvmetcli json, and FIO workload, and dmesg from both systems.
> >
> > We are able to provoke this problem with a variety of workloads but a high bandwidth read operation seems to cause it the most reliably, harder to produce with smaller block sizes.  For some reason the problem seems produced when we stop and restart IO - I can run the FIO workload on the initiator system for 1-2 hours without any new events in dmesg, pushing about 5500MB/sec the whole time, then kill it and wait 10 seconds and restart it, and the errors and reconnect events happen reliably at that point.  Working to characterize further this week and also to see if we can produce on a smaller configuration.  Happy to provide any additional details that would be useful or try any fixes!
> >
> > On the initiator we see events like this:
> >
> > [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> > [51390.065644] 00000000 00000000 00000000 00000000
> > [51390.065645] 00000000 00000000 00000000 00000000
> > [51390.065646] 00000000 00000000 00000000 00000000
> > [51390.065648] 00000000 08007806 250003ab 02b9dcd2
> > [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed with status memory management operation error (6)
> > [51390.079156] nvme nvme3: reconnecting in 10 seconds
> > [51400.432782] nvme nvme3: Successfully reconnected
>
> Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
> vendor specific syndromes on this output.

0x06 - Memory_Window_Bind_Error
0x78 - MEMOP_FRWR_TPT
0x08 - Not free

The check is for both umr.check_free and mkey.free.

Hope it helps.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170305/a8859651/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-02-27 20:33                   ` Sagi Grimberg
                                     ` (2 preceding siblings ...)
  (?)
@ 2017-03-06  0:07                   ` Max Gurtovoy
       [not found]                     ` <26912d0c-578f-26e9-490d-94fc95bdf259-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                                       ` (2 more replies)
  -1 siblings, 3 replies; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-06  0:07 UTC (permalink / raw)




On 2/27/2017 10:33 PM, Sagi Grimberg wrote:
>
> Hey Joseph,
>
>> In our lab we are dealing with an issue which has some of the same
>> symptoms.  Wanted to add to the thread in case it is useful here.  We
>> have a target system with 16 Intel P3520 disks and a Mellanox CX4 50Gb
>> NIC directly connected (no switch) to a single initiator system with a
>> matching Mellanox CX4 50Gb NIC.  We are running Ubuntu 16.10 with
>> 4.10-RC8 mainline kernel.  All drivers are kernel default drivers.
>> I've attached our nvmetcli json, and FIO workload, and dmesg from both
>> systems.
>>
>> We are able to provoke this problem with a variety of workloads but a
>> high bandwidth read operation seems to cause it the most reliably,
>> harder to produce with smaller block sizes.  For some reason the
>> problem seems produced when we stop and restart IO - I can run the FIO
>> workload on the initiator system for 1-2 hours without any new events
>> in dmesg, pushing about 5500MB/sec the whole time, then kill it and
>> wait 10 seconds and restart it, and the errors and reconnect events
>> happen reliably at that point.  Working to characterize further this
>> week and also to see if we can produce on a smaller configuration.
>> Happy to provide any additional details that would be useful or try
>> any fixes!
>>
>> On the initiator we see events like this:
>>
>> [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
>> [51390.065644] 00000000 00000000 00000000 00000000
>> [51390.065645] 00000000 00000000 00000000 00000000
>> [51390.065646] 00000000 00000000 00000000 00000000
>> [51390.065648] 00000000 08007806 250003ab 02b9dcd2
>> [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed
>> with status memory management operation error (6)
>> [51390.079156] nvme nvme3: reconnecting in 10 seconds
>> [51400.432782] nvme nvme3: Successfully reconnected
>
> Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
> vendor specific syndromes on this output.
>
>> On the target we see events like this:
>>
>> [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
>> [51370.394696] 00000000 00000000 00000000 00000000
>> [51370.394697] 00000000 00000000 00000000 00000000
>> [51370.394699] 00000000 00000000 00000000 00000000
>> [51370.394701] 00000000 00008813 080003ea 00c3b1d2
>
> If the host is failing on memory mapping while the target is initiating
> rdma access it makes sense that it will see errors.
>
>>
>> Sometimes, but less frequently, we also will see events on the target
>> like this as part of the problem:
>>
>> [21322.678571] nvmet: ctrl 1 fatal error occurred!
>
> Again, also makes sense because for nvmet this is a fatal error and we
> need to teardown the controller.
>
> You can try out this patch to see if it makes the memreg issues to go
> away:
> --
> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a2638e339..0f9a12570262 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3893,7 +3893,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                                 goto out;
>
>                         case IB_WR_LOCAL_INV:
> -                               next_fence =
> MLX5_FENCE_MODE_INITIATOR_SMALL;
> +                               next_fence =
> MLX5_FENCE_MODE_STRONG_ORDERING;
>                                 qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
>                                 ctrl->imm =
> cpu_to_be32(wr->ex.invalidate_rkey);
>                                 set_linv_wr(qp, &seg, &size);
> @@ -3901,7 +3901,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> ib_send_wr *wr,
>                                 break;
>
>                         case IB_WR_REG_MR:
> -                               next_fence =
> MLX5_FENCE_MODE_INITIATOR_SMALL;
> +                               next_fence =
> MLX5_FENCE_MODE_STRONG_ORDERING;
>                                 qp->sq.wr_data[idx] = IB_WR_REG_MR;
>                                 ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
>                                 err = set_reg_wr(qp, reg_wr(wr), &seg,
> &size);
> --
>
> Note that this will have a big performance (negative) impact on small
> read workloads.
>

Hi Sagi,

I think we need to add fence to the UMR wqe.

so lets try this one:

diff --git a/drivers/infiniband/hw/mlx5/qp.c 
b/drivers/infiniband/hw/mlx5/qp.c
index ad8a263..c38c4fa 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int 
idx, int size_16)

  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
  {
-       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
-                    wr->send_flags & IB_SEND_FENCE))
+       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
                 return MLX5_FENCE_MODE_STRONG_ORDERING;

         if (unlikely(fence)) {


Couldn't repro that case but I run some initial tests in my Lab (with my 
patch above) - not performace servers:

Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets), 
Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs 
and 128 iodepth.
register_always=N

Target - 1 subsystem with 1 ns (null_blk)

bs   read (without/with patch)   write (without/with patch)
--- --------------------------  ---------------------------
512     1019k / 1008k                 1004k / 992k
1k      1021k / 1013k                 1002k / 991k
4k      1030k / 1022k                 978k  / 969k

CPU usage is 100% for both cases in the initiator side.
haven't seen difference with bs = 16k.
No so big drop like we would expect,

Joseph,
please update after trying the 2 patches (seperatly) + perf numbers.

I'll take it internally and run some more tests with stronger servers 
using ConnectX4 NICs.

These patches are only for testing and not for submission yet. If we 
find them good enought for upstream then we need to distinguish between 
ConnexcX4/IB and ConnectX5 (we probably won't see it there).

Thanks,
Max.

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-06  0:07                   ` Max Gurtovoy
@ 2017-03-06 11:28                         ` Sagi Grimberg
  2017-03-14 19:57                     ` Gruher, Joseph R
  2017-03-17 18:37                     ` Gruher, Joseph R
  2 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-06 11:28 UTC (permalink / raw)
  To: Max Gurtovoy, Gruher, Joseph R, shahar.salzman, Laurence Oberman,
	Riches Jr, Robert M
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Robert LeBlanc

> Hi Sagi,
>
> I think we need to add fence to the UMR wqe.
>
> so lets try this one:
>
> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a263..c38c4fa 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int
> idx, int size_16)
>
>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>  {
> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> -                    wr->send_flags & IB_SEND_FENCE))
> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>
>         if (unlikely(fence)) {

This will kill performance, isn't there another fix that can
be applied just for retransmission flow?

> Couldn't repro that case but I run some initial tests in my Lab (with my
> patch above) - not performace servers:
>
> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets),
> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs
> and 128 iodepth.
> register_always=N
>
> Target - 1 subsystem with 1 ns (null_blk)
>
> bs   read (without/with patch)   write (without/with patch)
> --- --------------------------  ---------------------------
> 512     1019k / 1008k                 1004k / 992k
> 1k      1021k / 1013k                 1002k / 991k
> 4k      1030k / 1022k                 978k  / 969k
>
> CPU usage is 100% for both cases in the initiator side.
> haven't seen difference with bs = 16k.
> No so big drop like we would expect,

Obviously you won't see a drop without registering memory
for small IO (register_always=N), this would bypass registration
altogether... Please retest with register_always=Y.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-03-06 11:28                         ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-06 11:28 UTC (permalink / raw)


> Hi Sagi,
>
> I think we need to add fence to the UMR wqe.
>
> so lets try this one:
>
> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a263..c38c4fa 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int
> idx, int size_16)
>
>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>  {
> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> -                    wr->send_flags & IB_SEND_FENCE))
> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>
>         if (unlikely(fence)) {

This will kill performance, isn't there another fix that can
be applied just for retransmission flow?

> Couldn't repro that case but I run some initial tests in my Lab (with my
> patch above) - not performace servers:
>
> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets),
> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs
> and 128 iodepth.
> register_always=N
>
> Target - 1 subsystem with 1 ns (null_blk)
>
> bs   read (without/with patch)   write (without/with patch)
> --- --------------------------  ---------------------------
> 512     1019k / 1008k                 1004k / 992k
> 1k      1021k / 1013k                 1002k / 991k
> 4k      1030k / 1022k                 978k  / 969k
>
> CPU usage is 100% for both cases in the initiator side.
> haven't seen difference with bs = 16k.
> No so big drop like we would expect,

Obviously you won't see a drop without registering memory
for small IO (register_always=N), this would bypass registration
altogether... Please retest with register_always=Y.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-06 11:28                         ` Sagi Grimberg
  (?)
@ 2017-03-07  9:27                         ` Max Gurtovoy
       [not found]                           ` <fbd647dd-3a16-8155-107d-f98e8326cc63-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2017-03-12 12:33                           ` Vladimir Neyelov
  -1 siblings, 2 replies; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-07  9:27 UTC (permalink / raw)


Hi,

Shahar/Joseph, what is your link layer conf (IB/Eth) ?
In eth case, have you configured some PFC ? if not, can you try it ?
I suspect that this is the root cause and it might help you avoiding 
this case, meanwhile we're looking for for the best solution.

Adding Vladimir that will run iSER on his performance setup with the new 
fencing patch (not an NVMEoF related issue).
We can run also NVMEoF later on if needed.

Max.

On 3/6/2017 1:28 PM, Sagi Grimberg wrote:
>> Hi Sagi,
>>
>> I think we need to add fence to the UMR wqe.
>>
>> so lets try this one:
>>
>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>> b/drivers/infiniband/hw/mlx5/qp.c
>> index ad8a263..c38c4fa 100644
>> --- a/drivers/infiniband/hw/mlx5/qp.c
>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int
>> idx, int size_16)
>>
>>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>  {
>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>> -                    wr->send_flags & IB_SEND_FENCE))
>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>>
>>         if (unlikely(fence)) {
>
> This will kill performance, isn't there another fix that can
> be applied just for retransmission flow?
>
>> Couldn't repro that case but I run some initial tests in my Lab (with my
>> patch above) - not performace servers:
>>
>> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets),
>> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 jobs
>> and 128 iodepth.
>> register_always=N
>>
>> Target - 1 subsystem with 1 ns (null_blk)
>>
>> bs   read (without/with patch)   write (without/with patch)
>> --- --------------------------  ---------------------------
>> 512     1019k / 1008k                 1004k / 992k
>> 1k      1021k / 1013k                 1002k / 991k
>> 4k      1030k / 1022k                 978k  / 969k
>>
>> CPU usage is 100% for both cases in the initiator side.
>> haven't seen difference with bs = 16k.
>> No so big drop like we would expect,
>
> Obviously you won't see a drop without registering memory
> for small IO (register_always=N), this would bypass registration
> altogether... Please retest with register_always=Y.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-07  9:27                         ` Max Gurtovoy
@ 2017-03-07 13:41                               ` Sagi Grimberg
  2017-03-12 12:33                           ` Vladimir Neyelov
  1 sibling, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-07 13:41 UTC (permalink / raw)
  To: Max Gurtovoy, Gruher, Joseph R, shahar.salzman, Laurence Oberman,
	Riches Jr, Robert M
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Robert LeBlanc,
	Vladimir Neyelov


> Hi,
>
> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
> In eth case, have you configured some PFC ? if not, can you try it ?
> I suspect that this is the root cause

The root cause is that the device fails frwr in retransmission
flow, if PFC is not on, it will happen almost immediately, if not
it will happen at some point...

> and it might help you avoiding
> this case, meanwhile we're looking for for the best solution.
>
> Adding Vladimir that will run iSER on his performance setup with the new
> fencing patch (not an NVMEoF related issue).
> We can run also NVMEoF later on if needed.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-03-07 13:41                               ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-07 13:41 UTC (permalink / raw)



> Hi,
>
> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
> In eth case, have you configured some PFC ? if not, can you try it ?
> I suspect that this is the root cause

The root cause is that the device fails frwr in retransmission
flow, if PFC is not on, it will happen almost immediately, if not
it will happen at some point...

> and it might help you avoiding
> this case, meanwhile we're looking for for the best solution.
>
> Adding Vladimir that will run iSER on his performance setup with the new
> fencing patch (not an NVMEoF related issue).
> We can run also NVMEoF later on if needed.

Thanks.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-07 13:41                               ` Sagi Grimberg
@ 2017-03-09 12:18                                   ` shahar.salzman
  -1 siblings, 0 replies; 171+ messages in thread
From: shahar.salzman @ 2017-03-09 12:18 UTC (permalink / raw)
  To: Sagi Grimberg, Max Gurtovoy, Gruher, Joseph R, Laurence Oberman,
	Riches Jr, Robert M
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Robert LeBlanc,
	Vladimir Neyelov

Hi,

Sorry for the delay, I have been OOO for the past few days.

Indeed the underlying transport is Ethernet, and we have found that flow 
control is disabled on the switch side.
I still do not have access to the system for re-tests with the flow 
control, and 4.10. But I am working on assembling another system.

A little off-topic, but something that may help me get the other system 
up faster, can I use a dual CX4 card as initiator and target (i.e. one 
port going up to the switch, and the other coming back into the system) 
without the kernel looping back? I have a spare CX4, so if possible, I 
will use it to build a mini system for recreates of this sort.

Thanks,
Shahar


On 03/07/2017 03:41 PM, Sagi Grimberg wrote:
>
>> Hi,
>>
>> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
>> In eth case, have you configured some PFC ? if not, can you try it ?
>> I suspect that this is the root cause
>
> The root cause is that the device fails frwr in retransmission
> flow, if PFC is not on, it will happen almost immediately, if not
> it will happen at some point...
>
>> and it might help you avoiding
>> this case, meanwhile we're looking for for the best solution.
>>
>> Adding Vladimir that will run iSER on his performance setup with the new
>> fencing patch (not an NVMEoF related issue).
>> We can run also NVMEoF later on if needed.
>
> Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-03-09 12:18                                   ` shahar.salzman
  0 siblings, 0 replies; 171+ messages in thread
From: shahar.salzman @ 2017-03-09 12:18 UTC (permalink / raw)


Hi,

Sorry for the delay, I have been OOO for the past few days.

Indeed the underlying transport is Ethernet, and we have found that flow 
control is disabled on the switch side.
I still do not have access to the system for re-tests with the flow 
control, and 4.10. But I am working on assembling another system.

A little off-topic, but something that may help me get the other system 
up faster, can I use a dual CX4 card as initiator and target (i.e. one 
port going up to the switch, and the other coming back into the system) 
without the kernel looping back? I have a spare CX4, so if possible, I 
will use it to build a mini system for recreates of this sort.

Thanks,
Shahar


On 03/07/2017 03:41 PM, Sagi Grimberg wrote:
>
>> Hi,
>>
>> Shahar/Joseph, what is your link layer conf (IB/Eth) ?
>> In eth case, have you configured some PFC ? if not, can you try it ?
>> I suspect that this is the root cause
>
> The root cause is that the device fails frwr in retransmission
> flow, if PFC is not on, it will happen almost immediately, if not
> it will happen at some point...
>
>> and it might help you avoiding
>> this case, meanwhile we're looking for for the best solution.
>>
>> Adding Vladimir that will run iSER on his performance setup with the new
>> fencing patch (not an NVMEoF related issue).
>> We can run also NVMEoF later on if needed.
>
> Thanks.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-07  9:27                         ` Max Gurtovoy
       [not found]                           ` <fbd647dd-3a16-8155-107d-f98e8326cc63-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-03-12 12:33                           ` Vladimir Neyelov
       [not found]                             ` <AM4PR0501MB278621363209E177A738D75FCB220-dp/nxUn679hhbxXPg6FtWcDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
  1 sibling, 1 reply; 171+ messages in thread
From: Vladimir Neyelov @ 2017-03-12 12:33 UTC (permalink / raw)


Hi,
I tested performance regression with/without patch of Max.
I user to servers of HP.

Initiator:												
CPUs: 48												
CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz												
Hardware: ConnectX-4 												
Interface : Infiniband	Ethernet											
Kernel: 3.10.0-327.el7.x86_64												
OFED:	MLNX_OFED_LINUX-4.0-1.5.6.0:											
OS: RHEL 7.2												
Tunning commands:												
modprobe ib_iser always_register=N												
for i in `ls sd*`;do echo 2 > /sys/block/$i/queue/nomerges;done												
for i in `ls sd*`;do echo 0 > /sys/block/$i/queue/add_random ;done												
for i in `ls sd*`;do echo 1 > /sys/block/$i/queue/rq_affinity;done												
for i in `ls sd*`;do echo noop > /sys/block/$i/queue/scheduler;done												
echo performance > /sys/devices/system/cpu/cpu[0-9]*/cpufreq/scaling_governor												
service irqbalance stop												
set_irq_affinity.sh ens4f0												
												
												
Target:												
Hardware: ConnectX-4 												
CPUs: 48												
CPU: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz												
Interface : Infiniband												
												
OS: RHEL 7.2												
Kernel: 3.10.0-327.el7.x86_64												
Tunning commands:												
service irqbalance stop												
set_irq_affinity.sh ens4f0												
Target type LIO												
												
Command:												
fio --rw=write -bs=4K --numjobs=3 --iodepth=128 --runtime=60 --time_based --size=300k --loops=1 --ioengine=libaio --direct=1 --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap --group_reporting --exitall `cat disks`

Results:

Patched  by patch of Max (block 4K):

allways reg          Y                N
write              1902K            1923.3K
read               1315K            2009K 

Original  OFED code (block 4K)     

allways reg          Y                N
write              1947K           1982K
read               1273K           1978K												

Thanks,
Vladimir

-----Original Message-----
From: Max Gurtovoy 
Sent: Tuesday, March 7, 2017 11:28 AM
To: Sagi Grimberg <sagi at grimberg.me>; Gruher, Joseph R <joseph.r.gruher at intel.com>; shahar.salzman <shahar.salzman at gmail.com>; Laurence Oberman <loberman at redhat.com>; Riches Jr, Robert M <robert.m.riches.jr at intel.com>
Cc: linux-rdma at vger.kernel.org; linux-nvme at lists.infradead.org; Robert LeBlanc <robert at leblancnet.us>; Vladimir Neyelov <vladimirn at mellanox.com>
Subject: Re: Unexpected issues with 2 NVME initiators using the same target

Hi,

Shahar/Joseph, what is your link layer conf (IB/Eth) ?
In eth case, have you configured some PFC ? if not, can you try it ?
I suspect that this is the root cause and it might help you avoiding this case, meanwhile we're looking for for the best solution.

Adding Vladimir that will run iSER on his performance setup with the new fencing patch (not an NVMEoF related issue).
We can run also NVMEoF later on if needed.

Max.

On 3/6/2017 1:28 PM, Sagi Grimberg wrote:
>> Hi Sagi,
>>
>> I think we need to add fence to the UMR wqe.
>>
>> so lets try this one:
>>
>> diff --git a/drivers/infiniband/hw/mlx5/qp.c 
>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>> --- a/drivers/infiniband/hw/mlx5/qp.c
>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int 
>> idx, int size_16)
>>
>>  static u8 get_fence(u8 fence, struct ib_send_wr *wr)  {
>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>> -                    wr->send_flags & IB_SEND_FENCE))
>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == 
>> + IB_WR_REG_MR)
>>                 return MLX5_FENCE_MODE_STRONG_ORDERING;
>>
>>         if (unlikely(fence)) {
>
> This will kill performance, isn't there another fix that can be 
> applied just for retransmission flow?
>
>> Couldn't repro that case but I run some initial tests in my Lab (with 
>> my patch above) - not performace servers:
>>
>> Initiator with 24 CPUs (2 threads/core, 6 cores/socket, 2 sockets), 
>> Connect IB (same driver mlx5_ib), kernel 4.10.0, fio test with 24 
>> jobs and 128 iodepth.
>> register_always=N
>>
>> Target - 1 subsystem with 1 ns (null_blk)
>>
>> bs   read (without/with patch)   write (without/with patch)
>> --- --------------------------  ---------------------------
>> 512     1019k / 1008k                 1004k / 992k
>> 1k      1021k / 1013k                 1002k / 991k
>> 4k      1030k / 1022k                 978k  / 969k
>>
>> CPU usage is 100% for both cases in the initiator side.
>> haven't seen difference with bs = 16k.
>> No so big drop like we would expect,
>
> Obviously you won't see a drop without registering memory for small IO 
> (register_always=N), this would bypass registration altogether... 
> Please retest with register_always=Y.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-12 12:33                           ` Vladimir Neyelov
@ 2017-03-13  9:43                                 ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-13  9:43 UTC (permalink / raw)
  To: Vladimir Neyelov, Max Gurtovoy, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Robert LeBlanc


> Patched  by patch of Max (block 4K):
>
> allways reg          Y                N
> write              1902K            1923.3K
> read               1315K            2009K
>
> Original  OFED code (block 4K)
>
> allways reg          Y                N
> write              1947K           1982K
> read               1273K           1978K												

First, the write comparison is redundant because
we send immediate data without memory registration.

And, I'd compare against upstream code and not OFED.

So it seems that strong fencing does not effect performance
from the ULP point of view, surprising...
I'd suggest comparing on nvmf and srp as well.

If this is the case, and it indeed resolves the issue, we
should move forward with it as is.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-03-13  9:43                                 ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-13  9:43 UTC (permalink / raw)



> Patched  by patch of Max (block 4K):
>
> allways reg          Y                N
> write              1902K            1923.3K
> read               1315K            2009K
>
> Original  OFED code (block 4K)
>
> allways reg          Y                N
> write              1947K           1982K
> read               1273K           1978K												

First, the write comparison is redundant because
we send immediate data without memory registration.

And, I'd compare against upstream code and not OFED.

So it seems that strong fencing does not effect performance
from the ULP point of view, surprising...
I'd suggest comparing on nvmf and srp as well.

If this is the case, and it indeed resolves the issue, we
should move forward with it as is.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-13  9:43                                 ` Sagi Grimberg
  (?)
@ 2017-03-14  8:55                                 ` Max Gurtovoy
  -1 siblings, 0 replies; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-14  8:55 UTC (permalink / raw)




On 3/13/2017 11:43 AM, Sagi Grimberg wrote:
>
>> Patched  by patch of Max (block 4K):
>>
>> allways reg          Y                N
>> write              1902K            1923.3K
>> read               1315K            2009K
>>
>> Original  OFED code (block 4K)
>>
>> allways reg          Y                N
>> write              1947K           1982K
>> read               1273K
>> 1978K
>
> First, the write comparison is redundant because
> we send immediate data without memory registration.
>
> And, I'd compare against upstream code and not OFED.

I dont thing we'll get different results (my first tests were with nvmf 
upstream driver).

>
> So it seems that strong fencing does not effect performance
> from the ULP point of view, surprising...
> I'd suggest comparing on nvmf and srp as well.

I agree. We also want to run other application types with it in our 
performance lab.

>
> If this is the case, and it indeed resolves the issue, we
> should move forward with it as is.

Not as is, because it will have influence on ConnectX5 too (no need).
We'll need to update the patch before submission.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-06  0:07                   ` Max Gurtovoy
       [not found]                     ` <26912d0c-578f-26e9-490d-94fc95bdf259-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2017-03-14 19:57                     ` Gruher, Joseph R
  2017-03-14 23:42                       ` Gruher, Joseph R
  2017-03-17 18:37                     ` Gruher, Joseph R
  2 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-14 19:57 UTC (permalink / raw)



> >> On the initiator we see events like this:
> >>
> >> [51390.065641] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
> >> [51390.065644] 00000000 00000000 00000000 00000000 [51390.065645]
> >> 00000000 00000000 00000000 00000000 [51390.065646] 00000000
> 00000000
> >> 00000000 00000000 [51390.065648] 00000000 08007806 250003ab
> 02b9dcd2
> >> [51390.065666] nvme nvme3: MEMREG for CQE 0xffff9fc845039410 failed
> >> with status memory management operation error (6) [51390.079156] nvme
> >> nvme3: reconnecting in 10 seconds [51400.432782] nvme nvme3:
> >> Successfully reconnected
> >
> > Seems to me this is a CX4 FW issue. Mellanox can elaborate on these
> > vendor specific syndromes on this output.
> >
> >> On the target we see events like this:
> >>
> >> [51370.394694] mlx5_0:dump_cqe:262:(pid 6623): dump error cqe
> >> [51370.394696] 00000000 00000000 00000000 00000000 [51370.394697]
> >> 00000000 00000000 00000000 00000000 [51370.394699] 00000000
> 00000000
> >> 00000000 00000000 [51370.394701] 00000000 00008813 080003ea
> 00c3b1d2
> >
> > If the host is failing on memory mapping while the target is
> > initiating rdma access it makes sense that it will see errors.
> >
> > You can try out this patch to see if it makes the memreg issues to go
> > away:
> > --
> > diff --git a/drivers/infiniband/hw/mlx5/qp.c
> > b/drivers/infiniband/hw/mlx5/qp.c index ad8a2638e339..0f9a12570262
> > 100644
> > --- a/drivers/infiniband/hw/mlx5/qp.c
> > +++ b/drivers/infiniband/hw/mlx5/qp.c
> > @@ -3893,7 +3893,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> > ib_send_wr *wr,
> >                                 goto out;
> >
> >                         case IB_WR_LOCAL_INV:
> > -                               next_fence =
> > MLX5_FENCE_MODE_INITIATOR_SMALL;
> > +                               next_fence =
> > MLX5_FENCE_MODE_STRONG_ORDERING;
> >                                 qp->sq.wr_data[idx] = IB_WR_LOCAL_INV;
> >                                 ctrl->imm =
> > cpu_to_be32(wr->ex.invalidate_rkey);
> >                                 set_linv_wr(qp, &seg, &size); @@
> > -3901,7 +3901,7 @@ int mlx5_ib_post_send(struct ib_qp *ibqp, struct
> > ib_send_wr *wr,
> >                                 break;
> >
> >                         case IB_WR_REG_MR:
> > -                               next_fence =
> > MLX5_FENCE_MODE_INITIATOR_SMALL;
> > +                               next_fence =
> > MLX5_FENCE_MODE_STRONG_ORDERING;
> >                                 qp->sq.wr_data[idx] = IB_WR_REG_MR;
> >                                 ctrl->imm = cpu_to_be32(reg_wr(wr)->key);
> >                                 err = set_reg_wr(qp, reg_wr(wr), &seg,
> > &size);
> > --
> >
> > Note that this will have a big performance (negative) impact on small
> > read workloads.
> >
> 
> Hi Sagi,
> 
> I think we need to add fence to the UMR wqe.
> 
> so lets try this one:
> 
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a263..c38c4fa 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
> int size_16)
> 
>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>   {
> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> -                    wr->send_flags & IB_SEND_FENCE))
> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
> 
>          if (unlikely(fence)) {
> 
> 
> please update after trying the 2 patches (seperatly) + perf numbers.
> 
> These patches are only for testing and not for submission yet. If we find them
> good enought for upstream then we need to distinguish between ConnexcX4/IB
> and ConnectX5 (we probably won't see it there).

Sorry for the slow response here, we had to churn through some other testing on our test bed before we could try out these patches.  We tested the patches with a single target system and a single initiator system connected via CX4s at 25Gb through an Arista 7060X switch with regular Ethernet flow control enabled (no PFC/DCB - but the switch has no other traffic on it).  We connected 8 Intel P3520 1.2 TB SSDs from the target to the initiator with 16 IO queues per disk.  Then we ran FIO with a 4KB workload, random IO pattern, 4 jobs per disk, queue depth 32 per job, testing 100% read, 70/30 read/write, and 100% write workloads.  We used the default 4.10-RC8 kernel, then patched the same kernel with Sagi's patch, and then separately with Max's patch, and then both patches at the same time (just for fun).  The patches were applied on both target and initiator.  In general we do see to see a performance hit on small block read workloads but it is not massive, looks like about 10%.  We also tested some large block transfers and didn't see any impact.  Results here are in 4KB IOPS:

Read/Write	4.10-RC8	Patch 1 (Sagi)	Patch 2 (Max)	Both Patches
100/0		667,158		611,737		619,586		607,080
70/30		941,352		890,962		884,222		876,926
0/100		667,379		666,000		666,093		666,144

The next step for us is to retest at 50Gb - please note the failure we originally described has only been seen when running 50Gb, and has not been observed at 25Gb, so we don't yet have a conclusion on whether the patch fixes the original issue.  We should have those results later this week if all goes well.  

Let me know if you need more details on the results so far or the test configuration.

It is also worth noting the max throughput above is being limited by the 25Gb link.  When we test at 50Gb should we include some tests with fewer drives that would be disk IO bound instead of network bound, or is network bound the more interesting test case for these patches?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-14 19:57                     ` Gruher, Joseph R
@ 2017-03-14 23:42                       ` Gruher, Joseph R
  2017-03-16  0:03                         ` Gruher, Joseph R
  0 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-14 23:42 UTC (permalink / raw)


> We tested the patches
> with a single target system and a single initiator system connected via CX4s at
> 25Gb through an Arista 7060X switch with regular Ethernet flow control
> enabled (no PFC/DCB - but the switch has no other traffic on it).  We connected
> 8 Intel P3520 1.2 TB SSDs from the target to the initiator with 16 IO queues per
> disk.  Then we ran FIO with a 4KB workload, random IO pattern, 4 jobs per disk,
> queue depth 32 per job, testing 100% read, 70/30 read/write, and 100% write
> workloads.  We used the default 4.10-RC8 kernel, then patched the same kernel
> with Sagi's patch, and then separately with Max's patch, and then both patches
> at the same time (just for fun).  The patches were applied on both target and
> initiator.  In general we do see to see a performance hit on small block read
> workloads but it is not massive, looks like about 10%.  We also tested some
> large block transfers and didn't see any impact.  Results here are in 4KB IOPS:
> 
> Read/Write	4.10-RC8	Patch 1 (Sagi)	Patch 2 (Max)	Both Patches
> 100/0		667,158		611,737		619,586
> 		607,080
> 70/30		941,352		890,962		884,222
> 		876,926
> 0/100		667,379		666,000		666,093
> 		666,144
> 

One additional result from our 25Gb testing - we did do an additional test with the same configuration as above but we ran just a single disk, and a single FIO job with queue depth 8.  This is a light workload designed to examine latency under lower load, when not bottlenecked on network or disk throughput, as opposed to driving the system to max IOPS.  Here we see about a 30usec (20%) increase to latency on 4KB random reads when we apply Sagi's patch and a corresponding dip in IOPS (only about a 2% hit to latency was seen with Max's patch):

4.10-RC8	Patch 1		4.10-RC8 Kernel		Patch 1
IOPS		IOPS		Latency (usec)		Latency (usec)
49,304		40,490		160.3			192.9 

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-14 23:42                       ` Gruher, Joseph R
@ 2017-03-16  0:03                         ` Gruher, Joseph R
  0 siblings, 0 replies; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-16  0:03 UTC (permalink / raw)


> > We tested the patches
> > with a single target system and a single initiator system connected
> > via CX4s at 25Gb through an Arista 7060X switch with regular Ethernet
> > flow control enabled (no PFC/DCB - but the switch has no other traffic
> > on it).  We connected
> > 8 Intel P3520 1.2 TB SSDs from the target to the initiator with 16 IO
> > queues per disk.  Then we ran FIO with a 4KB workload, random IO
> > pattern, 4 jobs per disk, queue depth 32 per job, testing 100% read,
> > 70/30 read/write, and 100% write workloads.  We used the default
> > 4.10-RC8 kernel, then patched the same kernel with Sagi's patch, and
> > then separately with Max's patch, and then both patches at the same
> > time (just for fun).  The patches were applied on both target and
> > initiator.  In general we do see to see a performance hit on small
> > block read workloads but it is not massive, looks like about 10%.  We also
> tested some large block transfers and didn't see any impact.  Results here are
> in 4KB IOPS:
> >
> > Read/Write	4.10-RC8	Patch 1 (Sagi)	Patch 2 (Max)	Both Patches
> > 100/0		667,158		611,737		619,586
> > 		607,080
> > 70/30		941,352		890,962		884,222
> > 		876,926
> > 0/100		667,379		666,000		666,093
> > 		666,144
> >
> 
> One additional result from our 25Gb testing - we did do an additional test with
> the same configuration as above but we ran just a single disk, and a single FIO
> job with queue depth 8.  This is a light workload designed to examine latency
> under lower load, when not bottlenecked on network or disk throughput, as
> opposed to driving the system to max IOPS.  Here we see about a 30usec (20%)
> increase to latency on 4KB random reads when we apply Sagi's patch and a
> corresponding dip in IOPS (only about a 2% hit to latency was seen with Max's
> patch):
> 
> 4.10-RC8	Patch 1		4.10-RC8 Kernel		Patch 1
> IOPS		IOPS		Latency (usec)		Latency (usec)
> 49,304		40,490		160.3			192.9

After moving back to 50Gb CX4 NICs we tested the patches from Sagi and Max.  With Sagi's patch we seem to see a reduced frequency of errors, especially on the target, but errors still definitely occur.  We ran 48 different two-minute workloads and saw roughly 30 errors on the initiator and exactly two on the target.

Target error example:

[ 4336.224633] mlx5_0:dump_cqe:262:(pid 12397): dump error cqe
[ 4336.224636] 00000000 00000000 00000000 00000000
[ 4336.224636] 00000000 00000000 00000000 00000000
[ 4336.224637] 00000000 00000000 00000000 00000000
[ 4336.224637] 00000000 00008813 080000ca 3fb97fd3

Initiator error example:

[ 3134.447002] mlx5_0:dump_cqe:262:(pid 0): dump error cqe
[ 3134.447006] 00000000 00000000 00000000 00000000
[ 3134.447007] 00000000 00000000 00000000 00000000
[ 3134.447008] 00000000 00000000 00000000 00000000
[ 3134.447010] 00000000 08007806 250001a1 55a128d3
[ 3134.447032] nvme nvme0: MEMREG for CQE 0xffff91458a81a650 failed with status memory management operation error (6)
[ 3134.460612] nvme nvme0: reconnecting in 10 seconds
[ 3144.733988] nvme nvme0: Successfully reconnected

Full dmesg output from both systems is attached (it has a few annotations in it about what workload were running at the time of the errors - please just ignore those).

With Max's patch we have so far not produced any errors!  We will continue testing it.  We are also still working to assess the performance impact of Max's patch on the 50Gb configuration.  Since we get the errors without the patch (which then cause the initiator to disconnect and reconnect and thus affect performance) we cannot just run our automated test with and without the patch and compare the two results.  We will do some targeted testing to see if we can capture some unpatched runs that don't have errors and use those to assess the performance impact of Max's patch on the same workloads.

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-test-01-50g-patch1-i03-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170316/6ef575ae/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: patch-test-01-50g-patch1-t01-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170316/6ef575ae/attachment-0003.txt>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-06  0:07                   ` Max Gurtovoy
       [not found]                     ` <26912d0c-578f-26e9-490d-94fc95bdf259-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2017-03-14 19:57                     ` Gruher, Joseph R
@ 2017-03-17 18:37                     ` Gruher, Joseph R
  2017-03-17 19:49                       ` Max Gurtovoy
  2 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-17 18:37 UTC (permalink / raw)




> -----Original Message-----
> From: Max Gurtovoy [mailto:maxg at mellanox.com]
> 
> I think we need to add fence to the UMR wqe.
> 
> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
> index ad8a263..c38c4fa 100644
> --- a/drivers/infiniband/hw/mlx5/qp.c
> +++ b/drivers/infiniband/hw/mlx5/qp.c
> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
> int size_16)
> 
>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>   {
> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> -                    wr->send_flags & IB_SEND_FENCE))
> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
> 
>          if (unlikely(fence)) {
> 
> Joseph,
> please update after trying the 2 patches (seperatly) + perf numbers.
> 
> I'll take it internally and run some more tests with stronger servers using
> ConnectX4 NICs.
> 
> These patches are only for testing and not for submission yet. If we find them
> good enought for upstream then we need to distinguish between ConnexcX4/IB
> and ConnectX5 (we probably won't see it there).

Hi Max-

Our testing on this patch looks good, failures seem completely alleviated.  We are not really detecting any performance impact to small block read workloads.  Data below uses 50Gb CX4 initiator and target and FIO to generate load.  Each disk runs 4KB random reads with 4 jobs and queue depth 32 per job.  Initiator uses 16 IO queues per attached subsystem.  We tested with 2 P3520 disks attached, and again with 7 disks attached.

				IOPS		Latency (usec)
4.10-RC8	2 disks		545,695		466.0 
With Patch	2 disks		587,663		432.8
4.10-RC8	7 disks		1,074,311	829.5
With Patch	7 disks		1,080,099	825.4

You mention these patches are only for testing.  How do we get to something which can be submitted to upstream?

Thanks!

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-17 18:37                     ` Gruher, Joseph R
@ 2017-03-17 19:49                       ` Max Gurtovoy
       [not found]                         ` <DE927C68B458BE418D582EC97927A928550391C2@ORSMSX113.amr.corp.intel.com>
       [not found]                         ` <809f87ab-b787-9d40-5840-07500d12e81a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 2 replies; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-17 19:49 UTC (permalink / raw)




On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>
>
>> -----Original Message-----
>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>
>> I think we need to add fence to the UMR wqe.
>>
>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>> index ad8a263..c38c4fa 100644
>> --- a/drivers/infiniband/hw/mlx5/qp.c
>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>> int size_16)
>>
>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>   {
>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>> -                    wr->send_flags & IB_SEND_FENCE))
>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>
>>          if (unlikely(fence)) {
>>
>> Joseph,
>> please update after trying the 2 patches (seperatly) + perf numbers.
>>
>> I'll take it internally and run some more tests with stronger servers using
>> ConnectX4 NICs.
>>
>> These patches are only for testing and not for submission yet. If we find them
>> good enought for upstream then we need to distinguish between ConnexcX4/IB
>> and ConnectX5 (we probably won't see it there).
>
> Hi Max-
>
> Our testing on this patch looks good, failures seem completely alleviated.  We are not really detecting any performance impact to small block read workloads.  Data below uses 50Gb CX4 initiator and target and FIO to generate load.  Each disk runs 4KB random reads with 4 jobs and queue depth 32 per job.  Initiator uses 16 IO queues per attached subsystem.  We tested with 2 P3520 disks attached, and again with 7 disks attached.
>
> 				IOPS		Latency (usec)
> 4.10-RC8	2 disks		545,695		466.0
> With Patch	2 disks		587,663		432.8
> 4.10-RC8	7 disks		1,074,311	829.5
> With Patch	7 disks		1,080,099	825.4

Very nice.
We also run testing on null devices in our labs on iSER/NVMf (show 
better the network/transport layer performance) and the impact was 
Atolerable.

>
> You mention these patches are only for testing.  How do we get to something which can be submitted to upstream?

Yes, we need to be careful and not put the strong_fence if it's not a must.
I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to 
prepare a suitable patch and check some other applications performance 
numbers.
Thanks for the testing, you can use this patch meanwhile till we push 
the formal solution.

>
> Thanks!
>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
       [not found]                         ` <DE927C68B458BE418D582EC97927A928550391C2@ORSMSX113.amr.corp.intel.com>
@ 2017-03-24 18:30                           ` Gruher, Joseph R
  2017-03-27 14:17                             ` Max Gurtovoy
       [not found]                             ` <DE927C68B458BE418D582EC97927A928550419FA-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 2 replies; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-24 18:30 UTC (permalink / raw)


> > >> From: Max Gurtovoy [mailto:maxg at mellanox.com]
> > >>
> > >> I think we need to add fence to the UMR wqe.
> > >>
> > >> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> > >> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
> > >> --- a/drivers/infiniband/hw/mlx5/qp.c
> > >> +++ b/drivers/infiniband/hw/mlx5/qp.c
> > >> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
> > >> int idx, int size_16)
> > >>
> > >>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
> > >>   {
> > >> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> > >> -                    wr->send_flags & IB_SEND_FENCE))
> > >> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
> > >> + IB_WR_REG_MR)
> > >>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
> > >>
> > >>          if (unlikely(fence)) {
> > >>
> > >> Joseph,
> > >> please update after trying the 2 patches (seperatly) + perf numbers.
> > >>
> > >> I'll take it internally and run some more tests with stronger
> > >> servers using
> > >> ConnectX4 NICs.
> > >>
> > >> These patches are only for testing and not for submission yet. If
> > >> we find them good enought for upstream then we need to distinguish
> > >> between ConnexcX4/IB and ConnectX5 (we probably won't see it there).
> > >
> > > Hi Max-
> > >
> > > Our testing on this patch looks good, failures seem completely
> > > alleviated.  We
> > are not really detecting any performance impact to small block read
> > workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
> > generate load.  Each disk runs 4KB random reads with 4 jobs and queue depth
> 32 per job.
> > Initiator uses 16 IO queues per attached subsystem.  We tested with 2
> > P3520 disks attached, and again with 7 disks attached.
> > >
> > > 				IOPS		Latency (usec)
> > > 4.10-RC8	2 disks		545,695		466.0
> > > With Patch	2 disks		587,663		432.8
> > > 4.10-RC8	7 disks		1,074,311	829.5
> > > With Patch	7 disks		1,080,099	825.4
> >
> > Very nice.
> > We also run testing on null devices in our labs on iSER/NVMf (show
> > better the network/transport layer performance) and the impact was
> Atolerable.
> >
> > >
> > > You mention these patches are only for testing.  How do we get to
> > > something
> > which can be submitted to upstream?
> >
> > Yes, we need to be careful and not put the strong_fence if it's not a must.
> > I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
> > to prepare a suitable patch and check some other applications
> > performance numbers.
> > Thanks for the testing, you can use this patch meanwhile till we push
> > the formal solution.
> 
> With additional testing on this patch we are now encountering what seems to
> be a new failure.  It takes hours of testing to reproduce but we've been able to
> reproduce on 2 out of 2 overnight runs of continuous testing.  I cannot say
> conclusively the failure is due to the patch, as we cannot run this exact
> configuration without the patch, but I can say we did not see this failure mode
> in previous testing.  Either the patch induced the new failure mode, or perhaps
> this problem was always present and was just exposed by adding the patch.  We
> will continue trying to characterize the failure.
> 
> I've attached a couple dmesg logs of two different reproductions.  The t01 files
> are the target system and the i03 files are the initiator system.  Failure seems
> to start with a sequence like below on the target.  Thoughts?
> 
> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> [38875.108750] nvmet: ctrl 1 fatal error occurred!
> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
> seconds.
> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
> [nvmet_rdma] [39028.719909] Call Trace:
> [39028.719914]  __schedule+0x233/0x6f0
> [39028.719918]  ? sched_clock+0x9/0x10
> [39028.719919]  schedule+0x36/0x80
> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
> vprintk_emit+0x312/0x4a0 [39028.719927]  ? __kfifo_to_user_r+0xb0/0xb0
> [39028.719929]  wait_for_completion+0xb4/0x140 [39028.719930]  ?
> wake_up_q+0x80/0x80 [39028.719933]  nvmet_sq_destroy+0x41/0xf0 [nvmet]
> [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0 [nvmet_rdma]
> [39028.719936]  nvmet_rdma_release_queue_work+0x25/0x50 [nvmet_rdma]
> [39028.719939]  process_one_work+0x1fc/0x4b0 [39028.719940]
> worker_thread+0x4b/0x500 [39028.719942]  kthread+0x101/0x140
> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
> kthread_create_on_node+0x60/0x60 [39028.719946]
> ret_from_fork+0x2c/0x40

Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.

Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: full-std-50g-round2-t01-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170324/6a4aff65/attachment-0002.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: full-std-50g-single-t01-dmesg.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170324/6a4aff65/attachment-0003.txt>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-24 18:30                           ` Gruher, Joseph R
@ 2017-03-27 14:17                             ` Max Gurtovoy
  2017-03-27 15:39                               ` Gruher, Joseph R
       [not found]                             ` <DE927C68B458BE418D582EC97927A928550419FA-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-27 14:17 UTC (permalink / raw)




On 3/24/2017 9:30 PM, Gruher, Joseph R wrote:
>>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>>
>>>>> I think we need to add fence to the UMR wqe.
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>>>>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
>>>>> int idx, int size_16)
>>>>>
>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>   {
>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
>>>>> + IB_WR_REG_MR)
>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>
>>>>>          if (unlikely(fence)) {
>>>>>
>>>>> Joseph,
>>>>> please update after trying the 2 patches (seperatly) + perf numbers.
>>>>>
>>>>> I'll take it internally and run some more tests with stronger
>>>>> servers using
>>>>> ConnectX4 NICs.
>>>>>
>>>>> These patches are only for testing and not for submission yet. If
>>>>> we find them good enought for upstream then we need to distinguish
>>>>> between ConnexcX4/IB and ConnectX5 (we probably won't see it there).
>>>>
>>>> Hi Max-
>>>>
>>>> Our testing on this patch looks good, failures seem completely
>>>> alleviated.  We
>>> are not really detecting any performance impact to small block read
>>> workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
>>> generate load.  Each disk runs 4KB random reads with 4 jobs and queue depth
>> 32 per job.
>>> Initiator uses 16 IO queues per attached subsystem.  We tested with 2
>>> P3520 disks attached, and again with 7 disks attached.
>>>>
>>>> 				IOPS		Latency (usec)
>>>> 4.10-RC8	2 disks		545,695		466.0
>>>> With Patch	2 disks		587,663		432.8
>>>> 4.10-RC8	7 disks		1,074,311	829.5
>>>> With Patch	7 disks		1,080,099	825.4
>>>
>>> Very nice.
>>> We also run testing on null devices in our labs on iSER/NVMf (show
>>> better the network/transport layer performance) and the impact was
>> Atolerable.
>>>
>>>>
>>>> You mention these patches are only for testing.  How do we get to
>>>> something
>>> which can be submitted to upstream?
>>>
>>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
>>> to prepare a suitable patch and check some other applications
>>> performance numbers.
>>> Thanks for the testing, you can use this patch meanwhile till we push
>>> the formal solution.
>>
>> With additional testing on this patch we are now encountering what seems to
>> be a new failure.  It takes hours of testing to reproduce but we've been able to
>> reproduce on 2 out of 2 overnight runs of continuous testing.  I cannot say
>> conclusively the failure is due to the patch, as we cannot run this exact
>> configuration without the patch, but I can say we did not see this failure mode
>> in previous testing.  Either the patch induced the new failure mode, or perhaps
>> this problem was always present and was just exposed by adding the patch.  We
>> will continue trying to characterize the failure.
>>
>> I've attached a couple dmesg logs of two different reproductions.  The t01 files
>> are the target system and the i03 files are the initiator system.  Failure seems
>> to start with a sequence like below on the target.  Thoughts?
>>
>> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>> [38875.108750] nvmet: ctrl 1 fatal error occurred!
>> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
>> seconds.
>> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
>> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
>> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
>> [nvmet_rdma] [39028.719909] Call Trace:
>> [39028.719914]  __schedule+0x233/0x6f0
>> [39028.719918]  ? sched_clock+0x9/0x10
>> [39028.719919]  schedule+0x36/0x80
>> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
>> vprintk_emit+0x312/0x4a0 [39028.719927]  ? __kfifo_to_user_r+0xb0/0xb0
>> [39028.719929]  wait_for_completion+0xb4/0x140 [39028.719930]  ?
>> wake_up_q+0x80/0x80 [39028.719933]  nvmet_sq_destroy+0x41/0xf0 [nvmet]
>> [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0 [nvmet_rdma]
>> [39028.719936]  nvmet_rdma_release_queue_work+0x25/0x50 [nvmet_rdma]
>> [39028.719939]  process_one_work+0x1fc/0x4b0 [39028.719940]
>> worker_thread+0x4b/0x500 [39028.719942]  kthread+0x101/0x140
>> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
>> kthread_create_on_node+0x60/0x60 [39028.719946]
>> ret_from_fork+0x2c/0x40
>
> Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.
>
> Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
>

Hi Joseph,

you mentioned that you saw the cqe dump with my patch but I can't find 
it in the attached dmesg. I only see some hang wait_for_completion. can 
you share the initiator log (since the fix was done in initiator side).

Max.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-27 14:17                             ` Max Gurtovoy
@ 2017-03-27 15:39                               ` Gruher, Joseph R
  2017-03-28  8:38                                 ` Max Gurtovoy
  0 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-03-27 15:39 UTC (permalink / raw)




> On 3/24/2017 9:30 PM, Gruher, Joseph R wrote:
> >>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
> >>>>>
> >>>>> I think we need to add fence to the UMR wqe.
> >>>>>
> >>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
> >>>>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
> >>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
> >>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
> >>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
> >>>>> int idx, int size_16)
> >>>>>
> >>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
> >>>>>   {
> >>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
> >>>>> -                    wr->send_flags & IB_SEND_FENCE))
> >>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
> >>>>> + IB_WR_REG_MR)
> >>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
> >>>>>
> >>>>>          if (unlikely(fence)) {
> >>>>>
> >>>>> Joseph,
> >>>>> please update after trying the 2 patches (seperatly) + perf numbers.
> >>>>>
> >>>>> I'll take it internally and run some more tests with stronger
> >>>>> servers using
> >>>>> ConnectX4 NICs.
> >>>>>
> >>>>> These patches are only for testing and not for submission yet. If
> >>>>> we find them good enought for upstream then we need to distinguish
> >>>>> between ConnexcX4/IB and ConnectX5 (we probably won't see it there).
> >>>>
> >>>> Hi Max-
> >>>>
> >>>> Our testing on this patch looks good, failures seem completely
> >>>> alleviated.  We
> >>> are not really detecting any performance impact to small block read
> >>> workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
> >>> generate load.  Each disk runs 4KB random reads with 4 jobs and
> >>> queue depth
> >> 32 per job.
> >>> Initiator uses 16 IO queues per attached subsystem.  We tested with
> >>> 2
> >>> P3520 disks attached, and again with 7 disks attached.
> >>>>
> >>>> 				IOPS		Latency (usec)
> >>>> 4.10-RC8	2 disks		545,695		466.0
> >>>> With Patch	2 disks		587,663		432.8
> >>>> 4.10-RC8	7 disks		1,074,311	829.5
> >>>> With Patch	7 disks		1,080,099	825.4
> >>>
> >>> Very nice.
> >>> We also run testing on null devices in our labs on iSER/NVMf (show
> >>> better the network/transport layer performance) and the impact was
> >> Atolerable.
> >>>
> >>>>
> >>>> You mention these patches are only for testing.  How do we get to
> >>>> something
> >>> which can be submitted to upstream?
> >>>
> >>> Yes, we need to be careful and not put the strong_fence if it's not a must.
> >>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
> >>> to prepare a suitable patch and check some other applications
> >>> performance numbers.
> >>> Thanks for the testing, you can use this patch meanwhile till we
> >>> push the formal solution.
> >>
> >> With additional testing on this patch we are now encountering what
> >> seems to be a new failure.  It takes hours of testing to reproduce
> >> but we've been able to reproduce on 2 out of 2 overnight runs of
> >> continuous testing.  I cannot say conclusively the failure is due to
> >> the patch, as we cannot run this exact configuration without the
> >> patch, but I can say we did not see this failure mode in previous
> >> testing.  Either the patch induced the new failure mode, or perhaps
> >> this problem was always present and was just exposed by adding the patch.
> We will continue trying to characterize the failure.
> >>
> >> I've attached a couple dmesg logs of two different reproductions.
> >> The t01 files are the target system and the i03 files are the
> >> initiator system.  Failure seems to start with a sequence like below on the
> target.  Thoughts?
> >>
> >> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
> >> [38875.108750] nvmet: ctrl 1 fatal error occurred!
> >> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
> >> seconds.
> >> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
> >> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> >> disables this message.
> >> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
> >> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
> >> [nvmet_rdma] [39028.719909] Call Trace:
> >> [39028.719914]  __schedule+0x233/0x6f0 [39028.719918]  ?
> >> sched_clock+0x9/0x10 [39028.719919]  schedule+0x36/0x80
> >> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
> >> vprintk_emit+0x312/0x4a0 [39028.719927]  ?
> >> __kfifo_to_user_r+0xb0/0xb0 [39028.719929]
> wait_for_completion+0xb4/0x140 [39028.719930]  ?
> >> wake_up_q+0x80/0x80 [39028.719933]  nvmet_sq_destroy+0x41/0xf0
> >> [nvmet] [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0
> [nvmet_rdma]
> >> [39028.719936]  nvmet_rdma_release_queue_work+0x25/0x50
> [nvmet_rdma]
> >> [39028.719939]  process_one_work+0x1fc/0x4b0 [39028.719940]
> >> worker_thread+0x4b/0x500 [39028.719942]  kthread+0x101/0x140
> >> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
> >> kthread_create_on_node+0x60/0x60 [39028.719946]
> >> ret_from_fork+0x2c/0x40
> >
> > Hey folks.  Apologies if this message comes through twice, but when I
> originally sent it the list flagged it as too large due to the dmesg log
> attachments, and then a coworker just told me they never saw it, so I don't
> think it made it through on the first attempt.
> >
> > Please see last note above and dmesg example attached - after more
> extensive testing with Max's patch we are still able to produce cqe dump errors
> (at a much lower frequency) as well as a new failure mode involving a crash
> dump.
> >
> 
> Hi Joseph,
> 
> you mentioned that you saw the cqe dump with my patch but I can't find it in
> the attached dmesg. I only see some hang wait_for_completion. can you share
> the initiator log (since the fix was done in initiator side).
> 
> Max.

Hey Max, sure, here are both the initiator and target side dmesg logs for two test runs.  The logs tagged i03 are the initiator and t01 is the target.  We weren't aware fix was only for initiator side.  We've been applying the patch to both sides and trying to get an overall clean run.  As you say, no errors are directly observed on the initiator end, we do see disconnects/reconnects and various timeouts (I am assuming these are due to the errors on the target end).  At the target end, each log does have an instance of "dump error cqe", plus stack traces and other obvious problems.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: tmp.zip
Type: application/x-zip-compressed
Size: 92459 bytes
Desc: tmp.zip
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170327/22bb493d/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-27 15:39                               ` Gruher, Joseph R
@ 2017-03-28  8:38                                 ` Max Gurtovoy
  2017-03-28 10:21                                   ` shahar.salzman
  0 siblings, 1 reply; 171+ messages in thread
From: Max Gurtovoy @ 2017-03-28  8:38 UTC (permalink / raw)




On 3/27/2017 6:39 PM, Gruher, Joseph R wrote:
>
>
>> On 3/24/2017 9:30 PM, Gruher, Joseph R wrote:
>>>>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>>>>
>>>>>>> I think we need to add fence to the UMR wqe.
>>>>>>>
>>>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>>>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
>>>>>>> int idx, int size_16)
>>>>>>>
>>>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>>>   {
>>>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
>>>>>>> + IB_WR_REG_MR)
>>>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>>>
>>>>>>>          if (unlikely(fence)) {
>>>>>>>
>>>>>>> Joseph,
>>>>>>> please update after trying the 2 patches (seperatly) + perf numbers.
>>>>>>>
>>>>>>> I'll take it internally and run some more tests with stronger
>>>>>>> servers using
>>>>>>> ConnectX4 NICs.
>>>>>>>
>>>>>>> These patches are only for testing and not for submission yet. If
>>>>>>> we find them good enought for upstream then we need to distinguish
>>>>>>> between ConnexcX4/IB and ConnectX5 (we probably won't see it there).
>>>>>>
>>>>>> Hi Max-
>>>>>>
>>>>>> Our testing on this patch looks good, failures seem completely
>>>>>> alleviated.  We
>>>>> are not really detecting any performance impact to small block read
>>>>> workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
>>>>> generate load.  Each disk runs 4KB random reads with 4 jobs and
>>>>> queue depth
>>>> 32 per job.
>>>>> Initiator uses 16 IO queues per attached subsystem.  We tested with
>>>>> 2
>>>>> P3520 disks attached, and again with 7 disks attached.
>>>>>>
>>>>>> 				IOPS		Latency (usec)
>>>>>> 4.10-RC8	2 disks		545,695		466.0
>>>>>> With Patch	2 disks		587,663		432.8
>>>>>> 4.10-RC8	7 disks		1,074,311	829.5
>>>>>> With Patch	7 disks		1,080,099	825.4
>>>>>
>>>>> Very nice.
>>>>> We also run testing on null devices in our labs on iSER/NVMf (show
>>>>> better the network/transport layer performance) and the impact was
>>>> Atolerable.
>>>>>
>>>>>>
>>>>>> You mention these patches are only for testing.  How do we get to
>>>>>> something
>>>>> which can be submitted to upstream?
>>>>>
>>>>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>>>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
>>>>> to prepare a suitable patch and check some other applications
>>>>> performance numbers.
>>>>> Thanks for the testing, you can use this patch meanwhile till we
>>>>> push the formal solution.
>>>>
>>>> With additional testing on this patch we are now encountering what
>>>> seems to be a new failure.  It takes hours of testing to reproduce
>>>> but we've been able to reproduce on 2 out of 2 overnight runs of
>>>> continuous testing.  I cannot say conclusively the failure is due to
>>>> the patch, as we cannot run this exact configuration without the
>>>> patch, but I can say we did not see this failure mode in previous
>>>> testing.  Either the patch induced the new failure mode, or perhaps
>>>> this problem was always present and was just exposed by adding the patch.
>> We will continue trying to characterize the failure.
>>>>
>>>> I've attached a couple dmesg logs of two different reproductions.
>>>> The t01 files are the target system and the i03 files are the
>>>> initiator system.  Failure seems to start with a sequence like below on the
>> target.  Thoughts?
>>>>
>>>> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>>> [38875.108750] nvmet: ctrl 1 fatal error occurred!
>>>> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
>>>> seconds.
>>>> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
>>>> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>> disables this message.
>>>> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
>>>> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
>>>> [nvmet_rdma] [39028.719909] Call Trace:
>>>> [39028.719914]  __schedule+0x233/0x6f0 [39028.719918]  ?
>>>> sched_clock+0x9/0x10 [39028.719919]  schedule+0x36/0x80
>>>> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
>>>> vprintk_emit+0x312/0x4a0 [39028.719927]  ?
>>>> __kfifo_to_user_r+0xb0/0xb0 [39028.719929]
>> wait_for_completion+0xb4/0x140 [39028.719930]  ?
>>>> wake_up_q+0x80/0x80 [39028.719933]  nvmet_sq_destroy+0x41/0xf0
>>>> [nvmet] [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0
>> [nvmet_rdma]
>>>> [39028.719936]  nvmet_rdma_release_queue_work+0x25/0x50
>> [nvmet_rdma]
>>>> [39028.719939]  process_one_work+0x1fc/0x4b0 [39028.719940]
>>>> worker_thread+0x4b/0x500 [39028.719942]  kthread+0x101/0x140
>>>> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
>>>> kthread_create_on_node+0x60/0x60 [39028.719946]
>>>> ret_from_fork+0x2c/0x40
>>>
>>> Hey folks.  Apologies if this message comes through twice, but when I
>> originally sent it the list flagged it as too large due to the dmesg log
>> attachments, and then a coworker just told me they never saw it, so I don't
>> think it made it through on the first attempt.
>>>
>>> Please see last note above and dmesg example attached - after more
>> extensive testing with Max's patch we are still able to produce cqe dump errors
>> (at a much lower frequency) as well as a new failure mode involving a crash
>> dump.
>>>
>>
>> Hi Joseph,
>>
>> you mentioned that you saw the cqe dump with my patch but I can't find it in
>> the attached dmesg. I only see some hang wait_for_completion. can you share
>> the initiator log (since the fix was done in initiator side).
>>
>> Max.
>
> Hey Max, sure, here are both the initiator and target side dmesg logs for two test runs.  The logs tagged i03 are the initiator and t01 is the target.  We weren't aware fix was only for initiator side.

The fix is for the initiator but it's ok that you patched the target.

   We've been applying the patch to both sides and trying to get an 
overall clean run.  As you say, no errors are directly observed on the 
initiator end, we do see disconnects/reconnects and various timeouts (I 
am assuming these are due to the errors on the target end).

are you running simple fio to 16 different devices (each device is 
different subsystem with 1 ns on the target ?) ?

can you try 1 subsystem with 16 ns ?

what is the workload that cause the dump error ?

   At the target end, each log does have an instance of "dump error 
cqe", plus stack traces and other obvious problems.
>

you are getting "remote access error" for some unclear reason.
need to find why.
The reason should be found in the initiator but I only see IO error and 
reconnects there. My guess is that some IO tmo is expired and aborted 
(and the target doesn't really abort the cmd) and after some time the 
target finaly try to access some invalid resource.

what is the CPU utilization in the target side with your load ?

I think this is different issue now...

Max.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-03-28  8:38                                 ` Max Gurtovoy
@ 2017-03-28 10:21                                   ` shahar.salzman
  0 siblings, 0 replies; 171+ messages in thread
From: shahar.salzman @ 2017-03-28 10:21 UTC (permalink / raw)


Hi all,

Sorry for taking so long to reply, we had issues with ConnectX4 
availability, so I had a hard time getting HW for testing.

I have only applied the patch on the initiator side (over 4.9.6), target 
is untouched

Performance also looks good in my setup (no noticeable change 
with/without the patch), I haven't seen the issue recreate, but I'll 
leave the system running for the night just in case.

Also bumping the question about whether this fix (or a variation) will 
reach upstream.

Cheers,
Shahar


On 03/28/2017 11:38 AM, Max Gurtovoy wrote:
>
>
> On 3/27/2017 6:39 PM, Gruher, Joseph R wrote:
>>
>>
>>> On 3/24/2017 9:30 PM, Gruher, Joseph R wrote:
>>>>>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>>>>>
>>>>>>>> I think we need to add fence to the UMR wqe.
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> b/drivers/infiniband/hw/mlx5/qp.c index ad8a263..c38c4fa 100644
>>>>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp,
>>>>>>>> int idx, int size_16)
>>>>>>>>
>>>>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>>>>   {
>>>>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode ==
>>>>>>>> + IB_WR_REG_MR)
>>>>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>>>>
>>>>>>>>          if (unlikely(fence)) {
>>>>>>>>
>>>>>>>> Joseph,
>>>>>>>> please update after trying the 2 patches (seperatly) + perf 
>>>>>>>> numbers.
>>>>>>>>
>>>>>>>> I'll take it internally and run some more tests with stronger
>>>>>>>> servers using
>>>>>>>> ConnectX4 NICs.
>>>>>>>>
>>>>>>>> These patches are only for testing and not for submission yet. If
>>>>>>>> we find them good enought for upstream then we need to distinguish
>>>>>>>> between ConnexcX4/IB and ConnectX5 (we probably won't see it 
>>>>>>>> there).
>>>>>>>
>>>>>>> Hi Max-
>>>>>>>
>>>>>>> Our testing on this patch looks good, failures seem completely
>>>>>>> alleviated.  We
>>>>>> are not really detecting any performance impact to small block read
>>>>>> workloads.  Data below uses 50Gb CX4 initiator and target and FIO to
>>>>>> generate load.  Each disk runs 4KB random reads with 4 jobs and
>>>>>> queue depth
>>>>> 32 per job.
>>>>>> Initiator uses 16 IO queues per attached subsystem.  We tested with
>>>>>> 2
>>>>>> P3520 disks attached, and again with 7 disks attached.
>>>>>>>
>>>>>>>                 IOPS        Latency (usec)
>>>>>>> 4.10-RC8    2 disks        545,695        466.0
>>>>>>> With Patch    2 disks        587,663        432.8
>>>>>>> 4.10-RC8    7 disks        1,074,311    829.5
>>>>>>> With Patch    7 disks        1,080,099    825.4
>>>>>>
>>>>>> Very nice.
>>>>>> We also run testing on null devices in our labs on iSER/NVMf (show
>>>>>> better the network/transport layer performance) and the impact was
>>>>> Atolerable.
>>>>>>
>>>>>>>
>>>>>>> You mention these patches are only for testing.  How do we get to
>>>>>>> something
>>>>>> which can be submitted to upstream?
>>>>>>
>>>>>> Yes, we need to be careful and not put the strong_fence if it's 
>>>>>> not a must.
>>>>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers
>>>>>> to prepare a suitable patch and check some other applications
>>>>>> performance numbers.
>>>>>> Thanks for the testing, you can use this patch meanwhile till we
>>>>>> push the formal solution.
>>>>>
>>>>> With additional testing on this patch we are now encountering what
>>>>> seems to be a new failure.  It takes hours of testing to reproduce
>>>>> but we've been able to reproduce on 2 out of 2 overnight runs of
>>>>> continuous testing.  I cannot say conclusively the failure is due to
>>>>> the patch, as we cannot run this exact configuration without the
>>>>> patch, but I can say we did not see this failure mode in previous
>>>>> testing.  Either the patch induced the new failure mode, or perhaps
>>>>> this problem was always present and was just exposed by adding the 
>>>>> patch.
>>> We will continue trying to characterize the failure.
>>>>>
>>>>> I've attached a couple dmesg logs of two different reproductions.
>>>>> The t01 files are the target system and the i03 files are the
>>>>> initiator system.  Failure seems to start with a sequence like 
>>>>> below on the
>>> target.  Thoughts?
>>>>>
>>>>> [38875.102023] nvmet: ctrl 1 keep-alive timer (15 seconds) expired!
>>>>> [38875.108750] nvmet: ctrl 1 fatal error occurred!
>>>>> [39028.696921] INFO: task kworker/7:3:10534 blocked for more than 120
>>>>> seconds.
>>>>> [39028.704813]       Not tainted 4.10.0-rc8patch-2-get-fence #5
>>>>> [39028.711147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>>>> disables this message.
>>>>> [39028.719900] kworker/7:3     D    0 10534      2 0x00000000
>>>>> [39028.719908] Workqueue: events nvmet_rdma_release_queue_work
>>>>> [nvmet_rdma] [39028.719909] Call Trace:
>>>>> [39028.719914]  __schedule+0x233/0x6f0 [39028.719918]  ?
>>>>> sched_clock+0x9/0x10 [39028.719919]  schedule+0x36/0x80
>>>>> [39028.719921]  schedule_timeout+0x22a/0x3f0 [39028.719924]  ?
>>>>> vprintk_emit+0x312/0x4a0 [39028.719927]  ?
>>>>> __kfifo_to_user_r+0xb0/0xb0 [39028.719929]
>>> wait_for_completion+0xb4/0x140 [39028.719930]  ?
>>>>> wake_up_q+0x80/0x80 [39028.719933] nvmet_sq_destroy+0x41/0xf0
>>>>> [nvmet] [39028.719935]  nvmet_rdma_free_queue+0x28/0xa0
>>> [nvmet_rdma]
>>>>> [39028.719936] nvmet_rdma_release_queue_work+0x25/0x50
>>> [nvmet_rdma]
>>>>> [39028.719939] process_one_work+0x1fc/0x4b0 [39028.719940]
>>>>> worker_thread+0x4b/0x500 [39028.719942] kthread+0x101/0x140
>>>>> [39028.719943]  ? process_one_work+0x4b0/0x4b0 [39028.719945]  ?
>>>>> kthread_create_on_node+0x60/0x60 [39028.719946]
>>>>> ret_from_fork+0x2c/0x40
>>>>
>>>> Hey folks.  Apologies if this message comes through twice, but when I
>>> originally sent it the list flagged it as too large due to the dmesg 
>>> log
>>> attachments, and then a coworker just told me they never saw it, so 
>>> I don't
>>> think it made it through on the first attempt.
>>>>
>>>> Please see last note above and dmesg example attached - after more
>>> extensive testing with Max's patch we are still able to produce cqe 
>>> dump errors
>>> (at a much lower frequency) as well as a new failure mode involving 
>>> a crash
>>> dump.
>>>>
>>>
>>> Hi Joseph,
>>>
>>> you mentioned that you saw the cqe dump with my patch but I can't 
>>> find it in
>>> the attached dmesg. I only see some hang wait_for_completion. can 
>>> you share
>>> the initiator log (since the fix was done in initiator side).
>>>
>>> Max.
>>
>> Hey Max, sure, here are both the initiator and target side dmesg logs 
>> for two test runs.  The logs tagged i03 are the initiator and t01 is 
>> the target.  We weren't aware fix was only for initiator side.
>
> The fix is for the initiator but it's ok that you patched the target.
>
>   We've been applying the patch to both sides and trying to get an 
> overall clean run.  As you say, no errors are directly observed on the 
> initiator end, we do see disconnects/reconnects and various timeouts 
> (I am assuming these are due to the errors on the target end).
>
> are you running simple fio to 16 different devices (each device is 
> different subsystem with 1 ns on the target ?) ?
>
> can you try 1 subsystem with 16 ns ?
>
> what is the workload that cause the dump error ?
>
>   At the target end, each log does have an instance of "dump error 
> cqe", plus stack traces and other obvious problems.
>>
>
> you are getting "remote access error" for some unclear reason.
> need to find why.
> The reason should be found in the initiator but I only see IO error 
> and reconnects there. My guess is that some IO tmo is expired and 
> aborted (and the target doesn't really abort the cmd) and after some 
> time the target finaly try to access some invalid resource.
>
> what is the CPU utilization in the target side with your load ?
>
> I think this is different issue now...
>
> Max.
>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-24 18:30                           ` Gruher, Joseph R
@ 2017-03-28 11:34                                 ` Sagi Grimberg
       [not found]                             ` <DE927C68B458BE418D582EC97927A928550419FA-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-28 11:34 UTC (permalink / raw)
  To: Gruher, Joseph R, 'Max Gurtovoy',
	'shahar.salzman', 'Laurence Oberman',
	Riches Jr, Robert M
  Cc: 'linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org',
	'linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org',
	'Robert LeBlanc',
	Sen, Sujoy, Knapp, Anthony J

Hey Joseph,

> Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.
>
> Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
>

This is a different issue AFAICT,

Looks like nvmet_sq_destroy() is stuck waiting for
the final reference to drop (which seems to never happen).

I'm trying to look for a code path where this may happen.
Can jyou tell if the backend block device completed all of
its I/O when this happens (can check for active tags in debugfs).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-03-28 11:34                                 ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-03-28 11:34 UTC (permalink / raw)


Hey Joseph,

> Hey folks.  Apologies if this message comes through twice, but when I originally sent it the list flagged it as too large due to the dmesg log attachments, and then a coworker just told me they never saw it, so I don't think it made it through on the first attempt.
>
> Please see last note above and dmesg example attached - after more extensive testing with Max's patch we are still able to produce cqe dump errors (at a much lower frequency) as well as a new failure mode involving a crash dump.
>

This is a different issue AFAICT,

Looks like nvmet_sq_destroy() is stuck waiting for
the final reference to drop (which seems to never happen).

I'm trying to look for a code path where this may happen.
Can jyou tell if the backend block device completed all of
its I/O when this happens (can check for active tags in debugfs).

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-03-17 19:49                       ` Max Gurtovoy
@ 2017-04-10 11:40                             ` Marta Rybczynska
       [not found]                         ` <809f87ab-b787-9d40-5840-07500d12e81a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-04-10 11:40 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Gruher, Joseph R, Sagi Grimberg, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>
>>
>>> -----Original Message-----
>>> From: Max Gurtovoy [mailto:maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org]
>>>
>>> I think we need to add fence to the UMR wqe.
>>>
>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>> index ad8a263..c38c4fa 100644
>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>> int size_16)
>>>
>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>   {
>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>> -                    wr->send_flags & IB_SEND_FENCE))
>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>
>>>          if (unlikely(fence)) {
>>>
>>> 
>>
>> You mention these patches are only for testing.  How do we get to something
>> which can be submitted to upstream?
> 
> Yes, we need to be careful and not put the strong_fence if it's not a must.
> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
> prepare a suitable patch and check some other applications performance
> numbers.
> Thanks for the testing, you can use this patch meanwhile till we push
> the formal solution.
> 

Hello Max,
We're seeing the same issue in our setup and we're running this patch
on our system for some time already. It seems to have fixed the issue.
When there is a final patch available, we can test it too.

Thanks,
Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-04-10 11:40                             ` Marta Rybczynska
  0 siblings, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-04-10 11:40 UTC (permalink / raw)


> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>
>>
>>> -----Original Message-----
>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>
>>> I think we need to add fence to the UMR wqe.
>>>
>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>> index ad8a263..c38c4fa 100644
>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>> int size_16)
>>>
>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>   {
>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>> -                    wr->send_flags & IB_SEND_FENCE))
>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>
>>>          if (unlikely(fence)) {
>>>
>>> 
>>
>> You mention these patches are only for testing.  How do we get to something
>> which can be submitted to upstream?
> 
> Yes, we need to be careful and not put the strong_fence if it's not a must.
> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
> prepare a suitable patch and check some other applications performance
> numbers.
> Thanks for the testing, you can use this patch meanwhile till we push
> the formal solution.
> 

Hello Max,
We're seeing the same issue in our setup and we're running this patch
on our system for some time already. It seems to have fixed the issue.
When there is a final patch available, we can test it too.

Thanks,
Marta

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-04-10 11:40                             ` Marta Rybczynska
  (?)
@ 2017-04-10 14:09                             ` Max Gurtovoy
       [not found]                               ` <33e2cc35-f147-d4a4-9a42-8f1245e35842-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  -1 siblings, 1 reply; 171+ messages in thread
From: Max Gurtovoy @ 2017-04-10 14:09 UTC (permalink / raw)




On 4/10/2017 2:40 PM, Marta Rybczynska wrote:
>> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>
>>>> I think we need to add fence to the UMR wqe.
>>>>
>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>>> index ad8a263..c38c4fa 100644
>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>>> int size_16)
>>>>
>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>   {
>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>
>>>>          if (unlikely(fence)) {
>>>>
>>>>
>>>
>>> You mention these patches are only for testing.  How do we get to something
>>> which can be submitted to upstream?
>>
>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
>> prepare a suitable patch and check some other applications performance
>> numbers.
>> Thanks for the testing, you can use this patch meanwhile till we push
>> the formal solution.
>>
>
> Hello Max,
> We're seeing the same issue in our setup and we're running this patch
> on our system for some time already. It seems to have fixed the issue.
> When there is a final patch available, we can test it too.
>
> Thanks,
> Marta
>

Hi Marta,
thanks for testing my patch. I'll send it early next week (holiday's in 
our country) so it will be available in 4.11 kernel hopefully.
if you can share on which NIC's you tested it and the perf numbers you 
get (with and without the patch), it will be great.

thanks,
Max.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-04-10 14:09                             ` Max Gurtovoy
@ 2017-04-11 12:47                                   ` Marta Rybczynska
  0 siblings, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-04-11 12:47 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Gruher, Joseph R, Sagi Grimberg, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

----- Mail original -----
> On 4/10/2017 2:40 PM, Marta Rybczynska wrote:
>>> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Max Gurtovoy [mailto:maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org]
>>>>>
>>>>> I think we need to add fence to the UMR wqe.
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>>>> index ad8a263..c38c4fa 100644
>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>>>> int size_16)
>>>>>
>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>   {
>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>
>>>>>          if (unlikely(fence)) {
>>>>>
>>>>>
>>>>
>>>> You mention these patches are only for testing.  How do we get to something
>>>> which can be submitted to upstream?
>>>
>>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
>>> prepare a suitable patch and check some other applications performance
>>> numbers.
>>> Thanks for the testing, you can use this patch meanwhile till we push
>>> the formal solution.
>>>
>>
>> Hello Max,
>> We're seeing the same issue in our setup and we're running this patch
>> on our system for some time already. It seems to have fixed the issue.
>> When there is a final patch available, we can test it too.
>>
>> Thanks,
>> Marta
>>
> 
> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.
> 

Hello Max,
That's great news. We're ready to start testing as the final version
of the patch is out. We're using ConnectX 4. Unfortunately in our use
case the patch is not a question of performance: the workload doesn't
work without it. We may think about running other workloads for the
tests.

Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-04-11 12:47                                   ` Marta Rybczynska
  0 siblings, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-04-11 12:47 UTC (permalink / raw)


----- Mail original -----
> On 4/10/2017 2:40 PM, Marta Rybczynska wrote:
>>> On 3/17/2017 8:37 PM, Gruher, Joseph R wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Max Gurtovoy [mailto:maxg at mellanox.com]
>>>>>
>>>>> I think we need to add fence to the UMR wqe.
>>>>>
>>>>> diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
>>>>> index ad8a263..c38c4fa 100644
>>>>> --- a/drivers/infiniband/hw/mlx5/qp.c
>>>>> +++ b/drivers/infiniband/hw/mlx5/qp.c
>>>>> @@ -3737,8 +3737,7 @@ static void dump_wqe(struct mlx5_ib_qp *qp, int idx,
>>>>> int size_16)
>>>>>
>>>>>   static u8 get_fence(u8 fence, struct ib_send_wr *wr)
>>>>>   {
>>>>> -       if (unlikely(wr->opcode == IB_WR_LOCAL_INV &&
>>>>> -                    wr->send_flags & IB_SEND_FENCE))
>>>>> +       if (wr->opcode == IB_WR_LOCAL_INV || wr->opcode == IB_WR_REG_MR)
>>>>>                  return MLX5_FENCE_MODE_STRONG_ORDERING;
>>>>>
>>>>>          if (unlikely(fence)) {
>>>>>
>>>>>
>>>>
>>>> You mention these patches are only for testing.  How do we get to something
>>>> which can be submitted to upstream?
>>>
>>> Yes, we need to be careful and not put the strong_fence if it's not a must.
>>> I'll be out for the upcoming week, but I'll ask our mlx5 maintainers to
>>> prepare a suitable patch and check some other applications performance
>>> numbers.
>>> Thanks for the testing, you can use this patch meanwhile till we push
>>> the formal solution.
>>>
>>
>> Hello Max,
>> We're seeing the same issue in our setup and we're running this patch
>> on our system for some time already. It seems to have fixed the issue.
>> When there is a final patch available, we can test it too.
>>
>> Thanks,
>> Marta
>>
> 
> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.
> 

Hello Max,
That's great news. We're ready to start testing as the final version
of the patch is out. We're using ConnectX 4. Unfortunately in our use
case the patch is not a question of performance: the workload doesn't
work without it. We may think about running other workloads for the
tests.

Marta

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-04-10 14:09                             ` Max Gurtovoy
@ 2017-04-20 10:18                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-04-20 10:18 UTC (permalink / raw)
  To: Max Gurtovoy, Marta Rybczynska
  Cc: Gruher, Joseph R, shahar.salzman, Laurence Oberman, Riches Jr,
	Robert M, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Max,

> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.

Can we get it to 4.12 please (we need you to send it out today).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-04-20 10:18                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-04-20 10:18 UTC (permalink / raw)


Max,

> Hi Marta,
> thanks for testing my patch. I'll send it early next week (holiday's in
> our country) so it will be available in 4.11 kernel hopefully.
> if you can share on which NIC's you tested it and the perf numbers you
> get (with and without the patch), it will be great.

Can we get it to 4.12 please (we need you to send it out today).

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-04-20 10:18                                   ` Sagi Grimberg
  (?)
@ 2017-04-26 11:56                                   ` Max Gurtovoy
       [not found]                                     ` <af9044c2-fb7c-e8b3-d8fc-4874cfd1bb67-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  -1 siblings, 1 reply; 171+ messages in thread
From: Max Gurtovoy @ 2017-04-26 11:56 UTC (permalink / raw)




On 4/20/2017 1:18 PM, Sagi Grimberg wrote:
> Max,
>
>> Hi Marta,
>> thanks for testing my patch. I'll send it early next week (holiday's in
>> our country) so it will be available in 4.11 kernel hopefully.
>> if you can share on which NIC's you tested it and the perf numbers you
>> get (with and without the patch), it will be great.
>
> Can we get it to 4.12 please (we need you to send it out today).

Sagi,
we are preparing a cap bit for it.
I'll send it as soon as it will be ready (otherwise we'll add fence for 
devices that can handle without it - such as ConnectX5).

Max.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-04-26 11:56                                   ` Max Gurtovoy
@ 2017-04-26 14:45                                         ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-04-26 14:45 UTC (permalink / raw)
  To: Max Gurtovoy, Marta Rybczynska
  Cc: Gruher, Joseph R, shahar.salzman, Laurence Oberman, Riches Jr,
	Robert M, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> Can we get it to 4.12 please (we need you to send it out today).
>
> Sagi,
> we are preparing a cap bit for it.
> I'll send it as soon as it will be ready (otherwise we'll add fence for
> devices that can handle without it - such as ConnectX5).

Why? can't you just get what you have as a fix and incrementally add the
optimization for CX5?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-04-26 14:45                                         ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-04-26 14:45 UTC (permalink / raw)



>> Can we get it to 4.12 please (we need you to send it out today).
>
> Sagi,
> we are preparing a cap bit for it.
> I'll send it as soon as it will be ready (otherwise we'll add fence for
> devices that can handle without it - such as ConnectX5).

Why? can't you just get what you have as a fix and incrementally add the
optimization for CX5?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-04-26 14:45                                         ` Sagi Grimberg
  (?)
@ 2017-05-12 19:20                                         ` Gruher, Joseph R
       [not found]                                           ` <DE927C68B458BE418D582EC97927A92855088C6F-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  -1 siblings, 1 reply; 171+ messages in thread
From: Gruher, Joseph R @ 2017-05-12 19:20 UTC (permalink / raw)


> >> Can we get it to 4.12 please (we need you to send it out today).
> >
> > Sagi,
> > we are preparing a cap bit for it.
> > I'll send it as soon as it will be ready (otherwise we'll add fence
> > for devices that can handle without it - such as ConnectX5).
> 
> Why? can't you just get what you have as a fix and incrementally add the
> optimization for CX5?

Any update here?  Would love to be able to load up the new kernels without patching them every time. :)

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-12 19:20                                         ` Gruher, Joseph R
@ 2017-05-15 12:00                                               ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-05-15 12:00 UTC (permalink / raw)
  To: Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska
  Cc: shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)

I would like that too,

Max, can you send a patch? or should I?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 12:00                                               ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-05-15 12:00 UTC (permalink / raw)


> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)

I would like that too,

Max, can you send a patch? or should I?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 12:00                                               ` Sagi Grimberg
@ 2017-05-15 13:31                                                   ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 13:31 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 637 bytes --]

On Mon, May 15, 2017 at 03:00:07PM +0300, Sagi Grimberg wrote:
> > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>
> I would like that too,
>
> Max, can you send a patch? or should I?

Sagi,

Max is doing his best to provide a patch, unfortunately he is limited
by various architecture implications which he needs to resolve before
sending it.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 13:31                                                   ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 13:31 UTC (permalink / raw)


On Mon, May 15, 2017@03:00:07PM +0300, Sagi Grimberg wrote:
> > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>
> I would like that too,
>
> Max, can you send a patch? or should I?

Sagi,

Max is doing his best to provide a patch, unfortunately he is limited
by various architecture implications which he needs to resolve before
sending it.

Thanks

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170515/d9229a4b/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 13:31                                                   ` Leon Romanovsky
@ 2017-05-15 13:43                                                       ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-05-15 13:43 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

>>> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>>
>> I would like that too,
>>
>> Max, can you send a patch? or should I?
>
> Sagi,
>
> Max is doing his best to provide a patch, unfortunately he is limited
> by various architecture implications which he needs to resolve before
> sending it.

Well, he already sent a patch that fixes the issue.
He said that he needs additional optimization for CX5 (which I assume
doesn't need the strong fence), but that still does not change the
fact that CX4 is broken. I asked to include his patch, fix the
existing bug and incrementally optimize CX5.

What do you mean by architecture implications? it's broken and there is
a request from the community to fix it. Are you suggesting that it
doesn't solve the issue?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 13:43                                                       ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-05-15 13:43 UTC (permalink / raw)


>>> Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
>>
>> I would like that too,
>>
>> Max, can you send a patch? or should I?
>
> Sagi,
>
> Max is doing his best to provide a patch, unfortunately he is limited
> by various architecture implications which he needs to resolve before
> sending it.

Well, he already sent a patch that fixes the issue.
He said that he needs additional optimization for CX5 (which I assume
doesn't need the strong fence), but that still does not change the
fact that CX4 is broken. I asked to include his patch, fix the
existing bug and incrementally optimize CX5.

What do you mean by architecture implications? it's broken and there is
a request from the community to fix it. Are you suggesting that it
doesn't solve the issue?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 13:43                                                       ` Sagi Grimberg
@ 2017-05-15 14:36                                                           ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 14:36 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 1255 bytes --]

On Mon, May 15, 2017 at 04:43:18PM +0300, Sagi Grimberg wrote:
> > > > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
> > >
> > > I would like that too,
> > >
> > > Max, can you send a patch? or should I?
> >
> > Sagi,
> >
> > Max is doing his best to provide a patch, unfortunately he is limited
> > by various architecture implications which he needs to resolve before
> > sending it.
>
> Well, he already sent a patch that fixes the issue.
> He said that he needs additional optimization for CX5 (which I assume
> doesn't need the strong fence), but that still does not change the
> fact that CX4 is broken. I asked to include his patch, fix the
> existing bug and incrementally optimize CX5.
>
> What do you mean by architecture implications? it's broken and there is
> a request from the community to fix it. Are you suggesting that it
> doesn't solve the issue?

I understand you and both Max and me are feeling the same as you. For more
than 2 months, we constantly (almost on daily basis) asked for a solution from
architecture group, but received different answers. The proposals were
extremely broad from need for strong fence for all cards to no need for
strong fence at all.

Thanks

>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 14:36                                                           ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 14:36 UTC (permalink / raw)


On Mon, May 15, 2017@04:43:18PM +0300, Sagi Grimberg wrote:
> > > > Any update here?  Would love to be able to load up the new kernels without patching them every time. :)
> > >
> > > I would like that too,
> > >
> > > Max, can you send a patch? or should I?
> >
> > Sagi,
> >
> > Max is doing his best to provide a patch, unfortunately he is limited
> > by various architecture implications which he needs to resolve before
> > sending it.
>
> Well, he already sent a patch that fixes the issue.
> He said that he needs additional optimization for CX5 (which I assume
> doesn't need the strong fence), but that still does not change the
> fact that CX4 is broken. I asked to include his patch, fix the
> existing bug and incrementally optimize CX5.
>
> What do you mean by architecture implications? it's broken and there is
> a request from the community to fix it. Are you suggesting that it
> doesn't solve the issue?

I understand you and both Max and me are feeling the same as you. For more
than 2 months, we constantly (almost on daily basis) asked for a solution from
architecture group, but received different answers. The proposals were
extremely broad from need for strong fence for all cards to no need for
strong fence at all.

Thanks

>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170515/20da2579/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 14:36                                                           ` Leon Romanovsky
@ 2017-05-15 14:59                                                               ` Christoph Hellwig
  -1 siblings, 0 replies; 171+ messages in thread
From: Christoph Hellwig @ 2017-05-15 14:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Sagi Grimberg, Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> I understand you and both Max and me are feeling the same as you. For more
> than 2 months, we constantly (almost on daily basis) asked for a solution from
> architecture group, but received different answers. The proposals were
> extremely broad from need for strong fence for all cards to no need for
> strong fence at all.

So let's get the patch to do a strong fence everywhere now, and relax
it later where possible.

Correntness before speed..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 14:59                                                               ` Christoph Hellwig
  0 siblings, 0 replies; 171+ messages in thread
From: Christoph Hellwig @ 2017-05-15 14:59 UTC (permalink / raw)


On Mon, May 15, 2017@05:36:32PM +0300, Leon Romanovsky wrote:
> I understand you and both Max and me are feeling the same as you. For more
> than 2 months, we constantly (almost on daily basis) asked for a solution from
> architecture group, but received different answers. The proposals were
> extremely broad from need for strong fence for all cards to no need for
> strong fence at all.

So let's get the patch to do a strong fence everywhere now, and relax
it later where possible.

Correntness before speed..

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 14:59                                                               ` Christoph Hellwig
@ 2017-05-15 17:05                                                                   ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 17:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Gruher, Joseph R, Max Gurtovoy, Marta Rybczynska,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 786 bytes --]

On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> > I understand you and both Max and me are feeling the same as you. For more
> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> > architecture group, but received different answers. The proposals were
> > extremely broad from need for strong fence for all cards to no need for
> > strong fence at all.
>
> So let's get the patch to do a strong fence everywhere now, and relax
> it later where possible.
>
> Correntness before speed..

OK, please give me and Max till EOW to stop this saga. One of the two
options will be: Max will resend original patch, or Max will send patch
blessed by architecture group.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-15 17:05                                                                   ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-15 17:05 UTC (permalink / raw)


On Mon, May 15, 2017@07:59:52AM -0700, Christoph Hellwig wrote:
> On Mon, May 15, 2017@05:36:32PM +0300, Leon Romanovsky wrote:
> > I understand you and both Max and me are feeling the same as you. For more
> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> > architecture group, but received different answers. The proposals were
> > extremely broad from need for strong fence for all cards to no need for
> > strong fence at all.
>
> So let's get the patch to do a strong fence everywhere now, and relax
> it later where possible.
>
> Correntness before speed..

OK, please give me and Max till EOW to stop this saga. One of the two
options will be: Max will resend original patch, or Max will send patch
blessed by architecture group.

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170515/eda44767/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-15 17:05                                                                   ` Leon Romanovsky
@ 2017-05-17 12:56                                                                       ` Marta Rybczynska
  -1 siblings, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-05-17 12:56 UTC (permalink / raw)
  To: Leon Romanovsky, Max Gurtovoy
  Cc: Christoph Hellwig, Sagi Grimberg, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

> On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
>> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
>> > I understand you and both Max and me are feeling the same as you. For more
>> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> > architecture group, but received different answers. The proposals were
>> > extremely broad from need for strong fence for all cards to no need for
>> > strong fence at all.
>>
>> So let's get the patch to do a strong fence everywhere now, and relax
>> it later where possible.
>>
>> Correntness before speed..
> 
> OK, please give me and Max till EOW to stop this saga. One of the two
> options will be: Max will resend original patch, or Max will send patch
> blessed by architecture group.
> 

Good luck with this Max & Leon! It seems to be a complicated problem.
Just an idea: in our case it *seems* that the problem started appearing
after a firmware upgrade, older ones do not seem to have the same
behaviour. Maybe it's a hint for you.

Thanks!
Marta
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-17 12:56                                                                       ` Marta Rybczynska
  0 siblings, 0 replies; 171+ messages in thread
From: Marta Rybczynska @ 2017-05-17 12:56 UTC (permalink / raw)


> On Mon, May 15, 2017@07:59:52AM -0700, Christoph Hellwig wrote:
>> On Mon, May 15, 2017@05:36:32PM +0300, Leon Romanovsky wrote:
>> > I understand you and both Max and me are feeling the same as you. For more
>> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> > architecture group, but received different answers. The proposals were
>> > extremely broad from need for strong fence for all cards to no need for
>> > strong fence at all.
>>
>> So let's get the patch to do a strong fence everywhere now, and relax
>> it later where possible.
>>
>> Correntness before speed..
> 
> OK, please give me and Max till EOW to stop this saga. One of the two
> options will be: Max will resend original patch, or Max will send patch
> blessed by architecture group.
> 

Good luck with this Max & Leon! It seems to be a complicated problem.
Just an idea: in our case it *seems* that the problem started appearing
after a firmware upgrade, older ones do not seem to have the same
behaviour. Maybe it's a hint for you.

Thanks!
Marta

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-17 12:56                                                                       ` Marta Rybczynska
@ 2017-05-18 13:34                                                                           ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-18 13:34 UTC (permalink / raw)
  To: Marta Rybczynska
  Cc: Max Gurtovoy, Christoph Hellwig, Sagi Grimberg, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Robert LeBlanc,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote:
> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
> >> > I understand you and both Max and me are feeling the same as you. For more
> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> >> > architecture group, but received different answers. The proposals were
> >> > extremely broad from need for strong fence for all cards to no need for
> >> > strong fence at all.
> >>
> >> So let's get the patch to do a strong fence everywhere now, and relax
> >> it later where possible.
> >>
> >> Correntness before speed..
> >
> > OK, please give me and Max till EOW to stop this saga. One of the two
> > options will be: Max will resend original patch, or Max will send patch
> > blessed by architecture group.
> >
>
> Good luck with this Max & Leon! It seems to be a complicated problem.
> Just an idea: in our case it *seems* that the problem started appearing
> after a firmware upgrade, older ones do not seem to have the same
> behaviour. Maybe it's a hint for you.

OK, we came to the agreement which capability bits we should add. Max
will return to the office at the middle of the next week and we will
proceed with the submission of proper patch once our shared code will
be accepted.

In the meantime, i put the original patch to be part of our regression.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c

Thank you for your patience.

>
> Thanks!
> Marta
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-05-18 13:34                                                                           ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-05-18 13:34 UTC (permalink / raw)


On Wed, May 17, 2017@02:56:36PM +0200, Marta Rybczynska wrote:
> > On Mon, May 15, 2017@07:59:52AM -0700, Christoph Hellwig wrote:
> >> On Mon, May 15, 2017@05:36:32PM +0300, Leon Romanovsky wrote:
> >> > I understand you and both Max and me are feeling the same as you. For more
> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
> >> > architecture group, but received different answers. The proposals were
> >> > extremely broad from need for strong fence for all cards to no need for
> >> > strong fence at all.
> >>
> >> So let's get the patch to do a strong fence everywhere now, and relax
> >> it later where possible.
> >>
> >> Correntness before speed..
> >
> > OK, please give me and Max till EOW to stop this saga. One of the two
> > options will be: Max will resend original patch, or Max will send patch
> > blessed by architecture group.
> >
>
> Good luck with this Max & Leon! It seems to be a complicated problem.
> Just an idea: in our case it *seems* that the problem started appearing
> after a firmware upgrade, older ones do not seem to have the same
> behaviour. Maybe it's a hint for you.

OK, we came to the agreement which capability bits we should add. Max
will return to the office at the middle of the next week and we will
proceed with the submission of proper patch once our shared code will
be accepted.

In the meantime, i put the original patch to be part of our regression.
https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c

Thank you for your patience.

>
> Thanks!
> Marta
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170518/78895624/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-05-18 13:34                                                                           ` Leon Romanovsky
@ 2017-06-19 17:21                                                                               ` Robert LeBlanc
  -1 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-19 17:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Sagi Grimberg,
	Gruher, Joseph R, shahar.salzman, Laurence Oberman, Riches Jr,
	Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

I ran into this with 4.9.32 when I rebooted the target. I tested
4.12-rc6 and this particular error seems to have been resolved, but I
now get a new one on the initiator. This one doesn't seem as
impactful.

[Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
[Mon Jun 19 11:17:20 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:20 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001e7 45dd82d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection4:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001f4 004915d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:44 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 93005204 0a0001f6 004519d2
[Mon Jun 19 11:17:44 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:44 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:18:55 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 93005204 0a0001f7 01934fd2
[Mon Jun 19 11:18:55 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:18:55 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:20:25 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 93005204 0a0001f8 0274edd2
[Mon Jun 19 11:20:25 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:20:25 2017]  connection3:0: detected conn error (1011)

I'm going to try to cherry-pick the fix to 4.9.x and do some testing there.

Thanks,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, May 18, 2017 at 7:34 AM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Wed, May 17, 2017 at 02:56:36PM +0200, Marta Rybczynska wrote:
>> > On Mon, May 15, 2017 at 07:59:52AM -0700, Christoph Hellwig wrote:
>> >> On Mon, May 15, 2017 at 05:36:32PM +0300, Leon Romanovsky wrote:
>> >> > I understand you and both Max and me are feeling the same as you. For more
>> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> >> > architecture group, but received different answers. The proposals were
>> >> > extremely broad from need for strong fence for all cards to no need for
>> >> > strong fence at all.
>> >>
>> >> So let's get the patch to do a strong fence everywhere now, and relax
>> >> it later where possible.
>> >>
>> >> Correntness before speed..
>> >
>> > OK, please give me and Max till EOW to stop this saga. One of the two
>> > options will be: Max will resend original patch, or Max will send patch
>> > blessed by architecture group.
>> >
>>
>> Good luck with this Max & Leon! It seems to be a complicated problem.
>> Just an idea: in our case it *seems* that the problem started appearing
>> after a firmware upgrade, older ones do not seem to have the same
>> behaviour. Maybe it's a hint for you.
>
> OK, we came to the agreement which capability bits we should add. Max
> will return to the office at the middle of the next week and we will
> proceed with the submission of proper patch once our shared code will
> be accepted.
>
> In the meantime, i put the original patch to be part of our regression.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c
>
> Thank you for your patience.
>
>>
>> Thanks!
>> Marta
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-19 17:21                                                                               ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-19 17:21 UTC (permalink / raw)


I ran into this with 4.9.32 when I rebooted the target. I tested
4.12-rc6 and this particular error seems to have been resolved, but I
now get a new one on the initiator. This one doesn't seem as
impactful.

[Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
[Mon Jun 19 11:17:20 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:20 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001e7 45dd82d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection4:0: detected conn error (1011)
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:31 2017] 00000000 93005204 0a0001f4 004915d2
[Mon Jun 19 11:17:31 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:31 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:17:44 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:17:44 2017] 00000000 93005204 0a0001f6 004519d2
[Mon Jun 19 11:17:44 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:17:44 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:18:55 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:18:55 2017] 00000000 93005204 0a0001f7 01934fd2
[Mon Jun 19 11:18:55 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:18:55 2017]  connection3:0: detected conn error (1011)
[Mon Jun 19 11:20:25 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 00000000 00000000 00000000
[Mon Jun 19 11:20:25 2017] 00000000 93005204 0a0001f8 0274edd2
[Mon Jun 19 11:20:25 2017] iser: iser_err_comp: command failure: local
protection error (4) vend_err 52
[Mon Jun 19 11:20:25 2017]  connection3:0: detected conn error (1011)

I'm going to try to cherry-pick the fix to 4.9.x and do some testing there.

Thanks,

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, May 18, 2017@7:34 AM, Leon Romanovsky <leon@kernel.org> wrote:
> On Wed, May 17, 2017@02:56:36PM +0200, Marta Rybczynska wrote:
>> > On Mon, May 15, 2017@07:59:52AM -0700, Christoph Hellwig wrote:
>> >> On Mon, May 15, 2017@05:36:32PM +0300, Leon Romanovsky wrote:
>> >> > I understand you and both Max and me are feeling the same as you. For more
>> >> > than 2 months, we constantly (almost on daily basis) asked for a solution from
>> >> > architecture group, but received different answers. The proposals were
>> >> > extremely broad from need for strong fence for all cards to no need for
>> >> > strong fence at all.
>> >>
>> >> So let's get the patch to do a strong fence everywhere now, and relax
>> >> it later where possible.
>> >>
>> >> Correntness before speed..
>> >
>> > OK, please give me and Max till EOW to stop this saga. One of the two
>> > options will be: Max will resend original patch, or Max will send patch
>> > blessed by architecture group.
>> >
>>
>> Good luck with this Max & Leon! It seems to be a complicated problem.
>> Just an idea: in our case it *seems* that the problem started appearing
>> after a firmware upgrade, older ones do not seem to have the same
>> behaviour. Maybe it's a hint for you.
>
> OK, we came to the agreement which capability bits we should add. Max
> will return to the office at the middle of the next week and we will
> proceed with the submission of proper patch once our shared code will
> be accepted.
>
> In the meantime, i put the original patch to be part of our regression.
> https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/commit/?h=testing/queue-next&id=a40ac569f243db552661e6efad70080bb406823c
>
> Thank you for your patience.
>
>>
>> Thanks!
>> Marta
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-19 17:21                                                                               ` Robert LeBlanc
@ 2017-06-20  6:39                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  6:39 UTC (permalink / raw)
  To: Robert LeBlanc, Leon Romanovsky
  Cc: Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Robert,

> I ran into this with 4.9.32 when I rebooted the target. I tested
> 4.12-rc6 and this particular error seems to have been resolved, but I
> now get a new one on the initiator. This one doesn't seem as
> impactful.
> 
> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2

Max, Leon,

Care to parse this syndrome for us? ;)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20  6:39                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  6:39 UTC (permalink / raw)


Hi Robert,

> I ran into this with 4.9.32 when I rebooted the target. I tested
> 4.12-rc6 and this particular error seems to have been resolved, but I
> now get a new one on the initiator. This one doesn't seem as
> impactful.
> 
> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2

Max, Leon,

Care to parse this syndrome for us? ;)

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  6:39                                                                                   ` Sagi Grimberg
@ 2017-06-20  7:46                                                                                       ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-06-20  7:46 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 1444 bytes --]

On Tue, Jun 20, 2017 at 09:39:36AM +0300, Sagi Grimberg wrote:
> Hi Robert,
>
> > I ran into this with 4.9.32 when I rebooted the target. I tested
> > 4.12-rc6 and this particular error seems to have been resolved, but I
> > now get a new one on the initiator. This one doesn't seem as
> > impactful.
> >
> > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>
> Max, Leon,
>
> Care to parse this syndrome for us? ;)

Here the parsed output, it says that it was access to mkey which is
free.

======== cqe_with_error ========
wqe_id                           : 0x0
srqn_usr_index                   : 0x0
byte_cnt                         : 0x0
hw_error_syndrome                : 0x93
hw_syndrome_type                 : 0x0
vendor_error_syndrome            : 0x52
syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
s_wqe_opcode                     : SEND (0xa)
qpn_dctn_flow_tag                : 0x1bd
wqe_counter                      : 0x45c8
signature                        : 0xe0
opcode                           : REQUESTOR_ERROR (0xd)
cqe_format                       : NO_INLINE_DATA (0x0)
owner                            : 0x0

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20  7:46                                                                                       ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-06-20  7:46 UTC (permalink / raw)


On Tue, Jun 20, 2017@09:39:36AM +0300, Sagi Grimberg wrote:
> Hi Robert,
>
> > I ran into this with 4.9.32 when I rebooted the target. I tested
> > 4.12-rc6 and this particular error seems to have been resolved, but I
> > now get a new one on the initiator. This one doesn't seem as
> > impactful.
> >
> > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>
> Max, Leon,
>
> Care to parse this syndrome for us? ;)

Here the parsed output, it says that it was access to mkey which is
free.

======== cqe_with_error ========
wqe_id                           : 0x0
srqn_usr_index                   : 0x0
byte_cnt                         : 0x0
hw_error_syndrome                : 0x93
hw_syndrome_type                 : 0x0
vendor_error_syndrome            : 0x52
syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
s_wqe_opcode                     : SEND (0xa)
qpn_dctn_flow_tag                : 0x1bd
wqe_counter                      : 0x45c8
signature                        : 0xe0
opcode                           : REQUESTOR_ERROR (0xd)
cqe_format                       : NO_INLINE_DATA (0x0)
owner                            : 0x0

Thanks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170620/3a159e4b/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  7:46                                                                                       ` Leon Romanovsky
@ 2017-06-20  7:58                                                                                           ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  7:58 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r


>> Hi Robert,
>>
>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>> now get a new one on the initiator. This one doesn't seem as
>>> impactful.
>>>
>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>
>> Max, Leon,
>>
>> Care to parse this syndrome for us? ;)
> 
> Here the parsed output, it says that it was access to mkey which is
> free.
> 
> ======== cqe_with_error ========
> wqe_id                           : 0x0
> srqn_usr_index                   : 0x0
> byte_cnt                         : 0x0
> hw_error_syndrome                : 0x93
> hw_syndrome_type                 : 0x0
> vendor_error_syndrome            : 0x52

Can you share the check that correlates to the vendor+hw syndrome?

> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> s_wqe_opcode                     : SEND (0xa)

That's interesting, the opcode is a send operation. I'm assuming
that this is immediate-data write? Robert, did this happen when
you issued >4k writes to the target?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20  7:58                                                                                           ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  7:58 UTC (permalink / raw)



>> Hi Robert,
>>
>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>> now get a new one on the initiator. This one doesn't seem as
>>> impactful.
>>>
>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>
>> Max, Leon,
>>
>> Care to parse this syndrome for us? ;)
> 
> Here the parsed output, it says that it was access to mkey which is
> free.
> 
> ======== cqe_with_error ========
> wqe_id                           : 0x0
> srqn_usr_index                   : 0x0
> byte_cnt                         : 0x0
> hw_error_syndrome                : 0x93
> hw_syndrome_type                 : 0x0
> vendor_error_syndrome            : 0x52

Can you share the check that correlates to the vendor+hw syndrome?

> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> s_wqe_opcode                     : SEND (0xa)

That's interesting, the opcode is a send operation. I'm assuming
that this is immediate-data write? Robert, did this happen when
you issued >4k writes to the target?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  7:58                                                                                           ` Sagi Grimberg
@ 2017-06-20  8:33                                                                                               ` Leon Romanovsky
  -1 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-06-20  8:33 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

On Tue, Jun 20, 2017 at 10:58:47AM +0300, Sagi Grimberg wrote:
>
> > > Hi Robert,
> > >
> > > > I ran into this with 4.9.32 when I rebooted the target. I tested
> > > > 4.12-rc6 and this particular error seems to have been resolved, but I
> > > > now get a new one on the initiator. This one doesn't seem as
> > > > impactful.
> > > >
> > > > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
> > >
> > > Max, Leon,
> > >
> > > Care to parse this syndrome for us? ;)
> >
> > Here the parsed output, it says that it was access to mkey which is
> > free.
> >
> > ======== cqe_with_error ========
> > wqe_id                           : 0x0
> > srqn_usr_index                   : 0x0
> > byte_cnt                         : 0x0
> > hw_error_syndrome                : 0x93
> > hw_syndrome_type                 : 0x0
> > vendor_error_syndrome            : 0x52
>
> Can you share the check that correlates to the vendor+hw syndrome?

mkey.free == 1

>
> > syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> > s_wqe_opcode                     : SEND (0xa)
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20  8:33                                                                                               ` Leon Romanovsky
  0 siblings, 0 replies; 171+ messages in thread
From: Leon Romanovsky @ 2017-06-20  8:33 UTC (permalink / raw)


On Tue, Jun 20, 2017@10:58:47AM +0300, Sagi Grimberg wrote:
>
> > > Hi Robert,
> > >
> > > > I ran into this with 4.9.32 when I rebooted the target. I tested
> > > > 4.12-rc6 and this particular error seems to have been resolved, but I
> > > > now get a new one on the initiator. This one doesn't seem as
> > > > impactful.
> > > >
> > > > [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
> > > > [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
> > >
> > > Max, Leon,
> > >
> > > Care to parse this syndrome for us? ;)
> >
> > Here the parsed output, it says that it was access to mkey which is
> > free.
> >
> > ======== cqe_with_error ========
> > wqe_id                           : 0x0
> > srqn_usr_index                   : 0x0
> > byte_cnt                         : 0x0
> > hw_error_syndrome                : 0x93
> > hw_syndrome_type                 : 0x0
> > vendor_error_syndrome            : 0x52
>
> Can you share the check that correlates to the vendor+hw syndrome?

mkey.free == 1

>
> > syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
> > s_wqe_opcode                     : SEND (0xa)
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20170620/cf9d02f2/attachment.sig>

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  8:33                                                                                               ` Leon Romanovsky
@ 2017-06-20  9:33                                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  9:33 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss


>>> Here the parsed output, it says that it was access to mkey which is
>>> free.

Missed that :)

>>> ======== cqe_with_error ========
>>> wqe_id                           : 0x0
>>> srqn_usr_index                   : 0x0
>>> byte_cnt                         : 0x0
>>> hw_error_syndrome                : 0x93
>>> hw_syndrome_type                 : 0x0
>>> vendor_error_syndrome            : 0x52
>>
>> Can you share the check that correlates to the vendor+hw syndrome?
> 
> mkey.free == 1

Hmm, the way I understand it is that the HW is trying to access
(locally via send) a MR which was already invalidated.

Thinking of this further, this can happen in a case where the target
already completed the transaction, sent SEND_WITH_INVALIDATE but the
original send ack was lost somewhere causing the device to retransmit
from the MR (which was already invalidated). This is highly unlikely
though.

Shouldn't this be protected somehow by the device?
Can someone explain why the above cannot happen? Jason? Liran? Anyone?

Say host register MR (a) and send (1) from that MR to a target,
send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
on MR (a) and the host HCA process it, then host HCA timeout on send (1)
so it retries, but ehh, its already invalidated.

Or, we can also have a race where we destroy all our MRs when I/O
is still running (but from the code we should be safe here).

Robert, when you rebooted the target, I assume iscsi ping
timeout expired and the connection teardown started correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20  9:33                                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20  9:33 UTC (permalink / raw)



>>> Here the parsed output, it says that it was access to mkey which is
>>> free.

Missed that :)

>>> ======== cqe_with_error ========
>>> wqe_id                           : 0x0
>>> srqn_usr_index                   : 0x0
>>> byte_cnt                         : 0x0
>>> hw_error_syndrome                : 0x93
>>> hw_syndrome_type                 : 0x0
>>> vendor_error_syndrome            : 0x52
>>
>> Can you share the check that correlates to the vendor+hw syndrome?
> 
> mkey.free == 1

Hmm, the way I understand it is that the HW is trying to access
(locally via send) a MR which was already invalidated.

Thinking of this further, this can happen in a case where the target
already completed the transaction, sent SEND_WITH_INVALIDATE but the
original send ack was lost somewhere causing the device to retransmit
from the MR (which was already invalidated). This is highly unlikely
though.

Shouldn't this be protected somehow by the device?
Can someone explain why the above cannot happen? Jason? Liran? Anyone?

Say host register MR (a) and send (1) from that MR to a target,
send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
on MR (a) and the host HCA process it, then host HCA timeout on send (1)
so it retries, but ehh, its already invalidated.

Or, we can also have a race where we destroy all our MRs when I/O
is still running (but from the code we should be safe here).

Robert, when you rebooted the target, I assume iscsi ping
timeout expired and the connection teardown started correct?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  9:33                                                                                                   ` Sagi Grimberg
  (?)
@ 2017-06-20 10:31                                                                                                   ` Max Gurtovoy
       [not found]                                                                                                     ` <78b2c1db-6ece-0274-c4c9-5ee1f7c88469-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  -1 siblings, 1 reply; 171+ messages in thread
From: Max Gurtovoy @ 2017-06-20 10:31 UTC (permalink / raw)




On 6/20/2017 12:33 PM, Sagi Grimberg wrote:
>
>>>> Here the parsed output, it says that it was access to mkey which is
>>>> free.
>
> Missed that :)
>
>>>> ======== cqe_with_error ========
>>>> wqe_id                           : 0x0
>>>> srqn_usr_index                   : 0x0
>>>> byte_cnt                         : 0x0
>>>> hw_error_syndrome                : 0x93
>>>> hw_syndrome_type                 : 0x0
>>>> vendor_error_syndrome            : 0x52
>>>
>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>> mkey.free == 1
>
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
>
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
>
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.

This might happen IMO.
Robert, can you test this untested patch (this is not the full solution, 
just something to think about):

diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index c538a38..e93bd40 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -1079,7 +1079,7 @@ int iser_post_send(struct ib_conn *ib_conn, struct 
iser_tx_desc *tx_desc,
         wr->sg_list = tx_desc->tx_sg;
         wr->num_sge = tx_desc->num_sge;
         wr->opcode = IB_WR_SEND;
-       wr->send_flags = signal ? IB_SEND_SIGNALED : 0;
+       wr->send_flags = IB_SEND_SIGNALED;

         ib_ret = ib_post_send(ib_conn->qp, &tx_desc->wrs[0].send, &bad_wr);
         if (ib_ret)

>
> Or, we can also have a race where we destroy all our MRs when I/O
> is still running (but from the code we should be safe here).
>
> Robert, when you rebooted the target, I assume iscsi ping
> timeout expired and the connection teardown started correct?

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  9:33                                                                                                   ` Sagi Grimberg
@ 2017-06-20 12:02                                                                                                       ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 12:02 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche, Chuck Lever


>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>> mkey.free == 1
> 
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
> 
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
> 
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> 
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.

Well, this entire flow is broken, why should the host send the MR rkey
to the target if it is not using it for remote access, the target
should never have a chance to remote invalidate something it did not
access.

I think we have a bug in iSER code, as we should not send the key
for remote invalidation if we do inline data send...

Robert, can you try the following:
--
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 12ed62ce9ff7..2a07692007bd 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,

         if (unsol_sz < edtl) {
                 hdr->flags     |= ISER_WSV;
-               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
-               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
+               if (buf_out->data_len > imm_sz) {
+                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
+                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + 
unsol_sz);
+               }

                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
                          "VA:%#llX + unsol:%d\n",
--

Although, I still don't think its enough. We need to delay the
local invalidate till we received a send completion (guarantees
that ack was received)...

If this indeed the case, _all_ ULP initiator drivers share it because we
never condition on a send completion in order to complete an I/O, and
in the case of lost ack on send, looks like we need to... It *will* hurt
performance.

What do other folks think?

CC'ing Bart, Chuck, Christoph.

Guys, for summary, I think we might have a broken behavior in the
initiator mode drivers. We never condition send completions (for
requests) before we complete an I/O. The issue is that the ack for those
sends might get lost, which means that the HCA will retry them (dropped
by the peer HCA) but if we happen to complete the I/O before, either we
can unmap the request area, or for inline data, we invalidate it (so the
HCA will try to access a MR which was invalidated).

Signalling all send completions and also finishing I/Os only after we
got them will add latency, and that sucks...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 12:02                                                                                                       ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 12:02 UTC (permalink / raw)



>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>> mkey.free == 1
> 
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
> 
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
> 
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> 
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.

Well, this entire flow is broken, why should the host send the MR rkey
to the target if it is not using it for remote access, the target
should never have a chance to remote invalidate something it did not
access.

I think we have a bug in iSER code, as we should not send the key
for remote invalidation if we do inline data send...

Robert, can you try the following:
--
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 12ed62ce9ff7..2a07692007bd 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,

         if (unsol_sz < edtl) {
                 hdr->flags     |= ISER_WSV;
-               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
-               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
+               if (buf_out->data_len > imm_sz) {
+                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
+                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + 
unsol_sz);
+               }

                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
                          "VA:%#llX + unsol:%d\n",
--

Although, I still don't think its enough. We need to delay the
local invalidate till we received a send completion (guarantees
that ack was received)...

If this indeed the case, _all_ ULP initiator drivers share it because we
never condition on a send completion in order to complete an I/O, and
in the case of lost ack on send, looks like we need to... It *will* hurt
performance.

What do other folks think?

CC'ing Bart, Chuck, Christoph.

Guys, for summary, I think we might have a broken behavior in the
initiator mode drivers. We never condition send completions (for
requests) before we complete an I/O. The issue is that the ack for those
sends might get lost, which means that the HCA will retry them (dropped
by the peer HCA) but if we happen to complete the I/O before, either we
can unmap the request area, or for inline data, we invalidate it (so the
HCA will try to access a MR which was invalidated).

Signalling all send completions and also finishing I/Os only after we
got them will add latency, and that sucks...

^ permalink raw reply related	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 12:02                                                                                                       ` Sagi Grimberg
  (?)
@ 2017-06-20 13:28                                                                                                       ` Max Gurtovoy
  -1 siblings, 0 replies; 171+ messages in thread
From: Max Gurtovoy @ 2017-06-20 13:28 UTC (permalink / raw)




On 6/20/2017 3:02 PM, Sagi Grimberg wrote:
>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>> mkey.free == 1
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
>
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
>
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>
>         if (unsol_sz < edtl) {
>                 hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> +               }
>
>                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                          "VA:%#llX + unsol:%d\n",
> --
>
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
>
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
>
> What do other folks think?

Sagi,
As I mentioned in the mail before (with my "patch") that your scenario 
may happen and actually happened.
I agree we need to fix all the initiator ULP drivers.

>
> CC'ing Bart, Chuck, Christoph.
>
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).
>
> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Yes, you're right. We must complete the IO after receiving both 
send/recv completion.
I can run some performance checks with iSER/NVMf with signal on every 
send and release the IO task only after both arrived.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  7:58                                                                                           ` Sagi Grimberg
@ 2017-06-20 14:41                                                                                               ` Robert LeBlanc
  -1 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 14:41 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, Jun 20, 2017 at 1:58 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>
>>> Hi Robert,
>>>
>>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>>> now get a new one on the initiator. This one doesn't seem as
>>>> impactful.
>>>>
>>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>>
>>>
>>> Max, Leon,
>>>
>>> Care to parse this syndrome for us? ;)
>>
>>
>> Here the parsed output, it says that it was access to mkey which is
>> free.
>>
>> ======== cqe_with_error ========
>> wqe_id                           : 0x0
>> srqn_usr_index                   : 0x0
>> byte_cnt                         : 0x0
>> hw_error_syndrome                : 0x93
>> hw_syndrome_type                 : 0x0
>> vendor_error_syndrome            : 0x52
>
>
> Can you share the check that correlates to the vendor+hw syndrome?
>
>> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
>> s_wqe_opcode                     : SEND (0xa)
>
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?

I was running dd with oflag=direct, so yes.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 14:41                                                                                               ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 14:41 UTC (permalink / raw)


On Tue, Jun 20, 2017@1:58 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>> Hi Robert,
>>>
>>>> I ran into this with 4.9.32 when I rebooted the target. I tested
>>>> 4.12-rc6 and this particular error seems to have been resolved, but I
>>>> now get a new one on the initiator. This one doesn't seem as
>>>> impactful.
>>>>
>>>> [Mon Jun 19 11:17:20 2017] mlx5_0:dump_cqe:275:(pid 0): dump error cqe
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 00000000 00000000 00000000
>>>> [Mon Jun 19 11:17:20 2017] 00000000 93005204 0a0001bd 45c8e0d2
>>>
>>>
>>> Max, Leon,
>>>
>>> Care to parse this syndrome for us? ;)
>>
>>
>> Here the parsed output, it says that it was access to mkey which is
>> free.
>>
>> ======== cqe_with_error ========
>> wqe_id                           : 0x0
>> srqn_usr_index                   : 0x0
>> byte_cnt                         : 0x0
>> hw_error_syndrome                : 0x93
>> hw_syndrome_type                 : 0x0
>> vendor_error_syndrome            : 0x52
>
>
> Can you share the check that correlates to the vendor+hw syndrome?
>
>> syndrome                         : LOCAL_PROTECTION_ERROR (0x4)
>> s_wqe_opcode                     : SEND (0xa)
>
>
> That's interesting, the opcode is a send operation. I'm assuming
> that this is immediate-data write? Robert, did this happen when
> you issued >4k writes to the target?

I was running dd with oflag=direct, so yes.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20  9:33                                                                                                   ` Sagi Grimberg
@ 2017-06-20 14:43                                                                                                       ` Robert LeBlanc
  -1 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 14:43 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss

On Tue, Jun 20, 2017 at 3:33 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>
>>>> Here the parsed output, it says that it was access to mkey which is
>>>> free.
>
>
> Missed that :)
>
>>>> ======== cqe_with_error ========
>>>> wqe_id                           : 0x0
>>>> srqn_usr_index                   : 0x0
>>>> byte_cnt                         : 0x0
>>>> hw_error_syndrome                : 0x93
>>>> hw_syndrome_type                 : 0x0
>>>> vendor_error_syndrome            : 0x52
>>>
>>>
>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>>
>> mkey.free == 1
>
>
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
>
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
>
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.
>
> Or, we can also have a race where we destroy all our MRs when I/O
> is still running (but from the code we should be safe here).
>
> Robert, when you rebooted the target, I assume iscsi ping
> timeout expired and the connection teardown started correct?

I do remember that the ping timed out and the connection was torn down
according to the messages.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 14:43                                                                                                       ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 14:43 UTC (permalink / raw)


On Tue, Jun 20, 2017@3:33 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>>> Here the parsed output, it says that it was access to mkey which is
>>>> free.
>
>
> Missed that :)
>
>>>> ======== cqe_with_error ========
>>>> wqe_id                           : 0x0
>>>> srqn_usr_index                   : 0x0
>>>> byte_cnt                         : 0x0
>>>> hw_error_syndrome                : 0x93
>>>> hw_syndrome_type                 : 0x0
>>>> vendor_error_syndrome            : 0x52
>>>
>>>
>>> Can you share the check that correlates to the vendor+hw syndrome?
>>
>>
>> mkey.free == 1
>
>
> Hmm, the way I understand it is that the HW is trying to access
> (locally via send) a MR which was already invalidated.
>
> Thinking of this further, this can happen in a case where the target
> already completed the transaction, sent SEND_WITH_INVALIDATE but the
> original send ack was lost somewhere causing the device to retransmit
> from the MR (which was already invalidated). This is highly unlikely
> though.
>
> Shouldn't this be protected somehow by the device?
> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>
> Say host register MR (a) and send (1) from that MR to a target,
> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> so it retries, but ehh, its already invalidated.
>
> Or, we can also have a race where we destroy all our MRs when I/O
> is still running (but from the code we should be safe here).
>
> Robert, when you rebooted the target, I assume iscsi ping
> timeout expired and the connection teardown started correct?

I do remember that the ping timed out and the connection was torn down
according to the messages.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 12:02                                                                                                       ` Sagi Grimberg
@ 2017-06-20 17:01                                                                                                           ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 17:01 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche


> On Jun 20, 2017, at 8:02 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>> 
>>> mkey.free == 1
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
> 
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
> 
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
> 
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
> 
>        if (unsol_sz < edtl) {
>                hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               }
> 
>                iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                         "VA:%#llX + unsol:%d\n",
> --
> 
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
> 
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
> 
> What do other folks think?
> 
> CC'ing Bart, Chuck, Christoph.
> 
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).

So on occasion there is a Remote Access Error. That would
trigger connection loss, and the retransmitted Send request
is discarded (if there was externally exposed memory involved
with the original transaction that is now invalid).

NFS has a duplicate replay cache. If it sees a repeated RPC
XID it will send a cached reply. I guess the trick there is
to squelch remote invalidation for such retransmits to avoid
spurious Remote Access Errors. Should be rare, though.

RPC-over-RDMA uses persistent registration for its inline
buffers. The problem there is avoiding buffer reuse to soon.
Otherwise a garbled inline message is presented on retransmit.
Those would probably not be caught by the DRC.

But the real problem is preventing retransmitted Sends from
causing a ULP request to be executed multiple times.


> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Typically, Sends will complete before the response arrives.
The additional cost will be handling extra interrupts, IMO.

With FRWR, won't subsequent WRs be delayed until the HCA is
done with the Send? I don't think a signal is necessary in
every case. Send Queue accounting currently relies on that.

RPC-over-RDMA relies on the completion of Local Invalidation
to ensure that the initial Send WR is complete. For Remote
Invalidation and pure inline, there is nothing to fence that
Send.

The question I have is: how often do these Send retransmissions
occur? Is it enough to have a robust recovery mechanism, or
do we have to wire in assumptions about retransmission to
every Send operation?


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:01                                                                                                           ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 17:01 UTC (permalink / raw)



> On Jun 20, 2017,@8:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>> 
>>> mkey.free == 1
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
> 
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
> 
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
> 
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
> 
>        if (unsol_sz < edtl) {
>                hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               }
> 
>                iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                         "VA:%#llX + unsol:%d\n",
> --
> 
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
> 
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
> 
> What do other folks think?
> 
> CC'ing Bart, Chuck, Christoph.
> 
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).

So on occasion there is a Remote Access Error. That would
trigger connection loss, and the retransmitted Send request
is discarded (if there was externally exposed memory involved
with the original transaction that is now invalid).

NFS has a duplicate replay cache. If it sees a repeated RPC
XID it will send a cached reply. I guess the trick there is
to squelch remote invalidation for such retransmits to avoid
spurious Remote Access Errors. Should be rare, though.

RPC-over-RDMA uses persistent registration for its inline
buffers. The problem there is avoiding buffer reuse to soon.
Otherwise a garbled inline message is presented on retransmit.
Those would probably not be caught by the DRC.

But the real problem is preventing retransmitted Sends from
causing a ULP request to be executed multiple times.


> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Typically, Sends will complete before the response arrives.
The additional cost will be handling extra interrupts, IMO.

With FRWR, won't subsequent WRs be delayed until the HCA is
done with the Send? I don't think a signal is necessary in
every case. Send Queue accounting currently relies on that.

RPC-over-RDMA relies on the completion of Local Invalidation
to ensure that the initial Send WR is complete. For Remote
Invalidation and pure inline, there is nothing to fence that
Send.

The question I have is: how often do these Send retransmissions
occur? Is it enough to have a robust recovery mechanism, or
do we have to wire in assumptions about retransmission to
every Send operation?


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 12:02                                                                                                       ` Sagi Grimberg
@ 2017-06-20 17:08                                                                                                           ` Robert LeBlanc
  -1 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 17:08 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche, Chuck Lever

On Tue, Jun 20, 2017 at 6:02 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
>
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
>
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>
>         if (unsol_sz < edtl) {
>                 hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> +               }
>
>                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                          "VA:%#llX + unsol:%d\n",
> --
>
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
>
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
>
> What do other folks think?
>
> CC'ing Bart, Chuck, Christoph.
>
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).
>
> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Testing this patch I didn't see these new messages even when rebooting
the targets multiple times. It also resolved some performance problems
I was seeing (I think our switches are having bugs with IPv6 and
routing) and I was receiving expected performance. At one point in the
test, one target (4.9.33) showed:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

After this and a reboot of the target, the initiator would drop the
connection after 1.5-2 minutes then faster and faster until it was
every 5 seconds. It is almost like it set up the connection then lose
the first ping, or the ping wasn't set-up right. I tried rebooting the
target multiple times.

I tried to logout the "bad" session and got a back trace.

[Tue Jun 20 10:30:08 2017] ------------[ cut here ]------------
[Tue Jun 20 10:30:08 2017] WARNING: CPU: 20 PID: 783 at
fs/sysfs/group.c:237 sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] Modules linked in: ib_iser rdma_ucm ib_ucm
ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib mlx4_ib ib_core 8021q
garp mrp sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm iTCO_wdt iTCO_vendor_support mei_me irqbypass
crct10dif_pclmul mei crc32_pclmul ioatdma ghash_clmulni_intel lpc_ich
i2c_i801 pcbc mfd_core aesni_intel crypto_simd glue_helper cryptd
joydevsg pcspkr shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler
acpi_power_meter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 acpi_pad
nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c raid1
dm_service_time sd_mod mlx4_en be2iscsi bnx2i cnic uio qla4xxx
iscsi_boot_sysfs ast drm_kms_helper mlx5_core syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm
[Tue Jun 20 10:30:08 2017]  drm mlx4_core igb ahci libahci ptp libata
pps_core dca i2c_algo_bit dm_multipath dm_mirror dm_region_hash dm_log
dm_mod dax
[Tue Jun 20 10:30:08 2017] CPU: 20 PID: 783 Comm: kworker/u64:2 Not
tainted 4.12.0-rc6+ #5
[Tue Jun 20 10:30:08 2017] Hardware name: Supermicro
SYS-6028TP-HTFR/X10DRT-PIBF, BIOS 1.1 08/03/2015
[Tue Jun 20 10:30:08 2017] Workqueue: scsi_wq_12 __iscsi_unbind_session
[Tue Jun 20 10:30:08 2017] task: ffff887f5c45cb00 task.stack: ffffc90032ef4000
[Tue Jun 20 10:30:08 2017] RIP: 0010:sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] RSP: 0018:ffffc90032ef7d18 EFLAGS: 00010246
[Tue Jun 20 10:30:08 2017] RAX: 0000000000000038 RBX: 0000000000000000
RCX: 0000000000000000
[Tue Jun 20 10:30:08 2017] RDX: 0000000000000000 RSI: ffff883f7fd0e068
RDI: ffff883f7fd0e068
[Tue Jun 20 10:30:08 2017] RBP: ffffc90032ef7d30 R08: 0000000000000000
R09: 0000000000000676
[Tue Jun 20 10:30:08 2017] R10: 00000000000003ff R11: 0000000000000001
R12: ffffffff81da8a40
[Tue Jun 20 10:30:08 2017] R13: ffff887f52ec0838 R14: ffff887f52ec08d8
R15: ffff883f4c611000
[Tue Jun 20 10:30:08 2017] FS:  0000000000000000(0000)
GS:ffff883f7fd00000(0000) knlGS:0000000000000000
[Tue Jun 20 10:30:08 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jun 20 10:30:08 2017] CR2: 000055eef886b398 CR3: 0000000001c09000
CR4: 00000000001406e0
[Tue Jun 20 10:30:08 2017] Call Trace:
[Tue Jun 20 10:30:08 2017]  dpm_sysfs_remove+0x57/0x60
[Tue Jun 20 10:30:08 2017]  device_del+0x107/0x330
[Tue Jun 20 10:30:08 2017]  scsi_target_reap_ref_release+0x2d/0x40
[Tue Jun 20 10:30:08 2017]  scsi_target_reap+0x2e/0x40
[Tue Jun 20 10:30:08 2017]  scsi_remove_target+0x197/0x1b0
[Tue Jun 20 10:30:08 2017]  __iscsi_unbind_session+0xbe/0x170
[Tue Jun 20 10:30:08 2017]  process_one_work+0x149/0x360
[Tue Jun 20 10:30:08 2017]  worker_thread+0x4d/0x3c0
[Tue Jun 20 10:30:08 2017]  kthread+0x109/0x140
[Tue Jun 20 10:30:08 2017]  ? rescuer_thread+0x380/0x380
[Tue Jun 20 10:30:08 2017]  ? kthread_park+0x60/0x60
[Tue Jun 20 10:30:08 2017]  ret_from_fork+0x25/0x30
[Tue Jun 20 10:30:08 2017] Code: d5 c0 ff ff 5b 41 5c 41 5d 5d c3 48
89 df e8 66 bd ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 d0 3ca7 81
31 c0 e8 3c 01 ee ff <0f> ff eb d5 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00 00 55 48
[Tue Jun 20 10:30:08 2017] ---[ end trace 6161f21139b6a1ea ]---

Logging back into that target didn't help stabilize the connection. I
rebooted both initiator and targets to clear things up and after the
initiator went down, the target showed the timeout message again. It
seems something got out of whack and never recovered and "poisoned"
the other node in the process.

I'll test Max's patch now and report back.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:08                                                                                                           ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 17:08 UTC (permalink / raw)


On Tue, Jun 20, 2017@6:02 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> Well, this entire flow is broken, why should the host send the MR rkey
> to the target if it is not using it for remote access, the target
> should never have a chance to remote invalidate something it did not
> access.
>
> I think we have a bug in iSER code, as we should not send the key
> for remote invalidation if we do inline data send...
>
> Robert, can you try the following:
> --
> diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c
> b/drivers/infiniband/ulp/iser/iser_initiator.c
> index 12ed62ce9ff7..2a07692007bd 100644
> --- a/drivers/infiniband/ulp/iser/iser_initiator.c
> +++ b/drivers/infiniband/ulp/iser/iser_initiator.c
> @@ -137,8 +137,10 @@ iser_prepare_write_cmd(struct iscsi_task *task,
>
>         if (unsol_sz < edtl) {
>                 hdr->flags     |= ISER_WSV;
> -               hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> -               hdr->write_va   = cpu_to_be64(mem_reg->sge.addr + unsol_sz);
> +               if (buf_out->data_len > imm_sz) {
> +                       hdr->write_stag = cpu_to_be32(mem_reg->rkey);
> +                       hdr->write_va = cpu_to_be64(mem_reg->sge.addr +
> unsol_sz);
> +               }
>
>                 iser_dbg("Cmd itt:%d, WRITE tags, RKEY:%#.4X "
>                          "VA:%#llX + unsol:%d\n",
> --
>
> Although, I still don't think its enough. We need to delay the
> local invalidate till we received a send completion (guarantees
> that ack was received)...
>
> If this indeed the case, _all_ ULP initiator drivers share it because we
> never condition on a send completion in order to complete an I/O, and
> in the case of lost ack on send, looks like we need to... It *will* hurt
> performance.
>
> What do other folks think?
>
> CC'ing Bart, Chuck, Christoph.
>
> Guys, for summary, I think we might have a broken behavior in the
> initiator mode drivers. We never condition send completions (for
> requests) before we complete an I/O. The issue is that the ack for those
> sends might get lost, which means that the HCA will retry them (dropped
> by the peer HCA) but if we happen to complete the I/O before, either we
> can unmap the request area, or for inline data, we invalidate it (so the
> HCA will try to access a MR which was invalidated).
>
> Signalling all send completions and also finishing I/Os only after we
> got them will add latency, and that sucks...

Testing this patch I didn't see these new messages even when rebooting
the targets multiple times. It also resolved some performance problems
I was seeing (I think our switches are having bugs with IPv6 and
routing) and I was receiving expected performance. At one point in the
test, one target (4.9.33) showed:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

After this and a reboot of the target, the initiator would drop the
connection after 1.5-2 minutes then faster and faster until it was
every 5 seconds. It is almost like it set up the connection then lose
the first ping, or the ping wasn't set-up right. I tried rebooting the
target multiple times.

I tried to logout the "bad" session and got a back trace.

[Tue Jun 20 10:30:08 2017] ------------[ cut here ]------------
[Tue Jun 20 10:30:08 2017] WARNING: CPU: 20 PID: 783 at
fs/sysfs/group.c:237 sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] Modules linked in: ib_iser rdma_ucm ib_ucm
ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib mlx4_ib ib_core 8021q
garp mrp sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel kvm iTCO_wdt iTCO_vendor_support mei_me irqbypass
crct10dif_pclmul mei crc32_pclmul ioatdma ghash_clmulni_intel lpc_ich
i2c_i801 pcbc mfd_core aesni_intel crypto_simd glue_helper cryptd
joydevsg pcspkr shpchp wmi ipmi_si ipmi_devintf ipmi_msghandler
acpi_power_meter ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 acpi_pad
nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter nfsd
auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c raid1
dm_service_time sd_mod mlx4_en be2iscsi bnx2i cnic uio qla4xxx
iscsi_boot_sysfs ast drm_kms_helper mlx5_core syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm
[Tue Jun 20 10:30:08 2017]  drm mlx4_core igb ahci libahci ptp libata
pps_core dca i2c_algo_bit dm_multipath dm_mirror dm_region_hash dm_log
dm_mod dax
[Tue Jun 20 10:30:08 2017] CPU: 20 PID: 783 Comm: kworker/u64:2 Not
tainted 4.12.0-rc6+ #5
[Tue Jun 20 10:30:08 2017] Hardware name: Supermicro
SYS-6028TP-HTFR/X10DRT-PIBF, BIOS 1.1 08/03/2015
[Tue Jun 20 10:30:08 2017] Workqueue: scsi_wq_12 __iscsi_unbind_session
[Tue Jun 20 10:30:08 2017] task: ffff887f5c45cb00 task.stack: ffffc90032ef4000
[Tue Jun 20 10:30:08 2017] RIP: 0010:sysfs_remove_group+0x82/0x90
[Tue Jun 20 10:30:08 2017] RSP: 0018:ffffc90032ef7d18 EFLAGS: 00010246
[Tue Jun 20 10:30:08 2017] RAX: 0000000000000038 RBX: 0000000000000000
RCX: 0000000000000000
[Tue Jun 20 10:30:08 2017] RDX: 0000000000000000 RSI: ffff883f7fd0e068
RDI: ffff883f7fd0e068
[Tue Jun 20 10:30:08 2017] RBP: ffffc90032ef7d30 R08: 0000000000000000
R09: 0000000000000676
[Tue Jun 20 10:30:08 2017] R10: 00000000000003ff R11: 0000000000000001
R12: ffffffff81da8a40
[Tue Jun 20 10:30:08 2017] R13: ffff887f52ec0838 R14: ffff887f52ec08d8
R15: ffff883f4c611000
[Tue Jun 20 10:30:08 2017] FS:  0000000000000000(0000)
GS:ffff883f7fd00000(0000) knlGS:0000000000000000
[Tue Jun 20 10:30:08 2017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Jun 20 10:30:08 2017] CR2: 000055eef886b398 CR3: 0000000001c09000
CR4: 00000000001406e0
[Tue Jun 20 10:30:08 2017] Call Trace:
[Tue Jun 20 10:30:08 2017]  dpm_sysfs_remove+0x57/0x60
[Tue Jun 20 10:30:08 2017]  device_del+0x107/0x330
[Tue Jun 20 10:30:08 2017]  scsi_target_reap_ref_release+0x2d/0x40
[Tue Jun 20 10:30:08 2017]  scsi_target_reap+0x2e/0x40
[Tue Jun 20 10:30:08 2017]  scsi_remove_target+0x197/0x1b0
[Tue Jun 20 10:30:08 2017]  __iscsi_unbind_session+0xbe/0x170
[Tue Jun 20 10:30:08 2017]  process_one_work+0x149/0x360
[Tue Jun 20 10:30:08 2017]  worker_thread+0x4d/0x3c0
[Tue Jun 20 10:30:08 2017]  kthread+0x109/0x140
[Tue Jun 20 10:30:08 2017]  ? rescuer_thread+0x380/0x380
[Tue Jun 20 10:30:08 2017]  ? kthread_park+0x60/0x60
[Tue Jun 20 10:30:08 2017]  ret_from_fork+0x25/0x30
[Tue Jun 20 10:30:08 2017] Code: d5 c0 ff ff 5b 41 5c 41 5d 5d c3 48
89 df e8 66 bd ff ff eb c6 49 8b 55 00 49 8b 34 24 48 c7 c7 d0 3ca7 81
31 c0 e8 3c 01 ee ff <0f> ff eb d5 66 2e 0f 1f 84 00 00 00 00 00 0f 1f
44 00 00 55 48
[Tue Jun 20 10:30:08 2017] ---[ end trace 6161f21139b6a1ea ]---

Logging back into that target didn't help stabilize the connection. I
rebooted both initiator and targets to clear things up and after the
initiator went down, the target showed the timeout message again. It
seems something got out of whack and never recovered and "poisoned"
the other node in the process.

I'll test Max's patch now and report back.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:01                                                                                                           ` Chuck Lever
@ 2017-06-20 17:12                                                                                                               ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 17:12 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Leon Romanovsky, Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche


> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

I'm actually not concerned about the remote invalidation, that
is good that its discarded or failed. Its the inline sends that
are a bug here.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

Exactly.

>> Signalling all send completions and also finishing I/Os only after we
>> got them will add latency, and that sucks...
> 
> Typically, Sends will complete before the response arrives.
> The additional cost will be handling extra interrupts, IMO.

Not quite, heavy traffic _can_ results in dropped acks, my gut
feeling is that it can happen more than we suspect.

and yea, extra interrupt, extra cachelines, extra state,
but I do not see any other way around it.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

Not really, the Send after the FRWR might have a fence (not strong
ordered one) and CX3/CX4 strong order FRWR so for them that is a
non-issue. The problem is that ULPs can't rely on it.

> RPC-over-RDMA relies on the completion of Local Invalidation
> to ensure that the initial Send WR is complete.

Wait, is that guaranteed?

> For Remote
> Invalidation and pure inline, there is nothing to fence that
> Send.
> 
> The question I have is: how often do these Send retransmissions
> occur? Is it enough to have a robust recovery mechanism, or
> do we have to wire in assumptions about retransmission to
> every Send operation?

Even if its rare, we don't have any way to protect against devices
retrying the send operation.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:12                                                                                                               ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 17:12 UTC (permalink / raw)



> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

I'm actually not concerned about the remote invalidation, that
is good that its discarded or failed. Its the inline sends that
are a bug here.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

Exactly.

>> Signalling all send completions and also finishing I/Os only after we
>> got them will add latency, and that sucks...
> 
> Typically, Sends will complete before the response arrives.
> The additional cost will be handling extra interrupts, IMO.

Not quite, heavy traffic _can_ results in dropped acks, my gut
feeling is that it can happen more than we suspect.

and yea, extra interrupt, extra cachelines, extra state,
but I do not see any other way around it.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

Not really, the Send after the FRWR might have a fence (not strong
ordered one) and CX3/CX4 strong order FRWR so for them that is a
non-issue. The problem is that ULPs can't rely on it.

> RPC-over-RDMA relies on the completion of Local Invalidation
> to ensure that the initial Send WR is complete.

Wait, is that guaranteed?

> For Remote
> Invalidation and pure inline, there is nothing to fence that
> Send.
> 
> The question I have is: how often do these Send retransmissions
> occur? Is it enough to have a robust recovery mechanism, or
> do we have to wire in assumptions about retransmission to
> every Send operation?

Even if its rare, we don't have any way to protect against devices
retrying the send operation.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:08                                                                                                           ` Robert LeBlanc
@ 2017-06-20 17:19                                                                                                               ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 17:19 UTC (permalink / raw)
  To: Robert LeBlanc
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche, Chuck Lever


> Testing this patch I didn't see these new messages even when rebooting
> the targets multiple times. It also resolved some performance problems
> I was seeing (I think our switches are having bugs with IPv6 and
> routing) and I was receiving expected performance. At one point in the
> test, one target (4.9.33) showed:
> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
> transport retry counter exceeded (12) vend_err 81

I don't understand, is this new with the patch applied?

> After this and a reboot of the target, the initiator would drop the
> connection after 1.5-2 minutes then faster and faster until it was
> every 5 seconds. It is almost like it set up the connection then lose
> the first ping, or the ping wasn't set-up right. I tried rebooting the
> target multiple times.

So the initiator could not recover even after the target as available
again?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:19                                                                                                               ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-20 17:19 UTC (permalink / raw)



> Testing this patch I didn't see these new messages even when rebooting
> the targets multiple times. It also resolved some performance problems
> I was seeing (I think our switches are having bugs with IPv6 and
> routing) and I was receiving expected performance. At one point in the
> test, one target (4.9.33) showed:
> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
> transport retry counter exceeded (12) vend_err 81

I don't understand, is this new with the patch applied?

> After this and a reboot of the target, the initiator would drop the
> connection after 1.5-2 minutes then faster and faster until it was
> every 5 seconds. It is almost like it set up the connection then lose
> the first ping, or the ping wasn't set-up right. I tried rebooting the
> target multiple times.

So the initiator could not recover even after the target as available
again?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:19                                                                                                               ` Sagi Grimberg
@ 2017-06-20 17:28                                                                                                                   ` Robert LeBlanc
  -1 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 17:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche, Chuck Lever

On Tue, Jun 20, 2017 at 11:19 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
>
>> Testing this patch I didn't see these new messages even when rebooting
>> the targets multiple times. It also resolved some performance problems
>> I was seeing (I think our switches are having bugs with IPv6 and
>> routing) and I was receiving expected performance. At one point in the
>> test, one target (4.9.33) showed:
>> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
>> transport retry counter exceeded (12) vend_err 81
>
>
> I don't understand, is this new with the patch applied?

I applied your patch to 4.12-rc6 on the initiator, but my targets are
still 4.9.33 since it looked like the patch only affected the
initiator. I did not see this before your patch, but I also didn't try
rebooting the targets multiple times before because of the previous
messages.

>> After this and a reboot of the target, the initiator would drop the
>> connection after 1.5-2 minutes then faster and faster until it was
>> every 5 seconds. It is almost like it set up the connection then lose
>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>> target multiple times.
>
>
> So the initiator could not recover even after the target as available
> again?

The initiator recovered the connection when the target came back, but
the connection was not stable. I/O would happen on the connection,
then it would get shaky and then finally disconnect. Then it would
reconnect, pass more I/O, then get shaky and go down again. With the 5
second disconnects, it would pass traffic for 5 seconds, then as soon
as I saw the ping timeout, the I/O would stop until it reconnected. At
that point it seems that the lack of pings would kill the I/O unlike
earlier where there was a stall in I/O and then the connection would
be torn down. I can try to see if I can get it to happen again.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:28                                                                                                                   ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 17:28 UTC (permalink / raw)


On Tue, Jun 20, 2017@11:19 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
>
>> Testing this patch I didn't see these new messages even when rebooting
>> the targets multiple times. It also resolved some performance problems
>> I was seeing (I think our switches are having bugs with IPv6 and
>> routing) and I was receiving expected performance. At one point in the
>> test, one target (4.9.33) showed:
>> [Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:39 2017] iSCSI Login timeout on Network Portal [::]:3260
>> [Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
>> transport retry counter exceeded (12) vend_err 81
>
>
> I don't understand, is this new with the patch applied?

I applied your patch to 4.12-rc6 on the initiator, but my targets are
still 4.9.33 since it looked like the patch only affected the
initiator. I did not see this before your patch, but I also didn't try
rebooting the targets multiple times before because of the previous
messages.

>> After this and a reboot of the target, the initiator would drop the
>> connection after 1.5-2 minutes then faster and faster until it was
>> every 5 seconds. It is almost like it set up the connection then lose
>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>> target multiple times.
>
>
> So the initiator could not recover even after the target as available
> again?

The initiator recovered the connection when the target came back, but
the connection was not stable. I/O would happen on the connection,
then it would get shaky and then finally disconnect. Then it would
reconnect, pass more I/O, then get shaky and go down again. With the 5
second disconnects, it would pass traffic for 5 seconds, then as soon
as I saw the ping timeout, the I/O would stop until it reconnected. At
that point it seems that the lack of pings would kill the I/O unlike
earlier where there was a stall in I/O and then the connection would
be torn down. I can try to see if I can get it to happen again.

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:01                                                                                                           ` Chuck Lever
@ 2017-06-20 17:35                                                                                                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 17:35 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote:

> >> Shouldn't this be protected somehow by the device?
> >> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> >> Say host register MR (a) and send (1) from that MR to a target,
> >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> >> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> >> so it retries, but ehh, its already invalidated.

I'm not sure I understand the example.. but...

If you pass a MR key to a send, then that MR must remain valid until
the send completion is implied by an observation on the CQ. The HCA is
free to re-execute the SEND against the MR at any time up until the
completion reaches the CQ.

As I've explained before, a ULP must not use 'implied completion', eg
a receive that could only have happened if the far side got the
send. In particular this means it cannot use an incoming SEND_INV/etc
to invalidate an MR associated with a local SEND, as that is a form
of 'implied completion'

For sanity a MR associated with a local send should not be remote
accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.

Similarly, you cannot use a MR with SEND and remote access sanely, as
the far end could corrupt or invalidate the MR while the local HCA is
still using it.

> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

Once you get a connection loss I would think the state of all the MRs
need to be resync'd. Running through the CQ should indicate which ones
are invalidate and which ones are still good.

> NFS has a duplicate replay cache. If it sees a repeated RPC
> XID it will send a cached reply. I guess the trick there is
> to squelch remote invalidation for such retransmits to avoid
> spurious Remote Access Errors. Should be rare, though.

.. and because of the above if a RPC is re-issued it must be re-issued
with corrected, now-valid rkeys, and the sender must somehow detect
that the far side dropped it for replay and tear down the MRs.

> RPC-over-RDMA uses persistent registration for its inline
> buffers. The problem there is avoiding buffer reuse to soon.
> Otherwise a garbled inline message is presented on retransmit.
> Those would probably not be caught by the DRC.

We've had this discussion on the list before. You can *never* re-use a
SEND, or RDMA WRITE buffer until you observe the HCA is done with it
via a CQ poll.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

IB RC guarentees single delivery for SEND, so that doesn't seem
possible unless the ULP re-transmits the SEND on a new QP.

> > Signalling all send completions and also finishing I/Os only after
> > we got them will add latency, and that sucks...

There is no choice, you *MUST* see the send completion before
reclamining any resources associated with the send. Only the
completion guarentees that the HCA will not resend the packet or
otherwise continue to use the resources.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
send packets on the wire up to some internal limit.

Only the local state changed by FRWR related op codes happens
sequentially with other SQ work.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 17:35                                                                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 17:35 UTC (permalink / raw)


On Tue, Jun 20, 2017@01:01:39PM -0400, Chuck Lever wrote:

> >> Shouldn't this be protected somehow by the device?
> >> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
> >> Say host register MR (a) and send (1) from that MR to a target,
> >> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
> >> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
> >> so it retries, but ehh, its already invalidated.

I'm not sure I understand the example.. but...

If you pass a MR key to a send, then that MR must remain valid until
the send completion is implied by an observation on the CQ. The HCA is
free to re-execute the SEND against the MR at any time up until the
completion reaches the CQ.

As I've explained before, a ULP must not use 'implied completion', eg
a receive that could only have happened if the far side got the
send. In particular this means it cannot use an incoming SEND_INV/etc
to invalidate an MR associated with a local SEND, as that is a form
of 'implied completion'

For sanity a MR associated with a local send should not be remote
accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.

Similarly, you cannot use a MR with SEND and remote access sanely, as
the far end could corrupt or invalidate the MR while the local HCA is
still using it.

> So on occasion there is a Remote Access Error. That would
> trigger connection loss, and the retransmitted Send request
> is discarded (if there was externally exposed memory involved
> with the original transaction that is now invalid).

Once you get a connection loss I would think the state of all the MRs
need to be resync'd. Running through the CQ should indicate which ones
are invalidate and which ones are still good.

> NFS has a duplicate replay cache. If it sees a repeated RPC
> XID it will send a cached reply. I guess the trick there is
> to squelch remote invalidation for such retransmits to avoid
> spurious Remote Access Errors. Should be rare, though.

.. and because of the above if a RPC is re-issued it must be re-issued
with corrected, now-valid rkeys, and the sender must somehow detect
that the far side dropped it for replay and tear down the MRs.

> RPC-over-RDMA uses persistent registration for its inline
> buffers. The problem there is avoiding buffer reuse to soon.
> Otherwise a garbled inline message is presented on retransmit.
> Those would probably not be caught by the DRC.

We've had this discussion on the list before. You can *never* re-use a
SEND, or RDMA WRITE buffer until you observe the HCA is done with it
via a CQ poll.

> But the real problem is preventing retransmitted Sends from
> causing a ULP request to be executed multiple times.

IB RC guarentees single delivery for SEND, so that doesn't seem
possible unless the ULP re-transmits the SEND on a new QP.

> > Signalling all send completions and also finishing I/Os only after
> > we got them will add latency, and that sucks...

There is no choice, you *MUST* see the send completion before
reclamining any resources associated with the send. Only the
completion guarentees that the HCA will not resend the packet or
otherwise continue to use the resources.

> With FRWR, won't subsequent WRs be delayed until the HCA is
> done with the Send? I don't think a signal is necessary in
> every case. Send Queue accounting currently relies on that.

No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
send packets on the wire up to some internal limit.

Only the local state changed by FRWR related op codes happens
sequentially with other SQ work.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:35                                                                                                               ` Jason Gunthorpe
@ 2017-06-20 18:17                                                                                                                   ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 18:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 20, 2017, at 1:35 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Tue, Jun 20, 2017 at 01:01:39PM -0400, Chuck Lever wrote:
> 
>>>> Shouldn't this be protected somehow by the device?
>>>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>>> Say host register MR (a) and send (1) from that MR to a target,
>>>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>>>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>>>> so it retries, but ehh, its already invalidated.
> 
> I'm not sure I understand the example.. but...
> 
> If you pass a MR key to a send, then that MR must remain valid until
> the send completion is implied by an observation on the CQ. The HCA is
> free to re-execute the SEND against the MR at any time up until the
> completion reaches the CQ.
> 
> As I've explained before, a ULP must not use 'implied completion', eg
> a receive that could only have happened if the far side got the
> send. In particular this means it cannot use an incoming SEND_INV/etc
> to invalidate an MR associated with a local SEND, as that is a form
> of 'implied completion'
> 
> For sanity a MR associated with a local send should not be remote
> accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.
> 
> Similarly, you cannot use a MR with SEND and remote access sanely, as
> the far end could corrupt or invalidate the MR while the local HCA is
> still using it.
> 
>> So on occasion there is a Remote Access Error. That would
>> trigger connection loss, and the retransmitted Send request
>> is discarded (if there was externally exposed memory involved
>> with the original transaction that is now invalid).
> 
> Once you get a connection loss I would think the state of all the MRs
> need to be resync'd. Running through the CQ should indicate which ones
> are invalidate and which ones are still good.
> 
>> NFS has a duplicate replay cache. If it sees a repeated RPC
>> XID it will send a cached reply. I guess the trick there is
>> to squelch remote invalidation for such retransmits to avoid
>> spurious Remote Access Errors. Should be rare, though.
> 
> .. and because of the above if a RPC is re-issued it must be re-issued
> with corrected, now-valid rkeys, and the sender must somehow detect
> that the far side dropped it for replay and tear down the MRs.

Yes, if RPC-over-RDMA ULP is involved, any externally accessible
memory will be re-registered before an RPC retransmission.

The concern is whether a retransmitted Send will be exposed
to the receiving ULP. Below you imply that it will not be, so
perhaps this is not a concern after all.


>> RPC-over-RDMA uses persistent registration for its inline
>> buffers. The problem there is avoiding buffer reuse to soon.
>> Otherwise a garbled inline message is presented on retransmit.
>> Those would probably not be caught by the DRC.
> 
> We've had this discussion on the list before. You can *never* re-use a
> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> via a CQ poll.

RPC-over-RDMA is careful to invalidate buffers that are the
target of RDMA Write before RPC completion, as we have
discussed before.

Sends are assumed to be complete when a LocalInv completes.

When we had this discussion before, you explained the problem
with retransmitted Sends, but it appears that all the ULPs we
have operate without Send completion. Others whom I trust have
suggested that operating without that extra interrupt is
preferred. The client has operated this way since it was added
to the kernel almost 10 years ago.

So I took it as a "in a perfect world" kind of admonition.
You are making a stronger and more normative assertion here.


>> But the real problem is preventing retransmitted Sends from
>> causing a ULP request to be executed multiple times.
> 
> IB RC guarentees single delivery for SEND, so that doesn't seem
> possible unless the ULP re-transmits the SEND on a new QP.
> 
>>> Signalling all send completions and also finishing I/Os only after
>>> we got them will add latency, and that sucks...
> 
> There is no choice, you *MUST* see the send completion before
> reclamining any resources associated with the send. Only the
> completion guarentees that the HCA will not resend the packet or
> otherwise continue to use the resources.

On the NFS server side, I believe every Send is signaled.

On the NFS client side, we assume LocalInv completion is
good enough.


>> With FRWR, won't subsequent WRs be delayed until the HCA is
>> done with the Send? I don't think a signal is necessary in
>> every case. Send Queue accounting currently relies on that.
> 
> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> send packets on the wire up to some internal limit.

So if my ULP issues FastReg followed by Send followed by
LocalInv (signaled), I can't rely on the LocalInv completion
to imply that the Send is also complete?


> Only the local state changed by FRWR related op codes happens
> sequentially with other SQ work.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 18:17                                                                                                                   ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 18:17 UTC (permalink / raw)



> On Jun 20, 2017,@1:35 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Tue, Jun 20, 2017@01:01:39PM -0400, Chuck Lever wrote:
> 
>>>> Shouldn't this be protected somehow by the device?
>>>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>>> Say host register MR (a) and send (1) from that MR to a target,
>>>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>>>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>>>> so it retries, but ehh, its already invalidated.
> 
> I'm not sure I understand the example.. but...
> 
> If you pass a MR key to a send, then that MR must remain valid until
> the send completion is implied by an observation on the CQ. The HCA is
> free to re-execute the SEND against the MR at any time up until the
> completion reaches the CQ.
> 
> As I've explained before, a ULP must not use 'implied completion', eg
> a receive that could only have happened if the far side got the
> send. In particular this means it cannot use an incoming SEND_INV/etc
> to invalidate an MR associated with a local SEND, as that is a form
> of 'implied completion'
> 
> For sanity a MR associated with a local send should not be remote
> accessible at all, and shouldn't even have a 'rkey', just a 'lkey'.
> 
> Similarly, you cannot use a MR with SEND and remote access sanely, as
> the far end could corrupt or invalidate the MR while the local HCA is
> still using it.
> 
>> So on occasion there is a Remote Access Error. That would
>> trigger connection loss, and the retransmitted Send request
>> is discarded (if there was externally exposed memory involved
>> with the original transaction that is now invalid).
> 
> Once you get a connection loss I would think the state of all the MRs
> need to be resync'd. Running through the CQ should indicate which ones
> are invalidate and which ones are still good.
> 
>> NFS has a duplicate replay cache. If it sees a repeated RPC
>> XID it will send a cached reply. I guess the trick there is
>> to squelch remote invalidation for such retransmits to avoid
>> spurious Remote Access Errors. Should be rare, though.
> 
> .. and because of the above if a RPC is re-issued it must be re-issued
> with corrected, now-valid rkeys, and the sender must somehow detect
> that the far side dropped it for replay and tear down the MRs.

Yes, if RPC-over-RDMA ULP is involved, any externally accessible
memory will be re-registered before an RPC retransmission.

The concern is whether a retransmitted Send will be exposed
to the receiving ULP. Below you imply that it will not be, so
perhaps this is not a concern after all.


>> RPC-over-RDMA uses persistent registration for its inline
>> buffers. The problem there is avoiding buffer reuse to soon.
>> Otherwise a garbled inline message is presented on retransmit.
>> Those would probably not be caught by the DRC.
> 
> We've had this discussion on the list before. You can *never* re-use a
> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> via a CQ poll.

RPC-over-RDMA is careful to invalidate buffers that are the
target of RDMA Write before RPC completion, as we have
discussed before.

Sends are assumed to be complete when a LocalInv completes.

When we had this discussion before, you explained the problem
with retransmitted Sends, but it appears that all the ULPs we
have operate without Send completion. Others whom I trust have
suggested that operating without that extra interrupt is
preferred. The client has operated this way since it was added
to the kernel almost 10 years ago.

So I took it as a "in a perfect world" kind of admonition.
You are making a stronger and more normative assertion here.


>> But the real problem is preventing retransmitted Sends from
>> causing a ULP request to be executed multiple times.
> 
> IB RC guarentees single delivery for SEND, so that doesn't seem
> possible unless the ULP re-transmits the SEND on a new QP.
> 
>>> Signalling all send completions and also finishing I/Os only after
>>> we got them will add latency, and that sucks...
> 
> There is no choice, you *MUST* see the send completion before
> reclamining any resources associated with the send. Only the
> completion guarentees that the HCA will not resend the packet or
> otherwise continue to use the resources.

On the NFS server side, I believe every Send is signaled.

On the NFS client side, we assume LocalInv completion is
good enough.


>> With FRWR, won't subsequent WRs be delayed until the HCA is
>> done with the Send? I don't think a signal is necessary in
>> every case. Send Queue accounting currently relies on that.
> 
> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> send packets on the wire up to some internal limit.

So if my ULP issues FastReg followed by Send followed by
LocalInv (signaled), I can't rely on the LocalInv completion
to imply that the Send is also complete?


> Only the local state changed by FRWR related op codes happens
> sequentially with other SQ work.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 18:17                                                                                                                   ` Chuck Lever
@ 2017-06-20 19:27                                                                                                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 19:27 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote:

> The concern is whether a retransmitted Send will be exposed
> to the receiving ULP. Below you imply that it will not be, so
> perhaps this is not a concern after all.

A retransmitted SEND will never be exposed to the Reciever ULP for
Reliable Connected. That is part of the guarantee.

> > We've had this discussion on the list before. You can *never* re-use a
> > SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> > via a CQ poll.
> 
> RPC-over-RDMA is careful to invalidate buffers that are the
> target of RDMA Write before RPC completion, as we have
> discussed before.
> 
> Sends are assumed to be complete when a LocalInv completes.
> 
> When we had this discussion before, you explained the problem
> with retransmitted Sends, but it appears that all the ULPs we
> have operate without Send completion. Others whom I trust have
> suggested that operating without that extra interrupt is

Operating without the interrupt is of course preferred, but that means
you have to defer the invalidate for MR's refered to by SEND until a
CQ observation as well.

> preferred. The client has operated this way since it was added
> to the kernel almost 10 years ago.

I thought the use of MR's with SEND was a new invention? If you use
the local rdma lkey with send, it is never invalidated, and this is
not an issue, which IIRC, was the historical configuration for NFS.

> So I took it as a "in a perfect world" kind of admonition.
> You are making a stronger and more normative assertion here.

All ULPs must have periodic (related to SQ depth) signaled completions
or some of our supported hardware will explode.

All ULPs must flow control additions to the SQ based on CQ feedback,
or they will fail under load with SQ overflows, if this is done, then
the above happens correctly for free.

All ULPs must ensure SEND/RDMA Write resources remain stable until the
CQ indicates that work is completed. 'In a perfect world' this
includes not changing the source memory as that would cause
retransmitted packets to be different.

All ULPs must ensure the lkey remains valid until the CQ confirms
the work is done. This is not important if the lkey is always the
local rdma lkey, which is always valid.

> > No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> > send packets on the wire up to some internal limit.
> 
> So if my ULP issues FastReg followed by Send followed by
> LocalInv (signaled), I can't rely on the LocalInv completion
> to imply that the Send is also complete?

Correct.

This is explicitly defined in Table 79 of the IBA.

It describes the ordering requirements, if you order Send followed by
LocalInv the ordering is 'L' which means they are not ordered unless
the WR has the Local Invalidate Fence bit set.

LIF is an optional feature, I do not know if any of our hardware
supports it, but it is defined to cause the local invalidate to wait
until all ongoing references to the MR are completed.

No idea on the relative performance of LIF vs doing it manually, but
the need for one or the other is unambigously clear in the spec.

Why are you invaliding lkeys anyhow, that doesn't seem like something
that needs to happen synchronously.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 19:27                                                                                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 19:27 UTC (permalink / raw)


On Tue, Jun 20, 2017@02:17:39PM -0400, Chuck Lever wrote:

> The concern is whether a retransmitted Send will be exposed
> to the receiving ULP. Below you imply that it will not be, so
> perhaps this is not a concern after all.

A retransmitted SEND will never be exposed to the Reciever ULP for
Reliable Connected. That is part of the guarantee.

> > We've had this discussion on the list before. You can *never* re-use a
> > SEND, or RDMA WRITE buffer until you observe the HCA is done with it
> > via a CQ poll.
> 
> RPC-over-RDMA is careful to invalidate buffers that are the
> target of RDMA Write before RPC completion, as we have
> discussed before.
> 
> Sends are assumed to be complete when a LocalInv completes.
> 
> When we had this discussion before, you explained the problem
> with retransmitted Sends, but it appears that all the ULPs we
> have operate without Send completion. Others whom I trust have
> suggested that operating without that extra interrupt is

Operating without the interrupt is of course preferred, but that means
you have to defer the invalidate for MR's refered to by SEND until a
CQ observation as well.

> preferred. The client has operated this way since it was added
> to the kernel almost 10 years ago.

I thought the use of MR's with SEND was a new invention? If you use
the local rdma lkey with send, it is never invalidated, and this is
not an issue, which IIRC, was the historical configuration for NFS.

> So I took it as a "in a perfect world" kind of admonition.
> You are making a stronger and more normative assertion here.

All ULPs must have periodic (related to SQ depth) signaled completions
or some of our supported hardware will explode.

All ULPs must flow control additions to the SQ based on CQ feedback,
or they will fail under load with SQ overflows, if this is done, then
the above happens correctly for free.

All ULPs must ensure SEND/RDMA Write resources remain stable until the
CQ indicates that work is completed. 'In a perfect world' this
includes not changing the source memory as that would cause
retransmitted packets to be different.

All ULPs must ensure the lkey remains valid until the CQ confirms
the work is done. This is not important if the lkey is always the
local rdma lkey, which is always valid.

> > No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> > send packets on the wire up to some internal limit.
> 
> So if my ULP issues FastReg followed by Send followed by
> LocalInv (signaled), I can't rely on the LocalInv completion
> to imply that the Send is also complete?

Correct.

This is explicitly defined in Table 79 of the IBA.

It describes the ordering requirements, if you order Send followed by
LocalInv the ordering is 'L' which means they are not ordered unless
the WR has the Local Invalidate Fence bit set.

LIF is an optional feature, I do not know if any of our hardware
supports it, but it is defined to cause the local invalidate to wait
until all ongoing references to the MR are completed.

No idea on the relative performance of LIF vs doing it manually, but
the need for one or the other is unambigously clear in the spec.

Why are you invaliding lkeys anyhow, that doesn't seem like something
that needs to happen synchronously.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 19:27                                                                                                                       ` Jason Gunthorpe
@ 2017-06-20 20:56                                                                                                                           ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 20:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 20, 2017, at 3:27 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Tue, Jun 20, 2017 at 02:17:39PM -0400, Chuck Lever wrote:
> 
>> The concern is whether a retransmitted Send will be exposed
>> to the receiving ULP. Below you imply that it will not be, so
>> perhaps this is not a concern after all.
> 
> A retransmitted SEND will never be exposed to the Reciever ULP for
> Reliable Connected. That is part of the guarantee.
> 
>>> We've had this discussion on the list before. You can *never* re-use a
>>> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
>>> via a CQ poll.
>> 
>> RPC-over-RDMA is careful to invalidate buffers that are the
>> target of RDMA Write before RPC completion, as we have
>> discussed before.
>> 
>> Sends are assumed to be complete when a LocalInv completes.
>> 
>> When we had this discussion before, you explained the problem
>> with retransmitted Sends, but it appears that all the ULPs we
>> have operate without Send completion. Others whom I trust have
>> suggested that operating without that extra interrupt is
> 
> Operating without the interrupt is of course preferred, but that means
> you have to defer the invalidate for MR's refered to by SEND until a
> CQ observation as well.
> 
>> preferred. The client has operated this way since it was added
>> to the kernel almost 10 years ago.
> 
> I thought the use of MR's with SEND was a new invention? If you use
> the local rdma lkey with send, it is never invalidated, and this is
> not an issue, which IIRC, was the historical configuration for NFS.

We may be conflating things a bit.

RPC-over-RDMA client uses persistently registered buffers, using
the lkey, for inline data. The use of MRs is reserved for NFS READ
and WRITE payloads. The inline buffers are never explicitly
invalidated by RPC-over-RDMA.


>> So I took it as a "in a perfect world" kind of admonition.
>> You are making a stronger and more normative assertion here.
> 
> All ULPs must have periodic (related to SQ depth) signaled completions
> or some of our supported hardware will explode.

RPC-over-RDMA client does that.


> All ULPs must flow control additions to the SQ based on CQ feedback,
> or they will fail under load with SQ overflows, if this is done, then
> the above happens correctly for free.

RPC-over-RDMA client does that.


> All ULPs must ensure SEND/RDMA Write resources remain stable until the
> CQ indicates that work is completed. 'In a perfect world' this
> includes not changing the source memory as that would cause
> retransmitted packets to be different.

I assume you mean the sending side (the server) for RDMA
Write. I believe rdma_rw uses the local rdma lkey by default
for RDMA Write source buffers.


> All ULPs must ensure the lkey remains valid until the CQ confirms
> the work is done. This is not important if the lkey is always the
> local rdma lkey, which is always valid.

As above, Send buffers use the local rdma lkey.


>>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
>>> send packets on the wire up to some internal limit.
>> 
>> So if my ULP issues FastReg followed by Send followed by
>> LocalInv (signaled), I can't rely on the LocalInv completion
>> to imply that the Send is also complete?
> 
> Correct.
> 
> This is explicitly defined in Table 79 of the IBA.
> 
> It describes the ordering requirements, if you order Send followed by
> LocalInv the ordering is 'L' which means they are not ordered unless
> the WR has the Local Invalidate Fence bit set.
> 
> LIF is an optional feature, I do not know if any of our hardware
> supports it, but it is defined to cause the local invalidate to wait
> until all ongoing references to the MR are completed.

Now, since there was confusion about using an MR for a
Send operation, let me clarify. If the client does:

FastReg(payload buffer)
Send(inline buffer)
...
Recv
LocalInv(payload buffer)
wait for LI completion

Is setting IB_SEND_FENCE on the LocalInv enough to ensure
that the Send is complete?

cscope seems to suggest all our devices support IB_SEND_FENCE.
Sagi mentioned some devices do this fencing automatically.


> No idea on the relative performance of LIF vs doing it manually, but
> the need for one or the other is unambigously clear in the spec.

It seems to me that the guarantee that the server sees
only one copy of the Send payload is good enough. That
means that by the time Recv completion occurs on the
client, even if the client HCA still thinks it needs to
retransmit the Send containing the RPC Call, the server
ULP has already seen and processed that Send payload,
and the HCA on the server won't deliver that payload a
second time.

The RPC Reply is evidence that the server saw the correct
RPC Call message payload, and the client always preserves
the Send's inline buffer until the reply has been received.

If the only concern about preserving that inline buffer is
guaranteeing that retransmits contain the same content, I
don't think we have a problem. All HCA retransmits of an
RPC Call, until the matching RPC Reply is received on the
client, will contain the same content.

The issue about the HCA not being able to access the inline
buffer during a retransmit is also not an issue for RPC-
over-RDMA because these buffers are always registered with
the local rdma lkey.


> Why are you invaliding lkeys anyhow, that doesn't seem like something
> that needs to happen synchronously.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 20:56                                                                                                                           ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-20 20:56 UTC (permalink / raw)



> On Jun 20, 2017,@3:27 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Tue, Jun 20, 2017@02:17:39PM -0400, Chuck Lever wrote:
> 
>> The concern is whether a retransmitted Send will be exposed
>> to the receiving ULP. Below you imply that it will not be, so
>> perhaps this is not a concern after all.
> 
> A retransmitted SEND will never be exposed to the Reciever ULP for
> Reliable Connected. That is part of the guarantee.
> 
>>> We've had this discussion on the list before. You can *never* re-use a
>>> SEND, or RDMA WRITE buffer until you observe the HCA is done with it
>>> via a CQ poll.
>> 
>> RPC-over-RDMA is careful to invalidate buffers that are the
>> target of RDMA Write before RPC completion, as we have
>> discussed before.
>> 
>> Sends are assumed to be complete when a LocalInv completes.
>> 
>> When we had this discussion before, you explained the problem
>> with retransmitted Sends, but it appears that all the ULPs we
>> have operate without Send completion. Others whom I trust have
>> suggested that operating without that extra interrupt is
> 
> Operating without the interrupt is of course preferred, but that means
> you have to defer the invalidate for MR's refered to by SEND until a
> CQ observation as well.
> 
>> preferred. The client has operated this way since it was added
>> to the kernel almost 10 years ago.
> 
> I thought the use of MR's with SEND was a new invention? If you use
> the local rdma lkey with send, it is never invalidated, and this is
> not an issue, which IIRC, was the historical configuration for NFS.

We may be conflating things a bit.

RPC-over-RDMA client uses persistently registered buffers, using
the lkey, for inline data. The use of MRs is reserved for NFS READ
and WRITE payloads. The inline buffers are never explicitly
invalidated by RPC-over-RDMA.


>> So I took it as a "in a perfect world" kind of admonition.
>> You are making a stronger and more normative assertion here.
> 
> All ULPs must have periodic (related to SQ depth) signaled completions
> or some of our supported hardware will explode.

RPC-over-RDMA client does that.


> All ULPs must flow control additions to the SQ based on CQ feedback,
> or they will fail under load with SQ overflows, if this is done, then
> the above happens correctly for free.

RPC-over-RDMA client does that.


> All ULPs must ensure SEND/RDMA Write resources remain stable until the
> CQ indicates that work is completed. 'In a perfect world' this
> includes not changing the source memory as that would cause
> retransmitted packets to be different.

I assume you mean the sending side (the server) for RDMA
Write. I believe rdma_rw uses the local rdma lkey by default
for RDMA Write source buffers.


> All ULPs must ensure the lkey remains valid until the CQ confirms
> the work is done. This is not important if the lkey is always the
> local rdma lkey, which is always valid.

As above, Send buffers use the local rdma lkey.


>>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
>>> send packets on the wire up to some internal limit.
>> 
>> So if my ULP issues FastReg followed by Send followed by
>> LocalInv (signaled), I can't rely on the LocalInv completion
>> to imply that the Send is also complete?
> 
> Correct.
> 
> This is explicitly defined in Table 79 of the IBA.
> 
> It describes the ordering requirements, if you order Send followed by
> LocalInv the ordering is 'L' which means they are not ordered unless
> the WR has the Local Invalidate Fence bit set.
> 
> LIF is an optional feature, I do not know if any of our hardware
> supports it, but it is defined to cause the local invalidate to wait
> until all ongoing references to the MR are completed.

Now, since there was confusion about using an MR for a
Send operation, let me clarify. If the client does:

FastReg(payload buffer)
Send(inline buffer)
...
Recv
LocalInv(payload buffer)
wait for LI completion

Is setting IB_SEND_FENCE on the LocalInv enough to ensure
that the Send is complete?

cscope seems to suggest all our devices support IB_SEND_FENCE.
Sagi mentioned some devices do this fencing automatically.


> No idea on the relative performance of LIF vs doing it manually, but
> the need for one or the other is unambigously clear in the spec.

It seems to me that the guarantee that the server sees
only one copy of the Send payload is good enough. That
means that by the time Recv completion occurs on the
client, even if the client HCA still thinks it needs to
retransmit the Send containing the RPC Call, the server
ULP has already seen and processed that Send payload,
and the HCA on the server won't deliver that payload a
second time.

The RPC Reply is evidence that the server saw the correct
RPC Call message payload, and the client always preserves
the Send's inline buffer until the reply has been received.

If the only concern about preserving that inline buffer is
guaranteeing that retransmits contain the same content, I
don't think we have a problem. All HCA retransmits of an
RPC Call, until the matching RPC Reply is received on the
client, will contain the same content.

The issue about the HCA not being able to access the inline
buffer during a retransmit is also not an issue for RPC-
over-RDMA because these buffers are always registered with
the local rdma lkey.


> Why are you invaliding lkeys anyhow, that doesn't seem like something
> that needs to happen synchronously.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 20:56                                                                                                                           ` Chuck Lever
@ 2017-06-20 21:19                                                                                                                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 21:19 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Tue, Jun 20, 2017 at 04:56:39PM -0400, Chuck Lever wrote:

> > I thought the use of MR's with SEND was a new invention? If you use
> > the local rdma lkey with send, it is never invalidated, and this is
> > not an issue, which IIRC, was the historical configuration for NFS.
> 
> We may be conflating things a bit.
> 
> RPC-over-RDMA client uses persistently registered buffers, using
> the lkey, for inline data. The use of MRs is reserved for NFS READ
> and WRITE payloads. The inline buffers are never explicitly
> invalidated by RPC-over-RDMA.

That makes much more sense, but is that the original question in this
thread? Why are we even talking about invalidate ordering then?

> > All ULPs must ensure SEND/RDMA Write resources remain stable until the
> > CQ indicates that work is completed. 'In a perfect world' this
> > includes not changing the source memory as that would cause
> > retransmitted packets to be different.
> 
> I assume you mean the sending side (the server) for RDMA
> Write. I believe rdma_rw uses the local rdma lkey by default
> for RDMA Write source buffers.

RDMA Write or SEND

> >>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> >>> send packets on the wire up to some internal limit.
> >> 
> >> So if my ULP issues FastReg followed by Send followed by
> >> LocalInv (signaled), I can't rely on the LocalInv completion
> >> to imply that the Send is also complete?
> > 
> > Correct.
> > 
> > This is explicitly defined in Table 79 of the IBA.
> > 
> > It describes the ordering requirements, if you order Send followed by
> > LocalInv the ordering is 'L' which means they are not ordered unless
> > the WR has the Local Invalidate Fence bit set.
> > 
> > LIF is an optional feature, I do not know if any of our hardware
> > supports it, but it is defined to cause the local invalidate to wait
> > until all ongoing references to the MR are completed.
> 
> Now, since there was confusion about using an MR for a
> Send operation, let me clarify. If the client does:

> FastReg(payload buffer)
> Send(inline buffer)
> ...
> Recv
> LocalInv(payload buffer)
> wait for LI completion

Not sure what you are describing?

Is Recv landing memory for a SEND? In that case it is using a lkey,
lkeys are not remotely usable, so it does not need synchronous
invalidation. In all cases the LocalInv must only be posted once a CQE
for the Recv is observed.

If Recv is RDMA WRITE target memory, then it using the rkey and it
does does need synchronous invalidation. This must be done once a recv
CQE is observed, or optimized by having the other send via one of the
_INV operations.

In no case can you pipeline a LocalInv into the SQ that would impact
RQ activity, even with any of the fences.

> Is setting IB_SEND_FENCE on the LocalInv enough to ensure
> that the Send is complete?

No.

There are two fences in the spec, IB_SEND_FENCE is the mandatory one,
and it only interacts with RDMA READ and ATOMIC entries.

Local Invalidate Fence (the optinal one) also will not order the two
because LIF is only defined to order against SQE's that use the
MR. Since Send is using the global dma lkey it does not interact with
the LocalInv and LIF will not order them.

> > No idea on the relative performance of LIF vs doing it manually, but
> > the need for one or the other is unambigously clear in the spec.
> 
> It seems to me that the guarantee that the server sees
> only one copy of the Send payload is good enough. That
> means that by the time Recv completion occurs on the
> client, even if the client HCA still thinks it needs to
> retransmit the Send containing the RPC Call, the server
> ULP has already seen and processed that Send payload,
> and the HCA on the server won't deliver that payload a
> second time.

Yes, that is OK reasoning.

> If the only concern about preserving that inline buffer is
> guaranteeing that retransmits contain the same content, I
> don't think we have a problem. All HCA retransmits of an
> RPC Call, until the matching RPC Reply is received on the
> client, will contain the same content.

Right.

> The issue about the HCA not being able to access the inline
> buffer during a retransmit is also not an issue for RPC-
> over-RDMA because these buffers are always registered with
> the local rdma lkey.

Exactly.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 21:19                                                                                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-20 21:19 UTC (permalink / raw)


On Tue, Jun 20, 2017@04:56:39PM -0400, Chuck Lever wrote:

> > I thought the use of MR's with SEND was a new invention? If you use
> > the local rdma lkey with send, it is never invalidated, and this is
> > not an issue, which IIRC, was the historical configuration for NFS.
> 
> We may be conflating things a bit.
> 
> RPC-over-RDMA client uses persistently registered buffers, using
> the lkey, for inline data. The use of MRs is reserved for NFS READ
> and WRITE payloads. The inline buffers are never explicitly
> invalidated by RPC-over-RDMA.

That makes much more sense, but is that the original question in this
thread? Why are we even talking about invalidate ordering then?

> > All ULPs must ensure SEND/RDMA Write resources remain stable until the
> > CQ indicates that work is completed. 'In a perfect world' this
> > includes not changing the source memory as that would cause
> > retransmitted packets to be different.
> 
> I assume you mean the sending side (the server) for RDMA
> Write. I believe rdma_rw uses the local rdma lkey by default
> for RDMA Write source buffers.

RDMA Write or SEND

> >>> No. The SQ side is asynchronous to the CQ side, the HCA will pipeline
> >>> send packets on the wire up to some internal limit.
> >> 
> >> So if my ULP issues FastReg followed by Send followed by
> >> LocalInv (signaled), I can't rely on the LocalInv completion
> >> to imply that the Send is also complete?
> > 
> > Correct.
> > 
> > This is explicitly defined in Table 79 of the IBA.
> > 
> > It describes the ordering requirements, if you order Send followed by
> > LocalInv the ordering is 'L' which means they are not ordered unless
> > the WR has the Local Invalidate Fence bit set.
> > 
> > LIF is an optional feature, I do not know if any of our hardware
> > supports it, but it is defined to cause the local invalidate to wait
> > until all ongoing references to the MR are completed.
> 
> Now, since there was confusion about using an MR for a
> Send operation, let me clarify. If the client does:

> FastReg(payload buffer)
> Send(inline buffer)
> ...
> Recv
> LocalInv(payload buffer)
> wait for LI completion

Not sure what you are describing?

Is Recv landing memory for a SEND? In that case it is using a lkey,
lkeys are not remotely usable, so it does not need synchronous
invalidation. In all cases the LocalInv must only be posted once a CQE
for the Recv is observed.

If Recv is RDMA WRITE target memory, then it using the rkey and it
does does need synchronous invalidation. This must be done once a recv
CQE is observed, or optimized by having the other send via one of the
_INV operations.

In no case can you pipeline a LocalInv into the SQ that would impact
RQ activity, even with any of the fences.

> Is setting IB_SEND_FENCE on the LocalInv enough to ensure
> that the Send is complete?

No.

There are two fences in the spec, IB_SEND_FENCE is the mandatory one,
and it only interacts with RDMA READ and ATOMIC entries.

Local Invalidate Fence (the optinal one) also will not order the two
because LIF is only defined to order against SQE's that use the
MR. Since Send is using the global dma lkey it does not interact with
the LocalInv and LIF will not order them.

> > No idea on the relative performance of LIF vs doing it manually, but
> > the need for one or the other is unambigously clear in the spec.
> 
> It seems to me that the guarantee that the server sees
> only one copy of the Send payload is good enough. That
> means that by the time Recv completion occurs on the
> client, even if the client HCA still thinks it needs to
> retransmit the Send containing the RPC Call, the server
> ULP has already seen and processed that Send payload,
> and the HCA on the server won't deliver that payload a
> second time.

Yes, that is OK reasoning.

> If the only concern about preserving that inline buffer is
> guaranteeing that retransmits contain the same content, I
> don't think we have a problem. All HCA retransmits of an
> RPC Call, until the matching RPC Reply is received on the
> client, will contain the same content.

Right.

> The issue about the HCA not being able to access the inline
> buffer during a retransmit is also not an issue for RPC-
> over-RDMA because these buffers are always registered with
> the local rdma lkey.

Exactly.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 10:31                                                                                                   ` Max Gurtovoy
@ 2017-06-20 22:58                                                                                                         ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 22:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Sagi Grimberg, Leon Romanovsky, Marta Rybczynska,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss

On Tue, Jun 20, 2017 at 4:31 AM, Max Gurtovoy <maxg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>
> On 6/20/2017 12:33 PM, Sagi Grimberg wrote:
>>
>>
>>>>> Here the parsed output, it says that it was access to mkey which is
>>>>> free.
>>
>>
>> Missed that :)
>>
>>>>> ======== cqe_with_error ========
>>>>> wqe_id                           : 0x0
>>>>> srqn_usr_index                   : 0x0
>>>>> byte_cnt                         : 0x0
>>>>> hw_error_syndrome                : 0x93
>>>>> hw_syndrome_type                 : 0x0
>>>>> vendor_error_syndrome            : 0x52
>>>>
>>>>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> This might happen IMO.
> Robert, can you test this untested patch (this is not the full solution,
> just something to think about):
>
> diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c
> b/drivers/infiniband/ulp/iser/iser_verbs.c
> index c538a38..e93bd40 100644
> --- a/drivers/infiniband/ulp/iser/iser_verbs.c
> +++ b/drivers/infiniband/ulp/iser/iser_verbs.c
> @@ -1079,7 +1079,7 @@ int iser_post_send(struct ib_conn *ib_conn, struct
> iser_tx_desc *tx_desc,
>         wr->sg_list = tx_desc->tx_sg;
>         wr->num_sge = tx_desc->num_sge;
>         wr->opcode = IB_WR_SEND;
> -       wr->send_flags = signal ? IB_SEND_SIGNALED : 0;
> +       wr->send_flags = IB_SEND_SIGNALED;
>
>         ib_ret = ib_post_send(ib_conn->qp, &tx_desc->wrs[0].send, &bad_wr);
>         if (ib_ret)
>
>
>>
>> Or, we can also have a race where we destroy all our MRs when I/O
>> is still running (but from the code we should be safe here).
>>
>> Robert, when you rebooted the target, I assume iscsi ping
>> timeout expired and the connection teardown started correct?

I still get the local protection errors with this patch. I am seeing a
ping timeout on the initiator when I reboot the target.
[Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
[Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs

Since I'm gracefully shutting down the targets in this case, shouldn't
the connection be closed gracefully by the target instead of the
initiator having to wait for ping to fail?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-20 22:58                                                                                                         ` Robert LeBlanc
  0 siblings, 0 replies; 171+ messages in thread
From: Robert LeBlanc @ 2017-06-20 22:58 UTC (permalink / raw)


On Tue, Jun 20, 2017@4:31 AM, Max Gurtovoy <maxg@mellanox.com> wrote:
>
>
> On 6/20/2017 12:33 PM, Sagi Grimberg wrote:
>>
>>
>>>>> Here the parsed output, it says that it was access to mkey which is
>>>>> free.
>>
>>
>> Missed that :)
>>
>>>>> ======== cqe_with_error ========
>>>>> wqe_id                           : 0x0
>>>>> srqn_usr_index                   : 0x0
>>>>> byte_cnt                         : 0x0
>>>>> hw_error_syndrome                : 0x93
>>>>> hw_syndrome_type                 : 0x0
>>>>> vendor_error_syndrome            : 0x52
>>>>
>>>>
>>>> Can you share the check that correlates to the vendor+hw syndrome?
>>>
>>>
>>> mkey.free == 1
>>
>>
>> Hmm, the way I understand it is that the HW is trying to access
>> (locally via send) a MR which was already invalidated.
>>
>> Thinking of this further, this can happen in a case where the target
>> already completed the transaction, sent SEND_WITH_INVALIDATE but the
>> original send ack was lost somewhere causing the device to retransmit
>> from the MR (which was already invalidated). This is highly unlikely
>> though.
>>
>> Shouldn't this be protected somehow by the device?
>> Can someone explain why the above cannot happen? Jason? Liran? Anyone?
>>
>> Say host register MR (a) and send (1) from that MR to a target,
>> send (1) ack got lost, and the target issues SEND_WITH_INVALIDATE
>> on MR (a) and the host HCA process it, then host HCA timeout on send (1)
>> so it retries, but ehh, its already invalidated.
>
>
> This might happen IMO.
> Robert, can you test this untested patch (this is not the full solution,
> just something to think about):
>
> diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c
> b/drivers/infiniband/ulp/iser/iser_verbs.c
> index c538a38..e93bd40 100644
> --- a/drivers/infiniband/ulp/iser/iser_verbs.c
> +++ b/drivers/infiniband/ulp/iser/iser_verbs.c
> @@ -1079,7 +1079,7 @@ int iser_post_send(struct ib_conn *ib_conn, struct
> iser_tx_desc *tx_desc,
>         wr->sg_list = tx_desc->tx_sg;
>         wr->num_sge = tx_desc->num_sge;
>         wr->opcode = IB_WR_SEND;
> -       wr->send_flags = signal ? IB_SEND_SIGNALED : 0;
> +       wr->send_flags = IB_SEND_SIGNALED;
>
>         ib_ret = ib_post_send(ib_conn->qp, &tx_desc->wrs[0].send, &bad_wr);
>         if (ib_ret)
>
>
>>
>> Or, we can also have a race where we destroy all our MRs when I/O
>> is still running (but from the code we should be safe here).
>>
>> Robert, when you rebooted the target, I assume iscsi ping
>> timeout expired and the connection teardown started correct?

I still get the local protection errors with this patch. I am seeing a
ping timeout on the initiator when I reboot the target.
[Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
[Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs

Since I'm gracefully shutting down the targets in this case, shouldn't
the connection be closed gracefully by the target instead of the
initiator having to wait for ping to fail?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 22:58                                                                                                         ` Robert LeBlanc
@ 2017-06-27  7:16                                                                                                           ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:16 UTC (permalink / raw)
  To: Robert LeBlanc, Max Gurtovoy
  Cc: Leon Romanovsky, Marta Rybczynska, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme, Jason Gunthorpe, Liran Liss,
	target-devel


> I still get the local protection errors with this patch. I am seeing a
> ping timeout on the initiator when I reboot the target.
> [Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
> [Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs
> 

Not a big surprise as its not really addressing the issue...

> Since I'm gracefully shutting down the targets in this case, shouldn't
> the connection be closed gracefully by the target instead of the
> initiator having to wait for ping to fail?

Not really, even in orderly shutdown, the device driver (mlx5 in
this case) shutdown sequence is triggered before ib_isert and
fires DEVICE_REMOVAL events to all its upper layer users (ib_isert
being one of them), which forces resource teardown (no disconnect).

We could resgister a shutdown handler in ib_isert, but its not really
its responsibility as a transport driver..

It would be nice if we had targetcli daemon'ised as a service
and register a shutdown notification, orderly remove and save the
existing configuration before the kernel even sees it. But its
a different scope really...

CC'ing target-devel

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27  7:16                                                                                                           ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:16 UTC (permalink / raw)



> I still get the local protection errors with this patch. I am seeing a
> ping timeout on the initiator when I reboot the target.
> [Tue Jun 20 16:41:21 2017]  connection7:0: detected conn error (1011)
> [Tue Jun 20 16:41:26 2017]  session7: session recovery timed out after 5 secs
> 

Not a big surprise as its not really addressing the issue...

> Since I'm gracefully shutting down the targets in this case, shouldn't
> the connection be closed gracefully by the target instead of the
> initiator having to wait for ping to fail?

Not really, even in orderly shutdown, the device driver (mlx5 in
this case) shutdown sequence is triggered before ib_isert and
fires DEVICE_REMOVAL events to all its upper layer users (ib_isert
being one of them), which forces resource teardown (no disconnect).

We could resgister a shutdown handler in ib_isert, but its not really
its responsibility as a transport driver..

It would be nice if we had targetcli daemon'ised as a service
and register a shutdown notification, orderly remove and save the
existing configuration before the kernel even sees it. But its
a different scope really...

CC'ing target-devel

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 17:28                                                                                                                   ` Robert LeBlanc
@ 2017-06-27  7:22                                                                                                                       ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:22 UTC (permalink / raw)
  To: Robert LeBlanc
  Cc: Leon Romanovsky, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Jason Gunthorpe,
	Liran Liss, Bart Van Assche, Chuck Lever, target-devel


>> I don't understand, is this new with the patch applied?
> 
> I applied your patch to 4.12-rc6 on the initiator, but my targets are
> still 4.9.33 since it looked like the patch only affected the
> initiator. I did not see this before your patch, but I also didn't try
> rebooting the targets multiple times before because of the previous
> messages.

That sounds like a separate issue. Should we move forward with the
suggested patch?

>>> After this and a reboot of the target, the initiator would drop the
>>> connection after 1.5-2 minutes then faster and faster until it was
>>> every 5 seconds. It is almost like it set up the connection then lose
>>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>>> target multiple times.
>>
>>
>> So the initiator could not recover even after the target as available
>> again?
> 
> The initiator recovered the connection when the target came back, but
> the connection was not stable. I/O would happen on the connection,
> then it would get shaky and then finally disconnect. Then it would
> reconnect, pass more I/O, then get shaky and go down again. With the 5
> second disconnects, it would pass traffic for 5 seconds, then as soon
> as I saw the ping timeout, the I/O would stop until it reconnected. At
> that point it seems that the lack of pings would kill the I/O unlike
> earlier where there was a stall in I/O and then the connection would
> be torn down. I can try to see if I can get it to happen again.

So looks like the target is not responding to NOOP_OUTs (or traffic
at all for that matter).

The messages:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260

Are indicating that something is stuck in the login thread, not sure
where though. Did you see a watchdog popping on a hang?

And massage:
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

Is an indication that the rdma fabric is in some error state.

On which reboot attempt all this happened? the first one?

Again, CCing target-devel.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27  7:22                                                                                                                       ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:22 UTC (permalink / raw)



>> I don't understand, is this new with the patch applied?
> 
> I applied your patch to 4.12-rc6 on the initiator, but my targets are
> still 4.9.33 since it looked like the patch only affected the
> initiator. I did not see this before your patch, but I also didn't try
> rebooting the targets multiple times before because of the previous
> messages.

That sounds like a separate issue. Should we move forward with the
suggested patch?

>>> After this and a reboot of the target, the initiator would drop the
>>> connection after 1.5-2 minutes then faster and faster until it was
>>> every 5 seconds. It is almost like it set up the connection then lose
>>> the first ping, or the ping wasn't set-up right. I tried rebooting the
>>> target multiple times.
>>
>>
>> So the initiator could not recover even after the target as available
>> again?
> 
> The initiator recovered the connection when the target came back, but
> the connection was not stable. I/O would happen on the connection,
> then it would get shaky and then finally disconnect. Then it would
> reconnect, pass more I/O, then get shaky and go down again. With the 5
> second disconnects, it would pass traffic for 5 seconds, then as soon
> as I saw the ping timeout, the I/O would stop until it reconnected. At
> that point it seems that the lack of pings would kill the I/O unlike
> earlier where there was a stall in I/O and then the connection would
> be torn down. I can try to see if I can get it to happen again.

So looks like the target is not responding to NOOP_OUTs (or traffic
at all for that matter).

The messages:
[Tue Jun 20 10:11:20 2017] iSCSI Login timeout on Network Portal [::]:3260

Are indicating that something is stuck in the login thread, not sure
where though. Did you see a watchdog popping on a hang?

And massage:
[Tue Jun 20 10:11:58 2017] isert: isert_print_wc: login send failure:
transport retry counter exceeded (12) vend_err 81

Is an indication that the rdma fabric is in some error state.

On which reboot attempt all this happened? the first one?

Again, CCing target-devel.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-20 21:19                                                                                                                               ` Jason Gunthorpe
@ 2017-06-27  7:37                                                                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Chuck Lever
  Cc: Leon Romanovsky, Robert LeBlanc, Marta Rybczynska, Max Gurtovoy,
	Christoph Hellwig, Gruher, Joseph R, shahar.salzman,
	Laurence Oberman, Riches Jr, Robert M, linux-rdma,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, Liran Liss,
	Bart Van Assche

Jason,

>> The issue about the HCA not being able to access the inline
>> buffer during a retransmit is also not an issue for RPC-
>> over-RDMA because these buffers are always registered with
>> the local rdma lkey.
> 
> Exactly.

Lost track of the thread...


Indeed you raised this issue lots of times before, and I failed to see
why its important or why its error prone, but now I do...

My apologies for not listening :(

We should fix _all_ initiators for it, nvme-rdma, iser, srp
and xprtrdma (and probably some more ULPs out there)...

It also means that we cannot really suppress any send completions as
that would result in an unpredictable latency (which is not acceptable).

I wish we could somehow tell the HCA that it can ignore access fail to a
specific address when retransmitting.. but maybe its too much to ask...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27  7:37                                                                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27  7:37 UTC (permalink / raw)


Jason,

>> The issue about the HCA not being able to access the inline
>> buffer during a retransmit is also not an issue for RPC-
>> over-RDMA because these buffers are always registered with
>> the local rdma lkey.
> 
> Exactly.

Lost track of the thread...


Indeed you raised this issue lots of times before, and I failed to see
why its important or why its error prone, but now I do...

My apologies for not listening :(

We should fix _all_ initiators for it, nvme-rdma, iser, srp
and xprtrdma (and probably some more ULPs out there)...

It also means that we cannot really suppress any send completions as
that would result in an unpredictable latency (which is not acceptable).

I wish we could somehow tell the HCA that it can ignore access fail to a
specific address when retransmitting.. but maybe its too much to ask...

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27  7:37                                                                                                                                   ` Sagi Grimberg
@ 2017-06-27 14:42                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-27 14:42 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 27, 2017, at 3:37 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> Jason,
> 
>>> The issue about the HCA not being able to access the inline
>>> buffer during a retransmit is also not an issue for RPC-
>>> over-RDMA because these buffers are always registered with
>>> the local rdma lkey.
>> Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...

Go back and browse the end of the thread: there's no need to change
xprtrdma, and maybe no need to change the others either.


> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 14:42                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-27 14:42 UTC (permalink / raw)



> On Jun 27, 2017,@3:37 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> Jason,
> 
>>> The issue about the HCA not being able to access the inline
>>> buffer during a retransmit is also not an issue for RPC-
>>> over-RDMA because these buffers are always registered with
>>> the local rdma lkey.
>> Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...

Go back and browse the end of the thread: there's no need to change
xprtrdma, and maybe no need to change the others either.


> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 14:42                                                                                                                                       ` Chuck Lever
@ 2017-06-27 16:07                                                                                                                                           ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27 16:07 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> Go back and browse the end of the thread: there's no need to change
> xprtrdma, and maybe no need to change the others either.

I think there is, even with inline, xprtrdma dma maps the immediate
buffers, also the message head and tail so unmapping these buffers
without waiting for the send completion would trigger a IOMMU access
error (the HCA (re)tries to access an already unmapped buffer).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 16:07                                                                                                                                           ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-27 16:07 UTC (permalink / raw)



> Go back and browse the end of the thread: there's no need to change
> xprtrdma, and maybe no need to change the others either.

I think there is, even with inline, xprtrdma dma maps the immediate
buffers, also the message head and tail so unmapping these buffers
without waiting for the send completion would trigger a IOMMU access
error (the HCA (re)tries to access an already unmapped buffer).

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 16:07                                                                                                                                           ` Sagi Grimberg
@ 2017-06-27 16:28                                                                                                                                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-27 16:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chuck Lever, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Tue, Jun 27, 2017 at 07:07:08PM +0300, Sagi Grimberg wrote:
> 
> >Go back and browse the end of the thread: there's no need to change
> >xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Yes, that is an excellent observation. When using the local rdma lkey
you still need to ensure the linux API DMA map remains until
completion.

send completion mitigation is still possible, if it is OK for the
backing pages to remain, but I think a more sophisticated strategy is
needed - eg maybe push some kind of NOP through the send q after a
timer or only complete when the last available work is stuffed or
something.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 16:28                                                                                                                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-27 16:28 UTC (permalink / raw)


On Tue, Jun 27, 2017@07:07:08PM +0300, Sagi Grimberg wrote:
> 
> >Go back and browse the end of the thread: there's no need to change
> >xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Yes, that is an excellent observation. When using the local rdma lkey
you still need to ensure the linux API DMA map remains until
completion.

send completion mitigation is still possible, if it is OK for the
backing pages to remain, but I think a more sophisticated strategy is
needed - eg maybe push some kind of NOP through the send q after a
timer or only complete when the last available work is stuffed or
something.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 16:07                                                                                                                                           ` Sagi Grimberg
@ 2017-06-27 16:28                                                                                                                                               ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-27 16:28 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 27, 2017, at 12:07 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> Go back and browse the end of the thread: there's no need to change
>> xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Thinking out loud:

IIRC the message head and tail reside in the persistently registered
and DMA mapped buffers with few exceptions.

However, when page cache pages are involved, xprtrdma will do a DMA
unmap as you say.

So:
- we don't have a problem transmitting a garbled request thanks to
  exactly-once receive semantics
- we don't have a problem with the timing of registration and
  invalidation on the initiator because the PD's DMA lkey is used
- we do have a problem with DMA unmap

Using only persistently registered and DMA mapped Send buffers
should avoid the need to signal all Sends. However, when page
cache pages are involved, then the Send needs to be signaled,
and the pages unmapped only after Send completion, to be completely
safe.

Or, the initiator should just use RDMA Read and Write, and stick
with small inline sizes.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 16:28                                                                                                                                               ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-27 16:28 UTC (permalink / raw)



> On Jun 27, 2017,@12:07 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Go back and browse the end of the thread: there's no need to change
>> xprtrdma, and maybe no need to change the others either.
> 
> I think there is, even with inline, xprtrdma dma maps the immediate
> buffers, also the message head and tail so unmapping these buffers
> without waiting for the send completion would trigger a IOMMU access
> error (the HCA (re)tries to access an already unmapped buffer).

Thinking out loud:

IIRC the message head and tail reside in the persistently registered
and DMA mapped buffers with few exceptions.

However, when page cache pages are involved, xprtrdma will do a DMA
unmap as you say.

So:
- we don't have a problem transmitting a garbled request thanks to
  exactly-once receive semantics
- we don't have a problem with the timing of registration and
  invalidation on the initiator because the PD's DMA lkey is used
- we do have a problem with DMA unmap

Using only persistently registered and DMA mapped Send buffers
should avoid the need to signal all Sends. However, when page
cache pages are involved, then the Send needs to be signaled,
and the pages unmapped only after Send completion, to be completely
safe.

Or, the initiator should just use RDMA Read and Write, and stick
with small inline sizes.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27  7:37                                                                                                                                   ` Sagi Grimberg
@ 2017-06-27 18:08                                                                                                                                       ` Bart Van Assche
  -1 siblings, 0 replies; 171+ messages in thread
From: Bart Van Assche @ 2017-06-27 18:08 UTC (permalink / raw)
  To: jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: mrybczyn-FNhOzJFKnXGHXe+LvDLADg, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w, Bart Van Assche,
	robert.m.riches.jr-ral2JQCrhuEAvxtiuMwx3w,
	robert-4JaGZRWAfWbajFs6igw21g,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, loberman-H+wXaHxf7aLQT0dZR+AlfA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, liranl-VPRAkNaXOzVWk0Htik3J/w,
	joseph.r.gruher-ral2JQCrhuEAvxtiuMwx3w

On Tue, 2017-06-27 at 10:37 +0300, Sagi Grimberg wrote:
> Jason,
> 
> > > The issue about the HCA not being able to access the inline
> > > buffer during a retransmit is also not an issue for RPC-
> > > over-RDMA because these buffers are always registered with
> > > the local rdma lkey.
> > 
> > Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...
> 
> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

Hello Sagi,

Can you clarify why you think that the SRP initiator needs to be changed?
The SRP initiator submits the local invalidate work request after the RDMA
write request. According to table 79 "Work Request Operation Ordering" the
order of these work requests must be maintained by the HCA. I think if a HCA
would start with invalidating the MR before the remote HCA has acknowledged
the written data that that's a firmware bug.

The upstream SRP initiator does not use inline data.

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 18:08                                                                                                                                       ` Bart Van Assche
  0 siblings, 0 replies; 171+ messages in thread
From: Bart Van Assche @ 2017-06-27 18:08 UTC (permalink / raw)


On Tue, 2017-06-27@10:37 +0300, Sagi Grimberg wrote:
> Jason,
> 
> > > The issue about the HCA not being able to access the inline
> > > buffer during a retransmit is also not an issue for RPC-
> > > over-RDMA because these buffers are always registered with
> > > the local rdma lkey.
> > 
> > Exactly.
> 
> Lost track of the thread...
> 
> 
> Indeed you raised this issue lots of times before, and I failed to see
> why its important or why its error prone, but now I do...
> 
> My apologies for not listening :(
> 
> We should fix _all_ initiators for it, nvme-rdma, iser, srp
> and xprtrdma (and probably some more ULPs out there)...
> 
> It also means that we cannot really suppress any send completions as
> that would result in an unpredictable latency (which is not acceptable).
> 
> I wish we could somehow tell the HCA that it can ignore access fail to a
> specific address when retransmitting.. but maybe its too much to ask...

Hello Sagi,

Can you clarify why you think that the SRP initiator needs to be changed?
The SRP initiator submits the local invalidate work request after the RDMA
write request. According to table 79 "Work Request Operation Ordering" the
order of these work requests must be maintained by the HCA. I think if a HCA
would start with invalidating the MR before the remote HCA has acknowledged
the written data that that's a firmware bug.

The upstream SRP initiator does not use inline data.

Bart.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 18:08                                                                                                                                       ` Bart Van Assche
@ 2017-06-27 18:14                                                                                                                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-27 18:14 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: chuck.lever-QHcLZuEGTsvQT0dZR+AlfA, sagi-NQWnxTmZq1alnMjI0IkVqw,
	mrybczyn-FNhOzJFKnXGHXe+LvDLADg, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w,
	robert.m.riches.jr-ral2JQCrhuEAvxtiuMwx3w,
	robert-4JaGZRWAfWbajFs6igw21g,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, loberman-H+wXaHxf7aLQT0dZR+AlfA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, liranl-VPRAkNaXOzVWk0Htik3J/w,
	joseph.r.gruher-ral2JQCrhuEAvxtiuMwx3w

On Tue, Jun 27, 2017 at 06:08:57PM +0000, Bart Van Assche wrote:

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation
> Ordering" the

That table has a 'L' for invalidate that follows RDMA Write.

L means they are not ordered unless the optional Local Invalidate
Fence mode is used.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-27 18:14                                                                                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-06-27 18:14 UTC (permalink / raw)


On Tue, Jun 27, 2017@06:08:57PM +0000, Bart Van Assche wrote:

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation
> Ordering" the

That table has a 'L' for invalidate that follows RDMA Write.

L means they are not ordered unless the optional Local Invalidate
Fence mode is used.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 16:28                                                                                                                                               ` Jason Gunthorpe
@ 2017-06-28  7:03                                                                                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Chuck Lever, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> send completion mitigation is still possible, if it is OK for the
> backing pages to remain, but I think a more sophisticated strategy is
> needed - eg maybe push some kind of NOP through the send q after a
> timer or only complete when the last available work is stuffed or
> something.

I'm not smart enough to come up with something like that.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-28  7:03                                                                                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:03 UTC (permalink / raw)



> send completion mitigation is still possible, if it is OK for the
> backing pages to remain, but I think a more sophisticated strategy is
> needed - eg maybe push some kind of NOP through the send q after a
> timer or only complete when the last available work is stuffed or
> something.

I'm not smart enough to come up with something like that.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 16:28                                                                                                                                               ` Chuck Lever
@ 2017-06-28  7:08                                                                                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:08 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> Thinking out loud:
> 
> IIRC the message head and tail reside in the persistently registered
> and DMA mapped buffers with few exceptions.
> 
> However, when page cache pages are involved, xprtrdma will do a DMA
> unmap as you say.
> 
> So:
> - we don't have a problem transmitting a garbled request thanks to
>    exactly-once receive semantics
> - we don't have a problem with the timing of registration and
>    invalidation on the initiator because the PD's DMA lkey is used
> - we do have a problem with DMA unmap
> 
> Using only persistently registered and DMA mapped Send buffers
> should avoid the need to signal all Sends. However, when page
> cache pages are involved, then the Send needs to be signaled,
> and the pages unmapped only after Send completion, to be completely
> safe.

How do you know when that happens?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-28  7:08                                                                                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:08 UTC (permalink / raw)



> Thinking out loud:
> 
> IIRC the message head and tail reside in the persistently registered
> and DMA mapped buffers with few exceptions.
> 
> However, when page cache pages are involved, xprtrdma will do a DMA
> unmap as you say.
> 
> So:
> - we don't have a problem transmitting a garbled request thanks to
>    exactly-once receive semantics
> - we don't have a problem with the timing of registration and
>    invalidation on the initiator because the PD's DMA lkey is used
> - we do have a problem with DMA unmap
> 
> Using only persistently registered and DMA mapped Send buffers
> should avoid the need to signal all Sends. However, when page
> cache pages are involved, then the Send needs to be signaled,
> and the pages unmapped only after Send completion, to be completely
> safe.

How do you know when that happens?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-27 18:08                                                                                                                                       ` Bart Van Assche
@ 2017-06-28  7:16                                                                                                                                           ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:16 UTC (permalink / raw)
  To: Bart Van Assche, jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA
  Cc: mrybczyn-FNhOzJFKnXGHXe+LvDLADg, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w,
	robert.m.riches.jr-ral2JQCrhuEAvxtiuMwx3w,
	robert-4JaGZRWAfWbajFs6igw21g,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, loberman-H+wXaHxf7aLQT0dZR+AlfA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, liranl-VPRAkNaXOzVWk0Htik3J/w,
	joseph.r.gruher-ral2JQCrhuEAvxtiuMwx3w

> Hello Sagi,

Hi Bart,

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation Ordering" the
> order of these work requests must be maintained by the HCA. I think if a HCA
> would start with invalidating the MR before the remote HCA has acknowledged
> the written data that that's a firmware bug.

That flow is fine, we were discussing immediate data sends.

SRP only needs fixing by waiting for the all local invalidates to
complete before unmapping the user buffers and completing the I/O.

BTW, did the efforts on standardizing remote invalidate in SRP ever
evolved to something? Would it be acceptable to add a non-standard
ib_srp and ib_srpt? We can default to off and allow the user to opt
it in if it knows the two sides comply...

We need to fix that in nvmf and iser too btw (luckily we have remote
invalidate so its not a big issue).

> The upstream SRP initiator does not use inline data.

Yet :)

IIRC you have a branch with immediate-data support against scst so
it might be relevant there...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-28  7:16                                                                                                                                           ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-28  7:16 UTC (permalink / raw)


> Hello Sagi,

Hi Bart,

> Can you clarify why you think that the SRP initiator needs to be changed?
> The SRP initiator submits the local invalidate work request after the RDMA
> write request. According to table 79 "Work Request Operation Ordering" the
> order of these work requests must be maintained by the HCA. I think if a HCA
> would start with invalidating the MR before the remote HCA has acknowledged
> the written data that that's a firmware bug.

That flow is fine, we were discussing immediate data sends.

SRP only needs fixing by waiting for the all local invalidates to
complete before unmapping the user buffers and completing the I/O.

BTW, did the efforts on standardizing remote invalidate in SRP ever
evolved to something? Would it be acceptable to add a non-standard
ib_srp and ib_srpt? We can default to off and allow the user to opt
it in if it knows the two sides comply...

We need to fix that in nvmf and iser too btw (luckily we have remote
invalidate so its not a big issue).

> The upstream SRP initiator does not use inline data.

Yet :)

IIRC you have a branch with immediate-data support against scst so
it might be relevant there...

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-28  7:16                                                                                                                                           ` Sagi Grimberg
@ 2017-06-28  9:43                                                                                                                                               ` Bart Van Assche
  -1 siblings, 0 replies; 171+ messages in thread
From: Bart Van Assche @ 2017-06-28  9:43 UTC (permalink / raw)
  To: jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/,
	chuck.lever-QHcLZuEGTsvQT0dZR+AlfA, sagi-NQWnxTmZq1alnMjI0IkVqw
  Cc: mrybczyn-FNhOzJFKnXGHXe+LvDLADg, hch-wEGCiKHe2LqWVfeAwA7xHQ,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	shahar.salzman-Re5JQEeQqe8AvxtiuMwx3w,
	robert.m.riches.jr-ral2JQCrhuEAvxtiuMwx3w,
	robert-4JaGZRWAfWbajFs6igw21g,
	linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	maxg-VPRAkNaXOzVWk0Htik3J/w, loberman-H+wXaHxf7aLQT0dZR+AlfA,
	leon-DgEjT+Ai2ygdnm+yROfE0A, liranl-VPRAkNaXOzVWk0Htik3J/w,
	joseph.r.gruher-ral2JQCrhuEAvxtiuMwx3w

On Wed, 2017-06-28 at 10:16 +0300, Sagi Grimberg wrote:
> BTW, did the efforts on standardizing remote invalidate in SRP ever
> evolved to something? Would it be acceptable to add a non-standard
> ib_srp and ib_srpt? We can default to off and allow the user to opt
> it in if it knows the two sides comply...

I'd like to hear Doug's opinion about whether it's OK to add this
feature to the upstream drivers before support for remote invalidate
is standardized.

> We need to fix that in nvmf and iser too btw (luckily we have remote
> invalidate so its not a big issue).
> 
> > The upstream SRP initiator does not use inline data.
> 
> Yet :)
> 
> IIRC you have a branch with immediate-data support against scst so
> it might be relevant there...

Support for immediate data is being standardized. My plan is to add
support for immediate data to the upstream drivers once the T10 committee
agrees about the implementation. See also
http://www.t10.org/cgi-bin/ac.pl?t=f&f=srp2r02a.pdf

Bart.--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-28  9:43                                                                                                                                               ` Bart Van Assche
  0 siblings, 0 replies; 171+ messages in thread
From: Bart Van Assche @ 2017-06-28  9:43 UTC (permalink / raw)


On Wed, 2017-06-28@10:16 +0300, Sagi Grimberg wrote:
> BTW, did the efforts on standardizing remote invalidate in SRP ever
> evolved to something? Would it be acceptable to add a non-standard
> ib_srp and ib_srpt? We can default to off and allow the user to opt
> it in if it knows the two sides comply...

I'd like to hear Doug's opinion about whether it's OK to add this
feature to the upstream drivers before support for remote invalidate
is standardized.

> We need to fix that in nvmf and iser too btw (luckily we have remote
> invalidate so its not a big issue).
> 
> > The upstream SRP initiator does not use inline data.
> 
> Yet :)
> 
> IIRC you have a branch with immediate-data support against scst so
> it might be relevant there...

Support for immediate data is being standardized. My plan is to add
support for immediate data to the upstream drivers once the T10 committee
agrees about the implementation. See also
http://www.t10.org/cgi-bin/ac.pl?t=f&f=srp2r02a.pdf

Bart.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-28  7:08                                                                                                                                                   ` Sagi Grimberg
@ 2017-06-28 16:11                                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-28 16:11 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 28, 2017, at 3:08 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>> Thinking out loud:
>> IIRC the message head and tail reside in the persistently registered
>> and DMA mapped buffers with few exceptions.
>> However, when page cache pages are involved, xprtrdma will do a DMA
>> unmap as you say.
>> So:
>> - we don't have a problem transmitting a garbled request thanks to
>>   exactly-once receive semantics
>> - we don't have a problem with the timing of registration and
>>   invalidation on the initiator because the PD's DMA lkey is used
>> - we do have a problem with DMA unmap
>> Using only persistently registered and DMA mapped Send buffers
>> should avoid the need to signal all Sends. However, when page
>> cache pages are involved, then the Send needs to be signaled,
>> and the pages unmapped only after Send completion, to be completely
>> safe.
> 
> How do you know when that happens?

The RPC Call send path sets up the Send SGE array. If it includes
page cache pages, it can set IB_SEND_SIGNALED.

The SGE array and the ib_cqe for the send are in the same data
structure, so the Send completion handler can find the SGE array
and figure out what needs to be unmapped.

The only problem is if a POSIX signal fires. In that case the
data structure can be released before the Send completion fires,
and we get touch-after-free in the completion handler.

I'm thinking that it just isn't going to be practical to handle
unmapping this way, and I should just revert back to using RDMA
Read instead of adding page cache pages to the Send SGE.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-28 16:11                                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-28 16:11 UTC (permalink / raw)



> On Jun 28, 2017,@3:08 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Thinking out loud:
>> IIRC the message head and tail reside in the persistently registered
>> and DMA mapped buffers with few exceptions.
>> However, when page cache pages are involved, xprtrdma will do a DMA
>> unmap as you say.
>> So:
>> - we don't have a problem transmitting a garbled request thanks to
>>   exactly-once receive semantics
>> - we don't have a problem with the timing of registration and
>>   invalidation on the initiator because the PD's DMA lkey is used
>> - we do have a problem with DMA unmap
>> Using only persistently registered and DMA mapped Send buffers
>> should avoid the need to signal all Sends. However, when page
>> cache pages are involved, then the Send needs to be signaled,
>> and the pages unmapped only after Send completion, to be completely
>> safe.
> 
> How do you know when that happens?

The RPC Call send path sets up the Send SGE array. If it includes
page cache pages, it can set IB_SEND_SIGNALED.

The SGE array and the ib_cqe for the send are in the same data
structure, so the Send completion handler can find the SGE array
and figure out what needs to be unmapped.

The only problem is if a POSIX signal fires. In that case the
data structure can be released before the Send completion fires,
and we get touch-after-free in the completion handler.

I'm thinking that it just isn't going to be practical to handle
unmapping this way, and I should just revert back to using RDMA
Read instead of adding page cache pages to the Send SGE.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-28 16:11                                                                                                                                                       ` Chuck Lever
@ 2017-06-29  5:35                                                                                                                                                           ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-29  5:35 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


>> How do you know when that happens?
> 
> The RPC Call send path sets up the Send SGE array. If it includes
> page cache pages, it can set IB_SEND_SIGNALED.
> 
> The SGE array and the ib_cqe for the send are in the same data
> structure, so the Send completion handler can find the SGE array
> and figure out what needs to be unmapped.
> 
> The only problem is if a POSIX signal fires. In that case the
> data structure can be released before the Send completion fires,
> and we get touch-after-free in the completion handler.
> 
> I'm thinking that it just isn't going to be practical to handle
> unmapping this way, and I should just revert back to using RDMA
> Read instead of adding page cache pages to the Send SGE.

Or wait for the send completion before completing the I/O?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-29  5:35                                                                                                                                                           ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-06-29  5:35 UTC (permalink / raw)



>> How do you know when that happens?
> 
> The RPC Call send path sets up the Send SGE array. If it includes
> page cache pages, it can set IB_SEND_SIGNALED.
> 
> The SGE array and the ib_cqe for the send are in the same data
> structure, so the Send completion handler can find the SGE array
> and figure out what needs to be unmapped.
> 
> The only problem is if a POSIX signal fires. In that case the
> data structure can be released before the Send completion fires,
> and we get touch-after-free in the completion handler.
> 
> I'm thinking that it just isn't going to be practical to handle
> unmapping this way, and I should just revert back to using RDMA
> Read instead of adding page cache pages to the Send SGE.

Or wait for the send completion before completing the I/O?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-29  5:35                                                                                                                                                           ` Sagi Grimberg
@ 2017-06-29 14:55                                                                                                                                                               ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-29 14:55 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jun 29, 2017, at 1:35 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>>> How do you know when that happens?
>> The RPC Call send path sets up the Send SGE array. If it includes
>> page cache pages, it can set IB_SEND_SIGNALED.
>> The SGE array and the ib_cqe for the send are in the same data
>> structure, so the Send completion handler can find the SGE array
>> and figure out what needs to be unmapped.
>> The only problem is if a POSIX signal fires. In that case the
>> data structure can be released before the Send completion fires,
>> and we get touch-after-free in the completion handler.
>> I'm thinking that it just isn't going to be practical to handle
>> unmapping this way, and I should just revert back to using RDMA
>> Read instead of adding page cache pages to the Send SGE.
> 
> Or wait for the send completion before completing the I/O?

In the normal case, that works.

If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
and recovers all resources. The Send can still be running at that
point, and it can't be stopped (without transitioning the QP to
error state, I guess).

The alternative is reference-counting the data structure that has
the ib_cqe and the SGE array. That adds one or more atomic_t
operations per I/O that I'd like to avoid.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-06-29 14:55                                                                                                                                                               ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-06-29 14:55 UTC (permalink / raw)



> On Jun 29, 2017,@1:35 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> How do you know when that happens?
>> The RPC Call send path sets up the Send SGE array. If it includes
>> page cache pages, it can set IB_SEND_SIGNALED.
>> The SGE array and the ib_cqe for the send are in the same data
>> structure, so the Send completion handler can find the SGE array
>> and figure out what needs to be unmapped.
>> The only problem is if a POSIX signal fires. In that case the
>> data structure can be released before the Send completion fires,
>> and we get touch-after-free in the completion handler.
>> I'm thinking that it just isn't going to be practical to handle
>> unmapping this way, and I should just revert back to using RDMA
>> Read instead of adding page cache pages to the Send SGE.
> 
> Or wait for the send completion before completing the I/O?

In the normal case, that works.

If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
and recovers all resources. The Send can still be running at that
point, and it can't be stopped (without transitioning the QP to
error state, I guess).

The alternative is reference-counting the data structure that has
the ib_cqe and the SGE array. That adds one or more atomic_t
operations per I/O that I'd like to avoid.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-06-29 14:55                                                                                                                                                               ` Chuck Lever
@ 2017-07-02  9:45                                                                                                                                                                   ` Sagi Grimberg
  -1 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-07-02  9:45 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


>> Or wait for the send completion before completing the I/O?
> 
> In the normal case, that works.
> 
> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
> and recovers all resources. The Send can still be running at that
> point, and it can't be stopped (without transitioning the QP to
> error state, I guess).

In that case we can't complete the I/O either (or move the
QP into error state), we need to defer/sleep on send completion.


> The alternative is reference-counting the data structure that has
> the ib_cqe and the SGE array. That adds one or more atomic_t
> operations per I/O that I'd like to avoid.

Why atomics?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-02  9:45                                                                                                                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 171+ messages in thread
From: Sagi Grimberg @ 2017-07-02  9:45 UTC (permalink / raw)



>> Or wait for the send completion before completing the I/O?
> 
> In the normal case, that works.
> 
> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
> and recovers all resources. The Send can still be running at that
> point, and it can't be stopped (without transitioning the QP to
> error state, I guess).

In that case we can't complete the I/O either (or move the
QP into error state), we need to defer/sleep on send completion.


> The alternative is reference-counting the data structure that has
> the ib_cqe and the SGE array. That adds one or more atomic_t
> operations per I/O that I'd like to avoid.

Why atomics?

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-02  9:45                                                                                                                                                                   ` Sagi Grimberg
@ 2017-07-02 18:17                                                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-02 18:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Jason Gunthorpe, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 2, 2017, at 5:45 AM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
> 
>>> Or wait for the send completion before completing the I/O?
>> In the normal case, that works.
>> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
>> and recovers all resources. The Send can still be running at that
>> point, and it can't be stopped (without transitioning the QP to
>> error state, I guess).
> 
> In that case we can't complete the I/O either (or move the
> QP into error state), we need to defer/sleep on send completion.

Unfortunately the RPC client finite state machine mustn't
sleep when a POSIX signal fires. xprtrdma has to unblock the
waiting application process but clean up the resources
asynchronously.

The RPC completion doesn't have to wait on DMA unmapping the
send buffer. What would have to wait is cleaning up the
resources -- in particular, allowing the rpcrdma_req
structure, where the send SGEs are kept, to be re-used. In
the current design, both happen at the same time.


>> The alternative is reference-counting the data structure that has
>> the ib_cqe and the SGE array. That adds one or more atomic_t
>> operations per I/O that I'd like to avoid.
> 
> Why atomics?

Either an atomic reference count or a spin lock is necessary
because there are two different ways an RPC can exit:

1. The common way, which is through receipt of an RPC reply,
handled by rpcrdma_reply_handler.

2. POSIX signal, where the RPC reply races with the wake-up
of the application process (in other words, the reply can
still arrive while the RPC is terminating).

In both cases, the RPC client has to invalidate any
registered memory, and it has to be done no more and no less
than once.

I deal with some of this in my for-13 patches:

http://marc.info/?l=linux-nfs&m=149693711119727&w=2

The first seven patches handle the race condition and the
need for exactly-once invalidation.

But the issue with unmapping the Send buffers has to do
with how the Send SGEs are managed. The data structure
containing the SGEs goes away once the RPC is complete.

So there are two "users": one is the RPC completion, and
one is the Send completion. Once both are done, the data
structure can be released. But RPC completion can't wait
if the Send completion hasn't yet fired.

I could kmalloc the SGE array instead, signal each Send,
and then in the Send completion handler, unmap the SGEs
and then kfree the SGE array. That's a lot of overhead.

Or I could revert all the "map page cache pages" logic and
just use memcpy for small NFS WRITEs, and RDMA the rest of
the time. That keeps everything simple, but means large
inline thresholds can't use send-in-place.

I'm still open to suggestion. for-4.14 will deal with other
problems, unless an obvious and easy fix arises.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-02 18:17                                                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-02 18:17 UTC (permalink / raw)



> On Jul 2, 2017,@5:45 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> Or wait for the send completion before completing the I/O?
>> In the normal case, that works.
>> If a POSIX signal occurs (^C, RPC timeout), the RPC exits immediately
>> and recovers all resources. The Send can still be running at that
>> point, and it can't be stopped (without transitioning the QP to
>> error state, I guess).
> 
> In that case we can't complete the I/O either (or move the
> QP into error state), we need to defer/sleep on send completion.

Unfortunately the RPC client finite state machine mustn't
sleep when a POSIX signal fires. xprtrdma has to unblock the
waiting application process but clean up the resources
asynchronously.

The RPC completion doesn't have to wait on DMA unmapping the
send buffer. What would have to wait is cleaning up the
resources -- in particular, allowing the rpcrdma_req
structure, where the send SGEs are kept, to be re-used. In
the current design, both happen at the same time.


>> The alternative is reference-counting the data structure that has
>> the ib_cqe and the SGE array. That adds one or more atomic_t
>> operations per I/O that I'd like to avoid.
> 
> Why atomics?

Either an atomic reference count or a spin lock is necessary
because there are two different ways an RPC can exit:

1. The common way, which is through receipt of an RPC reply,
handled by rpcrdma_reply_handler.

2. POSIX signal, where the RPC reply races with the wake-up
of the application process (in other words, the reply can
still arrive while the RPC is terminating).

In both cases, the RPC client has to invalidate any
registered memory, and it has to be done no more and no less
than once.

I deal with some of this in my for-13 patches:

http://marc.info/?l=linux-nfs&m=149693711119727&w=2

The first seven patches handle the race condition and the
need for exactly-once invalidation.

But the issue with unmapping the Send buffers has to do
with how the Send SGEs are managed. The data structure
containing the SGEs goes away once the RPC is complete.

So there are two "users": one is the RPC completion, and
one is the Send completion. Once both are done, the data
structure can be released. But RPC completion can't wait
if the Send completion hasn't yet fired.

I could kmalloc the SGE array instead, signal each Send,
and then in the Send completion handler, unmap the SGEs
and then kfree the SGE array. That's a lot of overhead.

Or I could revert all the "map page cache pages" logic and
just use memcpy for small NFS WRITEs, and RDMA the rest of
the time. That keeps everything simple, but means large
inline thresholds can't use send-in-place.

I'm still open to suggestion. for-4.14 will deal with other
problems, unless an obvious and easy fix arises.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-02 18:17                                                                                                                                                                       ` Chuck Lever
@ 2017-07-09 16:47                                                                                                                                                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-09 16:47 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote:

> I could kmalloc the SGE array instead, signal each Send,
> and then in the Send completion handler, unmap the SGEs
> and then kfree the SGE array. That's a lot of overhead.

Usually after allocating the send queue you'd pre-allocate all the
tracking memory needed for each SQE - eg enough information to do the
dma unmaps/etc?

> Or I could revert all the "map page cache pages" logic and
> just use memcpy for small NFS WRITEs, and RDMA the rest of
> the time. That keeps everything simple, but means large
> inline thresholds can't use send-in-place.

Don't you have the same problem with RDMA WRITE?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-09 16:47                                                                                                                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-09 16:47 UTC (permalink / raw)


On Sun, Jul 02, 2017@02:17:52PM -0400, Chuck Lever wrote:

> I could kmalloc the SGE array instead, signal each Send,
> and then in the Send completion handler, unmap the SGEs
> and then kfree the SGE array. That's a lot of overhead.

Usually after allocating the send queue you'd pre-allocate all the
tracking memory needed for each SQE - eg enough information to do the
dma unmaps/etc?

> Or I could revert all the "map page cache pages" logic and
> just use memcpy for small NFS WRITEs, and RDMA the rest of
> the time. That keeps everything simple, but means large
> inline thresholds can't use send-in-place.

Don't you have the same problem with RDMA WRITE?

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-09 16:47                                                                                                                                                                           ` Jason Gunthorpe
@ 2017-07-10 19:03                                                                                                                                                                               ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 9, 2017, at 12:47 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Sun, Jul 02, 2017 at 02:17:52PM -0400, Chuck Lever wrote:
> 
>> I could kmalloc the SGE array instead, signal each Send,
>> and then in the Send completion handler, unmap the SGEs
>> and then kfree the SGE array. That's a lot of overhead.
> 
> Usually after allocating the send queue you'd pre-allocate all the
> tracking memory needed for each SQE - eg enough information to do the
> dma unmaps/etc?

Right. In xprtrdma, the QP resources are allocated in rpcrdma_ep_create.
For every RPC-over-RDMA credit, rpcrdma_buffer_create allocates an
rpcrdma_req structure, which contains an ib_cqe and an array of SGEs for
the Send, and a number of other resources used to maintain registration
state during an RPC-over-RDMA call. Both of these functions are invoked
during transport instance set-up.

The problem is the lifetime for the rpcrdma_req structure. Currently it
is acquired when an RPC is started, and it is released when the RPC
terminates.

Inline send buffers are never unmapped until transport tear-down, but
since:

commit 655fec6987be05964e70c2e2efcbb253710e282f
Author:     Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
AuthorDate: Thu Sep 15 10:57:24 2016 -0400
Commit:     Anna Schumaker <Anna.Schumaker-ZwjVKphTwtPQT0dZR+AlfA@public.gmane.org>
CommitDate: Mon Sep 19 13:08:38 2016 -0400

    xprtrdma: Use gathered Send for large inline messages

Part of the Send payload can come from page cache pages for NFS WRITE
and NFS SYMLINK operations. Send buffers that are page cache pages are
DMA unmapped when rpcrdma_req is released.

IIUC what Sagi found is that Send WRs can continue running even after an
RPC completes in certain pathological cases. Therefore the Send WR can
complete after the rpcrdma_req is released and page cache-related Send
buffers have been unmapped.

It's not an issue to make the RPC reply handler wait for Send completion.
In most cases, this is not going to add any additional latency because
the Send will complete long before the RPC reply arrives. By far the
common case, and that's an extra completion interrupt for nothing.

The problem arises if the RPC is terminated locally before the reply
arrives. Suppose, for example, user hits ^C, or a timer fires. Then the
rpcrdma_req can be released and re-used before the Send completes.
There's no way to make RPC completion wait for Send completion.

One option is to somehow split the Send-related data structures from
rpcrdma_req, and manage them independently. I've already done that for
MRs: MR state is now located in rpcrdma_mw.

If instead I just never DMA map page cache pages, then all Send buffers
are always left DMA mapped while the transport is active. There's no
problem there with Send retransmits. The overhead is that I have to
either copy data into the Send buffers, or force the server to use RDMA
Read, which has a palpable overhead.


>> Or I could revert all the "map page cache pages" logic and
>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>> the time. That keeps everything simple, but means large
>> inline thresholds can't use send-in-place.
> 
> Don't you have the same problem with RDMA WRITE?

The server side initiates RDMA Writes. The final RDMA Write in a WR
chain is signaled, but a subsequent Send completion is used to
determine when the server may release resources used for the Writes.
We're already doing it the slow way there, and there's no ^C hazard
on the server.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 19:03                                                                                                                                                                               ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 19:03 UTC (permalink / raw)



> On Jul 9, 2017,@12:47 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Sun, Jul 02, 2017@02:17:52PM -0400, Chuck Lever wrote:
> 
>> I could kmalloc the SGE array instead, signal each Send,
>> and then in the Send completion handler, unmap the SGEs
>> and then kfree the SGE array. That's a lot of overhead.
> 
> Usually after allocating the send queue you'd pre-allocate all the
> tracking memory needed for each SQE - eg enough information to do the
> dma unmaps/etc?

Right. In xprtrdma, the QP resources are allocated in rpcrdma_ep_create.
For every RPC-over-RDMA credit, rpcrdma_buffer_create allocates an
rpcrdma_req structure, which contains an ib_cqe and an array of SGEs for
the Send, and a number of other resources used to maintain registration
state during an RPC-over-RDMA call. Both of these functions are invoked
during transport instance set-up.

The problem is the lifetime for the rpcrdma_req structure. Currently it
is acquired when an RPC is started, and it is released when the RPC
terminates.

Inline send buffers are never unmapped until transport tear-down, but
since:

commit 655fec6987be05964e70c2e2efcbb253710e282f
Author:     Chuck Lever <chuck.lever at oracle.com>
AuthorDate: Thu Sep 15 10:57:24 2016 -0400
Commit:     Anna Schumaker <Anna.Schumaker at Netapp.com>
CommitDate: Mon Sep 19 13:08:38 2016 -0400

    xprtrdma: Use gathered Send for large inline messages

Part of the Send payload can come from page cache pages for NFS WRITE
and NFS SYMLINK operations. Send buffers that are page cache pages are
DMA unmapped when rpcrdma_req is released.

IIUC what Sagi found is that Send WRs can continue running even after an
RPC completes in certain pathological cases. Therefore the Send WR can
complete after the rpcrdma_req is released and page cache-related Send
buffers have been unmapped.

It's not an issue to make the RPC reply handler wait for Send completion.
In most cases, this is not going to add any additional latency because
the Send will complete long before the RPC reply arrives. By far the
common case, and that's an extra completion interrupt for nothing.

The problem arises if the RPC is terminated locally before the reply
arrives. Suppose, for example, user hits ^C, or a timer fires. Then the
rpcrdma_req can be released and re-used before the Send completes.
There's no way to make RPC completion wait for Send completion.

One option is to somehow split the Send-related data structures from
rpcrdma_req, and manage them independently. I've already done that for
MRs: MR state is now located in rpcrdma_mw.

If instead I just never DMA map page cache pages, then all Send buffers
are always left DMA mapped while the transport is active. There's no
problem there with Send retransmits. The overhead is that I have to
either copy data into the Send buffers, or force the server to use RDMA
Read, which has a palpable overhead.


>> Or I could revert all the "map page cache pages" logic and
>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>> the time. That keeps everything simple, but means large
>> inline thresholds can't use send-in-place.
> 
> Don't you have the same problem with RDMA WRITE?

The server side initiates RDMA Writes. The final RDMA Write in a WR
chain is signaled, but a subsequent Send completion is used to
determine when the server may release resources used for the Writes.
We're already doing it the slow way there, and there's no ^C hazard
on the server.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 19:03                                                                                                                                                                               ` Chuck Lever
@ 2017-07-10 20:05                                                                                                                                                                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 20:05 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:

> One option is to somehow split the Send-related data structures from
> rpcrdma_req, and manage them independently. I've already done that for
> MRs: MR state is now located in rpcrdma_mw.

Yes, this is is what I was implying.. Track the SQE related stuff
seperately in memory allocated during SQ setup - MR, dma maps, etc.

No need for an atomic/lock then, right? The required memory is bounded
since the inline send depth is bounded.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 20:05                                                                                                                                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 20:05 UTC (permalink / raw)


On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:

> One option is to somehow split the Send-related data structures from
> rpcrdma_req, and manage them independently. I've already done that for
> MRs: MR state is now located in rpcrdma_mw.

Yes, this is is what I was implying.. Track the SQE related stuff
seperately in memory allocated during SQ setup - MR, dma maps, etc.

No need for an atomic/lock then, right? The required memory is bounded
since the inline send depth is bounded.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 20:05                                                                                                                                                                                   ` Jason Gunthorpe
@ 2017-07-10 20:51                                                                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 20:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> 
>> One option is to somehow split the Send-related data structures from
>> rpcrdma_req, and manage them independently. I've already done that for
>> MRs: MR state is now located in rpcrdma_mw.
> 
> Yes, this is is what I was implying.. Track the SQE related stuff
> seperately in memory allocated during SQ setup - MR, dma maps, etc.

> No need for an atomic/lock then, right? The required memory is bounded
> since the inline send depth is bounded.

Perhaps I lack some imagination, but I don't see how I can manage
these small objects without a serialized free list or circular
array that would be accessed in the forward path and also in a
Send completion handler.

And I would still have to signal all Sends, which is extra
interrupts and context switches.

This seems like a lot of overhead to deal with a very uncommon
case. I can reduce this overhead by signaling only Sends that
need to unmap page cache pages, but still.


But I also realized that Send Queue accounting can be broken by a
delayed Send completion.

As we previously discussed, xprtrdma does SQ accounting using RPC
completion as the gate. Basically xprtrdma will send another RPC
as soon as a previous one is terminated. If the Send WR is still
running when the RPC terminates, I can potentially overrun the
Send Queue.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 20:51                                                                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 20:51 UTC (permalink / raw)



> On Jul 10, 2017,@4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:
> 
>> One option is to somehow split the Send-related data structures from
>> rpcrdma_req, and manage them independently. I've already done that for
>> MRs: MR state is now located in rpcrdma_mw.
> 
> Yes, this is is what I was implying.. Track the SQE related stuff
> seperately in memory allocated during SQ setup - MR, dma maps, etc.

> No need for an atomic/lock then, right? The required memory is bounded
> since the inline send depth is bounded.

Perhaps I lack some imagination, but I don't see how I can manage
these small objects without a serialized free list or circular
array that would be accessed in the forward path and also in a
Send completion handler.

And I would still have to signal all Sends, which is extra
interrupts and context switches.

This seems like a lot of overhead to deal with a very uncommon
case. I can reduce this overhead by signaling only Sends that
need to unmap page cache pages, but still.


But I also realized that Send Queue accounting can be broken by a
delayed Send completion.

As we previously discussed, xprtrdma does SQ accounting using RPC
completion as the gate. Basically xprtrdma will send another RPC
as soon as a previous one is terminated. If the Send WR is still
running when the RPC terminates, I can potentially overrun the
Send Queue.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 20:51                                                                                                                                                                                       ` Chuck Lever
@ 2017-07-10 21:14                                                                                                                                                                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:14 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Mon, Jul 10, 2017 at 04:51:20PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017, at 4:05 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > 
> > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> > 
> >> One option is to somehow split the Send-related data structures from
> >> rpcrdma_req, and manage them independently. I've already done that for
> >> MRs: MR state is now located in rpcrdma_mw.
> > 
> > Yes, this is is what I was implying.. Track the SQE related stuff
> > seperately in memory allocated during SQ setup - MR, dma maps, etc.
> 
> > No need for an atomic/lock then, right? The required memory is bounded
> > since the inline send depth is bounded.
> 
> Perhaps I lack some imagination, but I don't see how I can manage
> these small objects without a serialized free list or circular
> array that would be accessed in the forward path and also in a
> Send completion handler.

I don't get it, dma unmap can only ever happen in the send completion
handler, it can never happen in the forward path. (this is the whole
point of this thread)

Since you are not using send completion today you can just use the
wr_id to point to the pre-allocated memory containing the pages to
invalidate. Completely remove dma unmap from the forward path.

Usually I work things out so that the meta-data array is a ring and
every SQE post consumes a meta-data entry. Then occasionally I signal
completion and provide a wr_id of the latest ring index and the
completion handler runs through all the accumulated meta-data and acts
on it (eg unmaps/etc). This approach still allows batching
completions.

Since ring entries are bounded size we just preallocate the largest
size at QP creation. In this case it is some multiple of the number of
inline send pages * number of SQE entries.

> This seems like a lot of overhead to deal with a very uncommon
> case. I can reduce this overhead by signaling only Sends that
> need to unmap page cache pages, but still.

Yes, but it is not avoidable..

> As we previously discussed, xprtrdma does SQ accounting using RPC
> completion as the gate. Basically xprtrdma will send another RPC
> as soon as a previous one is terminated. If the Send WR is still
> running when the RPC terminates, I can potentially overrun the
> Send Queue.

Makes sense. The SQ accounting must be precise.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 21:14                                                                                                                                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:14 UTC (permalink / raw)


On Mon, Jul 10, 2017@04:51:20PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017,@4:05 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> > 
> > On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:
> > 
> >> One option is to somehow split the Send-related data structures from
> >> rpcrdma_req, and manage them independently. I've already done that for
> >> MRs: MR state is now located in rpcrdma_mw.
> > 
> > Yes, this is is what I was implying.. Track the SQE related stuff
> > seperately in memory allocated during SQ setup - MR, dma maps, etc.
> 
> > No need for an atomic/lock then, right? The required memory is bounded
> > since the inline send depth is bounded.
> 
> Perhaps I lack some imagination, but I don't see how I can manage
> these small objects without a serialized free list or circular
> array that would be accessed in the forward path and also in a
> Send completion handler.

I don't get it, dma unmap can only ever happen in the send completion
handler, it can never happen in the forward path. (this is the whole
point of this thread)

Since you are not using send completion today you can just use the
wr_id to point to the pre-allocated memory containing the pages to
invalidate. Completely remove dma unmap from the forward path.

Usually I work things out so that the meta-data array is a ring and
every SQE post consumes a meta-data entry. Then occasionally I signal
completion and provide a wr_id of the latest ring index and the
completion handler runs through all the accumulated meta-data and acts
on it (eg unmaps/etc). This approach still allows batching
completions.

Since ring entries are bounded size we just preallocate the largest
size at QP creation. In this case it is some multiple of the number of
inline send pages * number of SQE entries.

> This seems like a lot of overhead to deal with a very uncommon
> case. I can reduce this overhead by signaling only Sends that
> need to unmap page cache pages, but still.

Yes, but it is not avoidable..

> As we previously discussed, xprtrdma does SQ accounting using RPC
> completion as the gate. Basically xprtrdma will send another RPC
> as soon as a previous one is terminated. If the Send WR is still
> running when the RPC terminates, I can potentially overrun the
> Send Queue.

Makes sense. The SQ accounting must be precise.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 19:03                                                                                                                                                                               ` Chuck Lever
@ 2017-07-10 21:24                                                                                                                                                                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:24 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:

> >> Or I could revert all the "map page cache pages" logic and
> >> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >> the time. That keeps everything simple, but means large
> >> inline thresholds can't use send-in-place.
> > 
> > Don't you have the same problem with RDMA WRITE?
> 
> The server side initiates RDMA Writes. The final RDMA Write in a WR
> chain is signaled, but a subsequent Send completion is used to
> determine when the server may release resources used for the Writes.
> We're already doing it the slow way there, and there's no ^C hazard
> on the server.

Wait, I guess I meant RDMA READ path.

The same contraints apply to RKeys as inline send - you cannot DMA
unmap rkey memory until the rkey is invalidated at the HCA.

So posting an invalidate SQE and then immediately unmapping the DMA
pages is bad too..

No matter how the data is transfered the unmapping must follow the
same HCA synchronous model.. DMA unmap must only be done from the send
completion handler (inline send or invalidate rkey), from the recv
completion handler (send with invalidate), or from QP error state teardown.

Anything that does DMA memory unmap from another thread is very, very
suspect, eg async from a ctrl-c trigger event.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 21:24                                                                                                                                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:24 UTC (permalink / raw)


On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:

> >> Or I could revert all the "map page cache pages" logic and
> >> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >> the time. That keeps everything simple, but means large
> >> inline thresholds can't use send-in-place.
> > 
> > Don't you have the same problem with RDMA WRITE?
> 
> The server side initiates RDMA Writes. The final RDMA Write in a WR
> chain is signaled, but a subsequent Send completion is used to
> determine when the server may release resources used for the Writes.
> We're already doing it the slow way there, and there's no ^C hazard
> on the server.

Wait, I guess I meant RDMA READ path.

The same contraints apply to RKeys as inline send - you cannot DMA
unmap rkey memory until the rkey is invalidated at the HCA.

So posting an invalidate SQE and then immediately unmapping the DMA
pages is bad too..

No matter how the data is transfered the unmapping must follow the
same HCA synchronous model.. DMA unmap must only be done from the send
completion handler (inline send or invalidate rkey), from the recv
completion handler (send with invalidate), or from QP error state teardown.

Anything that does DMA memory unmap from another thread is very, very
suspect, eg async from a ctrl-c trigger event.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 21:24                                                                                                                                                                                   ` Jason Gunthorpe
@ 2017-07-10 21:29                                                                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 21:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> 
>>>> Or I could revert all the "map page cache pages" logic and
>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>> the time. That keeps everything simple, but means large
>>>> inline thresholds can't use send-in-place.
>>> 
>>> Don't you have the same problem with RDMA WRITE?
>> 
>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>> chain is signaled, but a subsequent Send completion is used to
>> determine when the server may release resources used for the Writes.
>> We're already doing it the slow way there, and there's no ^C hazard
>> on the server.
> 
> Wait, I guess I meant RDMA READ path.
> 
> The same contraints apply to RKeys as inline send - you cannot DMA
> unmap rkey memory until the rkey is invalidated at the HCA.
> 
> So posting an invalidate SQE and then immediately unmapping the DMA
> pages is bad too..
> 
> No matter how the data is transfered the unmapping must follow the
> same HCA synchronous model.. DMA unmap must only be done from the send
> completion handler (inline send or invalidate rkey), from the recv
> completion handler (send with invalidate), or from QP error state teardown.
> 
> Anything that does DMA memory unmap from another thread is very, very
> suspect, eg async from a ctrl-c trigger event.

4.13 server side is converted to use the rdma_rw API for
handling RDMA Read. For non-iWARP cases, it's using the
local DMA key for Read sink buffers. For iWARP it should
be using Read-with-invalidate (IIRC).

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 21:29                                                                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 21:29 UTC (permalink / raw)



> On Jul 10, 2017,@5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:
> 
>>>> Or I could revert all the "map page cache pages" logic and
>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>> the time. That keeps everything simple, but means large
>>>> inline thresholds can't use send-in-place.
>>> 
>>> Don't you have the same problem with RDMA WRITE?
>> 
>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>> chain is signaled, but a subsequent Send completion is used to
>> determine when the server may release resources used for the Writes.
>> We're already doing it the slow way there, and there's no ^C hazard
>> on the server.
> 
> Wait, I guess I meant RDMA READ path.
> 
> The same contraints apply to RKeys as inline send - you cannot DMA
> unmap rkey memory until the rkey is invalidated at the HCA.
> 
> So posting an invalidate SQE and then immediately unmapping the DMA
> pages is bad too..
> 
> No matter how the data is transfered the unmapping must follow the
> same HCA synchronous model.. DMA unmap must only be done from the send
> completion handler (inline send or invalidate rkey), from the recv
> completion handler (send with invalidate), or from QP error state teardown.
> 
> Anything that does DMA memory unmap from another thread is very, very
> suspect, eg async from a ctrl-c trigger event.

4.13 server side is converted to use the rdma_rw API for
handling RDMA Read. For non-iWARP cases, it's using the
local DMA key for Read sink buffers. For iWARP it should
be using Read-with-invalidate (IIRC).

--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 21:29                                                                                                                                                                                       ` Chuck Lever
@ 2017-07-10 21:32                                                                                                                                                                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:32 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > 
> > On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
> > 
> >>>> Or I could revert all the "map page cache pages" logic and
> >>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >>>> the time. That keeps everything simple, but means large
> >>>> inline thresholds can't use send-in-place.
> >>> 
> >>> Don't you have the same problem with RDMA WRITE?
> >> 
> >> The server side initiates RDMA Writes. The final RDMA Write in a WR
> >> chain is signaled, but a subsequent Send completion is used to
> >> determine when the server may release resources used for the Writes.
> >> We're already doing it the slow way there, and there's no ^C hazard
> >> on the server.
> > 
> > Wait, I guess I meant RDMA READ path.
> > 
> > The same contraints apply to RKeys as inline send - you cannot DMA
> > unmap rkey memory until the rkey is invalidated at the HCA.
> > 
> > So posting an invalidate SQE and then immediately unmapping the DMA
> > pages is bad too..
> > 
> > No matter how the data is transfered the unmapping must follow the
> > same HCA synchronous model.. DMA unmap must only be done from the send
> > completion handler (inline send or invalidate rkey), from the recv
> > completion handler (send with invalidate), or from QP error state teardown.
> > 
> > Anything that does DMA memory unmap from another thread is very, very
> > suspect, eg async from a ctrl-c trigger event.
> 
> 4.13 server side is converted to use the rdma_rw API for
> handling RDMA Read. For non-iWARP cases, it's using the
> local DMA key for Read sink buffers. For iWARP it should
> be using Read-with-invalidate (IIRC).

The server sounds fine, how does the client work?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 21:32                                                                                                                                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 21:32 UTC (permalink / raw)


On Mon, Jul 10, 2017@05:29:53PM -0400, Chuck Lever wrote:
> 
> > On Jul 10, 2017,@5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> > 
> > On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:
> > 
> >>>> Or I could revert all the "map page cache pages" logic and
> >>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
> >>>> the time. That keeps everything simple, but means large
> >>>> inline thresholds can't use send-in-place.
> >>> 
> >>> Don't you have the same problem with RDMA WRITE?
> >> 
> >> The server side initiates RDMA Writes. The final RDMA Write in a WR
> >> chain is signaled, but a subsequent Send completion is used to
> >> determine when the server may release resources used for the Writes.
> >> We're already doing it the slow way there, and there's no ^C hazard
> >> on the server.
> > 
> > Wait, I guess I meant RDMA READ path.
> > 
> > The same contraints apply to RKeys as inline send - you cannot DMA
> > unmap rkey memory until the rkey is invalidated at the HCA.
> > 
> > So posting an invalidate SQE and then immediately unmapping the DMA
> > pages is bad too..
> > 
> > No matter how the data is transfered the unmapping must follow the
> > same HCA synchronous model.. DMA unmap must only be done from the send
> > completion handler (inline send or invalidate rkey), from the recv
> > completion handler (send with invalidate), or from QP error state teardown.
> > 
> > Anything that does DMA memory unmap from another thread is very, very
> > suspect, eg async from a ctrl-c trigger event.
> 
> 4.13 server side is converted to use the rdma_rw API for
> handling RDMA Read. For non-iWARP cases, it's using the
> local DMA key for Read sink buffers. For iWARP it should
> be using Read-with-invalidate (IIRC).

The server sounds fine, how does the client work?

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 21:32                                                                                                                                                                                           ` Jason Gunthorpe
@ 2017-07-10 22:04                                                                                                                                                                                               ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 10, 2017, at 5:32 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Mon, Jul 10, 2017 at 05:29:53PM -0400, Chuck Lever wrote:
>> 
>>> On Jul 10, 2017, at 5:24 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>> 
>>> On Mon, Jul 10, 2017 at 03:03:18PM -0400, Chuck Lever wrote:
>>> 
>>>>>> Or I could revert all the "map page cache pages" logic and
>>>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>>>> the time. That keeps everything simple, but means large
>>>>>> inline thresholds can't use send-in-place.
>>>>> 
>>>>> Don't you have the same problem with RDMA WRITE?
>>>> 
>>>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>>>> chain is signaled, but a subsequent Send completion is used to
>>>> determine when the server may release resources used for the Writes.
>>>> We're already doing it the slow way there, and there's no ^C hazard
>>>> on the server.
>>> 
>>> Wait, I guess I meant RDMA READ path.
>>> 
>>> The same contraints apply to RKeys as inline send - you cannot DMA
>>> unmap rkey memory until the rkey is invalidated at the HCA.
>>> 
>>> So posting an invalidate SQE and then immediately unmapping the DMA
>>> pages is bad too..
>>> 
>>> No matter how the data is transfered the unmapping must follow the
>>> same HCA synchronous model.. DMA unmap must only be done from the send
>>> completion handler (inline send or invalidate rkey), from the recv
>>> completion handler (send with invalidate), or from QP error state teardown.
>>> 
>>> Anything that does DMA memory unmap from another thread is very, very
>>> suspect, eg async from a ctrl-c trigger event.
>> 
>> 4.13 server side is converted to use the rdma_rw API for
>> handling RDMA Read. For non-iWARP cases, it's using the
>> local DMA key for Read sink buffers. For iWARP it should
>> be using Read-with-invalidate (IIRC).
> 
> The server sounds fine, how does the client work?

The client does not initiate RDMA Read or Write today.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 22:04                                                                                                                                                                                               ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-10 22:04 UTC (permalink / raw)



> On Jul 10, 2017,@5:32 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017@05:29:53PM -0400, Chuck Lever wrote:
>> 
>>> On Jul 10, 2017,@5:24 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>> 
>>> On Mon, Jul 10, 2017@03:03:18PM -0400, Chuck Lever wrote:
>>> 
>>>>>> Or I could revert all the "map page cache pages" logic and
>>>>>> just use memcpy for small NFS WRITEs, and RDMA the rest of
>>>>>> the time. That keeps everything simple, but means large
>>>>>> inline thresholds can't use send-in-place.
>>>>> 
>>>>> Don't you have the same problem with RDMA WRITE?
>>>> 
>>>> The server side initiates RDMA Writes. The final RDMA Write in a WR
>>>> chain is signaled, but a subsequent Send completion is used to
>>>> determine when the server may release resources used for the Writes.
>>>> We're already doing it the slow way there, and there's no ^C hazard
>>>> on the server.
>>> 
>>> Wait, I guess I meant RDMA READ path.
>>> 
>>> The same contraints apply to RKeys as inline send - you cannot DMA
>>> unmap rkey memory until the rkey is invalidated at the HCA.
>>> 
>>> So posting an invalidate SQE and then immediately unmapping the DMA
>>> pages is bad too..
>>> 
>>> No matter how the data is transfered the unmapping must follow the
>>> same HCA synchronous model.. DMA unmap must only be done from the send
>>> completion handler (inline send or invalidate rkey), from the recv
>>> completion handler (send with invalidate), or from QP error state teardown.
>>> 
>>> Anything that does DMA memory unmap from another thread is very, very
>>> suspect, eg async from a ctrl-c trigger event.
>> 
>> 4.13 server side is converted to use the rdma_rw API for
>> handling RDMA Read. For non-iWARP cases, it's using the
>> local DMA key for Read sink buffers. For iWARP it should
>> be using Read-with-invalidate (IIRC).
> 
> The server sounds fine, how does the client work?

The client does not initiate RDMA Read or Write today.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 22:04                                                                                                                                                                                               ` Chuck Lever
@ 2017-07-10 22:09                                                                                                                                                                                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 22:09 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:

> > The server sounds fine, how does the client work?
> 
> The client does not initiate RDMA Read or Write today.

Right, but it provides an rkey that the server uses for READ or WRITE.

The invalidate of that rkey at the client must follow the same rules
as inline send.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-10 22:09                                                                                                                                                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 171+ messages in thread
From: Jason Gunthorpe @ 2017-07-10 22:09 UTC (permalink / raw)


On Mon, Jul 10, 2017@06:04:18PM -0400, Chuck Lever wrote:

> > The server sounds fine, how does the client work?
> 
> The client does not initiate RDMA Read or Write today.

Right, but it provides an rkey that the server uses for READ or WRITE.

The invalidate of that rkey at the client must follow the same rules
as inline send.

Jason

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-10 22:09                                                                                                                                                                                                   ` Jason Gunthorpe
@ 2017-07-11  3:57                                                                                                                                                                                                       ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-11  3:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche


> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> 
> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
> 
>>> The server sounds fine, how does the client work?
>> 
>> The client does not initiate RDMA Read or Write today.
> 
> Right, but it provides an rkey that the server uses for READ or WRITE.
> 
> The invalidate of that rkey at the client must follow the same rules
> as inline send.

Ah, I see.

The RPC reply handler calls frwr_op_unmap_sync to invalidate
any MRs associated with the RPC.

frwr_op_unmap_sync has to sort the rkeys that are remotely
invalidated, and those that have not been.

The first step is to ensure all the rkeys for an RPC are
invalid. The rkey that was remotely invalidated is skipped
here, and a chain of LocalInv WRs is posted to invalidate
any remaining rkeys. The last WR in the chain is signaled.

If one or more LocalInv WRs are posted, this function waits
for LocalInv completion.

The last step is always DMA unmapping. Note that we can't
get a completion for a remotely invalidated rkey, and we
have to wait for LocalInv to complete anyway. So the DMA
unmapping is always handled here instead of in a
completion handler.

When frwr_op_unmap_sync returns to the RPC reply handler,
the handler calls xprt_complete_rqst, and the RPC is
terminated. This guarantees that the MRs are invalid before
control is returned to the RPC consumer.


In the ^C case, frwr_op_unmap_safe is invoked during RPC
termination. The MRs are passed to the background recovery
task, which invokes frwr_op_recover_mr.

frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
memory. (It's also used when registration or invalidation
flushes, which is why it uses a hammer).

So here, we're a little fast/loose: the ordering of
invalidation and unmapping is correct, but the MRs can be
invalidated after the RPC completes. Since RPC termination
can't wait, this is the best I can do for now.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-11  3:57                                                                                                                                                                                                       ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-11  3:57 UTC (permalink / raw)



> On Jul 10, 2017,@6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
> 
> On Mon, Jul 10, 2017@06:04:18PM -0400, Chuck Lever wrote:
> 
>>> The server sounds fine, how does the client work?
>> 
>> The client does not initiate RDMA Read or Write today.
> 
> Right, but it provides an rkey that the server uses for READ or WRITE.
> 
> The invalidate of that rkey at the client must follow the same rules
> as inline send.

Ah, I see.

The RPC reply handler calls frwr_op_unmap_sync to invalidate
any MRs associated with the RPC.

frwr_op_unmap_sync has to sort the rkeys that are remotely
invalidated, and those that have not been.

The first step is to ensure all the rkeys for an RPC are
invalid. The rkey that was remotely invalidated is skipped
here, and a chain of LocalInv WRs is posted to invalidate
any remaining rkeys. The last WR in the chain is signaled.

If one or more LocalInv WRs are posted, this function waits
for LocalInv completion.

The last step is always DMA unmapping. Note that we can't
get a completion for a remotely invalidated rkey, and we
have to wait for LocalInv to complete anyway. So the DMA
unmapping is always handled here instead of in a
completion handler.

When frwr_op_unmap_sync returns to the RPC reply handler,
the handler calls xprt_complete_rqst, and the RPC is
terminated. This guarantees that the MRs are invalid before
control is returned to the RPC consumer.


In the ^C case, frwr_op_unmap_safe is invoked during RPC
termination. The MRs are passed to the background recovery
task, which invokes frwr_op_recover_mr.

frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
memory. (It's also used when registration or invalidation
flushes, which is why it uses a hammer).

So here, we're a little fast/loose: the ordering of
invalidation and unmapping is correct, but the MRs can be
invalidated after the RPC completes. Since RPC termination
can't wait, this is the best I can do for now.


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-11  3:57                                                                                                                                                                                                       ` Chuck Lever
@ 2017-07-11 13:23                                                                                                                                                                                                           ` Tom Talpey
  -1 siblings, 0 replies; 171+ messages in thread
From: Tom Talpey @ 2017-07-11 13:23 UTC (permalink / raw)
  To: Chuck Lever, Jason Gunthorpe
  Cc: Sagi Grimberg, Leon Romanovsky, Robert LeBlanc, Marta Rybczynska,
	Max Gurtovoy, Christoph Hellwig, Gruher, Joseph R,
	shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

On 7/10/2017 11:57 PM, Chuck Lever wrote:
> 
>> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>
>> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
>>
>>>> The server sounds fine, how does the client work?
>>>
>>> The client does not initiate RDMA Read or Write today.
>>
>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>
>> The invalidate of that rkey at the client must follow the same rules
>> as inline send.
> 
> Ah, I see.
> 
> The RPC reply handler calls frwr_op_unmap_sync to invalidate
> any MRs associated with the RPC.
> 
> frwr_op_unmap_sync has to sort the rkeys that are remotely
> invalidated, and those that have not been.

Does the reply handler consider the possibility that the reply is
being signaled before the send WRs? There are some really interesting
races on shared or multiple CQs when the completion upcalls start
to back up under heavy load that we've seen in Windows SMB Direct.

In the end, we had to put explicit reference counts on each and
every object, and added rundown references to everything before
completing an operation and signaling the upper layer (SMB3, in
our case). This found a surprising number of double completions,
and missing completions from drivers as well.

> The first step is to ensure all the rkeys for an RPC are
> invalid. The rkey that was remotely invalidated is skipped
> here, and a chain of LocalInv WRs is posted to invalidate
> any remaining rkeys. The last WR in the chain is signaled.
> 
> If one or more LocalInv WRs are posted, this function waits
> for LocalInv completion.
> 
> The last step is always DMA unmapping. Note that we can't
> get a completion for a remotely invalidated rkey, and we
> have to wait for LocalInv to complete anyway. So the DMA
> unmapping is always handled here instead of in a
> completion handler.
> 
> When frwr_op_unmap_sync returns to the RPC reply handler,
> the handler calls xprt_complete_rqst, and the RPC is
> terminated. This guarantees that the MRs are invalid before
> control is returned to the RPC consumer.
> 
> 
> In the ^C case, frwr_op_unmap_safe is invoked during RPC
> termination. The MRs are passed to the background recovery
> task, which invokes frwr_op_recover_mr.

That worries me. How do you know it's going in sequence, and
that it will result in an invalidated MR?

> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
> memory. (It's also used when registration or invalidation
> flushes, which is why it uses a hammer).
> 
> So here, we're a little fast/loose: the ordering of
> invalidation and unmapping is correct, but the MRs can be
> invalidated after the RPC completes. Since RPC termination
> can't wait, this is the best I can do for now.

That would worry me even more. "fast/loose" isn't a good
situation when storage is concerned. Shouldn't you just be
closing the connection?

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-11 13:23                                                                                                                                                                                                           ` Tom Talpey
  0 siblings, 0 replies; 171+ messages in thread
From: Tom Talpey @ 2017-07-11 13:23 UTC (permalink / raw)


On 7/10/2017 11:57 PM, Chuck Lever wrote:
> 
>> On Jul 10, 2017,@6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>
>> On Mon, Jul 10, 2017@06:04:18PM -0400, Chuck Lever wrote:
>>
>>>> The server sounds fine, how does the client work?
>>>
>>> The client does not initiate RDMA Read or Write today.
>>
>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>
>> The invalidate of that rkey at the client must follow the same rules
>> as inline send.
> 
> Ah, I see.
> 
> The RPC reply handler calls frwr_op_unmap_sync to invalidate
> any MRs associated with the RPC.
> 
> frwr_op_unmap_sync has to sort the rkeys that are remotely
> invalidated, and those that have not been.

Does the reply handler consider the possibility that the reply is
being signaled before the send WRs? There are some really interesting
races on shared or multiple CQs when the completion upcalls start
to back up under heavy load that we've seen in Windows SMB Direct.

In the end, we had to put explicit reference counts on each and
every object, and added rundown references to everything before
completing an operation and signaling the upper layer (SMB3, in
our case). This found a surprising number of double completions,
and missing completions from drivers as well.

> The first step is to ensure all the rkeys for an RPC are
> invalid. The rkey that was remotely invalidated is skipped
> here, and a chain of LocalInv WRs is posted to invalidate
> any remaining rkeys. The last WR in the chain is signaled.
> 
> If one or more LocalInv WRs are posted, this function waits
> for LocalInv completion.
> 
> The last step is always DMA unmapping. Note that we can't
> get a completion for a remotely invalidated rkey, and we
> have to wait for LocalInv to complete anyway. So the DMA
> unmapping is always handled here instead of in a
> completion handler.
> 
> When frwr_op_unmap_sync returns to the RPC reply handler,
> the handler calls xprt_complete_rqst, and the RPC is
> terminated. This guarantees that the MRs are invalid before
> control is returned to the RPC consumer.
> 
> 
> In the ^C case, frwr_op_unmap_safe is invoked during RPC
> termination. The MRs are passed to the background recovery
> task, which invokes frwr_op_recover_mr.

That worries me. How do you know it's going in sequence, and
that it will result in an invalidated MR?

> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
> memory. (It's also used when registration or invalidation
> flushes, which is why it uses a hammer).
> 
> So here, we're a little fast/loose: the ordering of
> invalidation and unmapping is correct, but the MRs can be
> invalidated after the RPC completes. Since RPC termination
> can't wait, this is the best I can do for now.

That would worry me even more. "fast/loose" isn't a good
situation when storage is concerned. Shouldn't you just be
closing the connection?

Tom.

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Re: Unexpected issues with 2 NVME initiators using the same target
  2017-07-11 13:23                                                                                                                                                                                                           ` Tom Talpey
@ 2017-07-11 14:55                                                                                                                                                                                                               ` Chuck Lever
  -1 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-11 14:55 UTC (permalink / raw)
  To: Tom Talpey
  Cc: Jason Gunthorpe, Sagi Grimberg, Leon Romanovsky, Robert LeBlanc,
	Marta Rybczynska, Max Gurtovoy, Christoph Hellwig, Gruher,
	Joseph R, shahar.salzman, Laurence Oberman, Riches Jr, Robert M,
	linux-rdma, linux-nvme-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Liran Liss, Bart Van Assche

Hi Tom-


> On Jul 11, 2017, at 9:23 AM, Tom Talpey <tom-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org> wrote:
> 
> On 7/10/2017 11:57 PM, Chuck Lever wrote:
>>> On Jul 10, 2017, at 6:09 PM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>> 
>>> On Mon, Jul 10, 2017 at 06:04:18PM -0400, Chuck Lever wrote:
>>> 
>>>>> The server sounds fine, how does the client work?
>>>> 
>>>> The client does not initiate RDMA Read or Write today.
>>> 
>>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>> 
>>> The invalidate of that rkey at the client must follow the same rules
>>> as inline send.
>> Ah, I see.
>> The RPC reply handler calls frwr_op_unmap_sync to invalidate
>> any MRs associated with the RPC.
>> frwr_op_unmap_sync has to sort the rkeys that are remotely
>> invalidated, and those that have not been.
> 
> Does the reply handler consider the possibility that the reply is
> being signaled before the send WRs? There are some really interesting
> races on shared or multiple CQs when the completion upcalls start
> to back up under heavy load that we've seen in Windows SMB Direct.

If I understand you correctly, that's exactly what we're discussing.
The Send WR that transmitted the RPC Call can outlive the RPC
transaction in rare cases.

A partial solution is to move the Send SGE array into a data structure
whose lifetime is independent of rpcrdma_req. That allows the client
to guarantee that appropriate DMA unmaps are done only after the Send
is complete.

The other part of this issue is that delayed Send completion also
needs to prevent initiation of another RPC, otherwise the Send Queue
can overflow or the client might exceed the credit grant on this
connection. This part worries me because it could mean that some
serialization (eg. a completion and context switch) is needed for
every Send operation on the client.


> In the end, we had to put explicit reference counts on each and
> every object, and added rundown references to everything before
> completing an operation and signaling the upper layer (SMB3, in
> our case). This found a surprising number of double completions,
> and missing completions from drivers as well.
> 
>> The first step is to ensure all the rkeys for an RPC are
>> invalid. The rkey that was remotely invalidated is skipped
>> here, and a chain of LocalInv WRs is posted to invalidate
>> any remaining rkeys. The last WR in the chain is signaled.
>> If one or more LocalInv WRs are posted, this function waits
>> for LocalInv completion.
>> The last step is always DMA unmapping. Note that we can't
>> get a completion for a remotely invalidated rkey, and we
>> have to wait for LocalInv to complete anyway. So the DMA
>> unmapping is always handled here instead of in a
>> completion handler.
>> When frwr_op_unmap_sync returns to the RPC reply handler,
>> the handler calls xprt_complete_rqst, and the RPC is
>> terminated. This guarantees that the MRs are invalid before
>> control is returned to the RPC consumer.
>> In the ^C case, frwr_op_unmap_safe is invoked during RPC
>> termination. The MRs are passed to the background recovery
>> task, which invokes frwr_op_recover_mr.
> 
> That worries me. How do you know it's going in sequence,
> and that it will result in an invalidated MR?

The MR is destroyed synchronously. Isn't that enough to guarantee
that the rkey is invalid and DMA unmapping is safe to proceed?

With FRWR, if a LocalInv flushes with "memory management error"
there's no way to invalidate that MR other than:

- destroy the MR, or

- disconnect

The latter is a mighty sledgehammer that affects all the live MRs
and RPCs on that connection. That's why recover_mr destroys the
MR in all cases.


>> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
>> memory. (It's also used when registration or invalidation
>> flushes, which is why it uses a hammer).
>> So here, we're a little fast/loose: the ordering of
>> invalidation and unmapping is correct, but the MRs can be
>> invalidated after the RPC completes. Since RPC termination
>> can't wait, this is the best I can do for now.
> 
> That would worry me even more. "fast/loose" isn't a good
> situation when storage is concerned.

I agree!


> Shouldn't you just be closing the connection?

We'd have to wait for the connection close to complete in a
function that is not allowed to sleep, so it would have to be
done in the background as well. This is no better than handing
the MR to a background process.

And if possible we'd like to avoid connection loss (see above).

I don't like this situation, but it's the best I can do with
the current architecture of the RPC client. I've been staring
at this for a couple of years now wondering how to fix it.
Suggestions happily accepted!


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 171+ messages in thread

* Unexpected issues with 2 NVME initiators using the same target
@ 2017-07-11 14:55                                                                                                                                                                                                               ` Chuck Lever
  0 siblings, 0 replies; 171+ messages in thread
From: Chuck Lever @ 2017-07-11 14:55 UTC (permalink / raw)


Hi Tom-


> On Jul 11, 2017,@9:23 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 7/10/2017 11:57 PM, Chuck Lever wrote:
>>> On Jul 10, 2017,@6:09 PM, Jason Gunthorpe <jgunthorpe@obsidianresearch.com> wrote:
>>> 
>>> On Mon, Jul 10, 2017@06:04:18PM -0400, Chuck Lever wrote:
>>> 
>>>>> The server sounds fine, how does the client work?
>>>> 
>>>> The client does not initiate RDMA Read or Write today.
>>> 
>>> Right, but it provides an rkey that the server uses for READ or WRITE.
>>> 
>>> The invalidate of that rkey at the client must follow the same rules
>>> as inline send.
>> Ah, I see.
>> The RPC reply handler calls frwr_op_unmap_sync to invalidate
>> any MRs associated with the RPC.
>> frwr_op_unmap_sync has to sort the rkeys that are remotely
>> invalidated, and those that have not been.
> 
> Does the reply handler consider the possibility that the reply is
> being signaled before the send WRs? There are some really interesting
> races on shared or multiple CQs when the completion upcalls start
> to back up under heavy load that we've seen in Windows SMB Direct.

If I understand you correctly, that's exactly what we're discussing.
The Send WR that transmitted the RPC Call can outlive the RPC
transaction in rare cases.

A partial solution is to move the Send SGE array into a data structure
whose lifetime is independent of rpcrdma_req. That allows the client
to guarantee that appropriate DMA unmaps are done only after the Send
is complete.

The other part of this issue is that delayed Send completion also
needs to prevent initiation of another RPC, otherwise the Send Queue
can overflow or the client might exceed the credit grant on this
connection. This part worries me because it could mean that some
serialization (eg. a completion and context switch) is needed for
every Send operation on the client.


> In the end, we had to put explicit reference counts on each and
> every object, and added rundown references to everything before
> completing an operation and signaling the upper layer (SMB3, in
> our case). This found a surprising number of double completions,
> and missing completions from drivers as well.
> 
>> The first step is to ensure all the rkeys for an RPC are
>> invalid. The rkey that was remotely invalidated is skipped
>> here, and a chain of LocalInv WRs is posted to invalidate
>> any remaining rkeys. The last WR in the chain is signaled.
>> If one or more LocalInv WRs are posted, this function waits
>> for LocalInv completion.
>> The last step is always DMA unmapping. Note that we can't
>> get a completion for a remotely invalidated rkey, and we
>> have to wait for LocalInv to complete anyway. So the DMA
>> unmapping is always handled here instead of in a
>> completion handler.
>> When frwr_op_unmap_sync returns to the RPC reply handler,
>> the handler calls xprt_complete_rqst, and the RPC is
>> terminated. This guarantees that the MRs are invalid before
>> control is returned to the RPC consumer.
>> In the ^C case, frwr_op_unmap_safe is invoked during RPC
>> termination. The MRs are passed to the background recovery
>> task, which invokes frwr_op_recover_mr.
> 
> That worries me. How do you know it's going in sequence,
> and that it will result in an invalidated MR?

The MR is destroyed synchronously. Isn't that enough to guarantee
that the rkey is invalid and DMA unmapping is safe to proceed?

With FRWR, if a LocalInv flushes with "memory management error"
there's no way to invalidate that MR other than:

- destroy the MR, or

- disconnect

The latter is a mighty sledgehammer that affects all the live MRs
and RPCs on that connection. That's why recover_mr destroys the
MR in all cases.


>> frwr_op_recover_mr destroys the fr_mr and DMA unmaps the
>> memory. (It's also used when registration or invalidation
>> flushes, which is why it uses a hammer).
>> So here, we're a little fast/loose: the ordering of
>> invalidation and unmapping is correct, but the MRs can be
>> invalidated after the RPC completes. Since RPC termination
>> can't wait, this is the best I can do for now.
> 
> That would worry me even more. "fast/loose" isn't a good
> situation when storage is concerned.

I agree!


> Shouldn't you just be closing the connection?

We'd have to wait for the connection close to complete in a
function that is not allowed to sleep, so it would have to be
done in the background as well. This is no better than handing
the MR to a background process.

And if possible we'd like to avoid connection loss (see above).

I don't like this situation, but it's the best I can do with
the current architecture of the RPC client. I've been staring
at this for a couple of years now wondering how to fix it.
Suggestions happily accepted!


--
Chuck Lever

^ permalink raw reply	[flat|nested] 171+ messages in thread

end of thread, other threads:[~2017-07-11 14:55 UTC | newest]

Thread overview: 171+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-21 19:38 Unexpected issues with 2 NVME initiators using the same target shahar.salzman
     [not found] ` <08131a05-1f56-ef61-990a-7fff04eea095-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-21 22:50   ` Sagi Grimberg
2017-02-21 22:50     ` Sagi Grimberg
     [not found]     ` <de1a559a-bf24-0d73-5fc7-148d6cd4d4e0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-02-22 16:52       ` Laurence Oberman
2017-02-22 16:52         ` Laurence Oberman
     [not found]         ` <1848296658.37025722.1487782361271.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-02-22 19:39           ` Sagi Grimberg
2017-02-22 19:39             ` Sagi Grimberg
2017-02-26  8:03           ` shahar.salzman
2017-02-26  8:03             ` shahar.salzman
2017-02-26 17:58             ` Gruher, Joseph R
     [not found]               ` <DE927C68B458BE418D582EC97927A92854655137-8oqHQFITsIFcIJlls4ac1rfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-02-27 20:33                 ` Sagi Grimberg
2017-02-27 20:33                   ` Sagi Grimberg
2017-02-27 20:57                   ` Gruher, Joseph R
2017-03-05 18:23                   ` Leon Romanovsky
2017-03-06  0:07                   ` Max Gurtovoy
     [not found]                     ` <26912d0c-578f-26e9-490d-94fc95bdf259-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-03-06 11:28                       ` Sagi Grimberg
2017-03-06 11:28                         ` Sagi Grimberg
2017-03-07  9:27                         ` Max Gurtovoy
     [not found]                           ` <fbd647dd-3a16-8155-107d-f98e8326cc63-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-03-07 13:41                             ` Sagi Grimberg
2017-03-07 13:41                               ` Sagi Grimberg
     [not found]                               ` <c5a6f55b-a633-53ca-4378-7a7eaf7f77bd-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-03-09 12:18                                 ` shahar.salzman
2017-03-09 12:18                                   ` shahar.salzman
2017-03-12 12:33                           ` Vladimir Neyelov
     [not found]                             ` <AM4PR0501MB278621363209E177A738D75FCB220-dp/nxUn679hhbxXPg6FtWcDSnupUy6xnnBOFsp37pqbUKgpGm//BTAC/G2K4zDHf@public.gmane.org>
2017-03-13  9:43                               ` Sagi Grimberg
2017-03-13  9:43                                 ` Sagi Grimberg
2017-03-14  8:55                                 ` Max Gurtovoy
2017-03-14 19:57                     ` Gruher, Joseph R
2017-03-14 23:42                       ` Gruher, Joseph R
2017-03-16  0:03                         ` Gruher, Joseph R
2017-03-17 18:37                     ` Gruher, Joseph R
2017-03-17 19:49                       ` Max Gurtovoy
     [not found]                         ` <DE927C68B458BE418D582EC97927A928550391C2@ORSMSX113.amr.corp.intel.com>
2017-03-24 18:30                           ` Gruher, Joseph R
2017-03-27 14:17                             ` Max Gurtovoy
2017-03-27 15:39                               ` Gruher, Joseph R
2017-03-28  8:38                                 ` Max Gurtovoy
2017-03-28 10:21                                   ` shahar.salzman
     [not found]                             ` <DE927C68B458BE418D582EC97927A928550419FA-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-03-28 11:34                               ` Sagi Grimberg
2017-03-28 11:34                                 ` Sagi Grimberg
     [not found]                         ` <809f87ab-b787-9d40-5840-07500d12e81a-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-04-10 11:40                           ` Marta Rybczynska
2017-04-10 11:40                             ` Marta Rybczynska
2017-04-10 14:09                             ` Max Gurtovoy
     [not found]                               ` <33e2cc35-f147-d4a4-9a42-8f1245e35842-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-04-11 12:47                                 ` Marta Rybczynska
2017-04-11 12:47                                   ` Marta Rybczynska
2017-04-20 10:18                                 ` Sagi Grimberg
2017-04-20 10:18                                   ` Sagi Grimberg
2017-04-26 11:56                                   ` Max Gurtovoy
     [not found]                                     ` <af9044c2-fb7c-e8b3-d8fc-4874cfd1bb67-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-04-26 14:45                                       ` Sagi Grimberg
2017-04-26 14:45                                         ` Sagi Grimberg
2017-05-12 19:20                                         ` Gruher, Joseph R
     [not found]                                           ` <DE927C68B458BE418D582EC97927A92855088C6F-8oqHQFITsIFQxe9IK+vIArfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-05-15 12:00                                             ` Sagi Grimberg
2017-05-15 12:00                                               ` Sagi Grimberg
     [not found]                                               ` <82dd5b24-5657-ae5e-8a33-646fddd8b75b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-05-15 13:31                                                 ` Leon Romanovsky
2017-05-15 13:31                                                   ` Leon Romanovsky
     [not found]                                                   ` <20170515133122.GG3616-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-15 13:43                                                     ` Sagi Grimberg
2017-05-15 13:43                                                       ` Sagi Grimberg
     [not found]                                                       ` <9465cd0c-83db-b058-7615-5626ef60dbb0-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-05-15 14:36                                                         ` Leon Romanovsky
2017-05-15 14:36                                                           ` Leon Romanovsky
     [not found]                                                           ` <20170515143632.GH3616-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-15 14:59                                                             ` Christoph Hellwig
2017-05-15 14:59                                                               ` Christoph Hellwig
     [not found]                                                               ` <20170515145952.GA7871-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2017-05-15 17:05                                                                 ` Leon Romanovsky
2017-05-15 17:05                                                                   ` Leon Romanovsky
     [not found]                                                                   ` <20170515170506.GK3616-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-05-17 12:56                                                                     ` Marta Rybczynska
2017-05-17 12:56                                                                       ` Marta Rybczynska
     [not found]                                                                       ` <779753075.36035391.1495025796237.JavaMail.zimbra-FNhOzJFKnXGHXe+LvDLADg@public.gmane.org>
2017-05-18 13:34                                                                         ` Leon Romanovsky
2017-05-18 13:34                                                                           ` Leon Romanovsky
     [not found]                                                                           ` <20170518133439.GD3616-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-19 17:21                                                                             ` Robert LeBlanc
2017-06-19 17:21                                                                               ` Robert LeBlanc
     [not found]                                                                               ` <CAANLjFrCLpX3nb3q7LpFPpLJKciU+1Hvmt_hxyTovQJM2-zQmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20  6:39                                                                                 ` Sagi Grimberg
2017-06-20  6:39                                                                                   ` Sagi Grimberg
     [not found]                                                                                   ` <6073e553-e8c2-6d14-ba5d-c2bd5aff15eb-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-20  7:46                                                                                     ` Leon Romanovsky
2017-06-20  7:46                                                                                       ` Leon Romanovsky
     [not found]                                                                                       ` <20170620074639.GP17846-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-20  7:58                                                                                         ` Sagi Grimberg
2017-06-20  7:58                                                                                           ` Sagi Grimberg
     [not found]                                                                                           ` <1c706958-992e-b104-6bae-4a6616c0a9f9-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-20  8:33                                                                                             ` Leon Romanovsky
2017-06-20  8:33                                                                                               ` Leon Romanovsky
     [not found]                                                                                               ` <20170620083309.GQ17846-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2017-06-20  9:33                                                                                                 ` Sagi Grimberg
2017-06-20  9:33                                                                                                   ` Sagi Grimberg
2017-06-20 10:31                                                                                                   ` Max Gurtovoy
     [not found]                                                                                                     ` <78b2c1db-6ece-0274-c4c9-5ee1f7c88469-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2017-06-20 22:58                                                                                                       ` Robert LeBlanc
2017-06-20 22:58                                                                                                         ` Robert LeBlanc
2017-06-27  7:16                                                                                                         ` Sagi Grimberg
2017-06-27  7:16                                                                                                           ` Sagi Grimberg
     [not found]                                                                                                   ` <bd0b986f-9bed-3dfa-7454-0661559a527b-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-20 12:02                                                                                                     ` Sagi Grimberg
2017-06-20 12:02                                                                                                       ` Sagi Grimberg
2017-06-20 13:28                                                                                                       ` Max Gurtovoy
     [not found]                                                                                                       ` <614481c7-22dd-d93b-e97e-52f868727ec3-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-20 17:01                                                                                                         ` Chuck Lever
2017-06-20 17:01                                                                                                           ` Chuck Lever
     [not found]                                                                                                           ` <59FF0C04-2BFB-4F66-81BA-A598A9A087FC-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-20 17:12                                                                                                             ` Sagi Grimberg
2017-06-20 17:12                                                                                                               ` Sagi Grimberg
2017-06-20 17:35                                                                                                             ` Jason Gunthorpe
2017-06-20 17:35                                                                                                               ` Jason Gunthorpe
     [not found]                                                                                                               ` <20170620173532.GA827-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-20 18:17                                                                                                                 ` Chuck Lever
2017-06-20 18:17                                                                                                                   ` Chuck Lever
     [not found]                                                                                                                   ` <D3DC49A2-FFC9-4F62-8876-3E6AD5167DE5-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-20 19:27                                                                                                                     ` Jason Gunthorpe
2017-06-20 19:27                                                                                                                       ` Jason Gunthorpe
     [not found]                                                                                                                       ` <20170620192742.GB827-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-20 20:56                                                                                                                         ` Chuck Lever
2017-06-20 20:56                                                                                                                           ` Chuck Lever
     [not found]                                                                                                                           ` <C14B071E-F1B2-466A-82CF-4E20BFAD9DC1-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-20 21:19                                                                                                                             ` Jason Gunthorpe
2017-06-20 21:19                                                                                                                               ` Jason Gunthorpe
     [not found]                                                                                                                               ` <20170620211958.GA5574-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-27  7:37                                                                                                                                 ` Sagi Grimberg
2017-06-27  7:37                                                                                                                                   ` Sagi Grimberg
     [not found]                                                                                                                                   ` <4f0812f1-0067-4e63-e383-b913ee1f319d-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 14:42                                                                                                                                     ` Chuck Lever
2017-06-27 14:42                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                       ` <28F6F58E-B6F4-4114-8DFF-B72353CE814B-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-27 16:07                                                                                                                                         ` Sagi Grimberg
2017-06-27 16:07                                                                                                                                           ` Sagi Grimberg
     [not found]                                                                                                                                           ` <52ad3547-efcf-f428-6b39-117efda3379f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-27 16:28                                                                                                                                             ` Jason Gunthorpe
2017-06-27 16:28                                                                                                                                               ` Jason Gunthorpe
     [not found]                                                                                                                                               ` <20170627162800.GA22592-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-06-28  7:03                                                                                                                                                 ` Sagi Grimberg
2017-06-28  7:03                                                                                                                                                   ` Sagi Grimberg
2017-06-27 16:28                                                                                                                                             ` Chuck Lever
2017-06-27 16:28                                                                                                                                               ` Chuck Lever
     [not found]                                                                                                                                               ` <9990B5CB-E0FF-481E-9F34-21EACF0E796E-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-28  7:08                                                                                                                                                 ` Sagi Grimberg
2017-06-28  7:08                                                                                                                                                   ` Sagi Grimberg
     [not found]                                                                                                                                                   ` <f1f1a68c-90db-e6bf-e35e-55c4b469c339-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-28 16:11                                                                                                                                                     ` Chuck Lever
2017-06-28 16:11                                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                                       ` <7D1C540B-FEA0-4101-8B58-87BCB7DB5492-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-06-29  5:35                                                                                                                                                         ` Sagi Grimberg
2017-06-29  5:35                                                                                                                                                           ` Sagi Grimberg
     [not found]                                                                                                                                                           ` <66b1b8be-e506-50b8-c01f-fa0e3cea98a4-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-29 14:55                                                                                                                                                             ` Chuck Lever
2017-06-29 14:55                                                                                                                                                               ` Chuck Lever
     [not found]                                                                                                                                                               ` <9D8C7BC8-7E18-405A-9017-9DB23A6B5C15-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-02  9:45                                                                                                                                                                 ` Sagi Grimberg
2017-07-02  9:45                                                                                                                                                                   ` Sagi Grimberg
     [not found]                                                                                                                                                                   ` <11aa1a24-9f0b-dbb8-18eb-ad357c7727b2-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-07-02 18:17                                                                                                                                                                     ` Chuck Lever
2017-07-02 18:17                                                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                                                       ` <9E30754F-464A-4B62-ADE7-F6B2F6D95763-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-09 16:47                                                                                                                                                                         ` Jason Gunthorpe
2017-07-09 16:47                                                                                                                                                                           ` Jason Gunthorpe
     [not found]                                                                                                                                                                           ` <20170709164755.GB3058-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-10 19:03                                                                                                                                                                             ` Chuck Lever
2017-07-10 19:03                                                                                                                                                                               ` Chuck Lever
     [not found]                                                                                                                                                                               ` <A7C8C159-E916-4060-9FD1-8726D816B3C0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-10 20:05                                                                                                                                                                                 ` Jason Gunthorpe
2017-07-10 20:05                                                                                                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                                                                                                   ` <20170710200522.GA19293-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-10 20:51                                                                                                                                                                                     ` Chuck Lever
2017-07-10 20:51                                                                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                                                                       ` <C04891DF-5B3B-4156-9E04-9E18B238864A-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-10 21:14                                                                                                                                                                                         ` Jason Gunthorpe
2017-07-10 21:14                                                                                                                                                                                           ` Jason Gunthorpe
2017-07-10 21:24                                                                                                                                                                                 ` Jason Gunthorpe
2017-07-10 21:24                                                                                                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                                                                                                   ` <20170710212430.GA21721-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-10 21:29                                                                                                                                                                                     ` Chuck Lever
2017-07-10 21:29                                                                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                                                                       ` <C142385D-3A54-44C9-BA3D-0AABBC5E9E7B-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-10 21:32                                                                                                                                                                                         ` Jason Gunthorpe
2017-07-10 21:32                                                                                                                                                                                           ` Jason Gunthorpe
     [not found]                                                                                                                                                                                           ` <20170710213251.GA21908-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-10 22:04                                                                                                                                                                                             ` Chuck Lever
2017-07-10 22:04                                                                                                                                                                                               ` Chuck Lever
     [not found]                                                                                                                                                                                               ` <A342254B-1ECB-4644-8D68-328A52940C52-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-10 22:09                                                                                                                                                                                                 ` Jason Gunthorpe
2017-07-10 22:09                                                                                                                                                                                                   ` Jason Gunthorpe
     [not found]                                                                                                                                                                                                   ` <20170710220905.GA22589-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2017-07-11  3:57                                                                                                                                                                                                     ` Chuck Lever
2017-07-11  3:57                                                                                                                                                                                                       ` Chuck Lever
     [not found]                                                                                                                                                                                                       ` <718A099D-3597-4262-9A33-0BA7EFE5461F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2017-07-11 13:23                                                                                                                                                                                                         ` Tom Talpey
2017-07-11 13:23                                                                                                                                                                                                           ` Tom Talpey
     [not found]                                                                                                                                                                                                           ` <8dd77b19-3846-96ea-f50f-22182989e941-CLs1Zie5N5HQT0dZR+AlfA@public.gmane.org>
2017-07-11 14:55                                                                                                                                                                                                             ` Chuck Lever
2017-07-11 14:55                                                                                                                                                                                                               ` Chuck Lever
2017-06-27 18:08                                                                                                                                     ` Bart Van Assche
2017-06-27 18:08                                                                                                                                       ` Bart Van Assche
     [not found]                                                                                                                                       ` <1498586933.14963.1.camel-Sjgp3cTcYWE@public.gmane.org>
2017-06-27 18:14                                                                                                                                         ` Jason Gunthorpe
2017-06-27 18:14                                                                                                                                           ` Jason Gunthorpe
2017-06-28  7:16                                                                                                                                         ` Sagi Grimberg
2017-06-28  7:16                                                                                                                                           ` Sagi Grimberg
     [not found]                                                                                                                                           ` <47bbf598-6c82-610f-dc1d-706a1d869b8d-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-28  9:43                                                                                                                                             ` Bart Van Assche
2017-06-28  9:43                                                                                                                                               ` Bart Van Assche
2017-06-20 17:08                                                                                                         ` Robert LeBlanc
2017-06-20 17:08                                                                                                           ` Robert LeBlanc
     [not found]                                                                                                           ` <CAANLjFrpHzapvqgBajUu7QpgNNPvNjThMZrsXGcKt58E+6siMA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-20 17:19                                                                                                             ` Sagi Grimberg
2017-06-20 17:19                                                                                                               ` Sagi Grimberg
     [not found]                                                                                                               ` <3f3830eb-68c5-f862-58c7-7021e6462f6f-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2017-06-20 17:28                                                                                                                 ` Robert LeBlanc
2017-06-20 17:28                                                                                                                   ` Robert LeBlanc
     [not found]                                                                                                                   ` <CAANLjFo_1E5nPkXMmG0VxtXpQ2m6=WNd0OtvG4rd-__1TD0-VQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2017-06-27  7:22                                                                                                                     ` Sagi Grimberg
2017-06-27  7:22                                                                                                                       ` Sagi Grimberg
2017-06-20 14:43                                                                                                     ` Robert LeBlanc
2017-06-20 14:43                                                                                                       ` Robert LeBlanc
2017-06-20 14:41                                                                                             ` Robert LeBlanc
2017-06-20 14:41                                                                                               ` Robert LeBlanc
     [not found]             ` <1554c1d1-6bf4-9ca2-12d4-a0125d8c5715-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-02-27 20:13               ` Sagi Grimberg
2017-02-27 20:13                 ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.