linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
@ 2021-07-18 22:59 Rao Shoaib
  2021-07-18 22:59 ` [PATCH v3 1/1] " Rao Shoaib
  2021-07-27 16:15 ` [PATCH v3 0/1] " Shoaib Rao
  0 siblings, 2 replies; 23+ messages in thread
From: Rao Shoaib @ 2021-07-18 22:59 UTC (permalink / raw)
  To: linux-rdma, jgg; +Cc: rao.shoaib

Changes since 1st rev:
	Fixed an issue reported by kernel robot build
	Fixed index not being calculated properly by zyjzyj2000@gmail.com

Rao Shoaib (1):
  RDMA/rxe: Bump up default maximum values used via uverbs

 drivers/infiniband/sw/rxe/rxe_param.h | 30 ++++++++++++++-------------
 1 file changed, 16 insertions(+), 14 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-18 22:59 [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs Rao Shoaib
@ 2021-07-18 22:59 ` Rao Shoaib
  2021-07-27 16:15 ` [PATCH v3 0/1] " Shoaib Rao
  1 sibling, 0 replies; 23+ messages in thread
From: Rao Shoaib @ 2021-07-18 22:59 UTC (permalink / raw)
  To: linux-rdma, jgg; +Cc: rao.shoaib

From: Rao Shoaib <rshoaib@ca-dev141.us.oracle.com>

In our internal testing we have found that the
current maximum values are too small.
Ideally there should be no limits but currently,
maximum values are reported via ibv_query_device,
so we have to keep the maximum values but the
default has been set to a large value.

Resubmitting after fixing an issue reported by test robot.

Reported-by: kernel test robot <lkp@intel.com>

Signed-off-by: Rao Shoaib <rshoaib@ca-dev141.us.oracle.com>
---
 drivers/infiniband/sw/rxe/rxe_param.h | 30 ++++++++++++++-------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 742e6ec93686..d703a54df4ec 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -9,6 +9,8 @@
 
 #include <uapi/rdma/rdma_user_rxe.h>
 
+#define DEFAULT_MAX_VALUE (1 << 20)
+
 static inline enum ib_mtu rxe_mtu_int_to_enum(int mtu)
 {
 	if (mtu < 256)
@@ -37,7 +39,7 @@ static inline enum ib_mtu eth_mtu_int_to_enum(int mtu)
 enum rxe_device_param {
 	RXE_MAX_MR_SIZE			= -1ull,
 	RXE_PAGE_SIZE_CAP		= 0xfffff000,
-	RXE_MAX_QP_WR			= 0x4000,
+	RXE_MAX_QP_WR			= DEFAULT_MAX_VALUE,
 	RXE_DEVICE_CAP_FLAGS		= IB_DEVICE_BAD_PKEY_CNTR
 					| IB_DEVICE_BAD_QKEY_CNTR
 					| IB_DEVICE_AUTO_PATH_MIG
@@ -58,42 +60,42 @@ enum rxe_device_param {
 	RXE_MAX_INLINE_DATA		= RXE_MAX_WQE_SIZE -
 					  sizeof(struct rxe_send_wqe),
 	RXE_MAX_SGE_RD			= 32,
-	RXE_MAX_CQ			= 16384,
+	RXE_MAX_CQ			= DEFAULT_MAX_VALUE,
 	RXE_MAX_LOG_CQE			= 15,
-	RXE_MAX_PD			= 0x7ffc,
+	RXE_MAX_PD			= DEFAULT_MAX_VALUE,
 	RXE_MAX_QP_RD_ATOM		= 128,
 	RXE_MAX_RES_RD_ATOM		= 0x3f000,
 	RXE_MAX_QP_INIT_RD_ATOM		= 128,
 	RXE_MAX_MCAST_GRP		= 8192,
 	RXE_MAX_MCAST_QP_ATTACH		= 56,
 	RXE_MAX_TOT_MCAST_QP_ATTACH	= 0x70000,
-	RXE_MAX_AH			= 100,
-	RXE_MAX_SRQ_WR			= 0x4000,
+	RXE_MAX_AH			= DEFAULT_MAX_VALUE,
+	RXE_MAX_SRQ_WR			= DEFAULT_MAX_VALUE,
 	RXE_MIN_SRQ_WR			= 1,
 	RXE_MAX_SRQ_SGE			= 27,
 	RXE_MIN_SRQ_SGE			= 1,
 	RXE_MAX_FMR_PAGE_LIST_LEN	= 512,
-	RXE_MAX_PKEYS			= 1,
+	RXE_MAX_PKEYS			= 64,
 	RXE_LOCAL_CA_ACK_DELAY		= 15,
 
-	RXE_MAX_UCONTEXT		= 512,
+	RXE_MAX_UCONTEXT		= DEFAULT_MAX_VALUE,
 
 	RXE_NUM_PORT			= 1,
 
-	RXE_MAX_QP			= 0x10000,
 	RXE_MIN_QP_INDEX		= 16,
-	RXE_MAX_QP_INDEX		= 0x00020000,
+	RXE_MAX_QP_INDEX		= DEFAULT_MAX_VALUE,
+	RXE_MAX_QP			= DEFAULT_MAX_VALUE - RXE_MIN_QP_INDEX,
 
-	RXE_MAX_SRQ			= 0x00001000,
 	RXE_MIN_SRQ_INDEX		= 0x00020001,
-	RXE_MAX_SRQ_INDEX		= 0x00040000,
+	RXE_MAX_SRQ_INDEX		= DEFAULT_MAX_VALUE,
+	RXE_MAX_SRQ			= DEFAULT_MAX_VALUE - RXE_MIN_SRQ_INDEX,
 
-	RXE_MAX_MR			= 0x00001000,
-	RXE_MAX_MW			= 0x00001000,
 	RXE_MIN_MR_INDEX		= 0x00000001,
-	RXE_MAX_MR_INDEX		= 0x00010000,
+	RXE_MAX_MR_INDEX		= DEFAULT_MAX_VALUE,
+	RXE_MAX_MR			= DEFAULT_MAX_VALUE - RXE_MIN_MR_INDEX,
 	RXE_MIN_MW_INDEX		= 0x00010001,
 	RXE_MAX_MW_INDEX		= 0x00020000,
+	RXE_MAX_MW			= 0x00001000,
 
 	RXE_MAX_PKT_PER_ACK		= 64,
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-18 22:59 [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs Rao Shoaib
  2021-07-18 22:59 ` [PATCH v3 1/1] " Rao Shoaib
@ 2021-07-27 16:15 ` Shoaib Rao
  2021-07-27 17:41   ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-07-27 16:15 UTC (permalink / raw)
  To: linux-rdma, jgg

Hi Jason et al,

Can I please get an up or down comment on my patch?

Shoaib

On 7/18/21 3:59 PM, Rao Shoaib wrote:
> Changes since 1st rev:
> 	Fixed an issue reported by kernel robot build
> 	Fixed index not being calculated properly by zyjzyj2000@gmail.com
>
> Rao Shoaib (1):
>    RDMA/rxe: Bump up default maximum values used via uverbs
>
>   drivers/infiniband/sw/rxe/rxe_param.h | 30 ++++++++++++++-------------
>   1 file changed, 16 insertions(+), 14 deletions(-)
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-27 16:15 ` [PATCH v3 0/1] " Shoaib Rao
@ 2021-07-27 17:41   ` Jason Gunthorpe
  2021-07-29  6:42     ` Zhu Yanjun
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-07-27 17:41 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: linux-rdma

On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
> Hi Jason et al,
> 
> Can I please get an up or down comment on my patch?

Bob and Zhu should check it

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-27 17:41   ` Jason Gunthorpe
@ 2021-07-29  6:42     ` Zhu Yanjun
  2021-07-29  6:52       ` Shoaib Rao
  0 siblings, 1 reply; 23+ messages in thread
From: Zhu Yanjun @ 2021-07-29  6:42 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Shoaib Rao, RDMA mailing list

On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
> > Hi Jason et al,
> >
> > Can I please get an up or down comment on my patch?
>
> Bob and Zhu should check it

In my daily tests, I found that one host 5.12-stable, the other host
is 5.14.-rc3 + this commit.
rping can not work. Sometimes crash will occur.

It seems that changing maximum values breaks backward compatibility.

But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
rping can work well.

Zhu Yanjun
>
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29  6:42     ` Zhu Yanjun
@ 2021-07-29  6:52       ` Shoaib Rao
  2021-07-29  7:57         ` Zhu Yanjun
  0 siblings, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-07-29  6:52 UTC (permalink / raw)
  To: Zhu Yanjun, Jason Gunthorpe; +Cc: RDMA mailing list


On 7/28/21 11:42 PM, Zhu Yanjun wrote:
> On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
>>> Hi Jason et al,
>>>
>>> Can I please get an up or down comment on my patch?
>> Bob and Zhu should check it
> In my daily tests, I found that one host 5.12-stable, the other host
> is 5.14.-rc3 + this commit.
> rping can not work. Sometimes crash will occur.
Can you paste the stack?
>
> It seems that changing maximum values breaks backward compatibility.
>
> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
> rping can work well.

That is strange because all the large values do is initialize the pool 
with large values. Nothing else. So unless large values are used there 
should be no issues. Is it possible that the issue is with 5.14-rc3. Do 
things work between 5.12-stable systems. Anyways, please post the stack 
trace and also information on the setup and rping commands used.

Shoaib

>
> Zhu Yanjun
>> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29  6:52       ` Shoaib Rao
@ 2021-07-29  7:57         ` Zhu Yanjun
  2021-07-29 19:33           ` Shoaib Rao
  0 siblings, 1 reply; 23+ messages in thread
From: Zhu Yanjun @ 2021-07-29  7:57 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: Jason Gunthorpe, RDMA mailing list

On Thu, Jul 29, 2021 at 2:52 PM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>
>
> On 7/28/21 11:42 PM, Zhu Yanjun wrote:
> > On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
> >>> Hi Jason et al,
> >>>
> >>> Can I please get an up or down comment on my patch?
> >> Bob and Zhu should check it
> > In my daily tests, I found that one host 5.12-stable, the other host
> > is 5.14.-rc3 + this commit.
> > rping can not work. Sometimes crash will occur.
> Can you paste the stack?

[  381.068203] rdma_rxe: qp#17 moved to error state
[  421.464485] BUG: unable to handle page fault for address: ffff9e5de298d180
[  421.464515] #PF: supervisor write access in kernel mode
[  421.464532] #PF: error_code(0x0002) - not-present page
[  421.464549] PGD 100c00067 P4D 100c00067 PUD 100dc1067 PMD 125e78067 PTE 0
[  421.464572] Oops: 0002 [#1] SMP PTI
[  421.464585] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted:
G S      W  OE     5.13.1-rxe+ #17
[  421.464613] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[  421.464642] RIP: 0010:rxe_cq_post+0x98/0x210 [rdma_rxe]
[  421.464667] Code: 8b b3 48 01 00 00 4d 8b 48 08 41 8b 48 28 49 8d
b9 80 01 00 00 85 f6 0f 84 78 01 00 00 41 8b 50 34 d3 e2 48 01 fa 48
8b 4d 00 <48> 89 0a 48 8b 4d 08 48 89 4a 08 48 8b 4d 10 48 89 4a 10 48
8b 4d
[  421.464718] RSP: 0018:ffff9e5dc6ce0918 EFLAGS: 00010082
[  421.464735] RAX: 0000000000000246 RBX: ffff8b200cabd800 RCX: 0000000000000000
[  421.464756] RDX: ffff9e5de298d180 RSI: 0000000000000001 RDI: ffff9e5dc698b180
[  421.464777] RBP: ffff9e5dc6ce09c0 R08: ffff8b2014d85a80 R09: ffff9e5dc698b000
[  421.464797] R10: ffffffff8bc90940 R11: 0000000000000001 R12: 0000000000000000
[  421.464817] R13: ffff8b200cabd940 R14: ffff8b206e014008 R15: 000000000000001a
[  421.464838] FS:  0000000000000000(0000) GS:ffff8b1fd1040000(0000)
knlGS:0000000000000000
[  421.464861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  421.464879] CR2: ffff9e5de298d180 CR3: 0000000c4df4e006 CR4: 00000000007706e0
[  421.464899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  421.464920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  421.464941] PKRU: 55555554
[  421.464950] Call Trace:
[  421.464961]  <IRQ>
[  421.464971]  rxe_responder+0x621/0x2480 [rdma_rxe]
[  421.464993]  ? __fib_validate_source+0x2e9/0x450
[  421.465013]  rxe_do_task+0x89/0x100 [rdma_rxe]
[  421.465033]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
[  421.465050]  ? __udp4_lib_lookup+0x2c8/0x440
[  421.465065]  rxe_udp_encap_recv+0x68/0xc0 [rdma_rxe]
[  421.465085]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
[  421.465104]  udp_queue_rcv_one_skb+0x1df/0x4e0
[  421.465120]  udp_unicast_rcv_skb.isra.67+0x74/0x90
[  421.465135]  __udp4_lib_rcv+0x555/0xb90
[  421.465150]  ? nf_ct_deliver_cached_events+0xc1/0x120 [nf_conntrack]
[  421.465181]  ip_protocol_deliver_rcu+0xe8/0x1b0
[  421.465199]  ip_local_deliver_finish+0x44/0x50
[  421.465215]  ip_local_deliver+0xf1/0x100
[  421.465229]  ? coalesce_fill_reply+0x2c1/0x480
[  421.465249]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
[  421.465265]  ip_sublist_rcv_finish+0x75/0x80
[  421.465281]  ip_sublist_rcv+0x196/0x220
[  421.465296]  ? ip_local_deliver+0x100/0x100
[  421.465312]  ip_list_rcv+0x137/0x160
[  421.465325]  __netif_receive_skb_list_core+0x29b/0x2c0
[  421.465344]  netif_receive_skb_list_internal+0x1c3/0x2f0
[  421.465361]  gro_normal_list.part.158+0x19/0x40
[  421.465376]  napi_complete_done+0x67/0x160
[  421.465391]  i40e_napi_poll+0x53b/0x840 [i40e]
[  421.465426]  __napi_poll+0x2b/0x120
[  421.466123]  net_rx_action+0x236/0x300
[  421.466783]  __do_softirq+0xc9/0x285
[  421.467440]  irq_exit_rcu+0xba/0xd0
[  421.468091]  common_interrupt+0x7f/0xa0
[  421.468737]  </IRQ>
[  421.469366]  asm_common_interrupt+0x1e/0x40
[  421.469990] RIP: 0010:cpuidle_enter_state+0xd6/0x350
[  421.470608] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 45 49 99 ff 45
84 ff 74 12 9c 58 f6 c4 02 0f 85 32 02 00 00 31 ff e8 ae c8 9f ff fb
45 85 f6 <0f> 88 e0 00 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04
82 49
[  421.471935] RSP: 0018:ffff9e5dc679fe80 EFLAGS: 00000202
[  421.472599] RAX: ffff8b1fd106bc40 RBX: 0000000000000002 RCX: 000000000000001f
[  421.473266] RDX: 00000062213d764d RSI: 000000003351fed6 RDI: 0000000000000000
[  421.473920] RBP: ffffbe51c1040000 R08: 0000000000000002 R09: 000000000002b480
[  421.474558] R10: 0000a82bea904be8 R11: ffff8b1fd106a984 R12: 00000062213d764d
[  421.475172] R13: ffffffff8c6c6d80 R14: 0000000000000002 R15: 0000000000000000
[  421.475763]  cpuidle_enter+0x29/0x40
[  421.476348]  do_idle+0x257/0x2a0
[  421.476926]  cpu_startup_entry+0x19/0x20
[  421.477497]  start_secondary+0x116/0x150
[  421.478067]  secondary_startup_64_no_verify+0xc2/0xcb
[  421.478640] Modules linked in: rdma_rxe(OE) ip6_udp_tunnel
udp_tunnel xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun
bridge stp llc nls_utf8 isofs cdrom loop rfkill ib_isert
iscsi_target_mod ib_srpt ext4 target_core_mod ib_srp
scsi_transport_srp mbcache jbd2 rpcrdma sunrpc intel_rapl_msr
intel_rapl_common rdma_ucm isst_if_common ib_iser ib_umad rdma_cm
ib_ipoib iw_cm skx_edac libiscsi ib_cm nfit libnvdimm
scsi_transport_iscsi x86_pkg_temp_thermal intel_powerclamp mlx5_ib
coretemp crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support
ib_uverbs ghash_clmulni_intel rapl ipmi_ssif intel_cstate ib_core
mei_me acpi_ipmi i2c_i801 joydev intel_uncore pcspkr mei i2c_smbus
lpc_ich ioatdma ipmi_si intel_pch_thermal dca ipmi_devintf
ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c
sd_mod t10_pi sg mlx5_core ast i2c_algo_bit drm_vram_helper
[  421.478702]  drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops drm_ttm_helper ttm mlxfw ahci libahci pci_hyperv_intf ice
drm i40e tls crc32c_intel libata psample wmi dm_mirror dm_region_hash
dm_log dm_mod fuse [last unloaded: ip6_udp_tunnel]
[  421.483665] CR2: ffff9e5de298d180


> >
> > It seems that changing maximum values breaks backward compatibility.
> >
> > But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
> > rping can work well.
>
> That is strange because all the large values do is initialize the pool
> with large values. Nothing else. So unless large values are used there
> should be no issues. Is it possible that the issue is with 5.14-rc3. Do
> things work between 5.12-stable systems. Anyways, please post the stack
> trace and also information on the setup and rping commands used.
>
> Shoaib
>
> >
> > Zhu Yanjun
> >> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29  7:57         ` Zhu Yanjun
@ 2021-07-29 19:33           ` Shoaib Rao
  2021-07-29 19:50             ` Jason Gunthorpe
  0 siblings, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-07-29 19:33 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Jason Gunthorpe, RDMA mailing list

I switched the values to the old values and compiled rdma_rxe module. I 
could not get rping to work. First I get CRC errors and then one node 
panics. Both nodes are running 5.14.0-rc1. So the issue you are seeing 
is not caused by my changes, rxe is already broken in 5.14.0-rc1.

Jason,

Can we please accept my initial patch where I bumped up the values of a 
few parameters. We have extensively tested with those values. I will try 
to resolve CRC errors and panic and make changes to other tuneables later?

Regards,

Shoaib

[ 2105.071603] rdma_rxe: bad ICRC from 10.129.135.22
[ 2106.979538] rdma_rxe: bad ICRC from 10.129.135.22
[ 2109.155417] rdma_rxe: bad ICRC from 10.129.135.22
[ 2111.331292] rdma_rxe: bad ICRC from 10.129.135.22
[ 2113.507169] rdma_rxe: bad ICRC from 10.129.135.22
[ 2115.683046] rdma_rxe: bad ICRC from 10.129.135.22
[ 2117.858927] rdma_rxe: bad ICRC from 10.129.135.22
[ 2120.034798] rdma_rxe: bad ICRC from 10.129.135.22
[ 2122.210691] BUG: unable to handle page fault for address: 
ffffbd8562275180
[ 2122.292744] #PF: supervisor write access in kernel mode
[ 2122.355063] #PF: error_code(0x0002) - not-present page
[ 2122.416342] PGD 100000067 P4D 100000067 PUD 1001c7067 PMD 142a84067 PTE 0
[ 2122.497361] Oops: 0002 [#1] SMP PTI
[ 2122.538913] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 
5.14.0-rc1_rxe_values+ #4
[ 2122.626155] Hardware name: Oracle Corporation SUN FIRE X4170 M2 
SERVER        /ASSY,MOTHERBOARD,X4170, BIOS 08140115 07/04/2018
[ 2122.763248] RIP: 0010:rxe_cq_post+0x9e/0x220 [rdma_rxe]
[ 2122.825578] Code: 44 8b 8b 48 01 00 00 4c 8b 47 08 8b 4f 28 49 8d b0 
80 01 00 00 45 85 c9 0f 84 7d 01 00 00 8b 57 34 d3 e2 48 01 f2 49 8b 0c 
24 <48> 89 0a 49 8b 4c 24 08 48 89 4a 08 49 8b 4c 24 10 48 89 4a 10 49
[ 2123.049907] RSP: 0018:ffffbd85464f0800 EFLAGS: 00010082
[ 2123.112225] RAX: 0000000000000246 RBX: ffff9dad4fd3f800 RCX: 
0000000000000000
[ 2123.197389] RDX: ffffbd8562275180 RSI: ffffbd8546273180 RDI: 
ffff9da2c3cee840
[ 2123.282555] RBP: ffffbd85464f0840 R08: ffffbd8546273000 R09: 
0000000000000001
[ 2123.367722] R10: 0000000000000001 R11: 0000000000000001 R12: 
ffffbd85464f08b8
[ 2123.452886] R13: 0000000000000000 R14: ffff9dad4fd3f940 R15: 
ffff9da3002ac008
[ 2123.538053] FS:  0000000000000000(0000) GS:ffff9db94fa80000(0000) 
knlGS:0000000000000000
[ 2123.634642] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2123.703190] CR2: ffffbd8562275180 CR3: 00000013f340a005 CR4: 
00000000000206e0
[ 2123.788357] Call Trace:
[ 2123.817443]  <IRQ>
[ 2123.841335]  rxe_responder+0x5d9/0x2490 [rdma_rxe]
[ 2123.898467]  ? native_apic_mem_write+0x10/0x10
[ 2123.951445]  ? native_apic_wait_icr_idle+0x22/0x30
[ 2124.008575]  ? arch_irq_work_raise+0x3a/0x40
[ 2124.059476]  ? __irq_work_queue_local+0x48/0x60
[ 2124.113486]  ? fib_table_lookup+0x21e/0x640
[ 2124.163348]  ? wake_up_klogd.part.31+0x34/0x40
[ 2124.216319]  rxe_do_task+0x94/0x110 [rdma_rxe]
[ 2124.269297]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 2124.322275]  rxe_resp_queue_pkt+0x44/0x50 [rdma_rxe]
[ 2124.381485]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
[ 2124.431347]  rxe_udp_encap_recv+0x6d/0xd0 [rdma_rxe]
[ 2124.490555]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
[ 2124.548725]  udp_queue_rcv_one_skb+0x1f2/0x500
[ 2124.601697]  udp_queue_rcv_skb+0x50/0x210
[ 2124.649475]  udp_unicast_rcv_skb.isra.67+0x78/0x90
[ 2124.706600]  __udp4_lib_rcv+0x57c/0xbe0
[ 2124.752303]  udp_rcv+0x1a/0x20

On 7/29/21 12:57 AM, Zhu Yanjun wrote:
> On Thu, Jul 29, 2021 at 2:52 PM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>>
>> On 7/28/21 11:42 PM, Zhu Yanjun wrote:
>>> On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
>>>>> Hi Jason et al,
>>>>>
>>>>> Can I please get an up or down comment on my patch?
>>>> Bob and Zhu should check it
>>> In my daily tests, I found that one host 5.12-stable, the other host
>>> is 5.14.-rc3 + this commit.
>>> rping can not work. Sometimes crash will occur.
>> Can you paste the stack?
> [  381.068203] rdma_rxe: qp#17 moved to error state
> [  421.464485] BUG: unable to handle page fault for address: ffff9e5de298d180
> [  421.464515] #PF: supervisor write access in kernel mode
> [  421.464532] #PF: error_code(0x0002) - not-present page
> [  421.464549] PGD 100c00067 P4D 100c00067 PUD 100dc1067 PMD 125e78067 PTE 0
> [  421.464572] Oops: 0002 [#1] SMP PTI
> [  421.464585] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted:
> G S      W  OE     5.13.1-rxe+ #17
> [  421.464613] Hardware name: Intel Corporation S2600WFT/S2600WFT,
> BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
> [  421.464642] RIP: 0010:rxe_cq_post+0x98/0x210 [rdma_rxe]
> [  421.464667] Code: 8b b3 48 01 00 00 4d 8b 48 08 41 8b 48 28 49 8d
> b9 80 01 00 00 85 f6 0f 84 78 01 00 00 41 8b 50 34 d3 e2 48 01 fa 48
> 8b 4d 00 <48> 89 0a 48 8b 4d 08 48 89 4a 08 48 8b 4d 10 48 89 4a 10 48
> 8b 4d
> [  421.464718] RSP: 0018:ffff9e5dc6ce0918 EFLAGS: 00010082
> [  421.464735] RAX: 0000000000000246 RBX: ffff8b200cabd800 RCX: 0000000000000000
> [  421.464756] RDX: ffff9e5de298d180 RSI: 0000000000000001 RDI: ffff9e5dc698b180
> [  421.464777] RBP: ffff9e5dc6ce09c0 R08: ffff8b2014d85a80 R09: ffff9e5dc698b000
> [  421.464797] R10: ffffffff8bc90940 R11: 0000000000000001 R12: 0000000000000000
> [  421.464817] R13: ffff8b200cabd940 R14: ffff8b206e014008 R15: 000000000000001a
> [  421.464838] FS:  0000000000000000(0000) GS:ffff8b1fd1040000(0000)
> knlGS:0000000000000000
> [  421.464861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  421.464879] CR2: ffff9e5de298d180 CR3: 0000000c4df4e006 CR4: 00000000007706e0
> [  421.464899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  421.464920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  421.464941] PKRU: 55555554
> [  421.464950] Call Trace:
> [  421.464961]  <IRQ>
> [  421.464971]  rxe_responder+0x621/0x2480 [rdma_rxe]
> [  421.464993]  ? __fib_validate_source+0x2e9/0x450
> [  421.465013]  rxe_do_task+0x89/0x100 [rdma_rxe]
> [  421.465033]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
> [  421.465050]  ? __udp4_lib_lookup+0x2c8/0x440
> [  421.465065]  rxe_udp_encap_recv+0x68/0xc0 [rdma_rxe]
> [  421.465085]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
> [  421.465104]  udp_queue_rcv_one_skb+0x1df/0x4e0
> [  421.465120]  udp_unicast_rcv_skb.isra.67+0x74/0x90
> [  421.465135]  __udp4_lib_rcv+0x555/0xb90
> [  421.465150]  ? nf_ct_deliver_cached_events+0xc1/0x120 [nf_conntrack]
> [  421.465181]  ip_protocol_deliver_rcu+0xe8/0x1b0
> [  421.465199]  ip_local_deliver_finish+0x44/0x50
> [  421.465215]  ip_local_deliver+0xf1/0x100
> [  421.465229]  ? coalesce_fill_reply+0x2c1/0x480
> [  421.465249]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
> [  421.465265]  ip_sublist_rcv_finish+0x75/0x80
> [  421.465281]  ip_sublist_rcv+0x196/0x220
> [  421.465296]  ? ip_local_deliver+0x100/0x100
> [  421.465312]  ip_list_rcv+0x137/0x160
> [  421.465325]  __netif_receive_skb_list_core+0x29b/0x2c0
> [  421.465344]  netif_receive_skb_list_internal+0x1c3/0x2f0
> [  421.465361]  gro_normal_list.part.158+0x19/0x40
> [  421.465376]  napi_complete_done+0x67/0x160
> [  421.465391]  i40e_napi_poll+0x53b/0x840 [i40e]
> [  421.465426]  __napi_poll+0x2b/0x120
> [  421.466123]  net_rx_action+0x236/0x300
> [  421.466783]  __do_softirq+0xc9/0x285
> [  421.467440]  irq_exit_rcu+0xba/0xd0
> [  421.468091]  common_interrupt+0x7f/0xa0
> [  421.468737]  </IRQ>
> [  421.469366]  asm_common_interrupt+0x1e/0x40
> [  421.469990] RIP: 0010:cpuidle_enter_state+0xd6/0x350
> [  421.470608] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 45 49 99 ff 45
> 84 ff 74 12 9c 58 f6 c4 02 0f 85 32 02 00 00 31 ff e8 ae c8 9f ff fb
> 45 85 f6 <0f> 88 e0 00 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04
> 82 49
> [  421.471935] RSP: 0018:ffff9e5dc679fe80 EFLAGS: 00000202
> [  421.472599] RAX: ffff8b1fd106bc40 RBX: 0000000000000002 RCX: 000000000000001f
> [  421.473266] RDX: 00000062213d764d RSI: 000000003351fed6 RDI: 0000000000000000
> [  421.473920] RBP: ffffbe51c1040000 R08: 0000000000000002 R09: 000000000002b480
> [  421.474558] R10: 0000a82bea904be8 R11: ffff8b1fd106a984 R12: 00000062213d764d
> [  421.475172] R13: ffffffff8c6c6d80 R14: 0000000000000002 R15: 0000000000000000
> [  421.475763]  cpuidle_enter+0x29/0x40
> [  421.476348]  do_idle+0x257/0x2a0
> [  421.476926]  cpu_startup_entry+0x19/0x20
> [  421.477497]  start_secondary+0x116/0x150
> [  421.478067]  secondary_startup_64_no_verify+0xc2/0xcb
> [  421.478640] Modules linked in: rdma_rxe(OE) ip6_udp_tunnel
> udp_tunnel xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun
> bridge stp llc nls_utf8 isofs cdrom loop rfkill ib_isert
> iscsi_target_mod ib_srpt ext4 target_core_mod ib_srp
> scsi_transport_srp mbcache jbd2 rpcrdma sunrpc intel_rapl_msr
> intel_rapl_common rdma_ucm isst_if_common ib_iser ib_umad rdma_cm
> ib_ipoib iw_cm skx_edac libiscsi ib_cm nfit libnvdimm
> scsi_transport_iscsi x86_pkg_temp_thermal intel_powerclamp mlx5_ib
> coretemp crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support
> ib_uverbs ghash_clmulni_intel rapl ipmi_ssif intel_cstate ib_core
> mei_me acpi_ipmi i2c_i801 joydev intel_uncore pcspkr mei i2c_smbus
> lpc_ich ioatdma ipmi_si intel_pch_thermal dca ipmi_devintf
> ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c
> sd_mod t10_pi sg mlx5_core ast i2c_algo_bit drm_vram_helper
> [  421.478702]  drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops drm_ttm_helper ttm mlxfw ahci libahci pci_hyperv_intf ice
> drm i40e tls crc32c_intel libata psample wmi dm_mirror dm_region_hash
> dm_log dm_mod fuse [last unloaded: ip6_udp_tunnel]
> [  421.483665] CR2: ffff9e5de298d180
>
>
>>> It seems that changing maximum values breaks backward compatibility.
>>>
>>> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
>>> rping can work well.
>> That is strange because all the large values do is initialize the pool
>> with large values. Nothing else. So unless large values are used there
>> should be no issues. Is it possible that the issue is with 5.14-rc3. Do
>> things work between 5.12-stable systems. Anyways, please post the stack
>> trace and also information on the setup and rping commands used.
>>
>> Shoaib
>>
>>> Zhu Yanjun
>>>> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29 19:33           ` Shoaib Rao
@ 2021-07-29 19:50             ` Jason Gunthorpe
  2021-07-29 20:33               ` Shoaib Rao
  2021-07-29 23:08               ` Pearson, Robert B
  0 siblings, 2 replies; 23+ messages in thread
From: Jason Gunthorpe @ 2021-07-29 19:50 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: Zhu Yanjun, RDMA mailing list

On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:

> Can we please accept my initial patch where I bumped up the values of a few
> parameters. We have extensively tested with those values. I will try to
> resolve CRC errors and panic and make changes to other tuneables later?

I think Bob posted something for the icrc issues already

Please try to work in a sane fashion, rxe shouldn't be left broken
with so many people apparently interested in it??

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29 19:50             ` Jason Gunthorpe
@ 2021-07-29 20:33               ` Shoaib Rao
  2021-07-29 23:08               ` Pearson, Robert B
  1 sibling, 0 replies; 23+ messages in thread
From: Shoaib Rao @ 2021-07-29 20:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list


On 7/29/21 12:50 PM, Jason Gunthorpe wrote:
> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>
>> Can we please accept my initial patch where I bumped up the values of a few
>> parameters. We have extensively tested with those values. I will try to
>> resolve CRC errors and panic and make changes to other tuneables later?
> I think Bob posted something for the icrc issues already
>
> Please try to work in a sane fashion, rxe shouldn't be left broken
> with so many people apparently interested in it??

I agree with what you are saying and I plan to help address the issue.

For the record, I just tested my changes on a 5.13.6 kernel and they work.

[root@ca-dev14 ~]# rping -s  -a 10.129.135.23
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
[root@ca-dev14 ~]# uname -a
Linux ca-dev14.us.oracle.com 5.13.0-rc6_rxe_defaults+ #1 SMP Thu Jul 29 
12:48:41 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@ca-dev13 ~]# rping -c -a 10.129.135.23 -C 10 -v
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
[root@ca-dev13 ~]# uname -a
Linux ca-dev13.us.oracle.com 5.13.0-rc6_rxe_defaults+ #1 SMP Thu Jul 29 
12:48:41 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Shoaib

>
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29 19:50             ` Jason Gunthorpe
  2021-07-29 20:33               ` Shoaib Rao
@ 2021-07-29 23:08               ` Pearson, Robert B
  2021-07-30  0:34                 ` Shoaib Rao
  2021-08-05  6:11                 ` Shoaib Rao
  1 sibling, 2 replies; 23+ messages in thread
From: Pearson, Robert B @ 2021-07-29 23:08 UTC (permalink / raw)
  To: Jason Gunthorpe, Shoaib Rao; +Cc: Zhu Yanjun, RDMA mailing list

I found another rxe bug (for SRQ) and sent three bug fixes in a set including the one you mention. They should all be applied.

-----Original Message-----
From: Jason Gunthorpe <jgg@ziepe.ca> 
Sent: Thursday, July 29, 2021 2:51 PM
To: Shoaib Rao <rao.shoaib@oracle.com>
Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs

On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:

> Can we please accept my initial patch where I bumped up the values of 
> a few parameters. We have extensively tested with those values. I will 
> try to resolve CRC errors and panic and make changes to other tuneables later?

I think Bob posted something for the icrc issues already

Please try to work in a sane fashion, rxe shouldn't be left broken with so many people apparently interested in it??

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29 23:08               ` Pearson, Robert B
@ 2021-07-30  0:34                 ` Shoaib Rao
  2021-08-03 23:53                   ` Shoaib Rao
  2021-08-05  6:11                 ` Shoaib Rao
  1 sibling, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-07-30  0:34 UTC (permalink / raw)
  To: Pearson, Robert B, Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list

Thanks Bob.

Zhu can you please apply those patches and test.

Shoaib

On 7/29/21 4:08 PM, Pearson, Robert B wrote:
> I found another rxe bug (for SRQ) and sent three bug fixes in a set including the one you mention. They should all be applied.
>
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Thursday, July 29, 2021 2:51 PM
> To: Shoaib Rao <rao.shoaib@oracle.com>
> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list <linux-rdma@vger.kernel.org>
> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
>
> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>
>> Can we please accept my initial patch where I bumped up the values of
>> a few parameters. We have extensively tested with those values. I will
>> try to resolve CRC errors and panic and make changes to other tuneables later?
> I think Bob posted something for the icrc issues already
>
> Please try to work in a sane fashion, rxe shouldn't be left broken with so many people apparently interested in it??
>
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-30  0:34                 ` Shoaib Rao
@ 2021-08-03 23:53                   ` Shoaib Rao
  2021-08-04  0:51                     ` Zhu Yanjun
  0 siblings, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-08-03 23:53 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: RDMA mailing list

Hi Zhu,

Any update on your testing after applying Bob's fixes

Shoaib

On 7/29/21 5:34 PM, Shoaib Rao wrote:
> Thanks Bob.
>
> Zhu can you please apply those patches and test.
>
> Shoaib
>
> On 7/29/21 4:08 PM, Pearson, Robert B wrote:
>> I found another rxe bug (for SRQ) and sent three bug fixes in a set 
>> including the one you mention. They should all be applied.
>>
>> -----Original Message-----
>> From: Jason Gunthorpe <jgg@ziepe.ca>
>> Sent: Thursday, July 29, 2021 2:51 PM
>> To: Shoaib Rao <rao.shoaib@oracle.com>
>> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list 
>> <linux-rdma@vger.kernel.org>
>> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values 
>> used via uverbs
>>
>> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>>
>>> Can we please accept my initial patch where I bumped up the values of
>>> a few parameters. We have extensively tested with those values. I will
>>> try to resolve CRC errors and panic and make changes to other 
>>> tuneables later?
>> I think Bob posted something for the icrc issues already
>>
>> Please try to work in a sane fashion, rxe shouldn't be left broken 
>> with so many people apparently interested in it??
>>
>> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-03 23:53                   ` Shoaib Rao
@ 2021-08-04  0:51                     ` Zhu Yanjun
  2021-08-04  1:51                       ` Shoaib Rao
  0 siblings, 1 reply; 23+ messages in thread
From: Zhu Yanjun @ 2021-08-04  0:51 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: RDMA mailing list

On Wed, Aug 4, 2021 at 7:53 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>
> Hi Zhu,
>
> Any update on your testing after applying Bob's fixes

Do you read my problem carefully?
I mean that before your commit, the whole rxe can work well.
After your commit, the rxe can not work well.
Please reproduce this problem in your host and fix it.

Zhu Yanjun

>
> Shoaib
>
> On 7/29/21 5:34 PM, Shoaib Rao wrote:
> > Thanks Bob.
> >
> > Zhu can you please apply those patches and test.
> >
> > Shoaib
> >
> > On 7/29/21 4:08 PM, Pearson, Robert B wrote:
> >> I found another rxe bug (for SRQ) and sent three bug fixes in a set
> >> including the one you mention. They should all be applied.
> >>
> >> -----Original Message-----
> >> From: Jason Gunthorpe <jgg@ziepe.ca>
> >> Sent: Thursday, July 29, 2021 2:51 PM
> >> To: Shoaib Rao <rao.shoaib@oracle.com>
> >> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list
> >> <linux-rdma@vger.kernel.org>
> >> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values
> >> used via uverbs
> >>
> >> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
> >>
> >>> Can we please accept my initial patch where I bumped up the values of
> >>> a few parameters. We have extensively tested with those values. I will
> >>> try to resolve CRC errors and panic and make changes to other
> >>> tuneables later?
> >> I think Bob posted something for the icrc issues already
> >>
> >> Please try to work in a sane fashion, rxe shouldn't be left broken
> >> with so many people apparently interested in it??
> >>
> >> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-04  0:51                     ` Zhu Yanjun
@ 2021-08-04  1:51                       ` Shoaib Rao
  2021-08-04  2:21                         ` Zhu Yanjun
  0 siblings, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-08-04  1:51 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: RDMA mailing list

[-- Attachment #1: Type: text/plain, Size: 2302 bytes --]


On 8/3/21 5:51 PM, Zhu Yanjun wrote:
> On Wed, Aug 4, 2021 at 7:53 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>> Hi Zhu,
>>
>> Any update on your testing after applying Bob's fixes
> Do you read my problem carefully?
> I mean that before your commit, the whole rxe can work well.
> After your commit, the rxe can not work well.
> Please reproduce this problem in your host and fix it.
>
> Zhu Yanjun

You posted

> In my daily tests, I found that one host 5.12-stable, the other host
> is 5.14.-rc3 + this commit.
> rping can not work. Sometimes crash will occur.
>
> It seems that changing maximum values breaks backward compatibility.
>
> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
> rping can work well.
>
> Zhu Yanjun
I am not sure how you made rxe to work because it did not work for me 
and neither for Bob. Since then, Bob has posted patches for the issue. I 
also posted that my changes work on 5.13.6 kernel. emails attached.

Even if rxe in 5.14 is working for you some how, please apply Bob's 
patches and then mine and test.

Thanks,

Shoaib


>
>> Shoaib
>>
>> On 7/29/21 5:34 PM, Shoaib Rao wrote:
>>> Thanks Bob.
>>>
>>> Zhu can you please apply those patches and test.
>>>
>>> Shoaib
>>>
>>> On 7/29/21 4:08 PM, Pearson, Robert B wrote:
>>>> I found another rxe bug (for SRQ) and sent three bug fixes in a set
>>>> including the one you mention. They should all be applied.
>>>>
>>>> -----Original Message-----
>>>> From: Jason Gunthorpe <jgg@ziepe.ca>
>>>> Sent: Thursday, July 29, 2021 2:51 PM
>>>> To: Shoaib Rao <rao.shoaib@oracle.com>
>>>> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list
>>>> <linux-rdma@vger.kernel.org>
>>>> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values
>>>> used via uverbs
>>>>
>>>> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>>>>
>>>>> Can we please accept my initial patch where I bumped up the values of
>>>>> a few parameters. We have extensively tested with those values. I will
>>>>> try to resolve CRC errors and panic and make changes to other
>>>>> tuneables later?
>>>> I think Bob posted something for the icrc issues already
>>>>
>>>> Please try to work in a sane fashion, rxe shouldn't be left broken
>>>> with so many people apparently interested in it??
>>>>
>>>> Jason

[-- Attachment #2: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs.eml --]
[-- Type: message/rfc822, Size: 11406 bytes --]

From: Shoaib Rao <rao.shoaib@oracle.com>
To: Zhu Yanjun <zyjzyj2000@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>, RDMA mailing list <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
Date: Thu, 29 Jul 2021 12:33:14 -0700
Message-ID: <eb24b781-396f-5bb9-89c7-3ca0f8b83849@oracle.com>

I switched the values to the old values and compiled rdma_rxe module. I 
could not get rping to work. First I get CRC errors and then one node 
panics. Both nodes are running 5.14.0-rc1. So the issue you are seeing 
is not caused by my changes, rxe is already broken in 5.14.0-rc1.

Jason,

Can we please accept my initial patch where I bumped up the values of a 
few parameters. We have extensively tested with those values. I will try 
to resolve CRC errors and panic and make changes to other tuneables later?

Regards,

Shoaib

[ 2105.071603] rdma_rxe: bad ICRC from 10.129.135.22
[ 2106.979538] rdma_rxe: bad ICRC from 10.129.135.22
[ 2109.155417] rdma_rxe: bad ICRC from 10.129.135.22
[ 2111.331292] rdma_rxe: bad ICRC from 10.129.135.22
[ 2113.507169] rdma_rxe: bad ICRC from 10.129.135.22
[ 2115.683046] rdma_rxe: bad ICRC from 10.129.135.22
[ 2117.858927] rdma_rxe: bad ICRC from 10.129.135.22
[ 2120.034798] rdma_rxe: bad ICRC from 10.129.135.22
[ 2122.210691] BUG: unable to handle page fault for address: 
ffffbd8562275180
[ 2122.292744] #PF: supervisor write access in kernel mode
[ 2122.355063] #PF: error_code(0x0002) - not-present page
[ 2122.416342] PGD 100000067 P4D 100000067 PUD 1001c7067 PMD 142a84067 PTE 0
[ 2122.497361] Oops: 0002 [#1] SMP PTI
[ 2122.538913] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 
5.14.0-rc1_rxe_values+ #4
[ 2122.626155] Hardware name: Oracle Corporation SUN FIRE X4170 M2 
SERVER        /ASSY,MOTHERBOARD,X4170, BIOS 08140115 07/04/2018
[ 2122.763248] RIP: 0010:rxe_cq_post+0x9e/0x220 [rdma_rxe]
[ 2122.825578] Code: 44 8b 8b 48 01 00 00 4c 8b 47 08 8b 4f 28 49 8d b0 
80 01 00 00 45 85 c9 0f 84 7d 01 00 00 8b 57 34 d3 e2 48 01 f2 49 8b 0c 
24 <48> 89 0a 49 8b 4c 24 08 48 89 4a 08 49 8b 4c 24 10 48 89 4a 10 49
[ 2123.049907] RSP: 0018:ffffbd85464f0800 EFLAGS: 00010082
[ 2123.112225] RAX: 0000000000000246 RBX: ffff9dad4fd3f800 RCX: 
0000000000000000
[ 2123.197389] RDX: ffffbd8562275180 RSI: ffffbd8546273180 RDI: 
ffff9da2c3cee840
[ 2123.282555] RBP: ffffbd85464f0840 R08: ffffbd8546273000 R09: 
0000000000000001
[ 2123.367722] R10: 0000000000000001 R11: 0000000000000001 R12: 
ffffbd85464f08b8
[ 2123.452886] R13: 0000000000000000 R14: ffff9dad4fd3f940 R15: 
ffff9da3002ac008
[ 2123.538053] FS:  0000000000000000(0000) GS:ffff9db94fa80000(0000) 
knlGS:0000000000000000
[ 2123.634642] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2123.703190] CR2: ffffbd8562275180 CR3: 00000013f340a005 CR4: 
00000000000206e0
[ 2123.788357] Call Trace:
[ 2123.817443]  <IRQ>
[ 2123.841335]  rxe_responder+0x5d9/0x2490 [rdma_rxe]
[ 2123.898467]  ? native_apic_mem_write+0x10/0x10
[ 2123.951445]  ? native_apic_wait_icr_idle+0x22/0x30
[ 2124.008575]  ? arch_irq_work_raise+0x3a/0x40
[ 2124.059476]  ? __irq_work_queue_local+0x48/0x60
[ 2124.113486]  ? fib_table_lookup+0x21e/0x640
[ 2124.163348]  ? wake_up_klogd.part.31+0x34/0x40
[ 2124.216319]  rxe_do_task+0x94/0x110 [rdma_rxe]
[ 2124.269297]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 2124.322275]  rxe_resp_queue_pkt+0x44/0x50 [rdma_rxe]
[ 2124.381485]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
[ 2124.431347]  rxe_udp_encap_recv+0x6d/0xd0 [rdma_rxe]
[ 2124.490555]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
[ 2124.548725]  udp_queue_rcv_one_skb+0x1f2/0x500
[ 2124.601697]  udp_queue_rcv_skb+0x50/0x210
[ 2124.649475]  udp_unicast_rcv_skb.isra.67+0x78/0x90
[ 2124.706600]  __udp4_lib_rcv+0x57c/0xbe0
[ 2124.752303]  udp_rcv+0x1a/0x20

On 7/29/21 12:57 AM, Zhu Yanjun wrote:
> On Thu, Jul 29, 2021 at 2:52 PM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>>
>> On 7/28/21 11:42 PM, Zhu Yanjun wrote:
>>> On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
>>>>> Hi Jason et al,
>>>>>
>>>>> Can I please get an up or down comment on my patch?
>>>> Bob and Zhu should check it
>>> In my daily tests, I found that one host 5.12-stable, the other host
>>> is 5.14.-rc3 + this commit.
>>> rping can not work. Sometimes crash will occur.
>> Can you paste the stack?
> [  381.068203] rdma_rxe: qp#17 moved to error state
> [  421.464485] BUG: unable to handle page fault for address: ffff9e5de298d180
> [  421.464515] #PF: supervisor write access in kernel mode
> [  421.464532] #PF: error_code(0x0002) - not-present page
> [  421.464549] PGD 100c00067 P4D 100c00067 PUD 100dc1067 PMD 125e78067 PTE 0
> [  421.464572] Oops: 0002 [#1] SMP PTI
> [  421.464585] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted:
> G S      W  OE     5.13.1-rxe+ #17
> [  421.464613] Hardware name: Intel Corporation S2600WFT/S2600WFT,
> BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
> [  421.464642] RIP: 0010:rxe_cq_post+0x98/0x210 [rdma_rxe]
> [  421.464667] Code: 8b b3 48 01 00 00 4d 8b 48 08 41 8b 48 28 49 8d
> b9 80 01 00 00 85 f6 0f 84 78 01 00 00 41 8b 50 34 d3 e2 48 01 fa 48
> 8b 4d 00 <48> 89 0a 48 8b 4d 08 48 89 4a 08 48 8b 4d 10 48 89 4a 10 48
> 8b 4d
> [  421.464718] RSP: 0018:ffff9e5dc6ce0918 EFLAGS: 00010082
> [  421.464735] RAX: 0000000000000246 RBX: ffff8b200cabd800 RCX: 0000000000000000
> [  421.464756] RDX: ffff9e5de298d180 RSI: 0000000000000001 RDI: ffff9e5dc698b180
> [  421.464777] RBP: ffff9e5dc6ce09c0 R08: ffff8b2014d85a80 R09: ffff9e5dc698b000
> [  421.464797] R10: ffffffff8bc90940 R11: 0000000000000001 R12: 0000000000000000
> [  421.464817] R13: ffff8b200cabd940 R14: ffff8b206e014008 R15: 000000000000001a
> [  421.464838] FS:  0000000000000000(0000) GS:ffff8b1fd1040000(0000)
> knlGS:0000000000000000
> [  421.464861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  421.464879] CR2: ffff9e5de298d180 CR3: 0000000c4df4e006 CR4: 00000000007706e0
> [  421.464899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  421.464920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  421.464941] PKRU: 55555554
> [  421.464950] Call Trace:
> [  421.464961]  <IRQ>
> [  421.464971]  rxe_responder+0x621/0x2480 [rdma_rxe]
> [  421.464993]  ? __fib_validate_source+0x2e9/0x450
> [  421.465013]  rxe_do_task+0x89/0x100 [rdma_rxe]
> [  421.465033]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
> [  421.465050]  ? __udp4_lib_lookup+0x2c8/0x440
> [  421.465065]  rxe_udp_encap_recv+0x68/0xc0 [rdma_rxe]
> [  421.465085]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
> [  421.465104]  udp_queue_rcv_one_skb+0x1df/0x4e0
> [  421.465120]  udp_unicast_rcv_skb.isra.67+0x74/0x90
> [  421.465135]  __udp4_lib_rcv+0x555/0xb90
> [  421.465150]  ? nf_ct_deliver_cached_events+0xc1/0x120 [nf_conntrack]
> [  421.465181]  ip_protocol_deliver_rcu+0xe8/0x1b0
> [  421.465199]  ip_local_deliver_finish+0x44/0x50
> [  421.465215]  ip_local_deliver+0xf1/0x100
> [  421.465229]  ? coalesce_fill_reply+0x2c1/0x480
> [  421.465249]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
> [  421.465265]  ip_sublist_rcv_finish+0x75/0x80
> [  421.465281]  ip_sublist_rcv+0x196/0x220
> [  421.465296]  ? ip_local_deliver+0x100/0x100
> [  421.465312]  ip_list_rcv+0x137/0x160
> [  421.465325]  __netif_receive_skb_list_core+0x29b/0x2c0
> [  421.465344]  netif_receive_skb_list_internal+0x1c3/0x2f0
> [  421.465361]  gro_normal_list.part.158+0x19/0x40
> [  421.465376]  napi_complete_done+0x67/0x160
> [  421.465391]  i40e_napi_poll+0x53b/0x840 [i40e]
> [  421.465426]  __napi_poll+0x2b/0x120
> [  421.466123]  net_rx_action+0x236/0x300
> [  421.466783]  __do_softirq+0xc9/0x285
> [  421.467440]  irq_exit_rcu+0xba/0xd0
> [  421.468091]  common_interrupt+0x7f/0xa0
> [  421.468737]  </IRQ>
> [  421.469366]  asm_common_interrupt+0x1e/0x40
> [  421.469990] RIP: 0010:cpuidle_enter_state+0xd6/0x350
> [  421.470608] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 45 49 99 ff 45
> 84 ff 74 12 9c 58 f6 c4 02 0f 85 32 02 00 00 31 ff e8 ae c8 9f ff fb
> 45 85 f6 <0f> 88 e0 00 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04
> 82 49
> [  421.471935] RSP: 0018:ffff9e5dc679fe80 EFLAGS: 00000202
> [  421.472599] RAX: ffff8b1fd106bc40 RBX: 0000000000000002 RCX: 000000000000001f
> [  421.473266] RDX: 00000062213d764d RSI: 000000003351fed6 RDI: 0000000000000000
> [  421.473920] RBP: ffffbe51c1040000 R08: 0000000000000002 R09: 000000000002b480
> [  421.474558] R10: 0000a82bea904be8 R11: ffff8b1fd106a984 R12: 00000062213d764d
> [  421.475172] R13: ffffffff8c6c6d80 R14: 0000000000000002 R15: 0000000000000000
> [  421.475763]  cpuidle_enter+0x29/0x40
> [  421.476348]  do_idle+0x257/0x2a0
> [  421.476926]  cpu_startup_entry+0x19/0x20
> [  421.477497]  start_secondary+0x116/0x150
> [  421.478067]  secondary_startup_64_no_verify+0xc2/0xcb
> [  421.478640] Modules linked in: rdma_rxe(OE) ip6_udp_tunnel
> udp_tunnel xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun
> bridge stp llc nls_utf8 isofs cdrom loop rfkill ib_isert
> iscsi_target_mod ib_srpt ext4 target_core_mod ib_srp
> scsi_transport_srp mbcache jbd2 rpcrdma sunrpc intel_rapl_msr
> intel_rapl_common rdma_ucm isst_if_common ib_iser ib_umad rdma_cm
> ib_ipoib iw_cm skx_edac libiscsi ib_cm nfit libnvdimm
> scsi_transport_iscsi x86_pkg_temp_thermal intel_powerclamp mlx5_ib
> coretemp crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support
> ib_uverbs ghash_clmulni_intel rapl ipmi_ssif intel_cstate ib_core
> mei_me acpi_ipmi i2c_i801 joydev intel_uncore pcspkr mei i2c_smbus
> lpc_ich ioatdma ipmi_si intel_pch_thermal dca ipmi_devintf
> ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c
> sd_mod t10_pi sg mlx5_core ast i2c_algo_bit drm_vram_helper
> [  421.478702]  drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops drm_ttm_helper ttm mlxfw ahci libahci pci_hyperv_intf ice
> drm i40e tls crc32c_intel libata psample wmi dm_mirror dm_region_hash
> dm_log dm_mod fuse [last unloaded: ip6_udp_tunnel]
> [  421.483665] CR2: ffff9e5de298d180
>
>
>>> It seems that changing maximum values breaks backward compatibility.
>>>
>>> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
>>> rping can work well.
>> That is strange because all the large values do is initialize the pool
>> with large values. Nothing else. So unless large values are used there
>> should be no issues. Is it possible that the issue is with 5.14-rc3. Do
>> things work between 5.12-stable systems. Anyways, please post the stack
>> trace and also information on the setup and rping commands used.
>>
>> Shoaib
>>
>>> Zhu Yanjun
>>>> Jason

[-- Attachment #3: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs.eml --]
[-- Type: message/rfc822, Size: 2939 bytes --]

From: Shoaib Rao <rao.shoaib@oracle.com>
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Zhu Yanjun <zyjzyj2000@gmail.com>, RDMA mailing list <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
Date: Thu, 29 Jul 2021 13:33:41 -0700
Message-ID: <6e99e37d-2476-21ab-0584-6f4b12982b9d@oracle.com>


On 7/29/21 12:50 PM, Jason Gunthorpe wrote:
> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>
>> Can we please accept my initial patch where I bumped up the values of a few
>> parameters. We have extensively tested with those values. I will try to
>> resolve CRC errors and panic and make changes to other tuneables later?
> I think Bob posted something for the icrc issues already
>
> Please try to work in a sane fashion, rxe shouldn't be left broken
> with so many people apparently interested in it??

I agree with what you are saying and I plan to help address the issue.

For the record, I just tested my changes on a 5.13.6 kernel and they work.

[root@ca-dev14 ~]# rping -s  -a 10.129.135.23
server DISCONNECT EVENT...
wait for RDMA_READ_ADV state 10
[root@ca-dev14 ~]# uname -a
Linux ca-dev14.us.oracle.com 5.13.0-rc6_rxe_defaults+ #1 SMP Thu Jul 29 
12:48:41 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@ca-dev13 ~]# rping -c -a 10.129.135.23 -C 10 -v
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
[root@ca-dev13 ~]# uname -a
Linux ca-dev13.us.oracle.com 5.13.0-rc6_rxe_defaults+ #1 SMP Thu Jul 29 
12:48:41 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

Shoaib

>
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-04  1:51                       ` Shoaib Rao
@ 2021-08-04  2:21                         ` Zhu Yanjun
  2021-08-05  4:10                           ` Shoaib Rao
  0 siblings, 1 reply; 23+ messages in thread
From: Zhu Yanjun @ 2021-08-04  2:21 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: RDMA mailing list

On Wed, Aug 4, 2021 at 10:03 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>
>
> On 8/3/21 5:51 PM, Zhu Yanjun wrote:
> > On Wed, Aug 4, 2021 at 7:53 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
> >> Hi Zhu,
> >>
> >> Any update on your testing after applying Bob's fixes
> > Do you read my problem carefully?
> > I mean that before your commit, the whole rxe can work well.
> > After your commit, the rxe can not work well.
> > Please reproduce this problem in your host and fix it.
> >
> > Zhu Yanjun
>
> You posted
>
> > In my daily tests, I found that one host 5.12-stable, the other host
> > is 5.14.-rc3 + this commit.
> > rping can not work. Sometimes crash will occur.
> >
> > It seems that changing maximum values breaks backward compatibility.
> >
> > But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
> > rping can work well.
> >
> > Zhu Yanjun
> I am not sure how you made rxe to work because it did not work for me
> and neither for Bob. Since then, Bob has posted patches for the issue. I
> also posted that my changes work on 5.13.6 kernel. emails attached.
>
> Even if rxe in 5.14 is working for you some how, please apply Bob's
> patches and then mine and test.

I have already applied this commit
https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/.

And with your commit, rxe can not work well.

Zhu Yanjun

>
> Thanks,
>
> Shoaib
>
>
> >
> >> Shoaib
> >>
> >> On 7/29/21 5:34 PM, Shoaib Rao wrote:
> >>> Thanks Bob.
> >>>
> >>> Zhu can you please apply those patches and test.
> >>>
> >>> Shoaib
> >>>
> >>> On 7/29/21 4:08 PM, Pearson, Robert B wrote:
> >>>> I found another rxe bug (for SRQ) and sent three bug fixes in a set
> >>>> including the one you mention. They should all be applied.
> >>>>
> >>>> -----Original Message-----
> >>>> From: Jason Gunthorpe <jgg@ziepe.ca>
> >>>> Sent: Thursday, July 29, 2021 2:51 PM
> >>>> To: Shoaib Rao <rao.shoaib@oracle.com>
> >>>> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list
> >>>> <linux-rdma@vger.kernel.org>
> >>>> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values
> >>>> used via uverbs
> >>>>
> >>>> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
> >>>>
> >>>>> Can we please accept my initial patch where I bumped up the values of
> >>>>> a few parameters. We have extensively tested with those values. I will
> >>>>> try to resolve CRC errors and panic and make changes to other
> >>>>> tuneables later?
> >>>> I think Bob posted something for the icrc issues already
> >>>>
> >>>> Please try to work in a sane fashion, rxe shouldn't be left broken
> >>>> with so many people apparently interested in it??
> >>>>
> >>>> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-04  2:21                         ` Zhu Yanjun
@ 2021-08-05  4:10                           ` Shoaib Rao
  2021-08-05  6:56                             ` Leon Romanovsky
  0 siblings, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-08-05  4:10 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: RDMA mailing list


On 8/3/21 7:21 PM, Zhu Yanjun wrote:
> On Wed, Aug 4, 2021 at 10:03 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>>
>> On 8/3/21 5:51 PM, Zhu Yanjun wrote:
>>> On Wed, Aug 4, 2021 at 7:53 AM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>>>> Hi Zhu,
>>>>
>>>> Any update on your testing after applying Bob's fixes
>>> Do you read my problem carefully?
>>> I mean that before your commit, the whole rxe can work well.
>>> After your commit, the rxe can not work well.
>>> Please reproduce this problem in your host and fix it.
>>>
>>> Zhu Yanjun
>> You posted
>>
>>> In my daily tests, I found that one host 5.12-stable, the other host
>>> is 5.14.-rc3 + this commit.
>>> rping can not work. Sometimes crash will occur.
>>>
>>> It seems that changing maximum values breaks backward compatibility.
>>>
>>> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
>>> rping can work well.
>>>
>>> Zhu Yanjun
>> I am not sure how you made rxe to work because it did not work for me
>> and neither for Bob. Since then, Bob has posted patches for the issue. I
>> also posted that my changes work on 5.13.6 kernel. emails attached.
>>
>> Even if rxe in 5.14 is working for you some how, please apply Bob's
>> patches and then mine and test.
> I have already applied this commit
> https://urldefense.com/v3/__https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/__;!!ACWV5N9M2RV99hQ!b2c47MGvP_kCr0tkQgySPZaB3QX3DMeh4l_iwAS3IQHh9R589oF9BWrcgftcidGA$ .
>
> And with your commit, rxe can not work well.
>
> Zhu Yanjun

I am not sure how anyone can claim that the code works without my 
changes. Rxe in Linux 5.14-rc4 is broken due to following change made to 
rxe_cq_post() and will cause panic or corruption guaranteed.

addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);

It should be

addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);

The following function also seems wrong

> static inline void *producer_addr(struct rxe_queue *q, enum queue_type 
> type)
> {
>         u32 prod;
>
>         switch (type) {
>         case QUEUE_TYPE_FROM_CLIENT:
>                 /* protect user space index */
>                 prod = smp_load_acquire(&q->buf->producer_index);
>                 prod &= q->index_mask;
>                 break;
>         case QUEUE_TYPE_TO_CLIENT:
>                 prod = q->index;
>                 break;
>         }
>
>         return q->buf->data + (prod << q->log2_elem_size);
> }
index should be returned as it is.

The code has changed again in v5.14-rc4-22-g251a1524293d, so now I have 
to try again.

Can we please make sure that the code is working after the application 
of each patch or else it is a moving target.

BTW I liked the old code as it distinctly said what was being returned.

Shoaib


>> Thanks,
>>
>> Shoaib
>>
>>
>>>> Shoaib
>>>>
>>>> On 7/29/21 5:34 PM, Shoaib Rao wrote:
>>>>> Thanks Bob.
>>>>>
>>>>> Zhu can you please apply those patches and test.
>>>>>
>>>>> Shoaib
>>>>>
>>>>> On 7/29/21 4:08 PM, Pearson, Robert B wrote:
>>>>>> I found another rxe bug (for SRQ) and sent three bug fixes in a set
>>>>>> including the one you mention. They should all be applied.
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Jason Gunthorpe <jgg@ziepe.ca>
>>>>>> Sent: Thursday, July 29, 2021 2:51 PM
>>>>>> To: Shoaib Rao <rao.shoaib@oracle.com>
>>>>>> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list
>>>>>> <linux-rdma@vger.kernel.org>
>>>>>> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values
>>>>>> used via uverbs
>>>>>>
>>>>>> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>>>>>>
>>>>>>> Can we please accept my initial patch where I bumped up the values of
>>>>>>> a few parameters. We have extensively tested with those values. I will
>>>>>>> try to resolve CRC errors and panic and make changes to other
>>>>>>> tuneables later?
>>>>>> I think Bob posted something for the icrc issues already
>>>>>>
>>>>>> Please try to work in a sane fashion, rxe shouldn't be left broken
>>>>>> with so many people apparently interested in it??
>>>>>>
>>>>>> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-07-29 23:08               ` Pearson, Robert B
  2021-07-30  0:34                 ` Shoaib Rao
@ 2021-08-05  6:11                 ` Shoaib Rao
  2021-08-06 13:49                   ` Jason Gunthorpe
  1 sibling, 1 reply; 23+ messages in thread
From: Shoaib Rao @ 2021-08-05  6:11 UTC (permalink / raw)
  To: Pearson, Robert B, Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list

Bob,

Your third patch has an issue.

In rxe_cq_post()


addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);

It should be

addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);

After making this change, I have tested my patch and rping works.

Bob can you please point me to the discussion which lead to the current 
changes, particularly the need for user barrier.

Zhu can you apply Bob's 3 patches + the change above + my patch and 
report back. In my testing it works.

Regards,

Shoaib

On 7/29/21 4:08 PM, Pearson, Robert B wrote:
> I found another rxe bug (for SRQ) and sent three bug fixes in a set including the one you mention. They should all be applied.
>
> -----Original Message-----
> From: Jason Gunthorpe <jgg@ziepe.ca>
> Sent: Thursday, July 29, 2021 2:51 PM
> To: Shoaib Rao <rao.shoaib@oracle.com>
> Cc: Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list <linux-rdma@vger.kernel.org>
> Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
>
> On Thu, Jul 29, 2021 at 12:33:14PM -0700, Shoaib Rao wrote:
>
>> Can we please accept my initial patch where I bumped up the values of
>> a few parameters. We have extensively tested with those values. I will
>> try to resolve CRC errors and panic and make changes to other tuneables later?
> I think Bob posted something for the icrc issues already
>
> Please try to work in a sane fashion, rxe shouldn't be left broken with so many people apparently interested in it??
>
> Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-05  4:10                           ` Shoaib Rao
@ 2021-08-05  6:56                             ` Leon Romanovsky
  0 siblings, 0 replies; 23+ messages in thread
From: Leon Romanovsky @ 2021-08-05  6:56 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: Zhu Yanjun, RDMA mailing list

On Wed, Aug 04, 2021 at 09:10:22PM -0700, Shoaib Rao wrote:
> 

<...>

> Can we please make sure that the code is working after the application of
> each patch or else it is a moving target.

We will stop to accept new features till RXE is stabilized again.
https://lore.kernel.org/linux-rdma/YQmF9506lsmeaOBZ@unreal

Thanks

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-05  6:11                 ` Shoaib Rao
@ 2021-08-06 13:49                   ` Jason Gunthorpe
  2021-09-13  0:50                     ` Shoaib Rao
  0 siblings, 1 reply; 23+ messages in thread
From: Jason Gunthorpe @ 2021-08-06 13:49 UTC (permalink / raw)
  To: Shoaib Rao; +Cc: Pearson, Robert B, Zhu Yanjun, RDMA mailing list

On Wed, Aug 04, 2021 at 11:11:15PM -0700, Shoaib Rao wrote:
> Bob,
> 
> Your third patch has an issue.
> 
> In rxe_cq_post()
> 
> 
> addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
> 
> It should be
> 
> addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);
> 
> After making this change, I have tested my patch and rping works.
> 
> Bob can you please point me to the discussion which lead to the current
> changes, particularly the need for user barrier.
> 
> Zhu can you apply Bob's 3 patches + the change above + my patch and report
> back. In my testing it works.

I'll expect Bob to resend

	[for-next,v2,3/3] RDMA/rxe: Add memory barriers to kernel queues

Jason

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-08-06 13:49                   ` Jason Gunthorpe
@ 2021-09-13  0:50                     ` Shoaib Rao
  2021-09-13  3:34                       ` Pearson, Robert B
  2021-09-14 16:14                       ` Bob Pearson
  0 siblings, 2 replies; 23+ messages in thread
From: Shoaib Rao @ 2021-09-13  0:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Pearson, Robert B, Zhu Yanjun, RDMA mailing list


On 8/6/21 6:49 AM, Jason Gunthorpe wrote:
> On Wed, Aug 04, 2021 at 11:11:15PM -0700, Shoaib Rao wrote:
>> Bob,
>>
>> Your third patch has an issue.
>>
>> In rxe_cq_post()
>>
>>
>> addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>
>> It should be
>>
>> addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);
>>
>> After making this change, I have tested my patch and rping works.
>>
>> Bob can you please point me to the discussion which lead to the current
>> changes, particularly the need for user barrier.
>>
>> Zhu can you apply Bob's 3 patches + the change above + my patch and report
>> back. In my testing it works.
> I'll expect Bob to resend
>
> 	[for-next,v2,3/3] RDMA/rxe: Add memory barriers to kernel queues
>
> Jason

I have not seen a reply to this email thread. Has the issue been 
resolved and I missed it?

Shoaib


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-09-13  0:50                     ` Shoaib Rao
@ 2021-09-13  3:34                       ` Pearson, Robert B
  2021-09-14 16:14                       ` Bob Pearson
  1 sibling, 0 replies; 23+ messages in thread
From: Pearson, Robert B @ 2021-09-13  3:34 UTC (permalink / raw)
  To: Shoaib Rao, Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list

Sorry. I totally missed this. I will look at it in the morning.

Bob

-----Original Message-----
From: Shoaib Rao <rao.shoaib@oracle.com> 
Sent: Sunday, September 12, 2021 7:50 PM
To: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Pearson, Robert B <robert.pearson2@hpe.com>; Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs


On 8/6/21 6:49 AM, Jason Gunthorpe wrote:
> On Wed, Aug 04, 2021 at 11:11:15PM -0700, Shoaib Rao wrote:
>> Bob,
>>
>> Your third patch has an issue.
>>
>> In rxe_cq_post()
>>
>>
>> addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>
>> It should be
>>
>> addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);
>>
>> After making this change, I have tested my patch and rping works.
>>
>> Bob can you please point me to the discussion which lead to the 
>> current changes, particularly the need for user barrier.
>>
>> Zhu can you apply Bob's 3 patches + the change above + my patch and 
>> report back. In my testing it works.
> I'll expect Bob to resend
>
> 	[for-next,v2,3/3] RDMA/rxe: Add memory barriers to kernel queues
>
> Jason

I have not seen a reply to this email thread. Has the issue been resolved and I missed it?

Shoaib


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
  2021-09-13  0:50                     ` Shoaib Rao
  2021-09-13  3:34                       ` Pearson, Robert B
@ 2021-09-14 16:14                       ` Bob Pearson
  1 sibling, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2021-09-14 16:14 UTC (permalink / raw)
  To: Shoaib Rao, Jason Gunthorpe
  Cc: Pearson, Robert B, Zhu Yanjun, RDMA mailing list

On 9/12/21 7:50 PM, Shoaib Rao wrote:
> 
> On 8/6/21 6:49 AM, Jason Gunthorpe wrote:
>> On Wed, Aug 04, 2021 at 11:11:15PM -0700, Shoaib Rao wrote:
>>> Bob,
>>>
>>> Your third patch has an issue.
>>>
>>> In rxe_cq_post()
>>>
>>>
>>> addr = producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
>>>
>>> It should be
>>>
>>> addr = producer_addr(cq->queue, QUEUE_TYPE_FROM_CLIENT);
>>>
>>> After making this change, I have tested my patch and rping works.
>>>
>>> Bob can you please point me to the discussion which lead to the current
>>> changes, particularly the need for user barrier.
>>>
>>> Zhu can you apply Bob's 3 patches + the change above + my patch and report
>>> back. In my testing it works.
>> I'll expect Bob to resend
>>
>>     [for-next,v2,3/3] RDMA/rxe: Add memory barriers to kernel queues
>>
>> Jason
> 
> I have not seen a reply to this email thread. Has the issue been resolved and I missed it?
> 
> Shoaib
> 

Shoaib,

Thanks for this. I think I figured out what was causing the problem you tried to fix by
changing _TO_CLIENT to _FROM_CLIENT. That change isn't the whole solution.

The inline functions in rxe_queue.h all take a type parameter to let the compiler remove the switch
statement since the case is known at compile time. The types currently refer to the direction
of data flow in the queues from the point of view of the internals of the rxe driver. I.e.
for WQs data flows from the CLIENT to the DRIVER and for CQs the data flows to the CLIENT from
the DRIVER. This lets the routines in rxe_queue.h selectively use smp_load_acquire or
smp_store_release to 'protect' the queue indices owned by the client and private q->index for the
indices owned by the internals of the driver which is then copied to q->buf->producer/consumer_index.
The reason for this it that the driver can't trust the client in user space to not touch its data.

In rxe_cq.c where you made the change data is flowing to the client and the original type is the correct one. I believe the problem lies in rxe_verbs.c where verbs code manipulates the 'client'
end of the queues. This occurs in post_one_recv(), post_one_send(), rxe_poll_cq(), and rxe_peek_cq().
This code was using the same APIs as the driver internals which had two problems. First it had
the direction wrong in terms of which indices needed protection and worse it used the private
copies of the indices that should not be visible to clients. You put memory barriers in rxe_cq
that had the same effect ias putting the correct barriers in rxe_verbs.c.

I have fixed this by adding two new types for these verbs routines to use that put the correct
memory barriers on the correct indices. It is going to be resent as v4 of the patch series shortly.
I would like you to try it on rping and see of it cleans up what you were seeing.

Bob

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2021-09-14 16:14 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-18 22:59 [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs Rao Shoaib
2021-07-18 22:59 ` [PATCH v3 1/1] " Rao Shoaib
2021-07-27 16:15 ` [PATCH v3 0/1] " Shoaib Rao
2021-07-27 17:41   ` Jason Gunthorpe
2021-07-29  6:42     ` Zhu Yanjun
2021-07-29  6:52       ` Shoaib Rao
2021-07-29  7:57         ` Zhu Yanjun
2021-07-29 19:33           ` Shoaib Rao
2021-07-29 19:50             ` Jason Gunthorpe
2021-07-29 20:33               ` Shoaib Rao
2021-07-29 23:08               ` Pearson, Robert B
2021-07-30  0:34                 ` Shoaib Rao
2021-08-03 23:53                   ` Shoaib Rao
2021-08-04  0:51                     ` Zhu Yanjun
2021-08-04  1:51                       ` Shoaib Rao
2021-08-04  2:21                         ` Zhu Yanjun
2021-08-05  4:10                           ` Shoaib Rao
2021-08-05  6:56                             ` Leon Romanovsky
2021-08-05  6:11                 ` Shoaib Rao
2021-08-06 13:49                   ` Jason Gunthorpe
2021-09-13  0:50                     ` Shoaib Rao
2021-09-13  3:34                       ` Pearson, Robert B
2021-09-14 16:14                       ` Bob Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).