linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Shoaib Rao <rao.shoaib@oracle.com>
To: Zhu Yanjun <zyjzyj2000@gmail.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>,
	RDMA mailing list <linux-rdma@vger.kernel.org>
Subject: Re: [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs
Date: Thu, 29 Jul 2021 12:33:14 -0700	[thread overview]
Message-ID: <eb24b781-396f-5bb9-89c7-3ca0f8b83849@oracle.com> (raw)
In-Reply-To: <CAD=hENcTYfV1LT1=_e=eCNxdjr1Nmi+R3hH_CQn70MGRTKG7LA@mail.gmail.com>

I switched the values to the old values and compiled rdma_rxe module. I 
could not get rping to work. First I get CRC errors and then one node 
panics. Both nodes are running 5.14.0-rc1. So the issue you are seeing 
is not caused by my changes, rxe is already broken in 5.14.0-rc1.

Jason,

Can we please accept my initial patch where I bumped up the values of a 
few parameters. We have extensively tested with those values. I will try 
to resolve CRC errors and panic and make changes to other tuneables later?

Regards,

Shoaib

[ 2105.071603] rdma_rxe: bad ICRC from 10.129.135.22
[ 2106.979538] rdma_rxe: bad ICRC from 10.129.135.22
[ 2109.155417] rdma_rxe: bad ICRC from 10.129.135.22
[ 2111.331292] rdma_rxe: bad ICRC from 10.129.135.22
[ 2113.507169] rdma_rxe: bad ICRC from 10.129.135.22
[ 2115.683046] rdma_rxe: bad ICRC from 10.129.135.22
[ 2117.858927] rdma_rxe: bad ICRC from 10.129.135.22
[ 2120.034798] rdma_rxe: bad ICRC from 10.129.135.22
[ 2122.210691] BUG: unable to handle page fault for address: 
ffffbd8562275180
[ 2122.292744] #PF: supervisor write access in kernel mode
[ 2122.355063] #PF: error_code(0x0002) - not-present page
[ 2122.416342] PGD 100000067 P4D 100000067 PUD 1001c7067 PMD 142a84067 PTE 0
[ 2122.497361] Oops: 0002 [#1] SMP PTI
[ 2122.538913] CPU: 8 PID: 0 Comm: swapper/8 Not tainted 
5.14.0-rc1_rxe_values+ #4
[ 2122.626155] Hardware name: Oracle Corporation SUN FIRE X4170 M2 
SERVER        /ASSY,MOTHERBOARD,X4170, BIOS 08140115 07/04/2018
[ 2122.763248] RIP: 0010:rxe_cq_post+0x9e/0x220 [rdma_rxe]
[ 2122.825578] Code: 44 8b 8b 48 01 00 00 4c 8b 47 08 8b 4f 28 49 8d b0 
80 01 00 00 45 85 c9 0f 84 7d 01 00 00 8b 57 34 d3 e2 48 01 f2 49 8b 0c 
24 <48> 89 0a 49 8b 4c 24 08 48 89 4a 08 49 8b 4c 24 10 48 89 4a 10 49
[ 2123.049907] RSP: 0018:ffffbd85464f0800 EFLAGS: 00010082
[ 2123.112225] RAX: 0000000000000246 RBX: ffff9dad4fd3f800 RCX: 
0000000000000000
[ 2123.197389] RDX: ffffbd8562275180 RSI: ffffbd8546273180 RDI: 
ffff9da2c3cee840
[ 2123.282555] RBP: ffffbd85464f0840 R08: ffffbd8546273000 R09: 
0000000000000001
[ 2123.367722] R10: 0000000000000001 R11: 0000000000000001 R12: 
ffffbd85464f08b8
[ 2123.452886] R13: 0000000000000000 R14: ffff9dad4fd3f940 R15: 
ffff9da3002ac008
[ 2123.538053] FS:  0000000000000000(0000) GS:ffff9db94fa80000(0000) 
knlGS:0000000000000000
[ 2123.634642] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2123.703190] CR2: ffffbd8562275180 CR3: 00000013f340a005 CR4: 
00000000000206e0
[ 2123.788357] Call Trace:
[ 2123.817443]  <IRQ>
[ 2123.841335]  rxe_responder+0x5d9/0x2490 [rdma_rxe]
[ 2123.898467]  ? native_apic_mem_write+0x10/0x10
[ 2123.951445]  ? native_apic_wait_icr_idle+0x22/0x30
[ 2124.008575]  ? arch_irq_work_raise+0x3a/0x40
[ 2124.059476]  ? __irq_work_queue_local+0x48/0x60
[ 2124.113486]  ? fib_table_lookup+0x21e/0x640
[ 2124.163348]  ? wake_up_klogd.part.31+0x34/0x40
[ 2124.216319]  rxe_do_task+0x94/0x110 [rdma_rxe]
[ 2124.269297]  rxe_run_task+0x2a/0x40 [rdma_rxe]
[ 2124.322275]  rxe_resp_queue_pkt+0x44/0x50 [rdma_rxe]
[ 2124.381485]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
[ 2124.431347]  rxe_udp_encap_recv+0x6d/0xd0 [rdma_rxe]
[ 2124.490555]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
[ 2124.548725]  udp_queue_rcv_one_skb+0x1f2/0x500
[ 2124.601697]  udp_queue_rcv_skb+0x50/0x210
[ 2124.649475]  udp_unicast_rcv_skb.isra.67+0x78/0x90
[ 2124.706600]  __udp4_lib_rcv+0x57c/0xbe0
[ 2124.752303]  udp_rcv+0x1a/0x20

On 7/29/21 12:57 AM, Zhu Yanjun wrote:
> On Thu, Jul 29, 2021 at 2:52 PM Shoaib Rao <rao.shoaib@oracle.com> wrote:
>>
>> On 7/28/21 11:42 PM, Zhu Yanjun wrote:
>>> On Wed, Jul 28, 2021 at 1:42 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>>> On Tue, Jul 27, 2021 at 09:15:45AM -0700, Shoaib Rao wrote:
>>>>> Hi Jason et al,
>>>>>
>>>>> Can I please get an up or down comment on my patch?
>>>> Bob and Zhu should check it
>>> In my daily tests, I found that one host 5.12-stable, the other host
>>> is 5.14.-rc3 + this commit.
>>> rping can not work. Sometimes crash will occur.
>> Can you paste the stack?
> [  381.068203] rdma_rxe: qp#17 moved to error state
> [  421.464485] BUG: unable to handle page fault for address: ffff9e5de298d180
> [  421.464515] #PF: supervisor write access in kernel mode
> [  421.464532] #PF: error_code(0x0002) - not-present page
> [  421.464549] PGD 100c00067 P4D 100c00067 PUD 100dc1067 PMD 125e78067 PTE 0
> [  421.464572] Oops: 0002 [#1] SMP PTI
> [  421.464585] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted:
> G S      W  OE     5.13.1-rxe+ #17
> [  421.464613] Hardware name: Intel Corporation S2600WFT/S2600WFT,
> BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
> [  421.464642] RIP: 0010:rxe_cq_post+0x98/0x210 [rdma_rxe]
> [  421.464667] Code: 8b b3 48 01 00 00 4d 8b 48 08 41 8b 48 28 49 8d
> b9 80 01 00 00 85 f6 0f 84 78 01 00 00 41 8b 50 34 d3 e2 48 01 fa 48
> 8b 4d 00 <48> 89 0a 48 8b 4d 08 48 89 4a 08 48 8b 4d 10 48 89 4a 10 48
> 8b 4d
> [  421.464718] RSP: 0018:ffff9e5dc6ce0918 EFLAGS: 00010082
> [  421.464735] RAX: 0000000000000246 RBX: ffff8b200cabd800 RCX: 0000000000000000
> [  421.464756] RDX: ffff9e5de298d180 RSI: 0000000000000001 RDI: ffff9e5dc698b180
> [  421.464777] RBP: ffff9e5dc6ce09c0 R08: ffff8b2014d85a80 R09: ffff9e5dc698b000
> [  421.464797] R10: ffffffff8bc90940 R11: 0000000000000001 R12: 0000000000000000
> [  421.464817] R13: ffff8b200cabd940 R14: ffff8b206e014008 R15: 000000000000001a
> [  421.464838] FS:  0000000000000000(0000) GS:ffff8b1fd1040000(0000)
> knlGS:0000000000000000
> [  421.464861] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  421.464879] CR2: ffff9e5de298d180 CR3: 0000000c4df4e006 CR4: 00000000007706e0
> [  421.464899] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  421.464920] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  421.464941] PKRU: 55555554
> [  421.464950] Call Trace:
> [  421.464961]  <IRQ>
> [  421.464971]  rxe_responder+0x621/0x2480 [rdma_rxe]
> [  421.464993]  ? __fib_validate_source+0x2e9/0x450
> [  421.465013]  rxe_do_task+0x89/0x100 [rdma_rxe]
> [  421.465033]  rxe_rcv+0x2eb/0x900 [rdma_rxe]
> [  421.465050]  ? __udp4_lib_lookup+0x2c8/0x440
> [  421.465065]  rxe_udp_encap_recv+0x68/0xc0 [rdma_rxe]
> [  421.465085]  ? rxe_enable_task+0x10/0x10 [rdma_rxe]
> [  421.465104]  udp_queue_rcv_one_skb+0x1df/0x4e0
> [  421.465120]  udp_unicast_rcv_skb.isra.67+0x74/0x90
> [  421.465135]  __udp4_lib_rcv+0x555/0xb90
> [  421.465150]  ? nf_ct_deliver_cached_events+0xc1/0x120 [nf_conntrack]
> [  421.465181]  ip_protocol_deliver_rcu+0xe8/0x1b0
> [  421.465199]  ip_local_deliver_finish+0x44/0x50
> [  421.465215]  ip_local_deliver+0xf1/0x100
> [  421.465229]  ? coalesce_fill_reply+0x2c1/0x480
> [  421.465249]  ? ip_protocol_deliver_rcu+0x1b0/0x1b0
> [  421.465265]  ip_sublist_rcv_finish+0x75/0x80
> [  421.465281]  ip_sublist_rcv+0x196/0x220
> [  421.465296]  ? ip_local_deliver+0x100/0x100
> [  421.465312]  ip_list_rcv+0x137/0x160
> [  421.465325]  __netif_receive_skb_list_core+0x29b/0x2c0
> [  421.465344]  netif_receive_skb_list_internal+0x1c3/0x2f0
> [  421.465361]  gro_normal_list.part.158+0x19/0x40
> [  421.465376]  napi_complete_done+0x67/0x160
> [  421.465391]  i40e_napi_poll+0x53b/0x840 [i40e]
> [  421.465426]  __napi_poll+0x2b/0x120
> [  421.466123]  net_rx_action+0x236/0x300
> [  421.466783]  __do_softirq+0xc9/0x285
> [  421.467440]  irq_exit_rcu+0xba/0xd0
> [  421.468091]  common_interrupt+0x7f/0xa0
> [  421.468737]  </IRQ>
> [  421.469366]  asm_common_interrupt+0x1e/0x40
> [  421.469990] RIP: 0010:cpuidle_enter_state+0xd6/0x350
> [  421.470608] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 45 49 99 ff 45
> 84 ff 74 12 9c 58 f6 c4 02 0f 85 32 02 00 00 31 ff e8 ae c8 9f ff fb
> 45 85 f6 <0f> 88 e0 00 00 00 49 63 d6 4c 2b 24 24 48 8d 04 52 48 8d 04
> 82 49
> [  421.471935] RSP: 0018:ffff9e5dc679fe80 EFLAGS: 00000202
> [  421.472599] RAX: ffff8b1fd106bc40 RBX: 0000000000000002 RCX: 000000000000001f
> [  421.473266] RDX: 00000062213d764d RSI: 000000003351fed6 RDI: 0000000000000000
> [  421.473920] RBP: ffffbe51c1040000 R08: 0000000000000002 R09: 000000000002b480
> [  421.474558] R10: 0000a82bea904be8 R11: ffff8b1fd106a984 R12: 00000062213d764d
> [  421.475172] R13: ffffffff8c6c6d80 R14: 0000000000000002 R15: 0000000000000000
> [  421.475763]  cpuidle_enter+0x29/0x40
> [  421.476348]  do_idle+0x257/0x2a0
> [  421.476926]  cpu_startup_entry+0x19/0x20
> [  421.477497]  start_secondary+0x116/0x150
> [  421.478067]  secondary_startup_64_no_verify+0xc2/0xcb
> [  421.478640] Modules linked in: rdma_rxe(OE) ip6_udp_tunnel
> udp_tunnel xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT
> nf_reject_ipv4 nft_compat nft_counter nft_chain_nat nf_nat
> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun
> bridge stp llc nls_utf8 isofs cdrom loop rfkill ib_isert
> iscsi_target_mod ib_srpt ext4 target_core_mod ib_srp
> scsi_transport_srp mbcache jbd2 rpcrdma sunrpc intel_rapl_msr
> intel_rapl_common rdma_ucm isst_if_common ib_iser ib_umad rdma_cm
> ib_ipoib iw_cm skx_edac libiscsi ib_cm nfit libnvdimm
> scsi_transport_iscsi x86_pkg_temp_thermal intel_powerclamp mlx5_ib
> coretemp crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support
> ib_uverbs ghash_clmulni_intel rapl ipmi_ssif intel_cstate ib_core
> mei_me acpi_ipmi i2c_i801 joydev intel_uncore pcspkr mei i2c_smbus
> lpc_ich ioatdma ipmi_si intel_pch_thermal dca ipmi_devintf
> ipmi_msghandler acpi_pad acpi_power_meter ip_tables xfs libcrc32c
> sd_mod t10_pi sg mlx5_core ast i2c_algo_bit drm_vram_helper
> [  421.478702]  drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops drm_ttm_helper ttm mlxfw ahci libahci pci_hyperv_intf ice
> drm i40e tls crc32c_intel libata psample wmi dm_mirror dm_region_hash
> dm_log dm_mod fuse [last unloaded: ip6_udp_tunnel]
> [  421.483665] CR2: ffff9e5de298d180
>
>
>>> It seems that changing maximum values breaks backward compatibility.
>>>
>>> But without this commit, that is, 5.12-stable <-------> 5.14-rc3,
>>> rping can work well.
>> That is strange because all the large values do is initialize the pool
>> with large values. Nothing else. So unless large values are used there
>> should be no issues. Is it possible that the issue is with 5.14-rc3. Do
>> things work between 5.12-stable systems. Anyways, please post the stack
>> trace and also information on the setup and rping commands used.
>>
>> Shoaib
>>
>>> Zhu Yanjun
>>>> Jason

  reply	other threads:[~2021-07-29 19:33 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-18 22:59 [PATCH v3 0/1] RDMA/rxe: Bump up default maximum values used via uverbs Rao Shoaib
2021-07-18 22:59 ` [PATCH v3 1/1] " Rao Shoaib
2021-07-27 16:15 ` [PATCH v3 0/1] " Shoaib Rao
2021-07-27 17:41   ` Jason Gunthorpe
2021-07-29  6:42     ` Zhu Yanjun
2021-07-29  6:52       ` Shoaib Rao
2021-07-29  7:57         ` Zhu Yanjun
2021-07-29 19:33           ` Shoaib Rao [this message]
2021-07-29 19:50             ` Jason Gunthorpe
2021-07-29 20:33               ` Shoaib Rao
2021-07-29 23:08               ` Pearson, Robert B
2021-07-30  0:34                 ` Shoaib Rao
2021-08-03 23:53                   ` Shoaib Rao
2021-08-04  0:51                     ` Zhu Yanjun
2021-08-04  1:51                       ` Shoaib Rao
2021-08-04  2:21                         ` Zhu Yanjun
2021-08-05  4:10                           ` Shoaib Rao
2021-08-05  6:56                             ` Leon Romanovsky
2021-08-05  6:11                 ` Shoaib Rao
2021-08-06 13:49                   ` Jason Gunthorpe
2021-09-13  0:50                     ` Shoaib Rao
2021-09-13  3:34                       ` Pearson, Robert B
2021-09-14 16:14                       ` Bob Pearson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eb24b781-396f-5bb9-89c7-3ca0f8b83849@oracle.com \
    --to=rao.shoaib@oracle.com \
    --cc=jgg@ziepe.ca \
    --cc=linux-rdma@vger.kernel.org \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).