Question about RDMA server...

* Question about RDMA server...
@ 2022-03-31 15:16 Trond Myklebust
  2022-03-31 15:21 ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-03-31 15:16 UTC (permalink / raw)
  To: Chuck.Lever; +Cc: linux-nfs

Hi Chuck,

I'm seeing a very weird stack trace on one of our systems:

[88463.974603] BUG: kernel NULL pointer dereference, address: 0000000000000058
[88463.974778] #PF: supervisor read access in kernel mode
[88463.974916] #PF: error_code(0x0000) - not-present page
[88463.975037] PGD 0 P4D 0
[88463.975164] Oops: 0000 [#1] SMP NOPTI
[88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not tainted 5.15.31-200.pd.17802.el7.x86_64 #1
[88463.975452] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
[88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45 38 a8 04 0f 85 ec
[88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
[88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX: ffff9455daf0c000
[88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI: ffff9455da860000
[88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09: ffff945586b6b388
[88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12: 0000000000000000
[88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15: ffff9455dabb8000
[88463.977709] FS:  0000000000000000(0000) GS:ffff94586fd00000(0000) knlGS:0000000000000000
[88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4: 00000000003706e0
[88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[88463.979141] Call Trace:
[88463.979419]  <TASK>
[88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
[88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
[88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
[88463.980662]  svc_send+0x49/0x120 [sunrpc]
[88463.981009]  nfsd+0xe8/0x140 [nfsd]
[88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
[88463.981675]  kthread+0x127/0x150
[88463.981981]  ? set_kthread_struct+0x40/0x40
[88463.982284]  ret_from_fork+0x22/0x30
[88463.982586]  </TASK>
[88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci i2c_piix4
[88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf tls ata_generic drm pata_acpi psample

Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
net/sunrpc/xprtrdma/svc_rdma_sendto.c

i.e.

927		__be32 *rdma_argp = rctxt->rc_recv_buf;

which shows that somehow, we're getting to svc_rdma_sendto() with a
request that has rqstp->rq_xprt_ctxt == NULL.

I'm having trouble seeing how that can happen, but thought I'd ask in
case you've seen something similar. AFAICS, the code in that area has
not changed since early 2021.

As I understand it, knfsd was in the process of shutting down at the
time, so it is possible there is a connection to that...

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com

^ permalink raw reply	[flat|nested] 12+ messages in thread