All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about RDMA server...
@ 2022-03-31 15:16 Trond Myklebust
  2022-03-31 15:21 ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-03-31 15:16 UTC (permalink / raw)
  To: Chuck.Lever; +Cc: linux-nfs

Hi Chuck,


I'm seeing a very weird stack trace on one of our systems:

[88463.974603] BUG: kernel NULL pointer dereference, address: 0000000000000058
[88463.974778] #PF: supervisor read access in kernel mode
[88463.974916] #PF: error_code(0x0000) - not-present page
[88463.975037] PGD 0 P4D 0
[88463.975164] Oops: 0000 [#1] SMP NOPTI
[88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not tainted 5.15.31-200.pd.17802.el7.x86_64 #1
[88463.975452] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
[88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45 38 a8 04 0f 85 ec
[88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
[88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX: ffff9455daf0c000
[88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI: ffff9455da860000
[88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09: ffff945586b6b388
[88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12: 0000000000000000
[88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15: ffff9455dabb8000
[88463.977709] FS:  0000000000000000(0000) GS:ffff94586fd00000(0000) knlGS:0000000000000000
[88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4: 00000000003706e0
[88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[88463.979141] Call Trace:
[88463.979419]  <TASK>
[88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
[88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
[88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
[88463.980662]  svc_send+0x49/0x120 [sunrpc]
[88463.981009]  nfsd+0xe8/0x140 [nfsd]
[88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
[88463.981675]  kthread+0x127/0x150
[88463.981981]  ? set_kthread_struct+0x40/0x40
[88463.982284]  ret_from_fork+0x22/0x30
[88463.982586]  </TASK>
[88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci i2c_piix4
[88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf tls ata_generic drm pata_acpi psample

Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
net/sunrpc/xprtrdma/svc_rdma_sendto.c

i.e.

927		__be32 *rdma_argp = rctxt->rc_recv_buf;

which shows that somehow, we're getting to svc_rdma_sendto() with a
request that has rqstp->rq_xprt_ctxt == NULL.

I'm having trouble seeing how that can happen, but thought I'd ask in
case you've seen something similar. AFAICS, the code in that area has
not changed since early 2021.

As I understand it, knfsd was in the process of shutting down at the
time, so it is possible there is a connection to that...

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 15:16 Question about RDMA server Trond Myklebust
@ 2022-03-31 15:21 ` Chuck Lever III
  2022-03-31 15:59   ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 15:21 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Mar 31, 2022, at 11:16 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> Hi Chuck,
> 
> 
> I'm seeing a very weird stack trace on one of our systems:
> 
> [88463.974603] BUG: kernel NULL pointer dereference, address: 0000000000000058
> [88463.974778] #PF: supervisor read access in kernel mode
> [88463.974916] #PF: error_code(0x0000) - not-present page
> [88463.975037] PGD 0 P4D 0
> [88463.975164] Oops: 0000 [#1] SMP NOPTI
> [88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not tainted 5.15.31-200.pd.17802.el7.x86_64 #1
> [88463.975452] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
> [88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
> [88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45 38 a8 04 0f 85 ec
> [88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
> [88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX: ffff9455daf0c000
> [88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI: ffff9455da860000
> [88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09: ffff945586b6b388
> [88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12: 0000000000000000
> [88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15: ffff9455dabb8000
> [88463.977709] FS:  0000000000000000(0000) GS:ffff94586fd00000(0000) knlGS:0000000000000000
> [88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4: 00000000003706e0
> [88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [88463.979141] Call Trace:
> [88463.979419]  <TASK>
> [88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
> [88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
> [88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
> [88463.980662]  svc_send+0x49/0x120 [sunrpc]
> [88463.981009]  nfsd+0xe8/0x140 [nfsd]
> [88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
> [88463.981675]  kthread+0x127/0x150
> [88463.981981]  ? set_kthread_struct+0x40/0x40
> [88463.982284]  ret_from_fork+0x22/0x30
> [88463.982586]  </TASK>
> [88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci i2c_piix4
> [88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf tls ata_generic drm pata_acpi psample
> 
> Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
> net/sunrpc/xprtrdma/svc_rdma_sendto.c
> 
> i.e.
> 
> 927		__be32 *rdma_argp = rctxt->rc_recv_buf;
> 
> which shows that somehow, we're getting to svc_rdma_sendto() with a
> request that has rqstp->rq_xprt_ctxt == NULL.
> 
> I'm having trouble seeing how that can happen, but thought I'd ask in
> case you've seen something similar. AFAICS, the code in that area has
> not changed since early 2021.
> 
> As I understand it, knfsd was in the process of shutting down at the
> time, so it is possible there is a connection to that...

My immediate response is "that's not supposed to happen". I'll give
it some thought, but yeah, I bet the transport sendto code is not
dealing with connection shutdown properly.

And btw, Hammerspace's testing of NFS/RDMA is very much appreciated.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 15:21 ` Chuck Lever III
@ 2022-03-31 15:59   ` Chuck Lever III
  2022-03-31 16:15     ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 15:59 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Mar 31, 2022, at 11:21 AM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
> 
> 
>> On Mar 31, 2022, at 11:16 AM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>> 
>> Hi Chuck,
>> 
>> 
>> I'm seeing a very weird stack trace on one of our systems:
>> 
>> [88463.974603] BUG: kernel NULL pointer dereference, address: 0000000000000058
>> [88463.974778] #PF: supervisor read access in kernel mode
>> [88463.974916] #PF: error_code(0x0000) - not-present page
>> [88463.975037] PGD 0 P4D 0
>> [88463.975164] Oops: 0000 [#1] SMP NOPTI
>> [88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not tainted 5.15.31-200.pd.17802.el7.x86_64 #1
>> [88463.975452] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
>> [88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
>> [88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45 38 a8 04 0f 85 ec
>> [88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
>> [88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX: ffff9455daf0c000
>> [88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI: ffff9455da860000
>> [88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09: ffff945586b6b388
>> [88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12: 0000000000000000
>> [88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15: ffff9455dabb8000
>> [88463.977709] FS:  0000000000000000(0000) GS:ffff94586fd00000(0000) knlGS:0000000000000000
>> [88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4: 00000000003706e0
>> [88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> [88463.979141] Call Trace:
>> [88463.979419]  <TASK>
>> [88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
>> [88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
>> [88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
>> [88463.980662]  svc_send+0x49/0x120 [sunrpc]
>> [88463.981009]  nfsd+0xe8/0x140 [nfsd]
>> [88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
>> [88463.981675]  kthread+0x127/0x150
>> [88463.981981]  ? set_kthread_struct+0x40/0x40
>> [88463.982284]  ret_from_fork+0x22/0x30
>> [88463.982586]  </TASK>
>> [88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink ip6table_filter ip6_tables iptable_filter vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci i2c_piix4
>> [88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf tls ata_generic drm pata_acpi psample
>> 
>> Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
>> net/sunrpc/xprtrdma/svc_rdma_sendto.c
>> 
>> i.e.
>> 
>> 927		__be32 *rdma_argp = rctxt->rc_recv_buf;
>> 
>> which shows that somehow, we're getting to svc_rdma_sendto() with a
>> request that has rqstp->rq_xprt_ctxt == NULL.
>> 
>> I'm having trouble seeing how that can happen, but thought I'd ask in
>> case you've seen something similar. AFAICS, the code in that area has
>> not changed since early 2021.
>> 
>> As I understand it, knfsd was in the process of shutting down at the
>> time, so it is possible there is a connection to that...
> 
> My immediate response is "that's not supposed to happen". I'll give
> it some thought, but yeah, I bet the transport sendto code is not
> dealing with connection shutdown properly.

I still don't see exactly how @rctxt could be NULL here, but
it might be enough to move the deref of @rctxt down past the
svc_xprt_is_dead() test, say just before "*p++ = *rdma_argp;".

In any event, it could be that recent reorganization of generic
transport code might be responsible, or maybe that this is just
a hard race to hit.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 15:59   ` Chuck Lever III
@ 2022-03-31 16:15     ` Trond Myklebust
  2022-03-31 16:20       ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-03-31 16:15 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

On Thu, 2022-03-31 at 15:59 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 31, 2022, at 11:21 AM, Chuck Lever III
> > <chuck.lever@oracle.com> wrote:
> > 
> > 
> > 
> > > On Mar 31, 2022, at 11:16 AM, Trond Myklebust
> > > <trondmy@hammerspace.com> wrote:
> > > 
> > > Hi Chuck,
> > > 
> > > 
> > > I'm seeing a very weird stack trace on one of our systems:
> > > 
> > > [88463.974603] BUG: kernel NULL pointer dereference, address:
> > > 0000000000000058
> > > [88463.974778] #PF: supervisor read access in kernel mode
> > > [88463.974916] #PF: error_code(0x0000) - not-present page
> > > [88463.975037] PGD 0 P4D 0
> > > [88463.975164] Oops: 0000 [#1] SMP NOPTI
> > > [88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not
> > > tainted 5.15.31-200.pd.17802.el7.x86_64 #1
> > > [88463.975452] Hardware name: VMware, Inc. VMware Virtual
> > > Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
> > > [88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
> > > [88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55
> > > 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c
> > > 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45
> > > 38 a8 04 0f 85 ec
> > > [88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
> > > [88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX:
> > > ffff9455daf0c000
> > > [88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI:
> > > ffff9455da860000
> > > [88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09:
> > > ffff945586b6b388
> > > [88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12:
> > > 0000000000000000
> > > [88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15:
> > > ffff9455dabb8000
> > > [88463.977709] FS:  0000000000000000(0000)
> > > GS:ffff94586fd00000(0000) knlGS:0000000000000000
> > > [88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4:
> > > 00000000003706e0
> > > [88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> > > 0000000000000000
> > > [88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> > > 0000000000000400
> > > [88463.979141] Call Trace:
> > > [88463.979419]  <TASK>
> > > [88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
> > > [88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
> > > [88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
> > > [88463.980662]  svc_send+0x49/0x120 [sunrpc]
> > > [88463.981009]  nfsd+0xe8/0x140 [nfsd]
> > > [88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
> > > [88463.981675]  kthread+0x127/0x150
> > > [88463.981981]  ? set_kthread_struct+0x40/0x40
> > > [88463.982284]  ret_from_fork+0x22/0x30
> > > [88463.982586]  </TASK>
> > > [88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth
> > > nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver
> > > nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE
> > > nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc
> > > overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast
> > > nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT
> > > nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack
> > > ip6table_nat ip6table_mangle ip6table_security ip6table_raw
> > > iptable_nat nf_nat iptable_mangle iptable_security iptable_raw
> > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink
> > > ip6table_filter ip6_tables iptable_filter vsock_loopback
> > > vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock
> > > bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser
> > > libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib
> > > ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib
> > > crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci
> > > i2c_piix4
> > > [88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel
> > > ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf
> > > tls ata_generic drm pata_acpi psample
> > > 
> > > Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
> > > net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > > 
> > > i.e.
> > > 
> > > 927             __be32 *rdma_argp = rctxt->rc_recv_buf;
> > > 
> > > which shows that somehow, we're getting to svc_rdma_sendto() with
> > > a
> > > request that has rqstp->rq_xprt_ctxt == NULL.
> > > 
> > > I'm having trouble seeing how that can happen, but thought I'd
> > > ask in
> > > case you've seen something similar. AFAICS, the code in that area
> > > has
> > > not changed since early 2021.
> > > 
> > > As I understand it, knfsd was in the process of shutting down at
> > > the
> > > time, so it is possible there is a connection to that...
> > 
> > My immediate response is "that's not supposed to happen". I'll give
> > it some thought, but yeah, I bet the transport sendto code is not
> > dealing with connection shutdown properly.
> 
> I still don't see exactly how @rctxt could be NULL here, but
> it might be enough to move the deref of @rctxt down past the
> svc_xprt_is_dead() test, say just before "*p++ = *rdma_argp;".
> 
> In any event, it could be that recent reorganization of generic
> transport code might be responsible, or maybe that this is just
> a hard race to hit.
> 

Hmm... Here's another thought. What if this were a deferred request
that is being replayed after an upcall to mountd or the idmapper? It
would mean that the synchronous wait in cache_defer_req() failed, so it
is going to be rare, but it could happen on a congested system.

AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
svc_deferred_recv() won't restore it either.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 16:15     ` Trond Myklebust
@ 2022-03-31 16:20       ` Chuck Lever III
  2022-03-31 16:24         ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 16:20 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Mar 31, 2022, at 12:15 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Thu, 2022-03-31 at 15:59 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Mar 31, 2022, at 11:21 AM, Chuck Lever III
>>> <chuck.lever@oracle.com> wrote:
>>> 
>>> 
>>> 
>>>> On Mar 31, 2022, at 11:16 AM, Trond Myklebust
>>>> <trondmy@hammerspace.com> wrote:
>>>> 
>>>> Hi Chuck,
>>>> 
>>>> 
>>>> I'm seeing a very weird stack trace on one of our systems:
>>>> 
>>>> [88463.974603] BUG: kernel NULL pointer dereference, address:
>>>> 0000000000000058
>>>> [88463.974778] #PF: supervisor read access in kernel mode
>>>> [88463.974916] #PF: error_code(0x0000) - not-present page
>>>> [88463.975037] PGD 0 P4D 0
>>>> [88463.975164] Oops: 0000 [#1] SMP NOPTI
>>>> [88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not
>>>> tainted 5.15.31-200.pd.17802.el7.x86_64 #1
>>>> [88463.975452] Hardware name: VMware, Inc. VMware Virtual
>>>> Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
>>>> [88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
>>>> [88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56 41 55
>>>> 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48 89 3c
>>>> 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b 45
>>>> 38 a8 04 0f 85 ec
>>>> [88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
>>>> [88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400 RCX:
>>>> ffff9455daf0c000
>>>> [88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8 RDI:
>>>> ffff9455da860000
>>>> [88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388 R09:
>>>> ffff945586b6b388
>>>> [88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000 R12:
>>>> 0000000000000000
>>>> [88463.977449] R13: ffff945663080800 R14: ffff9455da860000 R15:
>>>> ffff9455dabb8000
>>>> [88463.977709] FS:  0000000000000000(0000)
>>>> GS:ffff94586fd00000(0000) knlGS:0000000000000000
>>>> [88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006 CR4:
>>>> 00000000003706e0
>>>> [88463.978583] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>> 0000000000000400
>>>> [88463.979141] Call Trace:
>>>> [88463.979419]  <TASK>
>>>> [88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
>>>> [88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
>>>> [88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
>>>> [88463.980662]  svc_send+0x49/0x120 [sunrpc]
>>>> [88463.981009]  nfsd+0xe8/0x140 [nfsd]
>>>> [88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
>>>> [88463.981675]  kthread+0x127/0x150
>>>> [88463.981981]  ? set_kthread_struct+0x40/0x40
>>>> [88463.982284]  ret_from_fork+0x22/0x30
>>>> [88463.982586]  </TASK>
>>>> [88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat veth
>>>> nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4 dns_resolver
>>>> nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace xt_MASQUERADE
>>>> nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc
>>>> overlay nf_nat_ftp nf_conntrack_netbios_ns nf_conntrack_broadcast
>>>> nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT
>>>> nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack
>>>> ip6table_nat ip6table_mangle ip6table_security ip6table_raw
>>>> iptable_nat nf_nat iptable_mangle iptable_security iptable_raw
>>>> nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink
>>>> ip6table_filter ip6_tables iptable_filter vsock_loopback
>>>> vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock
>>>> bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm ib_iser
>>>> libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib
>>>> ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul mlx5_ib
>>>> crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core vmw_vmci
>>>> i2c_piix4
>>>> [88463.982930]  dm_multipath ip_tables xfs mlx5_core crc32c_intel
>>>> ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec pci_hyperv_intf
>>>> tls ata_generic drm pata_acpi psample
>>>> 
>>>> Decoding the address svc_rdma_sendto+0x26 resolves to line 927 in
>>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c
>>>> 
>>>> i.e.
>>>> 
>>>> 927             __be32 *rdma_argp = rctxt->rc_recv_buf;
>>>> 
>>>> which shows that somehow, we're getting to svc_rdma_sendto() with
>>>> a
>>>> request that has rqstp->rq_xprt_ctxt == NULL.
>>>> 
>>>> I'm having trouble seeing how that can happen, but thought I'd
>>>> ask in
>>>> case you've seen something similar. AFAICS, the code in that area
>>>> has
>>>> not changed since early 2021.
>>>> 
>>>> As I understand it, knfsd was in the process of shutting down at
>>>> the
>>>> time, so it is possible there is a connection to that...
>>> 
>>> My immediate response is "that's not supposed to happen". I'll give
>>> it some thought, but yeah, I bet the transport sendto code is not
>>> dealing with connection shutdown properly.
>> 
>> I still don't see exactly how @rctxt could be NULL here, but
>> it might be enough to move the deref of @rctxt down past the
>> svc_xprt_is_dead() test, say just before "*p++ = *rdma_argp;".
>> 
>> In any event, it could be that recent reorganization of generic
>> transport code might be responsible, or maybe that this is just
>> a hard race to hit.
>> 
> 
> Hmm... Here's another thought. What if this were a deferred request
> that is being replayed after an upcall to mountd or the idmapper? It
> would mean that the synchronous wait in cache_defer_req() failed, so it
> is going to be rare, but it could happen on a congested system.
> 
> AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
> svc_deferred_recv() won't restore it either.

True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we have
seen this problem before on a socket transport?

I need to audit code to see if saving rq_xprt_ctxt in svc_defer()
is safe and reasonable to do. Maybe Bruce has a thought.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 16:20       ` Chuck Lever III
@ 2022-03-31 16:24         ` Trond Myklebust
  2022-03-31 18:53           ` Chuck Lever III
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-03-31 16:24 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

On Thu, 2022-03-31 at 16:20 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 31, 2022, at 12:15 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Thu, 2022-03-31 at 15:59 +0000, Chuck Lever III wrote:
> > > 
> > > 
> > > > On Mar 31, 2022, at 11:21 AM, Chuck Lever III
> > > > <chuck.lever@oracle.com> wrote:
> > > > 
> > > > 
> > > > 
> > > > > On Mar 31, 2022, at 11:16 AM, Trond Myklebust
> > > > > <trondmy@hammerspace.com> wrote:
> > > > > 
> > > > > Hi Chuck,
> > > > > 
> > > > > 
> > > > > I'm seeing a very weird stack trace on one of our systems:
> > > > > 
> > > > > [88463.974603] BUG: kernel NULL pointer dereference, address:
> > > > > 0000000000000058
> > > > > [88463.974778] #PF: supervisor read access in kernel mode
> > > > > [88463.974916] #PF: error_code(0x0000) - not-present page
> > > > > [88463.975037] PGD 0 P4D 0
> > > > > [88463.975164] Oops: 0000 [#1] SMP NOPTI
> > > > > [88463.975296] CPU: 4 PID: 12597 Comm: nfsd Kdump: loaded Not
> > > > > tainted 5.15.31-200.pd.17802.el7.x86_64 #1
> > > > > [88463.975452] Hardware name: VMware, Inc. VMware Virtual
> > > > > Platform/440BX Desktop Reference Platform, BIOS 6.00
> > > > > 12/12/2018
> > > > > [88463.975630] RIP: 0010:svc_rdma_sendto+0x26/0x330 [rpcrdma]
> > > > > [88463.975831] Code: 1f 44 00 00 0f 1f 44 00 00 41 57 41 56
> > > > > 41 55
> > > > > 41 54 55 53 48 83 ec 28 4c 8b 6f 20 4c 8b a7 90 01 00 00 48
> > > > > 89 3c
> > > > > 24 49 8b 45 38 <49> 8b 5c 24 58 a8 40 0f 85 f8 01 00 00 49 8b
> > > > > 45
> > > > > 38 a8 04 0f 85 ec
> > > > > [88463.976247] RSP: 0018:ffffad54c20b3e80 EFLAGS: 00010282
> > > > > [88463.976469] RAX: 0000000000004119 RBX: ffff94557ccd8400
> > > > > RCX:
> > > > > ffff9455daf0c000
> > > > > [88463.976705] RDX: ffff9455daf0c000 RSI: ffff9455da8601a8
> > > > > RDI:
> > > > > ffff9455da860000
> > > > > [88463.976946] RBP: ffff9455da860000 R08: ffff945586b6b388
> > > > > R09:
> > > > > ffff945586b6b388
> > > > > [88463.977196] R10: 7f0f1dac00000002 R11: 0000000000000000
> > > > > R12:
> > > > > 0000000000000000
> > > > > [88463.977449] R13: ffff945663080800 R14: ffff9455da860000
> > > > > R15:
> > > > > ffff9455dabb8000
> > > > > [88463.977709] FS:  0000000000000000(0000)
> > > > > GS:ffff94586fd00000(0000) knlGS:0000000000000000
> > > > > [88463.977982] CS:  0010 DS: 0000 ES: 0000 CR0:
> > > > > 0000000080050033
> > > > > [88463.978254] CR2: 0000000000000058 CR3: 00000001f9282006
> > > > > CR4:
> > > > > 00000000003706e0
> > > > > [88463.978583] DR0: 0000000000000000 DR1: 0000000000000000
> > > > > DR2:
> > > > > 0000000000000000
> > > > > [88463.978865] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> > > > > DR7:
> > > > > 0000000000000400
> > > > > [88463.979141] Call Trace:
> > > > > [88463.979419]  <TASK>
> > > > > [88463.979693]  ? svc_process_common+0xfa/0x6a0 [sunrpc]
> > > > > [88463.980021]  ? svc_rdma_cma_handler+0x30/0x30 [rpcrdma]
> > > > > [88463.980320]  ? svc_recv+0x48a/0x8c0 [sunrpc]
> > > > > [88463.980662]  svc_send+0x49/0x120 [sunrpc]
> > > > > [88463.981009]  nfsd+0xe8/0x140 [nfsd]
> > > > > [88463.981346]  ? nfsd_shutdown_threads+0x80/0x80 [nfsd]
> > > > > [88463.981675]  kthread+0x127/0x150
> > > > > [88463.981981]  ? set_kthread_struct+0x40/0x40
> > > > > [88463.982284]  ret_from_fork+0x22/0x30
> > > > > [88463.982586]  </TASK>
> > > > > [88463.982886] Modules linked in: nfsv3 bpf_preload xt_nat
> > > > > veth
> > > > > nfs_layout_flexfiles auth_name rpcsec_gss_krb5 nfsv4
> > > > > dns_resolver
> > > > > nfsidmap nfs nfsd auth_rpcgss nfs_acl lockd grace
> > > > > xt_MASQUERADE
> > > > > nf_conntrack_netlink xt_addrtype br_netfilter bridge stp llc
> > > > > overlay nf_nat_ftp nf_conntrack_netbios_ns
> > > > > nf_conntrack_broadcast
> > > > > nf_conntrack_ftp xt_CT xt_sctp ip6t_rpfilter ip6t_REJECT
> > > > > nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack
> > > > > ip6table_nat ip6table_mangle ip6table_security ip6table_raw
> > > > > iptable_nat nf_nat iptable_mangle iptable_security
> > > > > iptable_raw
> > > > > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nfnetlink
> > > > > ip6table_filter ip6_tables iptable_filter vsock_loopback
> > > > > vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport
> > > > > vsock
> > > > > bonding ipmi_msghandler vfat fat rpcrdma sunrpc rdma_ucm
> > > > > ib_iser
> > > > > libiscsi scsi_transport_iscsi ib_umad rdma_cm iw_cm ib_ipoib
> > > > > ib_cm intel_rapl_msr intel_rapl_common crct10dif_pclmul
> > > > > mlx5_ib
> > > > > crc32_pclmul ghash_clmulni_intel rapl ib_uverbs ib_core
> > > > > vmw_vmci
> > > > > i2c_piix4
> > > > > [88463.982930]  dm_multipath ip_tables xfs mlx5_core
> > > > > crc32c_intel
> > > > > ttm vmxnet3 vmw_pvscsi drm_kms_helper mlxfw cec
> > > > > pci_hyperv_intf
> > > > > tls ata_generic drm pata_acpi psample
> > > > > 
> > > > > Decoding the address svc_rdma_sendto+0x26 resolves to line
> > > > > 927 in
> > > > > net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > > > > 
> > > > > i.e.
> > > > > 
> > > > > 927             __be32 *rdma_argp = rctxt->rc_recv_buf;
> > > > > 
> > > > > which shows that somehow, we're getting to svc_rdma_sendto()
> > > > > with
> > > > > a
> > > > > request that has rqstp->rq_xprt_ctxt == NULL.
> > > > > 
> > > > > I'm having trouble seeing how that can happen, but thought
> > > > > I'd
> > > > > ask in
> > > > > case you've seen something similar. AFAICS, the code in that
> > > > > area
> > > > > has
> > > > > not changed since early 2021.
> > > > > 
> > > > > As I understand it, knfsd was in the process of shutting down
> > > > > at
> > > > > the
> > > > > time, so it is possible there is a connection to that...
> > > > 
> > > > My immediate response is "that's not supposed to happen". I'll
> > > > give
> > > > it some thought, but yeah, I bet the transport sendto code is
> > > > not
> > > > dealing with connection shutdown properly.
> > > 
> > > I still don't see exactly how @rctxt could be NULL here, but
> > > it might be enough to move the deref of @rctxt down past the
> > > svc_xprt_is_dead() test, say just before "*p++ = *rdma_argp;".
> > > 
> > > In any event, it could be that recent reorganization of generic
> > > transport code might be responsible, or maybe that this is just
> > > a hard race to hit.
> > > 
> > 
> > Hmm... Here's another thought. What if this were a deferred request
> > that is being replayed after an upcall to mountd or the idmapper?
> > It
> > would mean that the synchronous wait in cache_defer_req() failed,
> > so it
> > is going to be rare, but it could happen on a congested system.
> > 
> > AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
> > svc_deferred_recv() won't restore it either.
> 
> True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we have
> seen this problem before on a socket transport?

TCP does not set rq_xprt_ctxt, and nobody really uses UDP these days.

> 
> I need to audit code to see if saving rq_xprt_ctxt in svc_defer()
> is safe and reasonable to do. Maybe Bruce has a thought.
> 

It should be safe for the UDP case, AFAICS. I have no opinion as of yet
about how safe it is to do with RDMA.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 16:24         ` Trond Myklebust
@ 2022-03-31 18:53           ` Chuck Lever III
  2022-03-31 20:08             ` Trond Myklebust
  0 siblings, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-03-31 18:53 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List


> On Mar 31, 2022, at 12:24 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Thu, 2022-03-31 at 16:20 +0000, Chuck Lever III wrote:
>> 
>>> On Mar 31, 2022, at 12:15 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> Hmm... Here's another thought. What if this were a deferred request
>>> that is being replayed after an upcall to mountd or the idmapper?
>>> It
>>> would mean that the synchronous wait in cache_defer_req() failed,
>>> so it
>>> is going to be rare, but it could happen on a congested system.
>>> 
>>> AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
>>> svc_deferred_recv() won't restore it either.
>> 
>> True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we have
>> seen this problem before on a socket transport?
> 
> TCP does not set rq_xprt_ctxt, and nobody really uses UDP these days.
> 
>> I need to audit code to see if saving rq_xprt_ctxt in svc_defer()
>> is safe and reasonable to do. Maybe Bruce has a thought.
> 
> It should be safe for the UDP case, AFAICS. I have no opinion as of yet
> about how safe it is to do with RDMA.

It's plausible that a deferred request could be replayed, but I don't
understand the deferral mechanism enough to know whether the rctxt
would be released before the deferred request could be handled. It
doesn't look like it would, but I could misunderstand something.

There's a longstanding testing gap here: None of my test workloads
appear to force a request deferral. I don't recall Bruce having
such a test either.

It would be nice if we had something that could force the use of the
deferral path, like a command line option for mountd that would cause
it to sleep for several seconds before responding to an upcall. It
might also be done with the kernel's fault injector.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 18:53           ` Chuck Lever III
@ 2022-03-31 20:08             ` Trond Myklebust
  2022-04-04 15:12               ` J. Bruce Fields
  2022-04-04 15:34               ` Chuck Lever III
  0 siblings, 2 replies; 12+ messages in thread
From: Trond Myklebust @ 2022-03-31 20:08 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

On Thu, 2022-03-31 at 18:53 +0000, Chuck Lever III wrote:
> 
> > On Mar 31, 2022, at 12:24 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Thu, 2022-03-31 at 16:20 +0000, Chuck Lever III wrote:
> > > 
> > > > On Mar 31, 2022, at 12:15 PM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > Hmm... Here's another thought. What if this were a deferred
> > > > request
> > > > that is being replayed after an upcall to mountd or the
> > > > idmapper?
> > > > It
> > > > would mean that the synchronous wait in cache_defer_req()
> > > > failed,
> > > > so it
> > > > is going to be rare, but it could happen on a congested system.
> > > > 
> > > > AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
> > > > svc_deferred_recv() won't restore it either.
> > > 
> > > True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we have
> > > seen this problem before on a socket transport?
> > 
> > TCP does not set rq_xprt_ctxt, and nobody really uses UDP these
> > days.
> > 
> > > I need to audit code to see if saving rq_xprt_ctxt in svc_defer()
> > > is safe and reasonable to do. Maybe Bruce has a thought.
> > 
> > It should be safe for the UDP case, AFAICS. I have no opinion as of
> > yet
> > about how safe it is to do with RDMA.
> 
> It's plausible that a deferred request could be replayed, but I don't
> understand the deferral mechanism enough to know whether the rctxt
> would be released before the deferred request could be handled. It
> doesn't look like it would, but I could misunderstand something.
> 
> There's a longstanding testing gap here: None of my test workloads
> appear to force a request deferral. I don't recall Bruce having
> such a test either.
> 
> It would be nice if we had something that could force the use of the
> deferral path, like a command line option for mountd that would cause
> it to sleep for several seconds before responding to an upcall. It
> might also be done with the kernel's fault injector.
> 

You just need some mechanism that causes svc_get_next_xprt() to set
rqstp->rq_chandle.thread_wait to the value '0'.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 20:08             ` Trond Myklebust
@ 2022-04-04 15:12               ` J. Bruce Fields
  2022-04-04 15:34               ` Chuck Lever III
  1 sibling, 0 replies; 12+ messages in thread
From: J. Bruce Fields @ 2022-04-04 15:12 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: chuck.lever, linux-nfs

On Thu, Mar 31, 2022 at 08:08:59PM +0000, Trond Myklebust wrote:
> On Thu, 2022-03-31 at 18:53 +0000, Chuck Lever III wrote:
> > It's plausible that a deferred request could be replayed, but I don't
> > understand the deferral mechanism enough to know whether the rctxt
> > would be released before the deferred request could be handled. It
> > doesn't look like it would, but I could misunderstand something.
> > 
> > There's a longstanding testing gap here: None of my test workloads
> > appear to force a request deferral. I don't recall Bruce having
> > such a test either.
> > 
> > It would be nice if we had something that could force the use of the
> > deferral path, like a command line option for mountd that would cause
> > it to sleep for several seconds before responding to an upcall. It
> > might also be done with the kernel's fault injector.
> > 
> 
> You just need some mechanism that causes svc_get_next_xprt() to set
> rqstp->rq_chandle.thread_wait to the value '0'.

Yeah, no idea what's going on here, but a way to force the deferral case
sounds like a great idea that'd come in handy some day if not this time.

(Can you hack it by, say, killing mountd, flushing the caches (exportfs
-f), then waiting a few seconds to restart mountd?)

--b.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-03-31 20:08             ` Trond Myklebust
  2022-04-04 15:12               ` J. Bruce Fields
@ 2022-04-04 15:34               ` Chuck Lever III
  2022-04-04 16:19                 ` Trond Myklebust
  1 sibling, 1 reply; 12+ messages in thread
From: Chuck Lever III @ 2022-04-04 15:34 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List



> On Mar 31, 2022, at 4:08 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Thu, 2022-03-31 at 18:53 +0000, Chuck Lever III wrote:
>> 
>>> On Mar 31, 2022, at 12:24 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Thu, 2022-03-31 at 16:20 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Mar 31, 2022, at 12:15 PM, Trond Myklebust
>>>>> <trondmy@hammerspace.com> wrote:
>>>>> 
>>>>> Hmm... Here's another thought. What if this were a deferred
>>>>> request
>>>>> that is being replayed after an upcall to mountd or the
>>>>> idmapper?
>>>>> It
>>>>> would mean that the synchronous wait in cache_defer_req()
>>>>> failed,
>>>>> so it
>>>>> is going to be rare, but it could happen on a congested system.
>>>>> 
>>>>> AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
>>>>> svc_deferred_recv() won't restore it either.
>>>> 
>>>> True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we have
>>>> seen this problem before on a socket transport?
>>> 
>>> TCP does not set rq_xprt_ctxt, and nobody really uses UDP these
>>> days.
>>> 
>>>> I need to audit code to see if saving rq_xprt_ctxt in svc_defer()
>>>> is safe and reasonable to do. Maybe Bruce has a thought.
>>> 
>>> It should be safe for the UDP case, AFAICS. I have no opinion as of
>>> yet
>>> about how safe it is to do with RDMA.
>> 
>> It's plausible that a deferred request could be replayed, but I don't
>> understand the deferral mechanism enough to know whether the rctxt
>> would be released before the deferred request could be handled. It
>> doesn't look like it would, but I could misunderstand something.
>> 
>> There's a longstanding testing gap here: None of my test workloads
>> appear to force a request deferral. I don't recall Bruce having
>> such a test either.
>> 
>> It would be nice if we had something that could force the use of the
>> deferral path, like a command line option for mountd that would cause
>> it to sleep for several seconds before responding to an upcall. It
>> might also be done with the kernel's fault injector.
> 
> You just need some mechanism that causes svc_get_next_xprt() to set
> rqstp->rq_chandle.thread_wait to the value '0'.

I gotta ask: Why does cache_defer_req() check if thread_wait is zero?
Is there some code in the kernel now that sets it to zero? I see only
these two values:

 784         if (!test_bit(SP_CONGESTED, &pool->sp_flags))
 785                 rqstp->rq_chandle.thread_wait = 5*HZ;
 786         else
 787                 rqstp->rq_chandle.thread_wait = 1*HZ;


--
Chuck Lever




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-04-04 15:34               ` Chuck Lever III
@ 2022-04-04 16:19                 ` Trond Myklebust
  2022-04-05  5:00                   ` NeilBrown
  0 siblings, 1 reply; 12+ messages in thread
From: Trond Myklebust @ 2022-04-04 16:19 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs

On Mon, 2022-04-04 at 15:34 +0000, Chuck Lever III wrote:
> 
> 
> > On Mar 31, 2022, at 4:08 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Thu, 2022-03-31 at 18:53 +0000, Chuck Lever III wrote:
> > > 
> > > > On Mar 31, 2022, at 12:24 PM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Thu, 2022-03-31 at 16:20 +0000, Chuck Lever III wrote:
> > > > > 
> > > > > > On Mar 31, 2022, at 12:15 PM, Trond Myklebust
> > > > > > <trondmy@hammerspace.com> wrote:
> > > > > > 
> > > > > > Hmm... Here's another thought. What if this were a deferred
> > > > > > request
> > > > > > that is being replayed after an upcall to mountd or the
> > > > > > idmapper?
> > > > > > It
> > > > > > would mean that the synchronous wait in cache_defer_req()
> > > > > > failed,
> > > > > > so it
> > > > > > is going to be rare, but it could happen on a congested
> > > > > > system.
> > > > > > 
> > > > > > AFAICS, svc_defer() does _not_ save rqstp->rq_xprt_ctxt, so
> > > > > > svc_deferred_recv() won't restore it either.
> > > > > 
> > > > > True, but TCP and UDP both use rq_xprt_ctxt, so wouldn't we
> > > > > have
> > > > > seen this problem before on a socket transport?
> > > > 
> > > > TCP does not set rq_xprt_ctxt, and nobody really uses UDP these
> > > > days.
> > > > 
> > > > > I need to audit code to see if saving rq_xprt_ctxt in
> > > > > svc_defer()
> > > > > is safe and reasonable to do. Maybe Bruce has a thought.
> > > > 
> > > > It should be safe for the UDP case, AFAICS. I have no opinion
> > > > as of
> > > > yet
> > > > about how safe it is to do with RDMA.
> > > 
> > > It's plausible that a deferred request could be replayed, but I
> > > don't
> > > understand the deferral mechanism enough to know whether the
> > > rctxt
> > > would be released before the deferred request could be handled.
> > > It
> > > doesn't look like it would, but I could misunderstand something.
> > > 
> > > There's a longstanding testing gap here: None of my test
> > > workloads
> > > appear to force a request deferral. I don't recall Bruce having
> > > such a test either.
> > > 
> > > It would be nice if we had something that could force the use of
> > > the
> > > deferral path, like a command line option for mountd that would
> > > cause
> > > it to sleep for several seconds before responding to an upcall.
> > > It
> > > might also be done with the kernel's fault injector.
> > 
> > You just need some mechanism that causes svc_get_next_xprt() to set
> > rqstp->rq_chandle.thread_wait to the value '0'.
> 
> I gotta ask: Why does cache_defer_req() check if thread_wait is zero?
> Is there some code in the kernel now that sets it to zero? I see only
> these two values:
> 
>  784         if (!test_bit(SP_CONGESTED, &pool->sp_flags))
>  785                 rqstp->rq_chandle.thread_wait = 5*HZ;
>  786         else
>  787                 rqstp->rq_chandle.thread_wait = 1*HZ;


That test goes all the way back to commit f16b6e8d838b ("sunrpc/cache:
allow threads to block while waiting for cache update."), so probably
more of a question for Neil.
I don't see any immediate problems with removing it: even if someone
does set a value of 0, the wait code itself should cope.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Question about RDMA server...
  2022-04-04 16:19                 ` Trond Myklebust
@ 2022-04-05  5:00                   ` NeilBrown
  0 siblings, 0 replies; 12+ messages in thread
From: NeilBrown @ 2022-04-05  5:00 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: chuck.lever, linux-nfs

On Tue, 05 Apr 2022, Trond Myklebust wrote:
> On Mon, 2022-04-04 at 15:34 +0000, Chuck Lever III wrote:
> > 
....
> > 
> > I gotta ask: Why does cache_defer_req() check if thread_wait is zero?
> > Is there some code in the kernel now that sets it to zero? I see only
> > these two values:
> > 
> >  784         if (!test_bit(SP_CONGESTED, &pool->sp_flags))
> >  785                 rqstp->rq_chandle.thread_wait = 5*HZ;
> >  786         else
> >  787                 rqstp->rq_chandle.thread_wait = 1*HZ;
> 
> 
> That test goes all the way back to commit f16b6e8d838b ("sunrpc/cache:
> allow threads to block while waiting for cache update."), so probably
> more of a question for Neil.

12 years ago....  the commit message suggests that the "underlying
transport" might not approve of sleeping, but I cannot think what that
might mean.  Probably it seemed to mean something earlier in the design
process but I discarded the idea without discarding the code.

> I don't see any immediate problems with removing it: even if someone
> does set a value of 0, the wait code itself should cope.
> 
Agreed.

NeilBrown

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-05  5:01 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-31 15:16 Question about RDMA server Trond Myklebust
2022-03-31 15:21 ` Chuck Lever III
2022-03-31 15:59   ` Chuck Lever III
2022-03-31 16:15     ` Trond Myklebust
2022-03-31 16:20       ` Chuck Lever III
2022-03-31 16:24         ` Trond Myklebust
2022-03-31 18:53           ` Chuck Lever III
2022-03-31 20:08             ` Trond Myklebust
2022-04-04 15:12               ` J. Bruce Fields
2022-04-04 15:34               ` Chuck Lever III
2022-04-04 16:19                 ` Trond Myklebust
2022-04-05  5:00                   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.