[PATCH RFC] svcrdma: Ignore source port when computing DRC hash

linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
@ 2019-06-05 12:15 Chuck Lever
  2019-06-05 15:57 ` Olga Kornievskaia
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Chuck Lever @ 2019-06-05 12:15 UTC (permalink / raw)
  To: linux-rdma, linux-nfs

The DRC is not working at all after an RPC/RDMA transport reconnect.
The problem is that the new connection uses a different source port,
which defeats DRC hash.

An NFS/RDMA client's source port is meaningless for RDMA transports.
The transport layer typically sets the source port value on the
connection to a random ephemeral port. The server already ignores it
for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
client's source port on RDMA transports").

I'm not sure why I never noticed this before.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
---
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 027a3b0..1b3700b 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
 	/* Save client advertised inbound read limit for use later in accept. */
 	newxprt->sc_ord = param->initiator_depth;
 
-	/* Set the local and remote addresses in the transport */
 	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
 	svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
+	/* The remote port is arbitrary and not under the control of the
+	 * ULP. Set it to a fixed value so that the DRC continues to work
+	 * after a reconnect.
+	 */
+	rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
+
 	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
 	svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
 


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
@ 2019-06-05 15:57 ` Olga Kornievskaia
  2019-06-05 17:28   ` Chuck Lever
  2019-06-05 16:43 ` Tom Talpey
  2019-06-06 13:08 ` Sasha Levin
  2 siblings, 1 reply; 19+ messages in thread
From: Olga Kornievskaia @ 2019-06-05 15:57 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, linux-nfs

On Wed, Jun 5, 2019 at 8:15 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>
> The DRC is not working at all after an RPC/RDMA transport reconnect.
> The problem is that the new connection uses a different source port,
> which defeats DRC hash.
>
> An NFS/RDMA client's source port is meaningless for RDMA transports.
> The transport layer typically sets the source port value on the
> connection to a random ephemeral port. The server already ignores it
> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
> client's source port on RDMA transports").
>
> I'm not sure why I never noticed this before.

Hi Chuck,

I have a question: is the reason for choosing this fix as oppose to
fixing the client because it's server's responsibility to design a DRC
differently for the NFSoRDMA?

>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Cc: stable@vger.kernel.org
> ---
>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 027a3b0..1b3700b 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>         /* Save client advertised inbound read limit for use later in accept. */
>         newxprt->sc_ord = param->initiator_depth;
>
> -       /* Set the local and remote addresses in the transport */
>         sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>         svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> +       /* The remote port is arbitrary and not under the control of the
> +        * ULP. Set it to a fixed value so that the DRC continues to work
> +        * after a reconnect.
> +        */
> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
> +
>         sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>         svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
  2019-06-05 15:57 ` Olga Kornievskaia
@ 2019-06-05 16:43 ` Tom Talpey
  2019-06-05 17:25   ` Chuck Lever
  2019-06-06 13:08 ` Sasha Levin
  2 siblings, 1 reply; 19+ messages in thread
From: Tom Talpey @ 2019-06-05 16:43 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma, linux-nfs

On 6/5/2019 8:15 AM, Chuck Lever wrote:
> The DRC is not working at all after an RPC/RDMA transport reconnect.
> The problem is that the new connection uses a different source port,
> which defeats DRC hash.
> 
> An NFS/RDMA client's source port is meaningless for RDMA transports.
> The transport layer typically sets the source port value on the
> connection to a random ephemeral port. The server already ignores it
> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
> client's source port on RDMA transports").

Where does the entropy come from, then, for the server to not
match other requests from other mount points on this same client?
Any time an XID happens to match on a second mount, it will trigger
incorrect server processing, won't it? And since RDMA is capable of
such high IOPS, the likelihood seems rather high. Missing the cache
might actually be safer than hitting, in this case.

Tom.

> I'm not sure why I never noticed this before.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> Cc: stable@vger.kernel.org
> ---
>   net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 027a3b0..1b3700b 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>   	/* Save client advertised inbound read limit for use later in accept. */
>   	newxprt->sc_ord = param->initiator_depth;
>   
> -	/* Set the local and remote addresses in the transport */
>   	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>   	svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> +	/* The remote port is arbitrary and not under the control of the
> +	 * ULP. Set it to a fixed value so that the DRC continues to work
> +	 * after a reconnect.
> +	 */
> +	rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
> +
>   	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>   	svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>   
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 16:43 ` Tom Talpey
@ 2019-06-05 17:25   ` Chuck Lever
  2019-06-10 14:50     ` Tom Talpey
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2019-06-05 17:25 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List

Hi Tom-

> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>> The problem is that the new connection uses a different source port,
>> which defeats DRC hash.
>> 
>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>> The transport layer typically sets the source port value on the
>> connection to a random ephemeral port. The server already ignores it
>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>> client's source port on RDMA transports").
> 
> Where does the entropy come from, then, for the server to not
> match other requests from other mount points on this same client?

The first ~200 bytes of each RPC Call message.

[ Note that this has some fun ramifications for calls with small
RPC headers that use Read chunks. ]


> Any time an XID happens to match on a second mount, it will trigger
> incorrect server processing, won't it?

Not a risk for clients that use only a single transport per
client-server pair.


> And since RDMA is capable of
> such high IOPS, the likelihood seems rather high.

Only when the server's durable storage is slow enough to cause
some RPC requests to have extremely high latency.

And, most clients use an atomic counter for their XIDs, so they
are also likely to wrap that counter over some long-pending RPC
request.

The only real answer here is NFSv4 sessions.


> Missing the cache
> might actually be safer than hitting, in this case.

Remember that _any_ retransmit on RPC/RDMA requires a fresh
connection, that includes NFSv3, to reset credit accounting
due to the lost half of the RPC Call/Reply pair.

I can very quickly reproduce bad (non-deterministic) behavior
by running a software build on an NFSv3 on RDMA mount point
with disconnect injection. If the DRC issue is addressed, the
software build runs to completion.

IMO we can't leave things the way they are.


> Tom.
> 
>> I'm not sure why I never noticed this before.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> Cc: stable@vger.kernel.org
>> ---
>>  net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 027a3b0..1b3700b 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>  	/* Save client advertised inbound read limit for use later in accept. */
>>  	newxprt->sc_ord = param->initiator_depth;
>> 
>> -	/* Set the local and remote addresses in the transport */
>>  	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>  	svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>> +	/* The remote port is arbitrary and not under the control of the
>> +	 * ULP. Set it to a fixed value so that the DRC continues to work
>> +	 * after a reconnect.
>> +	 */
>> +	rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
>> +
>>  	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>>  	svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>> 
>> 
>> 
>> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 15:57 ` Olga Kornievskaia
@ 2019-06-05 17:28   ` Chuck Lever
  2019-06-06 18:13     ` Olga Kornievskaia
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2019-06-05 17:28 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Jun 5, 2019, at 11:57 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Wed, Jun 5, 2019 at 8:15 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>> The problem is that the new connection uses a different source port,
>> which defeats DRC hash.
>> 
>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>> The transport layer typically sets the source port value on the
>> connection to a random ephemeral port. The server already ignores it
>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>> client's source port on RDMA transports").
>> 
>> I'm not sure why I never noticed this before.
> 
> Hi Chuck,
> 
> I have a question: is the reason for choosing this fix as oppose to
> fixing the client because it's server's responsibility to design a DRC
> differently for the NFSoRDMA?

The server DRC implementation is not specified in any standard.
The server implementation can use whatever works the best. The
current Linux DRC implementation is useless for NFS/RDMA, because
the clients have to disconnect before they send retransmissions.

If someone knows a way that a client side RDMA consumer can specify
the source port that is presented to a server, then we can make
that change too.


>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> Cc: stable@vger.kernel.org
>> ---
>> net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 027a3b0..1b3700b 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>        /* Save client advertised inbound read limit for use later in accept. */
>>        newxprt->sc_ord = param->initiator_depth;
>> 
>> -       /* Set the local and remote addresses in the transport */
>>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>        svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>> +       /* The remote port is arbitrary and not under the control of the
>> +        * ULP. Set it to a fixed value so that the DRC continues to work
>> +        * after a reconnect.
>> +        */
>> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
>> +
>>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>>        svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>> 
>> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
  2019-06-05 15:57 ` Olga Kornievskaia
  2019-06-05 16:43 ` Tom Talpey
@ 2019-06-06 13:08 ` Sasha Levin
  2019-06-06 13:24   ` Chuck Lever
  2 siblings, 1 reply; 19+ messages in thread
From: Sasha Levin @ 2019-06-06 13:08 UTC (permalink / raw)
  To: Sasha Levin, Chuck Lever, linux-rdma, linux-nfs; +Cc: stable, stable

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain, Size: 842 bytes --]

Hi,

[This is an automated email]

This commit has been processed because it contains a -stable tag.
The stable tag indicates that it's relevant for the following trees: all

The bot has tested the following trees: v5.1.7, v5.0.21, v4.19.48, v4.14.123, v4.9.180, v4.4.180.

v5.1.7: Build OK!
v5.0.21: Build OK!
v4.19.48: Build OK!
v4.14.123: Build OK!
v4.9.180: Build failed! Errors:
    net/sunrpc/xprtrdma/svc_rdma_transport.c:712:2: error: implicit declaration of function ‘rpc_set_port’; did you mean ‘rpc_net_ns’? [-Werror=implicit-function-declaration]

v4.4.180: Build failed! Errors:
    net/sunrpc/xprtrdma/svc_rdma_transport.c:635:2: error: implicit declaration of function ‘rpc_set_port’; did you mean ‘rpc_net_ns’? [-Werror=implicit-function-declaration]


How should we proceed with this patch?

--
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-06 13:08 ` Sasha Levin
@ 2019-06-06 13:24   ` Chuck Lever
  0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2019-06-06 13:24 UTC (permalink / raw)
  To: Sasha Levin; +Cc: linux-rdma, Linux NFS Mailing List, stable



> On Jun 6, 2019, at 9:08 AM, Sasha Levin <sashal@kernel.org> wrote:
> 
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed because it contains a -stable tag.
> The stable tag indicates that it's relevant for the following trees: all
> 
> The bot has tested the following trees: v5.1.7, v5.0.21, v4.19.48, v4.14.123, v4.9.180, v4.4.180.
> 
> v5.1.7: Build OK!
> v5.0.21: Build OK!
> v4.19.48: Build OK!
> v4.14.123: Build OK!
> v4.9.180: Build failed! Errors:
>    net/sunrpc/xprtrdma/svc_rdma_transport.c:712:2: error: implicit declaration of function ‘rpc_set_port’; did you mean ‘rpc_net_ns’? [-Werror=implicit-function-declaration]
> 
> v4.4.180: Build failed! Errors:
>    net/sunrpc/xprtrdma/svc_rdma_transport.c:635:2: error: implicit declaration of function ‘rpc_set_port’; did you mean ‘rpc_net_ns’? [-Werror=implicit-function-declaration]
> 
> 
> How should we proceed with this patch?

If the review completes without objection, I will resubmit
this patch with an updated Cc: . Thanks for testing!

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 17:28   ` Chuck Lever
@ 2019-06-06 18:13     ` Olga Kornievskaia
  2019-06-06 18:33       ` Chuck Lever
  0 siblings, 1 reply; 19+ messages in thread
From: Olga Kornievskaia @ 2019-06-06 18:13 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On Wed, Jun 5, 2019 at 1:28 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Jun 5, 2019, at 11:57 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Wed, Jun 5, 2019 at 8:15 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >> The DRC is not working at all after an RPC/RDMA transport reconnect.
> >> The problem is that the new connection uses a different source port,
> >> which defeats DRC hash.
> >>
> >> An NFS/RDMA client's source port is meaningless for RDMA transports.
> >> The transport layer typically sets the source port value on the
> >> connection to a random ephemeral port. The server already ignores it
> >> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
> >> client's source port on RDMA transports").
> >>
> >> I'm not sure why I never noticed this before.
> >
> > Hi Chuck,
> >
> > I have a question: is the reason for choosing this fix as oppose to
> > fixing the client because it's server's responsibility to design a DRC
> > differently for the NFSoRDMA?
>
> The server DRC implementation is not specified in any standard.
> The server implementation can use whatever works the best. The
> current Linux DRC implementation is useless for NFS/RDMA, because
> the clients have to disconnect before they send retransmissions.
>
> If someone knows a way that a client side RDMA consumer can specify
> the source port that is presented to a server, then we can make
> that change too.

Ok I see you point about this being difficult on the client. Various
implementations take different approaches even on RoCE itself:
1. software RoCE does
        /* pick a source UDP port number for this QP based on
         * the source QPN. this spreads traffic for different QPs
         * across different NIC RX queues (while using a single
         * flow for a given QP to maintain packet order).
         * the port number must be in the Dynamic Ports range
         * (0xc000 - 0xffff).
         */
        qp->src_port = RXE_ROCE_V2_SPORT +
                (hash_32_generic(qp_num(qp), 14) & 0x3fff);

When I allow the connection to be re-established I can confirm that a
new UDP source port is gotten).

2. bnxt_re (broadband net extreme) seems to always use the same source port:
qp->qp1_hdr.udp.sport = htons(0x8CD1);

3. mlx4 seems to always use the same source port:
sqp->ud_header.udp.sport = htons(MLX4_ROCEV2_QP1_SPORT);

4. mlx5 also seems to be pick the same port:
return cpu_to_be16(MLX5_CAP_ROCE(dev->mdev, r_roce_min_src_udp_port));

I have a CX5 card and I see that the source port is always 49513. So
if you use Mellanox cards or this other card, DNC implementations are
safe?

I also see that there is nothing in the verbs API thru which we
interact with the RDMA drivers will allow us to set the port. Unless
we can ask the linux implementation to augment some structures to
allow us to set and query that port or is that unreasonable because
it's not in the standard API.

>
>
> >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >> Cc: stable@vger.kernel.org
> >> ---
> >> net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
> >> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >> index 027a3b0..1b3700b 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
> >>        /* Save client advertised inbound read limit for use later in accept. */
> >>        newxprt->sc_ord = param->initiator_depth;
> >>
> >> -       /* Set the local and remote addresses in the transport */
> >>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
> >>        svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> >> +       /* The remote port is arbitrary and not under the control of the
> >> +        * ULP. Set it to a fixed value so that the DRC continues to work
> >> +        * after a reconnect.
> >> +        */
> >> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
> >> +
> >>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
> >>        svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> >>
> >>
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-06 18:13     ` Olga Kornievskaia
@ 2019-06-06 18:33       ` Chuck Lever
  2019-06-07 15:43         ` Olga Kornievskaia
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2019-06-06 18:33 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: linux-rdma, Linux NFS Mailing List



> On Jun 6, 2019, at 2:13 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> 
> On Wed, Jun 5, 2019 at 1:28 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> 
>> 
>> 
>>> On Jun 5, 2019, at 11:57 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
>>> 
>>> On Wed, Jun 5, 2019 at 8:15 AM Chuck Lever <chuck.lever@oracle.com> wrote:
>>>> 
>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>> The problem is that the new connection uses a different source port,
>>>> which defeats DRC hash.
>>>> 
>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>> The transport layer typically sets the source port value on the
>>>> connection to a random ephemeral port. The server already ignores it
>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>> client's source port on RDMA transports").
>>>> 
>>>> I'm not sure why I never noticed this before.
>>> 
>>> Hi Chuck,
>>> 
>>> I have a question: is the reason for choosing this fix as oppose to
>>> fixing the client because it's server's responsibility to design a DRC
>>> differently for the NFSoRDMA?
>> 
>> The server DRC implementation is not specified in any standard.
>> The server implementation can use whatever works the best. The
>> current Linux DRC implementation is useless for NFS/RDMA, because
>> the clients have to disconnect before they send retransmissions.
>> 
>> If someone knows a way that a client side RDMA consumer can specify
>> the source port that is presented to a server, then we can make
>> that change too.
> 
> Ok I see you point about this being difficult on the client. Various
> implementations take different approaches even on RoCE itself:
> 1. software RoCE does
>        /* pick a source UDP port number for this QP based on
>         * the source QPN. this spreads traffic for different QPs
>         * across different NIC RX queues (while using a single
>         * flow for a given QP to maintain packet order).
>         * the port number must be in the Dynamic Ports range
>         * (0xc000 - 0xffff).
>         */
>        qp->src_port = RXE_ROCE_V2_SPORT +
>                (hash_32_generic(qp_num(qp), 14) & 0x3fff);
> 
> When I allow the connection to be re-established I can confirm that a
> new UDP source port is gotten).
> 
> 2. bnxt_re (broadband net extreme) seems to always use the same source port:
> qp->qp1_hdr.udp.sport = htons(0x8CD1);
> 
> 3. mlx4 seems to always use the same source port:
> sqp->ud_header.udp.sport = htons(MLX4_ROCEV2_QP1_SPORT);
> 
> 4. mlx5 also seems to be pick the same port:
> return cpu_to_be16(MLX5_CAP_ROCE(dev->mdev, r_roce_min_src_udp_port));
> 
> I have a CX5 card and I see that the source port is always 49513. So
> if you use Mellanox cards or this other card, DNC implementations are
> safe?

Thanks for looking into this.

Not sure that's the same source port that is visible to the
NFS server's transport accept logic. That's the one that
matters to the DRC. Certainly IB fabrics wouldn't have an IP
source port like this.

For RoCE (and perhaps iWARP) that port number would be the
same for all transports from the client. Still not useful
for adding entropy to a DRC entry hash.

@@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
        /* Save client advertised inbound read limit for use later in accept. */
        newxprt->sc_ord = param->initiator_depth;
 
        /* Set the local and remote addresses in the transport */
        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;

Can you add a printk to your server to show the port value
in cm_id->route.addr.dst_addr?


> I also see that there is nothing in the verbs API thru which we
> interact with the RDMA drivers will allow us to set the port.

I suspect this would be part of the RDMA Connection Manager
interface, not part of the RDMA driver code.


> Unless
> we can ask the linux implementation to augment some structures to
> allow us to set and query that port or is that unreasonable because
> it's not in the standard API.
> 
>> 
>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> Cc: stable@vger.kernel.org
>>>> ---
>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> index 027a3b0..1b3700b 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>>>       /* Save client advertised inbound read limit for use later in accept. */
>>>>       newxprt->sc_ord = param->initiator_depth;
>>>> 
>>>> -       /* Set the local and remote addresses in the transport */
>>>>       sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>>>       svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>>> +       /* The remote port is arbitrary and not under the control of the
>>>> +        * ULP. Set it to a fixed value so that the DRC continues to work
>>>> +        * after a reconnect.
>>>> +        */
>>>> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
>>>> +
>>>>       sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>>>>       svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>>> 
>>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-06 18:33       ` Chuck Lever
@ 2019-06-07 15:43         ` Olga Kornievskaia
  2019-06-10 14:38           ` Tom Talpey
  0 siblings, 1 reply; 19+ messages in thread
From: Olga Kornievskaia @ 2019-06-07 15:43 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On Thu, Jun 6, 2019 at 2:33 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Jun 6, 2019, at 2:13 PM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >
> > On Wed, Jun 5, 2019 at 1:28 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Jun 5, 2019, at 11:57 AM, Olga Kornievskaia <aglo@umich.edu> wrote:
> >>>
> >>> On Wed, Jun 5, 2019 at 8:15 AM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>>>
> >>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
> >>>> The problem is that the new connection uses a different source port,
> >>>> which defeats DRC hash.
> >>>>
> >>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
> >>>> The transport layer typically sets the source port value on the
> >>>> connection to a random ephemeral port. The server already ignores it
> >>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
> >>>> client's source port on RDMA transports").
> >>>>
> >>>> I'm not sure why I never noticed this before.
> >>>
> >>> Hi Chuck,
> >>>
> >>> I have a question: is the reason for choosing this fix as oppose to
> >>> fixing the client because it's server's responsibility to design a DRC
> >>> differently for the NFSoRDMA?
> >>
> >> The server DRC implementation is not specified in any standard.
> >> The server implementation can use whatever works the best. The
> >> current Linux DRC implementation is useless for NFS/RDMA, because
> >> the clients have to disconnect before they send retransmissions.
> >>
> >> If someone knows a way that a client side RDMA consumer can specify
> >> the source port that is presented to a server, then we can make
> >> that change too.
> >
> > Ok I see you point about this being difficult on the client. Various
> > implementations take different approaches even on RoCE itself:
> > 1. software RoCE does
> >        /* pick a source UDP port number for this QP based on
> >         * the source QPN. this spreads traffic for different QPs
> >         * across different NIC RX queues (while using a single
> >         * flow for a given QP to maintain packet order).
> >         * the port number must be in the Dynamic Ports range
> >         * (0xc000 - 0xffff).
> >         */
> >        qp->src_port = RXE_ROCE_V2_SPORT +
> >                (hash_32_generic(qp_num(qp), 14) & 0x3fff);
> >
> > When I allow the connection to be re-established I can confirm that a
> > new UDP source port is gotten).
> >
> > 2. bnxt_re (broadband net extreme) seems to always use the same source port:
> > qp->qp1_hdr.udp.sport = htons(0x8CD1);
> >
> > 3. mlx4 seems to always use the same source port:
> > sqp->ud_header.udp.sport = htons(MLX4_ROCEV2_QP1_SPORT);
> >
> > 4. mlx5 also seems to be pick the same port:
> > return cpu_to_be16(MLX5_CAP_ROCE(dev->mdev, r_roce_min_src_udp_port));
> >
> > I have a CX5 card and I see that the source port is always 49513. So
> > if you use Mellanox cards or this other card, DNC implementations are
> > safe?
>
> Thanks for looking into this.
>
> Not sure that's the same source port that is visible to the
> NFS server's transport accept logic. That's the one that
> matters to the DRC. Certainly IB fabrics wouldn't have an IP
> source port like this.

Right, given existence of NFSoRDMA over IB makes it so that DRC
implementation for the RDMA transport should be on something else than
ports and IPs.  (or we should restrict NFSoRDMA to 4.1+ :-))

> For RoCE (and perhaps iWARP) that port number would be the
> same for all transports from the client. Still not useful
> for adding entropy to a DRC entry hash.

If that port stayed the same (which it's not), why wouldn't it be
sufficient for DRC? It would be exactly the same as if proto=udp.

> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>         /* Save client advertised inbound read limit for use later in accept. */
>         newxprt->sc_ord = param->initiator_depth;
>
>         /* Set the local and remote addresses in the transport */
>         sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>
> Can you add a printk to your server to show the port value
> in cm_id->route.addr.dst_addr?

What I see printed here isn't not what I see in the network trace in
wireshark for the UDP port. It's also confusing that
Connection manager communication (in my case) say between source port
55410 (which stays the same on remounts) and dst port 4791. Then NFS
traffic is between source port 63494 (which changes on remounts) and
destination port 4791. Yet, NFS reports that source port was 40403 and
destination port was 20049(!).

The code that I looked at (as you pointed) was for the connection
manager and that port stays constant but for the NFSoRDMA after that
it's from a new port that changes all the time (even for the Mellanox
card).

Looks like there is no way to get the "real" port thru the rdma layers
on either side.


> > I also see that there is nothing in the verbs API thru which we
> > interact with the RDMA drivers will allow us to set the port.
>
> I suspect this would be part of the RDMA Connection Manager
> interface, not part of the RDMA driver code.
>
>
> > Unless
> > we can ask the linux implementation to augment some structures to
> > allow us to set and query that port or is that unreasonable because
> > it's not in the standard API.
> >
> >>
> >>
> >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >>>> Cc: stable@vger.kernel.org
> >>>> ---
> >>>> net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
> >>>> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>>> index 027a3b0..1b3700b 100644
> >>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> >>>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
> >>>>       /* Save client advertised inbound read limit for use later in accept. */
> >>>>       newxprt->sc_ord = param->initiator_depth;
> >>>>
> >>>> -       /* Set the local and remote addresses in the transport */
> >>>>       sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
> >>>>       svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> >>>> +       /* The remote port is arbitrary and not under the control of the
> >>>> +        * ULP. Set it to a fixed value so that the DRC continues to work
> >>>> +        * after a reconnect.
> >>>> +        */
> >>>> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
> >>>> +
> >>>>       sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
> >>>>       svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
> >>>>
> >>>>
> >>
> >> --
> >> Chuck Lever
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-07 15:43         ` Olga Kornievskaia
@ 2019-06-10 14:38           ` Tom Talpey
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Talpey @ 2019-06-10 14:38 UTC (permalink / raw)
  To: Olga Kornievskaia, Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 6/7/2019 11:43 AM, Olga Kornievskaia wrote:
> On Thu, Jun 6, 2019 at 2:33 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>> ...
>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>          /* Save client advertised inbound read limit for use later in accept. */
>>          newxprt->sc_ord = param->initiator_depth;
>>
>>          /* Set the local and remote addresses in the transport */
>>          sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>
>> Can you add a printk to your server to show the port value
>> in cm_id->route.addr.dst_addr?
> 
> What I see printed here isn't not what I see in the network trace in
> wireshark for the UDP port. It's also confusing that
> Connection manager communication (in my case) say between source port
> 55410 (which stays the same on remounts) and dst port 4791. Then NFS
> traffic is between source port 63494 (which changes on remounts) and
> destination port 4791. Yet, NFS reports that source port was 40403 and
> destination port was 20049(!).

This is expected. 4791 is the RoCEv2 protocol destination port, and
it's a constant for all RoCEv2 connections. think of it like a tunnel,
similar to VxLAN.

RoCEv2 then applies its own mapping to demultiplex incoming packets to
the proper queue pair. This is based on the RDMA CM connection service
id's. These are based on the upper layer's requested port.

In other words, you need to apply tracing at the proper layer. Wire
level traces aren't going to be very useful.

Btw, iWARP connections won't remap ports like this. But IB and RoCE
will, since they don't natively have an actual 5-tuple defined.

Tom.

> The code that I looked at (as you pointed) was for the connection
> manager and that port stays constant but for the NFSoRDMA after that
> it's from a new port that changes all the time (even for the Mellanox
> card).
> 
> Looks like there is no way to get the "real" port thru the rdma layers
> on either side.
> 
> 
>>> I also see that there is nothing in the verbs API thru which we
>>> interact with the RDMA drivers will allow us to set the port.
>>
>> I suspect this would be part of the RDMA Connection Manager
>> interface, not part of the RDMA driver code.
>>
>>
>>> Unless
>>> we can ask the linux implementation to augment some structures to
>>> allow us to set and query that port or is that unreasonable because
>>> it's not in the standard API.
>>>
>>>>
>>>>
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>> Cc: stable@vger.kernel.org
>>>>>> ---
>>>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>>> index 027a3b0..1b3700b 100644
>>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>>>>>        /* Save client advertised inbound read limit for use later in accept. */
>>>>>>        newxprt->sc_ord = param->initiator_depth;
>>>>>>
>>>>>> -       /* Set the local and remote addresses in the transport */
>>>>>>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>>>>>        svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>>>>> +       /* The remote port is arbitrary and not under the control of the
>>>>>> +        * ULP. Set it to a fixed value so that the DRC continues to work
>>>>>> +        * after a reconnect.
>>>>>> +        */
>>>>>> +       rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
>>>>>> +
>>>>>>        sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>>>>>>        svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-05 17:25   ` Chuck Lever
@ 2019-06-10 14:50     ` Tom Talpey
  2019-06-10 17:50       ` Chuck Lever
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Talpey @ 2019-06-10 14:50 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 6/5/2019 1:25 PM, Chuck Lever wrote:
> Hi Tom-
> 
>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>> The problem is that the new connection uses a different source port,
>>> which defeats DRC hash.
>>>
>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>> The transport layer typically sets the source port value on the
>>> connection to a random ephemeral port. The server already ignores it
>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>> client's source port on RDMA transports").
>>
>> Where does the entropy come from, then, for the server to not
>> match other requests from other mount points on this same client?
> 
> The first ~200 bytes of each RPC Call message.
> 
> [ Note that this has some fun ramifications for calls with small
> RPC headers that use Read chunks. ]

Ok, good to know. I forgot that the Linux server implemented this.
I have some concerns abot it, honestly, and it's important to remember
that it's not the same on all servers. But for the problem you're
fixing, it's ok I guess and certainly better than today. Still, the
errors are goingto be completely silent, and can lead to data being
corrupted. Well, welcome to the world of NFSv3.

>> Any time an XID happens to match on a second mount, it will trigger
>> incorrect server processing, won't it?
> 
> Not a risk for clients that use only a single transport per
> client-server pair.

I just want to interject here that this is completely irrelevant.
The server can't know what these clients are doing, or expecting.
One case that might work is not any kind of evidence, and is not
a workaround.

>> And since RDMA is capable of
>> such high IOPS, the likelihood seems rather high.
> 
> Only when the server's durable storage is slow enough to cause
> some RPC requests to have extremely high latency.
> 
> And, most clients use an atomic counter for their XIDs, so they
> are also likely to wrap that counter over some long-pending RPC
> request.
> 
> The only real answer here is NFSv4 sessions.
> 
> 
>> Missing the cache
>> might actually be safer than hitting, in this case.
> 
> Remember that _any_ retransmit on RPC/RDMA requires a fresh
> connection, that includes NFSv3, to reset credit accounting
> due to the lost half of the RPC Call/Reply pair.
> 
> I can very quickly reproduce bad (non-deterministic) behavior
> by running a software build on an NFSv3 on RDMA mount point
> with disconnect injection. If the DRC issue is addressed, the
> software build runs to completion.

Ok, good. But I have a better test.

In the Connectathon suite, there's a "Special" test called "nfsidem".
I wrote this test in, like, 1989 so I remember it :-)

This test performs all the non-idempotent NFv3 operations in a loop,
and each loop element depends on the previous one, so if there's
any failure, the test imemdiately bombs.

Nobody seems to understand it, usually when it gets run people will
run it without injecting errors, and it "passes" so they decide
everything is ok.

So my suggestion is to run your flakeway packet-drop harness while
running nfsidem in a huge loop (nfsidem 10000). The test is slow,
owing to the expensive operations it performs, so you'll need to
run it for a long time.

You'll almost definitely get a failure or two, since the NFSv3
protocol is flawed by design. But you can compare the behaviors,
and even compute a likelihood. I'd love to see some actual numbers.

> IMO we can't leave things the way they are.

Agreed!

Tom.


>>> I'm not sure why I never noticed this before.
>>>
>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>> Cc: stable@vger.kernel.org
>>> ---
>>>   net/sunrpc/xprtrdma/svc_rdma_transport.c |    7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> index 027a3b0..1b3700b 100644
>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>> @@ -211,9 +211,14 @@ static void handle_connect_req(struct rdma_cm_id *new_cma_id,
>>>   	/* Save client advertised inbound read limit for use later in accept. */
>>>   	newxprt->sc_ord = param->initiator_depth;
>>>
>>> -	/* Set the local and remote addresses in the transport */
>>>   	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.dst_addr;
>>>   	svc_xprt_set_remote(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>> +	/* The remote port is arbitrary and not under the control of the
>>> +	 * ULP. Set it to a fixed value so that the DRC continues to work
>>> +	 * after a reconnect.
>>> +	 */
>>> +	rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);
>>> +
>>>   	sa = (struct sockaddr *)&newxprt->sc_cm_id->route.addr.src_addr;
>>>   	svc_xprt_set_local(&newxprt->sc_xprt, sa, svc_addr_len(sa));
>>>
>>>
>>>
>>>
> 
> --
> Chuck Lever
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 14:50     ` Tom Talpey
@ 2019-06-10 17:50       ` Chuck Lever
  2019-06-10 19:14         ` Tom Talpey
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2019-06-10 17:50 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List

Hi Tom-

> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>> Hi Tom-
>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>> The problem is that the new connection uses a different source port,
>>>> which defeats DRC hash.
>>>> 
>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>> The transport layer typically sets the source port value on the
>>>> connection to a random ephemeral port. The server already ignores it
>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>> client's source port on RDMA transports").
>>> 
>>> Where does the entropy come from, then, for the server to not
>>> match other requests from other mount points on this same client?
>> The first ~200 bytes of each RPC Call message.
>> [ Note that this has some fun ramifications for calls with small
>> RPC headers that use Read chunks. ]
> 
> Ok, good to know. I forgot that the Linux server implemented this.
> I have some concerns abot it, honestly, and it's important to remember
> that it's not the same on all servers. But for the problem you're
> fixing, it's ok I guess and certainly better than today. Still, the
> errors are goingto be completely silent, and can lead to data being
> corrupted. Well, welcome to the world of NFSv3.

I don't see another option.

Some regard this checksum as more robust than using the client's
IP source port. After all, the same argument can be made that
the server cannot depend on clients to reuse their source port.
That is simply a convention that many clients adopted before
servers used a stronger DRC hash mechanism.


>>> And since RDMA is capable of
>>> such high IOPS, the likelihood seems rather high.
>> Only when the server's durable storage is slow enough to cause
>> some RPC requests to have extremely high latency.
>> And, most clients use an atomic counter for their XIDs, so they
>> are also likely to wrap that counter over some long-pending RPC
>> request.
>> The only real answer here is NFSv4 sessions.
>>> Missing the cache
>>> might actually be safer than hitting, in this case.
>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>> connection, that includes NFSv3, to reset credit accounting
>> due to the lost half of the RPC Call/Reply pair.
>> I can very quickly reproduce bad (non-deterministic) behavior
>> by running a software build on an NFSv3 on RDMA mount point
>> with disconnect injection. If the DRC issue is addressed, the
>> software build runs to completion.
> 
> Ok, good. But I have a better test.
> 
> In the Connectathon suite, there's a "Special" test called "nfsidem".
> I wrote this test in, like, 1989 so I remember it :-)
> 
> This test performs all the non-idempotent NFv3 operations in a loop,
> and each loop element depends on the previous one, so if there's
> any failure, the test imemdiately bombs.
> 
> Nobody seems to understand it, usually when it gets run people will
> run it without injecting errors, and it "passes" so they decide
> everything is ok.
> 
> So my suggestion is to run your flakeway packet-drop harness while
> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
> owing to the expensive operations it performs, so you'll need to
> run it for a long time.
> 
> You'll almost definitely get a failure or two, since the NFSv3
> protocol is flawed by design. But you can compare the behaviors,
> and even compute a likelihood. I'd love to see some actual numbers.

I configured the client to disconnect after 23711 RPCs have completed.
(I can re-run these with more frequent disconnects if you think that
would be useful).

Here's a run with the DRC modification:

[cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
[cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
testing 100000 idempotencies in directory "./TEST"
[cel@manet ~]$ sudo umount /mnt


Here's a run with the stock v5.1 Linux server:

[cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
[cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
testing 100000 idempotencies in directory "./TEST"
[cel@manet ~]$

This test reported no errors in either case. We can see that the
disconnects did trigger retransmits:

RPC statistics:
  1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
  average backlog queue length: 119

ACCESS:
        300001 ops (15%)        44 retrans (0%)         0 major timeouts
        avg bytes sent per op: 132	avg bytes received per op: 120
        backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
REMOVE:
       	300000 ops (15%)        40 retrans (0%)         0 major timeouts
        avg bytes sent per op: 136	avg bytes received per op: 144
        backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
MKDIR:
     	200000 ops (10%)        26 retrans (0%)         0 major timeouts
        avg bytes sent per op: 158      avg bytes received per op: 272
        backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
RMDIR:
	200000 ops (10%)        24 retrans (0%)         0 major timeouts
        avg bytes sent per op: 130	avg bytes received per op: 144
        backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
LOOKUP:
       	188533 ops (9%)         21 retrans (0%)         0 major timeouts
        avg bytes sent per op: 136	avg bytes received per op: 174
        backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
SETATTR:
        100000 ops (5%)         11 retrans (0%)         0 major timeouts
        avg bytes sent per op: 160	avg bytes received per op: 144
        backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
WRITE:
      	100000 ops (5%)         9 retrans (0%)  0 major timeouts
        avg bytes sent per op: 180	avg bytes received per op: 136
        backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
CREATE:
       	100000 ops (5%)         9 retrans (0%)  0 major timeouts
        avg bytes sent per op: 168	avg bytes received per op: 272
        backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
SYMLINK:
     	100000 ops (5%)         18 retrans (0%)         0 major timeouts
        avg bytes sent per op: 188	avg bytes received per op: 272
        backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
RENAME:
     	100000 ops (5%)         14 retrans (0%)         0 major timeouts
        avg bytes sent per op: 180	avg bytes received per op: 260
        backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)


--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 17:50       ` Chuck Lever
@ 2019-06-10 19:14         ` Tom Talpey
  2019-06-10 21:57           ` Chuck Lever
  0 siblings, 1 reply; 19+ messages in thread
From: Tom Talpey @ 2019-06-10 19:14 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 6/10/2019 1:50 PM, Chuck Lever wrote:
> Hi Tom-
> 
>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>> Hi Tom-
>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>> The problem is that the new connection uses a different source port,
>>>>> which defeats DRC hash.
>>>>>
>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>> The transport layer typically sets the source port value on the
>>>>> connection to a random ephemeral port. The server already ignores it
>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>> client's source port on RDMA transports").
>>>>
>>>> Where does the entropy come from, then, for the server to not
>>>> match other requests from other mount points on this same client?
>>> The first ~200 bytes of each RPC Call message.
>>> [ Note that this has some fun ramifications for calls with small
>>> RPC headers that use Read chunks. ]
>>
>> Ok, good to know. I forgot that the Linux server implemented this.
>> I have some concerns abot it, honestly, and it's important to remember
>> that it's not the same on all servers. But for the problem you're
>> fixing, it's ok I guess and certainly better than today. Still, the
>> errors are goingto be completely silent, and can lead to data being
>> corrupted. Well, welcome to the world of NFSv3.
> 
> I don't see another option.
> 
> Some regard this checksum as more robust than using the client's
> IP source port. After all, the same argument can be made that
> the server cannot depend on clients to reuse their source port.
> That is simply a convention that many clients adopted before
> servers used a stronger DRC hash mechanism.
> 
> 
>>>> And since RDMA is capable of
>>>> such high IOPS, the likelihood seems rather high.
>>> Only when the server's durable storage is slow enough to cause
>>> some RPC requests to have extremely high latency.
>>> And, most clients use an atomic counter for their XIDs, so they
>>> are also likely to wrap that counter over some long-pending RPC
>>> request.
>>> The only real answer here is NFSv4 sessions.
>>>> Missing the cache
>>>> might actually be safer than hitting, in this case.
>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>> connection, that includes NFSv3, to reset credit accounting
>>> due to the lost half of the RPC Call/Reply pair.
>>> I can very quickly reproduce bad (non-deterministic) behavior
>>> by running a software build on an NFSv3 on RDMA mount point
>>> with disconnect injection. If the DRC issue is addressed, the
>>> software build runs to completion.
>>
>> Ok, good. But I have a better test.
>>
>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>> I wrote this test in, like, 1989 so I remember it :-)
>>
>> This test performs all the non-idempotent NFv3 operations in a loop,
>> and each loop element depends on the previous one, so if there's
>> any failure, the test imemdiately bombs.
>>
>> Nobody seems to understand it, usually when it gets run people will
>> run it without injecting errors, and it "passes" so they decide
>> everything is ok.
>>
>> So my suggestion is to run your flakeway packet-drop harness while
>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>> owing to the expensive operations it performs, so you'll need to
>> run it for a long time.
>>
>> You'll almost definitely get a failure or two, since the NFSv3
>> protocol is flawed by design. But you can compare the behaviors,
>> and even compute a likelihood. I'd love to see some actual numbers.
> 
> I configured the client to disconnect after 23711 RPCs have completed.
> (I can re-run these with more frequent disconnects if you think that
> would be useful).
> 
> Here's a run with the DRC modification:
> 
> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
> testing 100000 idempotencies in directory "./TEST"
> [cel@manet ~]$ sudo umount /mnt
> 
> 
> Here's a run with the stock v5.1 Linux server:
> 
> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
> testing 100000 idempotencies in directory "./TEST"
> [cel@manet ~]$
> 
> This test reported no errors in either case. We can see that the
> disconnects did trigger retransmits:
> 
> RPC statistics:
>    1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>    average backlog queue length: 119

Ok, well, that's 1.2% error rate, which IMO could be cranked up much
higher for testing purposes. I'd also be sure the server was serving
other workloads during the same time, putting at least some pressure
on the DRC. The op rate of a single nfsidem test is pretty low so I
doubt it's ever evicting anything.

Ideally, it would be best to
1) increase the error probability
2) run several concurrent nfsidem tests, on different connections
3) apply some other load to the server, e.g. several cthon basics

The idea being to actually get the needle off of zero and measure some
kind of difference. Otherwise it really isn't giving any information
apart from a slight didn't-break-it confidence. Honestly, I'm surprised
you couldn't evince a failure from stock. On paper, these results don't
actually tell us the patch is doing anything.

Tom.

> 
> ACCESS:
>          300001 ops (15%)        44 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 132	avg bytes received per op: 120
>          backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
> REMOVE:
>         	300000 ops (15%)        40 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 136	avg bytes received per op: 144
>          backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
> MKDIR:
>       	200000 ops (10%)        26 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 158      avg bytes received per op: 272
>          backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
> RMDIR:
> 	200000 ops (10%)        24 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 130	avg bytes received per op: 144
>          backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
> LOOKUP:
>         	188533 ops (9%)         21 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 136	avg bytes received per op: 174
>          backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
> SETATTR:
>          100000 ops (5%)         11 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 160	avg bytes received per op: 144
>          backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
> WRITE:
>        	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>          avg bytes sent per op: 180	avg bytes received per op: 136
>          backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
> CREATE:
>         	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>          avg bytes sent per op: 168	avg bytes received per op: 272
>          backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
> SYMLINK:
>       	100000 ops (5%)         18 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 188	avg bytes received per op: 272
>          backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
> RENAME:
>       	100000 ops (5%)         14 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 180	avg bytes received per op: 260
>          backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)
> 
> 
> --
> Chuck Lever
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 19:14         ` Tom Talpey
@ 2019-06-10 21:57           ` Chuck Lever
  2019-06-10 22:13             ` Tom Talpey
  0 siblings, 1 reply; 19+ messages in thread
From: Chuck Lever @ 2019-06-10 21:57 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List



> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>> Hi Tom-
>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>> Hi Tom-
>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>> 
>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>>> The problem is that the new connection uses a different source port,
>>>>>> which defeats DRC hash.
>>>>>> 
>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>>> The transport layer typically sets the source port value on the
>>>>>> connection to a random ephemeral port. The server already ignores it
>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>> client's source port on RDMA transports").
>>>>> 
>>>>> Where does the entropy come from, then, for the server to not
>>>>> match other requests from other mount points on this same client?
>>>> The first ~200 bytes of each RPC Call message.
>>>> [ Note that this has some fun ramifications for calls with small
>>>> RPC headers that use Read chunks. ]
>>> 
>>> Ok, good to know. I forgot that the Linux server implemented this.
>>> I have some concerns abot it, honestly, and it's important to remember
>>> that it's not the same on all servers. But for the problem you're
>>> fixing, it's ok I guess and certainly better than today. Still, the
>>> errors are goingto be completely silent, and can lead to data being
>>> corrupted. Well, welcome to the world of NFSv3.
>> I don't see another option.
>> Some regard this checksum as more robust than using the client's
>> IP source port. After all, the same argument can be made that
>> the server cannot depend on clients to reuse their source port.
>> That is simply a convention that many clients adopted before
>> servers used a stronger DRC hash mechanism.
>>>>> And since RDMA is capable of
>>>>> such high IOPS, the likelihood seems rather high.
>>>> Only when the server's durable storage is slow enough to cause
>>>> some RPC requests to have extremely high latency.
>>>> And, most clients use an atomic counter for their XIDs, so they
>>>> are also likely to wrap that counter over some long-pending RPC
>>>> request.
>>>> The only real answer here is NFSv4 sessions.
>>>>> Missing the cache
>>>>> might actually be safer than hitting, in this case.
>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>> connection, that includes NFSv3, to reset credit accounting
>>>> due to the lost half of the RPC Call/Reply pair.
>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>> by running a software build on an NFSv3 on RDMA mount point
>>>> with disconnect injection. If the DRC issue is addressed, the
>>>> software build runs to completion.
>>> 
>>> Ok, good. But I have a better test.
>>> 
>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>> I wrote this test in, like, 1989 so I remember it :-)
>>> 
>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>> and each loop element depends on the previous one, so if there's
>>> any failure, the test imemdiately bombs.
>>> 
>>> Nobody seems to understand it, usually when it gets run people will
>>> run it without injecting errors, and it "passes" so they decide
>>> everything is ok.
>>> 
>>> So my suggestion is to run your flakeway packet-drop harness while
>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>> owing to the expensive operations it performs, so you'll need to
>>> run it for a long time.
>>> 
>>> You'll almost definitely get a failure or two, since the NFSv3
>>> protocol is flawed by design. But you can compare the behaviors,
>>> and even compute a likelihood. I'd love to see some actual numbers.
>> I configured the client to disconnect after 23711 RPCs have completed.
>> (I can re-run these with more frequent disconnects if you think that
>> would be useful).
>> Here's a run with the DRC modification:
>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>> testing 100000 idempotencies in directory "./TEST"
>> [cel@manet ~]$ sudo umount /mnt
>> Here's a run with the stock v5.1 Linux server:
>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>> testing 100000 idempotencies in directory "./TEST"
>> [cel@manet ~]$
>> This test reported no errors in either case. We can see that the
>> disconnects did trigger retransmits:
>> RPC statistics:
>>   1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>>   average backlog queue length: 119
> 
> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
> higher for testing purposes. I'd also be sure the server was serving
> other workloads during the same time, putting at least some pressure
> on the DRC. The op rate of a single nfsidem test is pretty low so I
> doubt it's ever evicting anything.
> 
> Ideally, it would be best to
> 1) increase the error probability
> 2) run several concurrent nfsidem tests, on different connections
> 3) apply some other load to the server, e.g. several cthon basics
> 
> The idea being to actually get the needle off of zero and measure some
> kind of difference. Otherwise it really isn't giving any information
> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
> you couldn't evince a failure from stock. On paper, these results don't
> actually tell us the patch is doing anything.

I boosted the disconnect injection rate to once every 11353 RPCs,
and mounted a second share with "nosharecache", running nfsidem on
both mounts. Both mounts are subject to disconnect injection.

With the current v5.2-rc Linux server, both nfsidem jobs fail within
30 seconds.

With my DRC fix applied, both jobs run to completion with no errors.
It takes more than an hour.


Here are the op metrics for one of the mounts during a run that
completes successfully:

RPC statistics:
  4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found)
  average backlog queue length: 1800

ACCESS:
       	300395 ops (7%)         301 retrans (0%)        0 major timeouts
        avg bytes sent per op: 132	avg bytes received per op: 120
        backlog wait: 4.289199  RTT: 0.019821   total execute time: 4.315092 (milliseconds)
REMOVE:
       	300390 ops (7%)         168 retrans (0%)        0 major timeouts
        avg bytes sent per op: 136	avg bytes received per op: 144
        backlog wait: 2.368664  RTT: 0.070106   total execute time: 2.445148 (milliseconds)
MKDIR:
      	200262 ops (4%)         193 retrans (0%)        0 major timeouts
        avg bytes sent per op: 158	avg bytes received per op: 271
        backlog wait: 4.207034  RTT: 0.075421   total execute time: 4.289101 (milliseconds)
RMDIR:
      	200262 ops (4%)         100 retrans (0%)        0 major timeouts
        avg bytes sent per op: 130	avg bytes received per op: 144
        backlog wait: 2.050749  RTT: 0.071676   total execute time: 2.128801 (milliseconds)
LOOKUP:
       	194664 ops (4%)         233 retrans (0%)        0 major timeouts
        avg bytes sent per op: 136	avg bytes received per op: 176
        backlog wait: 5.365984  RTT: 0.020615   total execute time: 5.392769 (milliseconds)
SETATTR:
       	100130 ops (2%)         86 retrans (0%)         0 major timeouts
        avg bytes sent per op: 160	avg bytes received per op: 144
        backlog wait: 3.520603  RTT: 0.066863   total execute time: 3.594327 (milliseconds)
WRITE:
       	100130 ops (2%)         82 retrans (0%)         0 major timeouts
        avg bytes sent per op: 180	avg bytes received per op: 136
        backlog wait: 3.331249  RTT: 0.118316   total execute time: 3.459563 (milliseconds)
CREATE:
       	100130 ops (2%)         95 retrans (0%)         0 major timeouts
        avg bytes sent per op: 168	avg bytes received per op: 272
        backlog wait: 4.099451  RTT: 0.071437   total execute time: 4.177479 (milliseconds)
SYMLINK:
       	100130 ops (2%)         83 retrans (0%)         0 major timeouts
        avg bytes sent per op: 188	avg bytes received per op: 271
        backlog wait: 3.727704  RTT: 0.073534   total execute time: 3.807700 (milliseconds)
RENAME:
       	100130 ops (2%)         68 retrans (0%)         0 major timeouts
        avg bytes sent per op: 180	avg bytes received per op: 260
        backlog wait: 2.659982  RTT: 0.070518   total execute time: 2.738979 (milliseconds)
LINK:
	100130 ops (2%) 	85 retrans (0%) 	0 major timeouts
	avg bytes sent per op: 172	avg bytes received per op: 232
	backlog wait: 3.676680 	RTT: 0.066773 	total execute time: 3.749785 (milliseconds)
GETATTR:
	230 ops (0%) 	81 retrans (35%) 	0 major timeouts
	avg bytes sent per op: 170	avg bytes received per op: 112
	backlog wait: 1584.026087 	RTT: 0.043478 	total execute time: 1584.082609 (milliseconds)
READDIRPLUS:
	10 ops (0%) 
	avg bytes sent per op: 149	avg bytes received per op: 1726
	backlog wait: 0.000000 	RTT: 0.300000 	total execute time: 0.400000 (milliseconds)
FSINFO:
	2 ops (0%) 
	avg bytes sent per op: 112	avg bytes received per op: 80
	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
NULL:
	1 ops (0%) 
	avg bytes sent per op: 40	avg bytes received per op: 24
	backlog wait: 2.000000 	RTT: 0.000000 	total execute time: 3.000000 (milliseconds)
READLINK:
	1 ops (0%) 
	avg bytes sent per op: 128	avg bytes received per op: 1140
	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
PATHCONF:
	1 ops (0%) 
	avg bytes sent per op: 112	avg bytes received per op: 56
	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)


> Tom.
> 
>> ACCESS:
>>         300001 ops (15%)        44 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 132	avg bytes received per op: 120
>>         backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
>> REMOVE:
>>        	300000 ops (15%)        40 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 136	avg bytes received per op: 144
>>         backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
>> MKDIR:
>>      	200000 ops (10%)        26 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 158      avg bytes received per op: 272
>>         backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
>> RMDIR:
>> 	200000 ops (10%)        24 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 130	avg bytes received per op: 144
>>         backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
>> LOOKUP:
>>        	188533 ops (9%)         21 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 136	avg bytes received per op: 174
>>         backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
>> SETATTR:
>>         100000 ops (5%)         11 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 160	avg bytes received per op: 144
>>         backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
>> WRITE:
>>       	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>         avg bytes sent per op: 180	avg bytes received per op: 136
>>         backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
>> CREATE:
>>        	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>         avg bytes sent per op: 168	avg bytes received per op: 272
>>         backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
>> SYMLINK:
>>      	100000 ops (5%)         18 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 188	avg bytes received per op: 272
>>         backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
>> RENAME:
>>      	100000 ops (5%)         14 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 180	avg bytes received per op: 260
>>         backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 21:57           ` Chuck Lever
@ 2019-06-10 22:13             ` Tom Talpey
  2019-06-11  0:07               ` Tom Talpey
  2019-06-11 14:23               ` Chuck Lever
  0 siblings, 2 replies; 19+ messages in thread
From: Tom Talpey @ 2019-06-10 22:13 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 6/10/2019 5:57 PM, Chuck Lever wrote:
> 
> 
>> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
>>
>> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>>> Hi Tom-
>>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>
>>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>>> Hi Tom-
>>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>
>>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>>>> The problem is that the new connection uses a different source port,
>>>>>>> which defeats DRC hash.
>>>>>>>
>>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>>>> The transport layer typically sets the source port value on the
>>>>>>> connection to a random ephemeral port. The server already ignores it
>>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>>> client's source port on RDMA transports").
>>>>>>
>>>>>> Where does the entropy come from, then, for the server to not
>>>>>> match other requests from other mount points on this same client?
>>>>> The first ~200 bytes of each RPC Call message.
>>>>> [ Note that this has some fun ramifications for calls with small
>>>>> RPC headers that use Read chunks. ]
>>>>
>>>> Ok, good to know. I forgot that the Linux server implemented this.
>>>> I have some concerns abot it, honestly, and it's important to remember
>>>> that it's not the same on all servers. But for the problem you're
>>>> fixing, it's ok I guess and certainly better than today. Still, the
>>>> errors are goingto be completely silent, and can lead to data being
>>>> corrupted. Well, welcome to the world of NFSv3.
>>> I don't see another option.
>>> Some regard this checksum as more robust than using the client's
>>> IP source port. After all, the same argument can be made that
>>> the server cannot depend on clients to reuse their source port.
>>> That is simply a convention that many clients adopted before
>>> servers used a stronger DRC hash mechanism.
>>>>>> And since RDMA is capable of
>>>>>> such high IOPS, the likelihood seems rather high.
>>>>> Only when the server's durable storage is slow enough to cause
>>>>> some RPC requests to have extremely high latency.
>>>>> And, most clients use an atomic counter for their XIDs, so they
>>>>> are also likely to wrap that counter over some long-pending RPC
>>>>> request.
>>>>> The only real answer here is NFSv4 sessions.
>>>>>> Missing the cache
>>>>>> might actually be safer than hitting, in this case.
>>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>>> connection, that includes NFSv3, to reset credit accounting
>>>>> due to the lost half of the RPC Call/Reply pair.
>>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>>> by running a software build on an NFSv3 on RDMA mount point
>>>>> with disconnect injection. If the DRC issue is addressed, the
>>>>> software build runs to completion.
>>>>
>>>> Ok, good. But I have a better test.
>>>>
>>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>>> I wrote this test in, like, 1989 so I remember it :-)
>>>>
>>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>>> and each loop element depends on the previous one, so if there's
>>>> any failure, the test imemdiately bombs.
>>>>
>>>> Nobody seems to understand it, usually when it gets run people will
>>>> run it without injecting errors, and it "passes" so they decide
>>>> everything is ok.
>>>>
>>>> So my suggestion is to run your flakeway packet-drop harness while
>>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>>> owing to the expensive operations it performs, so you'll need to
>>>> run it for a long time.
>>>>
>>>> You'll almost definitely get a failure or two, since the NFSv3
>>>> protocol is flawed by design. But you can compare the behaviors,
>>>> and even compute a likelihood. I'd love to see some actual numbers.
>>> I configured the client to disconnect after 23711 RPCs have completed.
>>> (I can re-run these with more frequent disconnects if you think that
>>> would be useful).
>>> Here's a run with the DRC modification:
>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>> testing 100000 idempotencies in directory "./TEST"
>>> [cel@manet ~]$ sudo umount /mnt
>>> Here's a run with the stock v5.1 Linux server:
>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>> testing 100000 idempotencies in directory "./TEST"
>>> [cel@manet ~]$
>>> This test reported no errors in either case. We can see that the
>>> disconnects did trigger retransmits:
>>> RPC statistics:
>>>    1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>>>    average backlog queue length: 119
>>
>> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
>> higher for testing purposes. I'd also be sure the server was serving
>> other workloads during the same time, putting at least some pressure
>> on the DRC. The op rate of a single nfsidem test is pretty low so I
>> doubt it's ever evicting anything.
>>
>> Ideally, it would be best to
>> 1) increase the error probability
>> 2) run several concurrent nfsidem tests, on different connections
>> 3) apply some other load to the server, e.g. several cthon basics
>>
>> The idea being to actually get the needle off of zero and measure some
>> kind of difference. Otherwise it really isn't giving any information
>> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
>> you couldn't evince a failure from stock. On paper, these results don't
>> actually tell us the patch is doing anything.
> 
> I boosted the disconnect injection rate to once every 11353 RPCs,
> and mounted a second share with "nosharecache", running nfsidem on
> both mounts. Both mounts are subject to disconnect injection.
> 
> With the current v5.2-rc Linux server, both nfsidem jobs fail within
> 30 seconds.
> 
> With my DRC fix applied, both jobs run to completion with no errors.
> It takes more than an hour.

Great, that definitely proves it. It's relatively low risk, and
should go in.

I think it's worth thinking through the best way to address this
across multiple transports, especially RDMA, where the port and
address behaviors may differ from TCP/UDP. Perhaps tossing the whole
idea of using the 5-tuple should simply be set aside, after all
these years. Even though NFSv4.1 already has!

Thanks for humoring me with the extra testing, by the way :-)

Tom.

> Here are the op metrics for one of the mounts during a run that
> completes successfully:
> 
> RPC statistics:
>    4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found)
>    average backlog queue length: 1800
> 
> ACCESS:
>         	300395 ops (7%)         301 retrans (0%)        0 major timeouts
>          avg bytes sent per op: 132	avg bytes received per op: 120
>          backlog wait: 4.289199  RTT: 0.019821   total execute time: 4.315092 (milliseconds)
> REMOVE:
>         	300390 ops (7%)         168 retrans (0%)        0 major timeouts
>          avg bytes sent per op: 136	avg bytes received per op: 144
>          backlog wait: 2.368664  RTT: 0.070106   total execute time: 2.445148 (milliseconds)
> MKDIR:
>        	200262 ops (4%)         193 retrans (0%)        0 major timeouts
>          avg bytes sent per op: 158	avg bytes received per op: 271
>          backlog wait: 4.207034  RTT: 0.075421   total execute time: 4.289101 (milliseconds)
> RMDIR:
>        	200262 ops (4%)         100 retrans (0%)        0 major timeouts
>          avg bytes sent per op: 130	avg bytes received per op: 144
>          backlog wait: 2.050749  RTT: 0.071676   total execute time: 2.128801 (milliseconds)
> LOOKUP:
>         	194664 ops (4%)         233 retrans (0%)        0 major timeouts
>          avg bytes sent per op: 136	avg bytes received per op: 176
>          backlog wait: 5.365984  RTT: 0.020615   total execute time: 5.392769 (milliseconds)
> SETATTR:
>         	100130 ops (2%)         86 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 160	avg bytes received per op: 144
>          backlog wait: 3.520603  RTT: 0.066863   total execute time: 3.594327 (milliseconds)
> WRITE:
>         	100130 ops (2%)         82 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 180	avg bytes received per op: 136
>          backlog wait: 3.331249  RTT: 0.118316   total execute time: 3.459563 (milliseconds)
> CREATE:
>         	100130 ops (2%)         95 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 168	avg bytes received per op: 272
>          backlog wait: 4.099451  RTT: 0.071437   total execute time: 4.177479 (milliseconds)
> SYMLINK:
>         	100130 ops (2%)         83 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 188	avg bytes received per op: 271
>          backlog wait: 3.727704  RTT: 0.073534   total execute time: 3.807700 (milliseconds)
> RENAME:
>         	100130 ops (2%)         68 retrans (0%)         0 major timeouts
>          avg bytes sent per op: 180	avg bytes received per op: 260
>          backlog wait: 2.659982  RTT: 0.070518   total execute time: 2.738979 (milliseconds)
> LINK:
> 	100130 ops (2%) 	85 retrans (0%) 	0 major timeouts
> 	avg bytes sent per op: 172	avg bytes received per op: 232
> 	backlog wait: 3.676680 	RTT: 0.066773 	total execute time: 3.749785 (milliseconds)
> GETATTR:
> 	230 ops (0%) 	81 retrans (35%) 	0 major timeouts
> 	avg bytes sent per op: 170	avg bytes received per op: 112
> 	backlog wait: 1584.026087 	RTT: 0.043478 	total execute time: 1584.082609 (milliseconds)
> READDIRPLUS:
> 	10 ops (0%)
> 	avg bytes sent per op: 149	avg bytes received per op: 1726
> 	backlog wait: 0.000000 	RTT: 0.300000 	total execute time: 0.400000 (milliseconds)
> FSINFO:
> 	2 ops (0%)
> 	avg bytes sent per op: 112	avg bytes received per op: 80
> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
> NULL:
> 	1 ops (0%)
> 	avg bytes sent per op: 40	avg bytes received per op: 24
> 	backlog wait: 2.000000 	RTT: 0.000000 	total execute time: 3.000000 (milliseconds)
> READLINK:
> 	1 ops (0%)
> 	avg bytes sent per op: 128	avg bytes received per op: 1140
> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
> PATHCONF:
> 	1 ops (0%)
> 	avg bytes sent per op: 112	avg bytes received per op: 56
> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
> 
> 
>> Tom.
>>
>>> ACCESS:
>>>          300001 ops (15%)        44 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 132	avg bytes received per op: 120
>>>          backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
>>> REMOVE:
>>>         	300000 ops (15%)        40 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 136	avg bytes received per op: 144
>>>          backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
>>> MKDIR:
>>>       	200000 ops (10%)        26 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 158      avg bytes received per op: 272
>>>          backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
>>> RMDIR:
>>> 	200000 ops (10%)        24 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 130	avg bytes received per op: 144
>>>          backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
>>> LOOKUP:
>>>         	188533 ops (9%)         21 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 136	avg bytes received per op: 174
>>>          backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
>>> SETATTR:
>>>          100000 ops (5%)         11 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 160	avg bytes received per op: 144
>>>          backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
>>> WRITE:
>>>        	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>          avg bytes sent per op: 180	avg bytes received per op: 136
>>>          backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
>>> CREATE:
>>>         	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>          avg bytes sent per op: 168	avg bytes received per op: 272
>>>          backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
>>> SYMLINK:
>>>       	100000 ops (5%)         18 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 188	avg bytes received per op: 272
>>>          backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
>>> RENAME:
>>>       	100000 ops (5%)         14 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 180	avg bytes received per op: 260
>>>          backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)
>>> --
>>> Chuck Lever
> 
> --
> Chuck Lever
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 22:13             ` Tom Talpey
@ 2019-06-11  0:07               ` Tom Talpey
  2019-06-11 14:25                 ` Chuck Lever
  2019-06-11 14:23               ` Chuck Lever
  1 sibling, 1 reply; 19+ messages in thread
From: Tom Talpey @ 2019-06-11  0:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma, Linux NFS Mailing List

On 6/10/2019 6:13 PM, Tom Talpey wrote:
> On 6/10/2019 5:57 PM, Chuck Lever wrote:
>>
>>
>>> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
>>>
>>> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>>>> Hi Tom-
>>>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>
>>>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>>>> Hi Tom-
>>>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>>
>>>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>>>> The DRC is not working at all after an RPC/RDMA transport 
>>>>>>>> reconnect.
>>>>>>>> The problem is that the new connection uses a different source 
>>>>>>>> port,
>>>>>>>> which defeats DRC hash.
>>>>>>>>
>>>>>>>> An NFS/RDMA client's source port is meaningless for RDMA 
>>>>>>>> transports.
>>>>>>>> The transport layer typically sets the source port value on the
>>>>>>>> connection to a random ephemeral port. The server already 
>>>>>>>> ignores it
>>>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>>>> client's source port on RDMA transports").
>>>>>>>
>>>>>>> Where does the entropy come from, then, for the server to not
>>>>>>> match other requests from other mount points on this same client?
>>>>>> The first ~200 bytes of each RPC Call message.
>>>>>> [ Note that this has some fun ramifications for calls with small
>>>>>> RPC headers that use Read chunks. ]
>>>>>
>>>>> Ok, good to know. I forgot that the Linux server implemented this.
>>>>> I have some concerns abot it, honestly, and it's important to remember
>>>>> that it's not the same on all servers. But for the problem you're
>>>>> fixing, it's ok I guess and certainly better than today. Still, the
>>>>> errors are goingto be completely silent, and can lead to data being
>>>>> corrupted. Well, welcome to the world of NFSv3.
>>>> I don't see another option.
>>>> Some regard this checksum as more robust than using the client's
>>>> IP source port. After all, the same argument can be made that
>>>> the server cannot depend on clients to reuse their source port.
>>>> That is simply a convention that many clients adopted before
>>>> servers used a stronger DRC hash mechanism.
>>>>>>> And since RDMA is capable of
>>>>>>> such high IOPS, the likelihood seems rather high.
>>>>>> Only when the server's durable storage is slow enough to cause
>>>>>> some RPC requests to have extremely high latency.
>>>>>> And, most clients use an atomic counter for their XIDs, so they
>>>>>> are also likely to wrap that counter over some long-pending RPC
>>>>>> request.
>>>>>> The only real answer here is NFSv4 sessions.
>>>>>>> Missing the cache
>>>>>>> might actually be safer than hitting, in this case.
>>>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>>>> connection, that includes NFSv3, to reset credit accounting
>>>>>> due to the lost half of the RPC Call/Reply pair.
>>>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>>>> by running a software build on an NFSv3 on RDMA mount point
>>>>>> with disconnect injection. If the DRC issue is addressed, the
>>>>>> software build runs to completion.
>>>>>
>>>>> Ok, good. But I have a better test.
>>>>>
>>>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>>>> I wrote this test in, like, 1989 so I remember it :-)
>>>>>
>>>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>>>> and each loop element depends on the previous one, so if there's
>>>>> any failure, the test imemdiately bombs.
>>>>>
>>>>> Nobody seems to understand it, usually when it gets run people will
>>>>> run it without injecting errors, and it "passes" so they decide
>>>>> everything is ok.
>>>>>
>>>>> So my suggestion is to run your flakeway packet-drop harness while
>>>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>>>> owing to the expensive operations it performs, so you'll need to
>>>>> run it for a long time.
>>>>>
>>>>> You'll almost definitely get a failure or two, since the NFSv3
>>>>> protocol is flawed by design. But you can compare the behaviors,
>>>>> and even compute a likelihood. I'd love to see some actual numbers.
>>>> I configured the client to disconnect after 23711 RPCs have completed.
>>>> (I can re-run these with more frequent disconnects if you think that
>>>> would be useful).
>>>> Here's a run with the DRC modification:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys 
>>>> klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$ sudo umount /mnt
>>>> Here's a run with the stock v5.1 Linux server:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys 
>>>> klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$
>>>> This test reported no errors in either case. We can see that the
>>>> disconnects did trigger retransmits:
>>>> RPC statistics:
>>>>    1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs 
>>>> not found)
>>>>    average backlog queue length: 119
>>>
>>> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
>>> higher for testing purposes. I'd also be sure the server was serving
>>> other workloads during the same time, putting at least some pressure
>>> on the DRC. The op rate of a single nfsidem test is pretty low so I
>>> doubt it's ever evicting anything.
>>>
>>> Ideally, it would be best to
>>> 1) increase the error probability
>>> 2) run several concurrent nfsidem tests, on different connections
>>> 3) apply some other load to the server, e.g. several cthon basics
>>>
>>> The idea being to actually get the needle off of zero and measure some
>>> kind of difference. Otherwise it really isn't giving any information
>>> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
>>> you couldn't evince a failure from stock. On paper, these results don't
>>> actually tell us the patch is doing anything.
>>
>> I boosted the disconnect injection rate to once every 11353 RPCs,
>> and mounted a second share with "nosharecache", running nfsidem on
>> both mounts. Both mounts are subject to disconnect injection.
>>
>> With the current v5.2-rc Linux server, both nfsidem jobs fail within
>> 30 seconds.
>>
>> With my DRC fix applied, both jobs run to completion with no errors.
>> It takes more than an hour.
> 
> Great, that definitely proves it. It's relatively low risk, and
> should go in.
> 
> I think it's worth thinking through the best way to address this
> across multiple transports, especially RDMA, where the port and
> address behaviors may differ from TCP/UDP. Perhaps tossing the whole
> idea of using the 5-tuple should simply be set aside, after all
> these years. Even though NFSv4.1 already has!
> 
> Thanks for humoring me with the extra testing, by the way :-)
> 
> Tom.
> 
>> Here are the op metrics for one of the mounts during a run that
>> completes successfully:
>>
>> RPC statistics:
>>    4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not 
>> found)
>>    average backlog queue length: 1800
>>
>> ACCESS:
>>             300395 ops (7%)         301 retrans (0%)        0 major 
>> timeouts
>>          avg bytes sent per op: 132    avg bytes received per op: 120
>>          backlog wait: 4.289199  RTT: 0.019821   total execute time: 
>> 4.315092 (milliseconds)
>> REMOVE:
>>             300390 ops (7%)         168 retrans (0%)        0 major 
>> timeouts
>>          avg bytes sent per op: 136    avg bytes received per op: 144
>>          backlog wait: 2.368664  RTT: 0.070106   total execute time: 
>> 2.445148 (milliseconds)
>> MKDIR:
>>            200262 ops (4%)         193 retrans (0%)        0 major 
>> timeouts
>>          avg bytes sent per op: 158    avg bytes received per op: 271
>>          backlog wait: 4.207034  RTT: 0.075421   total execute time: 
>> 4.289101 (milliseconds)
>> RMDIR:
>>            200262 ops (4%)         100 retrans (0%)        0 major 
>> timeouts
>>          avg bytes sent per op: 130    avg bytes received per op: 144
>>          backlog wait: 2.050749  RTT: 0.071676   total execute time: 
>> 2.128801 (milliseconds)
>> LOOKUP:
>>             194664 ops (4%)         233 retrans (0%)        0 major 
>> timeouts
>>          avg bytes sent per op: 136    avg bytes received per op: 176
>>          backlog wait: 5.365984  RTT: 0.020615   total execute time: 
>> 5.392769 (milliseconds)
>> SETATTR:
>>             100130 ops (2%)         86 retrans (0%)         0 major 
>> timeouts
>>          avg bytes sent per op: 160    avg bytes received per op: 144
>>          backlog wait: 3.520603  RTT: 0.066863   total execute time: 
>> 3.594327 (milliseconds)
>> WRITE:
>>             100130 ops (2%)         82 retrans (0%)         0 major 
>> timeouts
>>          avg bytes sent per op: 180    avg bytes received per op: 136
>>          backlog wait: 3.331249  RTT: 0.118316   total execute time: 
>> 3.459563 (milliseconds)
>> CREATE:
>>             100130 ops (2%)         95 retrans (0%)         0 major 
>> timeouts
>>          avg bytes sent per op: 168    avg bytes received per op: 272
>>          backlog wait: 4.099451  RTT: 0.071437   total execute time: 
>> 4.177479 (milliseconds)
>> SYMLINK:
>>             100130 ops (2%)         83 retrans (0%)         0 major 
>> timeouts
>>          avg bytes sent per op: 188    avg bytes received per op: 271
>>          backlog wait: 3.727704  RTT: 0.073534   total execute time: 
>> 3.807700 (milliseconds)
>> RENAME:
>>             100130 ops (2%)         68 retrans (0%)         0 major 
>> timeouts
>>          avg bytes sent per op: 180    avg bytes received per op: 260
>>          backlog wait: 2.659982  RTT: 0.070518   total execute time: 
>> 2.738979 (milliseconds)
>> LINK:
>>     100130 ops (2%)     85 retrans (0%)     0 major timeouts
>>     avg bytes sent per op: 172    avg bytes received per op: 232
>>     backlog wait: 3.676680     RTT: 0.066773     total execute time: 
>> 3.749785 (milliseconds)
>> GETATTR:
>>     230 ops (0%)     81 retrans (35%)     0 major timeouts
>>     avg bytes sent per op: 170    avg bytes received per op: 112
>>     backlog wait: 1584.026087     RTT: 0.043478     total execute 
>> time: 1584.082609 (milliseconds)

By the way, that's a *nasty* tail latency on at least one getattr.
What's up with that??

Tom.

>> READDIRPLUS:
>>     10 ops (0%)
>>     avg bytes sent per op: 149    avg bytes received per op: 1726
>>     backlog wait: 0.000000     RTT: 0.300000     total execute time: 
>> 0.400000 (milliseconds)
>> FSINFO:
>>     2 ops (0%)
>>     avg bytes sent per op: 112    avg bytes received per op: 80
>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 
>> 0.000000 (milliseconds)
>> NULL:
>>     1 ops (0%)
>>     avg bytes sent per op: 40    avg bytes received per op: 24
>>     backlog wait: 2.000000     RTT: 0.000000     total execute time: 
>> 3.000000 (milliseconds)
>> READLINK:
>>     1 ops (0%)
>>     avg bytes sent per op: 128    avg bytes received per op: 1140
>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 
>> 0.000000 (milliseconds)
>> PATHCONF:
>>     1 ops (0%)
>>     avg bytes sent per op: 112    avg bytes received per op: 56
>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 
>> 0.000000 (milliseconds)
>>
>>
>>> Tom.
>>>
>>>> ACCESS:
>>>>          300001 ops (15%)        44 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 132    avg bytes received per op: 120
>>>>          backlog wait: 0.591118  RTT: 0.017463   total execute time: 
>>>> 0.614795 (milliseconds)
>>>> REMOVE:
>>>>             300000 ops (15%)        40 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 136    avg bytes received per op: 144
>>>>          backlog wait: 0.531667  RTT: 0.018973   total execute time: 
>>>> 0.556927 (milliseconds)
>>>> MKDIR:
>>>>           200000 ops (10%)        26 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 158      avg bytes received per op: 272
>>>>          backlog wait: 0.518940  RTT: 0.019755   total execute time: 
>>>> 0.545230 (milliseconds)
>>>> RMDIR:
>>>>     200000 ops (10%)        24 retrans (0%)         0 major timeouts
>>>>          avg bytes sent per op: 130    avg bytes received per op: 144
>>>>          backlog wait: 0.512320  RTT: 0.018580   total execute time: 
>>>> 0.537095 (milliseconds)
>>>> LOOKUP:
>>>>             188533 ops (9%)         21 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 136    avg bytes received per op: 174
>>>>          backlog wait: 0.455925  RTT: 0.017721   total execute time: 
>>>> 0.480011 (milliseconds)
>>>> SETATTR:
>>>>          100000 ops (5%)         11 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 160    avg bytes received per op: 144
>>>>          backlog wait: 0.371960  RTT: 0.019470   total execute time: 
>>>> 0.398330 (milliseconds)
>>>> WRITE:
>>>>            100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>          avg bytes sent per op: 180    avg bytes received per op: 136
>>>>          backlog wait: 0.399190  RTT: 0.022860   total execute time: 
>>>> 0.436610 (milliseconds)
>>>> CREATE:
>>>>             100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>          avg bytes sent per op: 168    avg bytes received per op: 272
>>>>          backlog wait: 0.365290  RTT: 0.019560   total execute time: 
>>>> 0.391140 (milliseconds)
>>>> SYMLINK:
>>>>           100000 ops (5%)         18 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 188    avg bytes received per op: 272
>>>>          backlog wait: 0.750470  RTT: 0.020150   total execute time: 
>>>> 0.786410 (milliseconds)
>>>> RENAME:
>>>>           100000 ops (5%)         14 retrans (0%)         0 major 
>>>> timeouts
>>>>          avg bytes sent per op: 180    avg bytes received per op: 260
>>>>          backlog wait: 0.461650  RTT: 0.020710   total execute time: 
>>>> 0.489670 (milliseconds)
>>>> -- 
>>>> Chuck Lever
>>
>> -- 
>> Chuck Lever
>>
>>
>>
>>
>>
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-10 22:13             ` Tom Talpey
  2019-06-11  0:07               ` Tom Talpey
@ 2019-06-11 14:23               ` Chuck Lever
  1 sibling, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2019-06-11 14:23 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List



> On Jun 10, 2019, at 6:13 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/10/2019 5:57 PM, Chuck Lever wrote:
>>> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
>>> 
>>> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>>>> Hi Tom-
>>>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>> 
>>>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>>>> Hi Tom-
>>>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>> 
>>>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>>>>> The problem is that the new connection uses a different source port,
>>>>>>>> which defeats DRC hash.
>>>>>>>> 
>>>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>>>>> The transport layer typically sets the source port value on the
>>>>>>>> connection to a random ephemeral port. The server already ignores it
>>>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>>>> client's source port on RDMA transports").
>>>>>>> 
>>>>>>> Where does the entropy come from, then, for the server to not
>>>>>>> match other requests from other mount points on this same client?
>>>>>> The first ~200 bytes of each RPC Call message.
>>>>>> [ Note that this has some fun ramifications for calls with small
>>>>>> RPC headers that use Read chunks. ]
>>>>> 
>>>>> Ok, good to know. I forgot that the Linux server implemented this.
>>>>> I have some concerns abot it, honestly, and it's important to remember
>>>>> that it's not the same on all servers. But for the problem you're
>>>>> fixing, it's ok I guess and certainly better than today. Still, the
>>>>> errors are goingto be completely silent, and can lead to data being
>>>>> corrupted. Well, welcome to the world of NFSv3.
>>>> I don't see another option.
>>>> Some regard this checksum as more robust than using the client's
>>>> IP source port. After all, the same argument can be made that
>>>> the server cannot depend on clients to reuse their source port.
>>>> That is simply a convention that many clients adopted before
>>>> servers used a stronger DRC hash mechanism.
>>>>>>> And since RDMA is capable of
>>>>>>> such high IOPS, the likelihood seems rather high.
>>>>>> Only when the server's durable storage is slow enough to cause
>>>>>> some RPC requests to have extremely high latency.
>>>>>> And, most clients use an atomic counter for their XIDs, so they
>>>>>> are also likely to wrap that counter over some long-pending RPC
>>>>>> request.
>>>>>> The only real answer here is NFSv4 sessions.
>>>>>>> Missing the cache
>>>>>>> might actually be safer than hitting, in this case.
>>>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>>>> connection, that includes NFSv3, to reset credit accounting
>>>>>> due to the lost half of the RPC Call/Reply pair.
>>>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>>>> by running a software build on an NFSv3 on RDMA mount point
>>>>>> with disconnect injection. If the DRC issue is addressed, the
>>>>>> software build runs to completion.
>>>>> 
>>>>> Ok, good. But I have a better test.
>>>>> 
>>>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>>>> I wrote this test in, like, 1989 so I remember it :-)
>>>>> 
>>>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>>>> and each loop element depends on the previous one, so if there's
>>>>> any failure, the test imemdiately bombs.
>>>>> 
>>>>> Nobody seems to understand it, usually when it gets run people will
>>>>> run it without injecting errors, and it "passes" so they decide
>>>>> everything is ok.
>>>>> 
>>>>> So my suggestion is to run your flakeway packet-drop harness while
>>>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>>>> owing to the expensive operations it performs, so you'll need to
>>>>> run it for a long time.
>>>>> 
>>>>> You'll almost definitely get a failure or two, since the NFSv3
>>>>> protocol is flawed by design. But you can compare the behaviors,
>>>>> and even compute a likelihood. I'd love to see some actual numbers.
>>>> I configured the client to disconnect after 23711 RPCs have completed.
>>>> (I can re-run these with more frequent disconnects if you think that
>>>> would be useful).
>>>> Here's a run with the DRC modification:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$ sudo umount /mnt
>>>> Here's a run with the stock v5.1 Linux server:
>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>> testing 100000 idempotencies in directory "./TEST"
>>>> [cel@manet ~]$
>>>> This test reported no errors in either case. We can see that the
>>>> disconnects did trigger retransmits:
>>>> RPC statistics:
>>>>   1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>>>>   average backlog queue length: 119
>>> 
>>> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
>>> higher for testing purposes. I'd also be sure the server was serving
>>> other workloads during the same time, putting at least some pressure
>>> on the DRC. The op rate of a single nfsidem test is pretty low so I
>>> doubt it's ever evicting anything.
>>> 
>>> Ideally, it would be best to
>>> 1) increase the error probability
>>> 2) run several concurrent nfsidem tests, on different connections
>>> 3) apply some other load to the server, e.g. several cthon basics
>>> 
>>> The idea being to actually get the needle off of zero and measure some
>>> kind of difference. Otherwise it really isn't giving any information
>>> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
>>> you couldn't evince a failure from stock. On paper, these results don't
>>> actually tell us the patch is doing anything.
>> I boosted the disconnect injection rate to once every 11353 RPCs,
>> and mounted a second share with "nosharecache", running nfsidem on
>> both mounts. Both mounts are subject to disconnect injection.
>> With the current v5.2-rc Linux server, both nfsidem jobs fail within
>> 30 seconds.
>> With my DRC fix applied, both jobs run to completion with no errors.
>> It takes more than an hour.
> 
> Great, that definitely proves it. It's relatively low risk, and
> should go in.
> 
> I think it's worth thinking through the best way to address this
> across multiple transports, especially RDMA, where the port and
> address behaviors may differ from TCP/UDP. Perhaps tossing the whole
> idea of using the 5-tuple should simply be set aside, after all
> these years. Even though NFSv4.1 already has!
> 
> Thanks for humoring me with the extra testing, by the way :-)

Sho 'nuf. Worth it to ensure we are all on the same page now.

I'll send this to Bruce with some clarifications to the description.


> Tom.
> 
>> Here are the op metrics for one of the mounts during a run that
>> completes successfully:
>> RPC statistics:
>>   4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found)
>>   average backlog queue length: 1800
>> ACCESS:
>>        	300395 ops (7%)         301 retrans (0%)        0 major timeouts
>>         avg bytes sent per op: 132	avg bytes received per op: 120
>>         backlog wait: 4.289199  RTT: 0.019821   total execute time: 4.315092 (milliseconds)
>> REMOVE:
>>        	300390 ops (7%)         168 retrans (0%)        0 major timeouts
>>         avg bytes sent per op: 136	avg bytes received per op: 144
>>         backlog wait: 2.368664  RTT: 0.070106   total execute time: 2.445148 (milliseconds)
>> MKDIR:
>>       	200262 ops (4%)         193 retrans (0%)        0 major timeouts
>>         avg bytes sent per op: 158	avg bytes received per op: 271
>>         backlog wait: 4.207034  RTT: 0.075421   total execute time: 4.289101 (milliseconds)
>> RMDIR:
>>       	200262 ops (4%)         100 retrans (0%)        0 major timeouts
>>         avg bytes sent per op: 130	avg bytes received per op: 144
>>         backlog wait: 2.050749  RTT: 0.071676   total execute time: 2.128801 (milliseconds)
>> LOOKUP:
>>        	194664 ops (4%)         233 retrans (0%)        0 major timeouts
>>         avg bytes sent per op: 136	avg bytes received per op: 176
>>         backlog wait: 5.365984  RTT: 0.020615   total execute time: 5.392769 (milliseconds)
>> SETATTR:
>>        	100130 ops (2%)         86 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 160	avg bytes received per op: 144
>>         backlog wait: 3.520603  RTT: 0.066863   total execute time: 3.594327 (milliseconds)
>> WRITE:
>>        	100130 ops (2%)         82 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 180	avg bytes received per op: 136
>>         backlog wait: 3.331249  RTT: 0.118316   total execute time: 3.459563 (milliseconds)
>> CREATE:
>>        	100130 ops (2%)         95 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 168	avg bytes received per op: 272
>>         backlog wait: 4.099451  RTT: 0.071437   total execute time: 4.177479 (milliseconds)
>> SYMLINK:
>>        	100130 ops (2%)         83 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 188	avg bytes received per op: 271
>>         backlog wait: 3.727704  RTT: 0.073534   total execute time: 3.807700 (milliseconds)
>> RENAME:
>>        	100130 ops (2%)         68 retrans (0%)         0 major timeouts
>>         avg bytes sent per op: 180	avg bytes received per op: 260
>>         backlog wait: 2.659982  RTT: 0.070518   total execute time: 2.738979 (milliseconds)
>> LINK:
>> 	100130 ops (2%) 	85 retrans (0%) 	0 major timeouts
>> 	avg bytes sent per op: 172	avg bytes received per op: 232
>> 	backlog wait: 3.676680 	RTT: 0.066773 	total execute time: 3.749785 (milliseconds)
>> GETATTR:
>> 	230 ops (0%) 	81 retrans (35%) 	0 major timeouts
>> 	avg bytes sent per op: 170	avg bytes received per op: 112
>> 	backlog wait: 1584.026087 	RTT: 0.043478 	total execute time: 1584.082609 (milliseconds)
>> READDIRPLUS:
>> 	10 ops (0%)
>> 	avg bytes sent per op: 149	avg bytes received per op: 1726
>> 	backlog wait: 0.000000 	RTT: 0.300000 	total execute time: 0.400000 (milliseconds)
>> FSINFO:
>> 	2 ops (0%)
>> 	avg bytes sent per op: 112	avg bytes received per op: 80
>> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
>> NULL:
>> 	1 ops (0%)
>> 	avg bytes sent per op: 40	avg bytes received per op: 24
>> 	backlog wait: 2.000000 	RTT: 0.000000 	total execute time: 3.000000 (milliseconds)
>> READLINK:
>> 	1 ops (0%)
>> 	avg bytes sent per op: 128	avg bytes received per op: 1140
>> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
>> PATHCONF:
>> 	1 ops (0%)
>> 	avg bytes sent per op: 112	avg bytes received per op: 56
>> 	backlog wait: 0.000000 	RTT: 0.000000 	total execute time: 0.000000 (milliseconds)
>>> Tom.
>>> 
>>>> ACCESS:
>>>>         300001 ops (15%)        44 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 132	avg bytes received per op: 120
>>>>         backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
>>>> REMOVE:
>>>>        	300000 ops (15%)        40 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 136	avg bytes received per op: 144
>>>>         backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
>>>> MKDIR:
>>>>      	200000 ops (10%)        26 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 158      avg bytes received per op: 272
>>>>         backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
>>>> RMDIR:
>>>> 	200000 ops (10%)        24 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 130	avg bytes received per op: 144
>>>>         backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
>>>> LOOKUP:
>>>>        	188533 ops (9%)         21 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 136	avg bytes received per op: 174
>>>>         backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
>>>> SETATTR:
>>>>         100000 ops (5%)         11 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 160	avg bytes received per op: 144
>>>>         backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
>>>> WRITE:
>>>>       	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>         avg bytes sent per op: 180	avg bytes received per op: 136
>>>>         backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
>>>> CREATE:
>>>>        	100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>         avg bytes sent per op: 168	avg bytes received per op: 272
>>>>         backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
>>>> SYMLINK:
>>>>      	100000 ops (5%)         18 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 188	avg bytes received per op: 272
>>>>         backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
>>>> RENAME:
>>>>      	100000 ops (5%)         14 retrans (0%)         0 major timeouts
>>>>         avg bytes sent per op: 180	avg bytes received per op: 260
>>>>         backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)
>>>> --
>>>> Chuck Lever
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC] svcrdma: Ignore source port when computing DRC hash
  2019-06-11  0:07               ` Tom Talpey
@ 2019-06-11 14:25                 ` Chuck Lever
  0 siblings, 0 replies; 19+ messages in thread
From: Chuck Lever @ 2019-06-11 14:25 UTC (permalink / raw)
  To: Tom Talpey; +Cc: linux-rdma, Linux NFS Mailing List



> On Jun 10, 2019, at 8:07 PM, Tom Talpey <tom@talpey.com> wrote:
> 
> On 6/10/2019 6:13 PM, Tom Talpey wrote:
>> On 6/10/2019 5:57 PM, Chuck Lever wrote:
>>> 
>>> 
>>>> On Jun 10, 2019, at 3:14 PM, Tom Talpey <tom@talpey.com> wrote:
>>>> 
>>>> On 6/10/2019 1:50 PM, Chuck Lever wrote:
>>>>> Hi Tom-
>>>>>> On Jun 10, 2019, at 10:50 AM, Tom Talpey <tom@talpey.com> wrote:
>>>>>> 
>>>>>> On 6/5/2019 1:25 PM, Chuck Lever wrote:
>>>>>>> Hi Tom-
>>>>>>>> On Jun 5, 2019, at 12:43 PM, Tom Talpey <tom@talpey.com> wrote:
>>>>>>>> 
>>>>>>>> On 6/5/2019 8:15 AM, Chuck Lever wrote:
>>>>>>>>> The DRC is not working at all after an RPC/RDMA transport reconnect.
>>>>>>>>> The problem is that the new connection uses a different source port,
>>>>>>>>> which defeats DRC hash.
>>>>>>>>> 
>>>>>>>>> An NFS/RDMA client's source port is meaningless for RDMA transports.
>>>>>>>>> The transport layer typically sets the source port value on the
>>>>>>>>> connection to a random ephemeral port. The server already ignores it
>>>>>>>>> for the "secure port" check. See commit 16e4d93f6de7 ("NFSD: Ignore
>>>>>>>>> client's source port on RDMA transports").
>>>>>>>> 
>>>>>>>> Where does the entropy come from, then, for the server to not
>>>>>>>> match other requests from other mount points on this same client?
>>>>>>> The first ~200 bytes of each RPC Call message.
>>>>>>> [ Note that this has some fun ramifications for calls with small
>>>>>>> RPC headers that use Read chunks. ]
>>>>>> 
>>>>>> Ok, good to know. I forgot that the Linux server implemented this.
>>>>>> I have some concerns abot it, honestly, and it's important to remember
>>>>>> that it's not the same on all servers. But for the problem you're
>>>>>> fixing, it's ok I guess and certainly better than today. Still, the
>>>>>> errors are goingto be completely silent, and can lead to data being
>>>>>> corrupted. Well, welcome to the world of NFSv3.
>>>>> I don't see another option.
>>>>> Some regard this checksum as more robust than using the client's
>>>>> IP source port. After all, the same argument can be made that
>>>>> the server cannot depend on clients to reuse their source port.
>>>>> That is simply a convention that many clients adopted before
>>>>> servers used a stronger DRC hash mechanism.
>>>>>>>> And since RDMA is capable of
>>>>>>>> such high IOPS, the likelihood seems rather high.
>>>>>>> Only when the server's durable storage is slow enough to cause
>>>>>>> some RPC requests to have extremely high latency.
>>>>>>> And, most clients use an atomic counter for their XIDs, so they
>>>>>>> are also likely to wrap that counter over some long-pending RPC
>>>>>>> request.
>>>>>>> The only real answer here is NFSv4 sessions.
>>>>>>>> Missing the cache
>>>>>>>> might actually be safer than hitting, in this case.
>>>>>>> Remember that _any_ retransmit on RPC/RDMA requires a fresh
>>>>>>> connection, that includes NFSv3, to reset credit accounting
>>>>>>> due to the lost half of the RPC Call/Reply pair.
>>>>>>> I can very quickly reproduce bad (non-deterministic) behavior
>>>>>>> by running a software build on an NFSv3 on RDMA mount point
>>>>>>> with disconnect injection. If the DRC issue is addressed, the
>>>>>>> software build runs to completion.
>>>>>> 
>>>>>> Ok, good. But I have a better test.
>>>>>> 
>>>>>> In the Connectathon suite, there's a "Special" test called "nfsidem".
>>>>>> I wrote this test in, like, 1989 so I remember it :-)
>>>>>> 
>>>>>> This test performs all the non-idempotent NFv3 operations in a loop,
>>>>>> and each loop element depends on the previous one, so if there's
>>>>>> any failure, the test imemdiately bombs.
>>>>>> 
>>>>>> Nobody seems to understand it, usually when it gets run people will
>>>>>> run it without injecting errors, and it "passes" so they decide
>>>>>> everything is ok.
>>>>>> 
>>>>>> So my suggestion is to run your flakeway packet-drop harness while
>>>>>> running nfsidem in a huge loop (nfsidem 10000). The test is slow,
>>>>>> owing to the expensive operations it performs, so you'll need to
>>>>>> run it for a long time.
>>>>>> 
>>>>>> You'll almost definitely get a failure or two, since the NFSv3
>>>>>> protocol is flawed by design. But you can compare the behaviors,
>>>>>> and even compute a likelihood. I'd love to see some actual numbers.
>>>>> I configured the client to disconnect after 23711 RPCs have completed.
>>>>> (I can re-run these with more frequent disconnects if you think that
>>>>> would be useful).
>>>>> Here's a run with the DRC modification:
>>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>>> testing 100000 idempotencies in directory "./TEST"
>>>>> [cel@manet ~]$ sudo umount /mnt
>>>>> Here's a run with the stock v5.1 Linux server:
>>>>> [cel@manet ~]$ sudo mount -o vers=3,proto=rdma,sec=sys klimt.ib:/export/tmp /mnt
>>>>> [cel@manet ~]$ (cd /mnt; ~/src/cthon04/special/nfsidem 100000)
>>>>> testing 100000 idempotencies in directory "./TEST"
>>>>> [cel@manet ~]$
>>>>> This test reported no errors in either case. We can see that the
>>>>> disconnects did trigger retransmits:
>>>>> RPC statistics:
>>>>>    1888819 RPC requests sent, 1888581 RPC replies received (0 XIDs not found)
>>>>>    average backlog queue length: 119
>>>> 
>>>> Ok, well, that's 1.2% error rate, which IMO could be cranked up much
>>>> higher for testing purposes. I'd also be sure the server was serving
>>>> other workloads during the same time, putting at least some pressure
>>>> on the DRC. The op rate of a single nfsidem test is pretty low so I
>>>> doubt it's ever evicting anything.
>>>> 
>>>> Ideally, it would be best to
>>>> 1) increase the error probability
>>>> 2) run several concurrent nfsidem tests, on different connections
>>>> 3) apply some other load to the server, e.g. several cthon basics
>>>> 
>>>> The idea being to actually get the needle off of zero and measure some
>>>> kind of difference. Otherwise it really isn't giving any information
>>>> apart from a slight didn't-break-it confidence. Honestly, I'm surprised
>>>> you couldn't evince a failure from stock. On paper, these results don't
>>>> actually tell us the patch is doing anything.
>>> 
>>> I boosted the disconnect injection rate to once every 11353 RPCs,
>>> and mounted a second share with "nosharecache", running nfsidem on
>>> both mounts. Both mounts are subject to disconnect injection.
>>> 
>>> With the current v5.2-rc Linux server, both nfsidem jobs fail within
>>> 30 seconds.
>>> 
>>> With my DRC fix applied, both jobs run to completion with no errors.
>>> It takes more than an hour.
>> Great, that definitely proves it. It's relatively low risk, and
>> should go in.
>> I think it's worth thinking through the best way to address this
>> across multiple transports, especially RDMA, where the port and
>> address behaviors may differ from TCP/UDP. Perhaps tossing the whole
>> idea of using the 5-tuple should simply be set aside, after all
>> these years. Even though NFSv4.1 already has!
>> Thanks for humoring me with the extra testing, by the way :-)
>> Tom.
>>> Here are the op metrics for one of the mounts during a run that
>>> completes successfully:
>>> 
>>> RPC statistics:
>>>    4091380 RPC requests sent, 4088143 RPC replies received (0 XIDs not found)
>>>    average backlog queue length: 1800
>>> 
>>> ACCESS:
>>>             300395 ops (7%)         301 retrans (0%)        0 major timeouts
>>>          avg bytes sent per op: 132    avg bytes received per op: 120
>>>          backlog wait: 4.289199  RTT: 0.019821   total execute time: 4.315092 (milliseconds)
>>> REMOVE:
>>>             300390 ops (7%)         168 retrans (0%)        0 major timeouts
>>>          avg bytes sent per op: 136    avg bytes received per op: 144
>>>          backlog wait: 2.368664  RTT: 0.070106   total execute time: 2.445148 (milliseconds)
>>> MKDIR:
>>>            200262 ops (4%)         193 retrans (0%)        0 major timeouts
>>>          avg bytes sent per op: 158    avg bytes received per op: 271
>>>          backlog wait: 4.207034  RTT: 0.075421   total execute time: 4.289101 (milliseconds)
>>> RMDIR:
>>>            200262 ops (4%)         100 retrans (0%)        0 major timeouts
>>>          avg bytes sent per op: 130    avg bytes received per op: 144
>>>          backlog wait: 2.050749  RTT: 0.071676   total execute time: 2.128801 (milliseconds)
>>> LOOKUP:
>>>             194664 ops (4%)         233 retrans (0%)        0 major timeouts
>>>          avg bytes sent per op: 136    avg bytes received per op: 176
>>>          backlog wait: 5.365984  RTT: 0.020615   total execute time: 5.392769 (milliseconds)
>>> SETATTR:
>>>             100130 ops (2%)         86 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 160    avg bytes received per op: 144
>>>          backlog wait: 3.520603  RTT: 0.066863   total execute time: 3.594327 (milliseconds)
>>> WRITE:
>>>             100130 ops (2%)         82 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 180    avg bytes received per op: 136
>>>          backlog wait: 3.331249  RTT: 0.118316   total execute time: 3.459563 (milliseconds)
>>> CREATE:
>>>             100130 ops (2%)         95 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 168    avg bytes received per op: 272
>>>          backlog wait: 4.099451  RTT: 0.071437   total execute time: 4.177479 (milliseconds)
>>> SYMLINK:
>>>             100130 ops (2%)         83 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 188    avg bytes received per op: 271
>>>          backlog wait: 3.727704  RTT: 0.073534   total execute time: 3.807700 (milliseconds)
>>> RENAME:
>>>             100130 ops (2%)         68 retrans (0%)         0 major timeouts
>>>          avg bytes sent per op: 180    avg bytes received per op: 260
>>>          backlog wait: 2.659982  RTT: 0.070518   total execute time: 2.738979 (milliseconds)
>>> LINK:
>>>     100130 ops (2%)     85 retrans (0%)     0 major timeouts
>>>     avg bytes sent per op: 172    avg bytes received per op: 232
>>>     backlog wait: 3.676680     RTT: 0.066773     total execute time: 3.749785 (milliseconds)
>>> GETATTR:
>>>     230 ops (0%)     81 retrans (35%)     0 major timeouts
>>>     avg bytes sent per op: 170    avg bytes received per op: 112
>>>     backlog wait: 1584.026087     RTT: 0.043478     total execute time: 1584.082609 (milliseconds)
> 
> By the way, that's a *nasty* tail latency on at least one getattr.
> What's up with that??

The average round-trip with the server is 43 microseconds. The long
total execution time is due to the backlog wait. Likely that some of
the GETATTRs are waiting behind other operations. ftrace capture
would confirm it.


> Tom.
> 
>>> READDIRPLUS:
>>>     10 ops (0%)
>>>     avg bytes sent per op: 149    avg bytes received per op: 1726
>>>     backlog wait: 0.000000     RTT: 0.300000     total execute time: 0.400000 (milliseconds)
>>> FSINFO:
>>>     2 ops (0%)
>>>     avg bytes sent per op: 112    avg bytes received per op: 80
>>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 0.000000 (milliseconds)
>>> NULL:
>>>     1 ops (0%)
>>>     avg bytes sent per op: 40    avg bytes received per op: 24
>>>     backlog wait: 2.000000     RTT: 0.000000     total execute time: 3.000000 (milliseconds)
>>> READLINK:
>>>     1 ops (0%)
>>>     avg bytes sent per op: 128    avg bytes received per op: 1140
>>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 0.000000 (milliseconds)
>>> PATHCONF:
>>>     1 ops (0%)
>>>     avg bytes sent per op: 112    avg bytes received per op: 56
>>>     backlog wait: 0.000000     RTT: 0.000000     total execute time: 0.000000 (milliseconds)
>>> 
>>> 
>>>> Tom.
>>>> 
>>>>> ACCESS:
>>>>>          300001 ops (15%)        44 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 132    avg bytes received per op: 120
>>>>>          backlog wait: 0.591118  RTT: 0.017463   total execute time: 0.614795 (milliseconds)
>>>>> REMOVE:
>>>>>             300000 ops (15%)        40 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 136    avg bytes received per op: 144
>>>>>          backlog wait: 0.531667  RTT: 0.018973   total execute time: 0.556927 (milliseconds)
>>>>> MKDIR:
>>>>>           200000 ops (10%)        26 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 158      avg bytes received per op: 272
>>>>>          backlog wait: 0.518940  RTT: 0.019755   total execute time: 0.545230 (milliseconds)
>>>>> RMDIR:
>>>>>     200000 ops (10%)        24 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 130    avg bytes received per op: 144
>>>>>          backlog wait: 0.512320  RTT: 0.018580   total execute time: 0.537095 (milliseconds)
>>>>> LOOKUP:
>>>>>             188533 ops (9%)         21 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 136    avg bytes received per op: 174
>>>>>          backlog wait: 0.455925  RTT: 0.017721   total execute time: 0.480011 (milliseconds)
>>>>> SETATTR:
>>>>>          100000 ops (5%)         11 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 160    avg bytes received per op: 144
>>>>>          backlog wait: 0.371960  RTT: 0.019470   total execute time: 0.398330 (milliseconds)
>>>>> WRITE:
>>>>>            100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>>          avg bytes sent per op: 180    avg bytes received per op: 136
>>>>>          backlog wait: 0.399190  RTT: 0.022860   total execute time: 0.436610 (milliseconds)
>>>>> CREATE:
>>>>>             100000 ops (5%)         9 retrans (0%)  0 major timeouts
>>>>>          avg bytes sent per op: 168    avg bytes received per op: 272
>>>>>          backlog wait: 0.365290  RTT: 0.019560   total execute time: 0.391140 (milliseconds)
>>>>> SYMLINK:
>>>>>           100000 ops (5%)         18 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 188    avg bytes received per op: 272
>>>>>          backlog wait: 0.750470  RTT: 0.020150   total execute time: 0.786410 (milliseconds)
>>>>> RENAME:
>>>>>           100000 ops (5%)         14 retrans (0%)         0 major timeouts
>>>>>          avg bytes sent per op: 180    avg bytes received per op: 260
>>>>>          backlog wait: 0.461650  RTT: 0.020710   total execute time: 0.489670 (milliseconds)
>>>>> -- 
>>>>> Chuck Lever
>>> 
>>> -- 
>>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-06-11 14:25 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-05 12:15 [PATCH RFC] svcrdma: Ignore source port when computing DRC hash Chuck Lever
2019-06-05 15:57 ` Olga Kornievskaia
2019-06-05 17:28   ` Chuck Lever
2019-06-06 18:13     ` Olga Kornievskaia
2019-06-06 18:33       ` Chuck Lever
2019-06-07 15:43         ` Olga Kornievskaia
2019-06-10 14:38           ` Tom Talpey
2019-06-05 16:43 ` Tom Talpey
2019-06-05 17:25   ` Chuck Lever
2019-06-10 14:50     ` Tom Talpey
2019-06-10 17:50       ` Chuck Lever
2019-06-10 19:14         ` Tom Talpey
2019-06-10 21:57           ` Chuck Lever
2019-06-10 22:13             ` Tom Talpey
2019-06-11  0:07               ` Tom Talpey
2019-06-11 14:25                 ` Chuck Lever
2019-06-11 14:23               ` Chuck Lever
2019-06-06 13:08 ` Sasha Levin
2019-06-06 13:24   ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).