All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC 0/2] revisit RMDA XDR padding management
@ 2021-08-30 16:53 Olga Kornievskaia
  2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-30 16:53 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker, chuck.level; +Cc: linux-nfs

From: Olga Kornievskaia <kolga@netapp.com>

We have previously revisited how XDR padding management was done
for the RDMA read chunks.

This patch series attempts to do the same for the RDMA read chunks
and altogether remove the options of doing an implicit roundup.

According to the RFC 8166 client "SHOULD NOT" include additional
space for XDR roundup padding. Furthermore, server MUST NOT write
XDR padding into the a write chunk. Given those two statement and
a patch "NFS: Always provide aligned buffers to the RPC read layers",
there is no reason for the client to look at the tail and assume
there is any padding data for which it needs to allocate space for.

The only operation that this patch series effects is an NFS read.
All non-read ops that might use a write chunk are setup to use
reply chunk if reply is larger than inline threshold, not a write
chunk.



*** SUBJECT HERE ***

*** BLURB HERE ***

Olga Kornievskaia (2):
  xprtrdma: xdr pad optimization revisted again
  xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize

 net/sunrpc/xprtrdma/rpc_rdma.c  | 15 ---------------
 net/sunrpc/xprtrdma/transport.c |  8 --------
 net/sunrpc/xprtrdma/verbs.c     |  2 --
 net/sunrpc/xprtrdma/xprt_rdma.h |  6 ------
 4 files changed, 31 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 16:53 [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
@ 2021-08-30 16:53 ` Olga Kornievskaia
  2021-08-30 17:04   ` Chuck Lever III
  2021-08-30 16:53 ` [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize Olga Kornievskaia
  2021-08-30 16:55 ` [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
  2 siblings, 1 reply; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-30 16:53 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker, chuck.level; +Cc: linux-nfs

From: Olga Kornievskaia <kolga@netapp.com>

Given the patch "Always provide aligned buffers to the RPC read layers",
RPC over RDMA doesn't need to look at the tail page and add that space
to the write chunk.

For the RFC 8166 compliant server, it must not write an XDR padding
into the write chunk (even if space was provided). Historically
(before RFC 8166) Solaris RDMA server has been requiring the client
to provide space for the XDR padding and thus this client code has
existed.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
 1 file changed, 15 deletions(-)

diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
index c335c1361564..2c4146bcf2a8 100644
--- a/net/sunrpc/xprtrdma/rpc_rdma.c
+++ b/net/sunrpc/xprtrdma/rpc_rdma.c
@@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
 		page_base = 0;
 	}
 
-	if (type == rpcrdma_readch)
-		goto out;
-
-	/* When encoding a Write chunk, some servers need to see an
-	 * extra segment for non-XDR-aligned Write chunks. The upper
-	 * layer provides space in the tail iovec that may be used
-	 * for this purpose.
-	 */
-	if (type == rpcrdma_writech && r_xprt->rx_ep->re_implicit_roundup)
-		goto out;
-
-	if (xdrbuf->tail[0].iov_len)
-		rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
-
-out:
 	if (unlikely(n > RPCRDMA_MAX_SEGS))
 		return -EIO;
 	return n;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize
  2021-08-30 16:53 [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
  2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
@ 2021-08-30 16:53 ` Olga Kornievskaia
  2021-08-30 16:57   ` Chuck Lever III
  2021-08-30 16:55 ` [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
  2 siblings, 1 reply; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-30 16:53 UTC (permalink / raw)
  To: trond.myklebust, anna.schumaker, chuck.level; +Cc: linux-nfs

From: Olga Kornievskaia <kolga@netapp.com>

Since RPC over RDMA layer no longer needs to manage XDR padding,
remove existing code that managed whether or not client included
extra space for the XDR padding.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
---
 net/sunrpc/xprtrdma/transport.c | 8 --------
 net/sunrpc/xprtrdma/verbs.c     | 2 --
 net/sunrpc/xprtrdma/xprt_rdma.h | 6 ------
 3 files changed, 16 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 16e5696314a4..e7b9d88f4483 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -72,7 +72,6 @@ static unsigned int xprt_rdma_slot_table_entries = RPCRDMA_DEF_SLOT_TABLE;
 unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
 unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
 unsigned int xprt_rdma_memreg_strategy		= RPCRDMA_FRWR;
-int xprt_rdma_pad_optimize;
 static struct xprt_class xprt_rdma;
 
 #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
@@ -134,13 +133,6 @@ static struct ctl_table xr_tunables_table[] = {
 		.extra1		= &min_memreg,
 		.extra2		= &max_memreg,
 	},
-	{
-		.procname	= "rdma_pad_optimize",
-		.data		= &xprt_rdma_pad_optimize,
-		.maxlen		= sizeof(unsigned int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec,
-	},
 	{ },
 };
 
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index aaec3c9be8db..d8650a3563ef 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -205,14 +205,12 @@ static void rpcrdma_update_cm_private(struct rpcrdma_ep *ep,
 	unsigned int rsize, wsize;
 
 	/* Default settings for RPC-over-RDMA Version One */
-	ep->re_implicit_roundup = xprt_rdma_pad_optimize;
 	rsize = RPCRDMA_V1_DEF_INLINE_SIZE;
 	wsize = RPCRDMA_V1_DEF_INLINE_SIZE;
 
 	if (pmsg &&
 	    pmsg->cp_magic == rpcrdma_cmp_magic &&
 	    pmsg->cp_version == RPCRDMA_CMP_VERSION) {
-		ep->re_implicit_roundup = true;
 		rsize = rpcrdma_decode_buffer_size(pmsg->cp_send_size);
 		wsize = rpcrdma_decode_buffer_size(pmsg->cp_recv_size);
 	}
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index d91f54eae00b..137866a83a3a 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -74,7 +74,6 @@ struct rpcrdma_ep {
 	struct ib_pd		*re_pd;
 	unsigned int		re_max_rdma_segs;
 	unsigned int		re_max_fr_depth;
-	bool			re_implicit_roundup;
 	enum ib_mr_type		re_mrtype;
 	struct completion	re_done;
 	unsigned int		re_send_count;
@@ -441,11 +440,6 @@ rpcrdma_portstr(const struct rpcrdma_xprt *r_xprt)
 	return r_xprt->rx_xprt.address_strings[RPC_DISPLAY_PORT];
 }
 
-/* Setting this to 0 ensures interoperability with early servers.
- * Setting this to 1 enhances certain unaligned read/write performance.
- * Default is 0, see sysctl entry and rpc_rdma.c rpcrdma_convert_iovs() */
-extern int xprt_rdma_pad_optimize;
-
 /* This setting controls the hunt for a supported memory
  * registration strategy.
  */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC 0/2] revisit RMDA XDR padding management
  2021-08-30 16:53 [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
  2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
  2021-08-30 16:53 ` [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize Olga Kornievskaia
@ 2021-08-30 16:55 ` Olga Kornievskaia
  2 siblings, 0 replies; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-30 16:55 UTC (permalink / raw)
  To: Trond Myklebust, Anna Schumaker, Chuck Lever; +Cc: linux-nfs

On Mon, Aug 30, 2021 at 12:53 PM Olga Kornievskaia
<olga.kornievskaia@gmail.com> wrote:
>
> From: Olga Kornievskaia <kolga@netapp.com>
>
> We have previously revisited how XDR padding management was done
> for the RDMA read chunks.
>
> This patch series attempts to do the same for the RDMA read chunks
> and altogether remove the options of doing an implicit roundup.

Apologies for typos but this is about RDMA write chunks.

> According to the RFC 8166 client "SHOULD NOT" include additional
> space for XDR roundup padding. Furthermore, server MUST NOT write
> XDR padding into the a write chunk. Given those two statement and
> a patch "NFS: Always provide aligned buffers to the RPC read layers",
> there is no reason for the client to look at the tail and assume
> there is any padding data for which it needs to allocate space for.
>
> The only operation that this patch series effects is an NFS read.
> All non-read ops that might use a write chunk are setup to use
> reply chunk if reply is larger than inline threshold, not a write
> chunk.
>
>
>
> *** SUBJECT HERE ***
>
> *** BLURB HERE ***
>
> Olga Kornievskaia (2):
>   xprtrdma: xdr pad optimization revisted again
>   xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize
>
>  net/sunrpc/xprtrdma/rpc_rdma.c  | 15 ---------------
>  net/sunrpc/xprtrdma/transport.c |  8 --------
>  net/sunrpc/xprtrdma/verbs.c     |  2 --
>  net/sunrpc/xprtrdma/xprt_rdma.h |  6 ------
>  4 files changed, 31 deletions(-)
>
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize
  2021-08-30 16:53 ` [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize Olga Kornievskaia
@ 2021-08-30 16:57   ` Chuck Lever III
  0 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever III @ 2021-08-30 16:57 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Trond Myklebust, Anna Schumaker, chuck.level, Linux NFS Mailing List



> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> From: Olga Kornievskaia <kolga@netapp.com>
> 
> Since RPC over RDMA layer no longer needs to manage XDR padding,
> remove existing code that managed whether or not client included
> extra space for the XDR padding.
> 
> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> ---
> net/sunrpc/xprtrdma/transport.c | 8 --------
> net/sunrpc/xprtrdma/verbs.c     | 2 --
> net/sunrpc/xprtrdma/xprt_rdma.h | 6 ------
> 3 files changed, 16 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index 16e5696314a4..e7b9d88f4483 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -72,7 +72,6 @@ static unsigned int xprt_rdma_slot_table_entries = RPCRDMA_DEF_SLOT_TABLE;
> unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE;
> unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
> unsigned int xprt_rdma_memreg_strategy		= RPCRDMA_FRWR;
> -int xprt_rdma_pad_optimize;
> static struct xprt_class xprt_rdma;
> 
> #if IS_ENABLED(CONFIG_SUNRPC_DEBUG)
> @@ -134,13 +133,6 @@ static struct ctl_table xr_tunables_table[] = {
> 		.extra1		= &min_memreg,
> 		.extra2		= &max_memreg,
> 	},
> -	{
> -		.procname	= "rdma_pad_optimize",
> -		.data		= &xprt_rdma_pad_optimize,
> -		.maxlen		= sizeof(unsigned int),
> -		.mode		= 0644,
> -		.proc_handler	= proc_dointvec,
> -	},
> 	{ },
> };

I have to NACK at least this hunk: my understanding is we
can't just remove /proc interfaces.


> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index aaec3c9be8db..d8650a3563ef 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -205,14 +205,12 @@ static void rpcrdma_update_cm_private(struct rpcrdma_ep *ep,
> 	unsigned int rsize, wsize;
> 
> 	/* Default settings for RPC-over-RDMA Version One */
> -	ep->re_implicit_roundup = xprt_rdma_pad_optimize;
> 	rsize = RPCRDMA_V1_DEF_INLINE_SIZE;
> 	wsize = RPCRDMA_V1_DEF_INLINE_SIZE;
> 
> 	if (pmsg &&
> 	    pmsg->cp_magic == rpcrdma_cmp_magic &&
> 	    pmsg->cp_version == RPCRDMA_CMP_VERSION) {
> -		ep->re_implicit_roundup = true;
> 		rsize = rpcrdma_decode_buffer_size(pmsg->cp_send_size);
> 		wsize = rpcrdma_decode_buffer_size(pmsg->cp_recv_size);
> 	}
> diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
> index d91f54eae00b..137866a83a3a 100644
> --- a/net/sunrpc/xprtrdma/xprt_rdma.h
> +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
> @@ -74,7 +74,6 @@ struct rpcrdma_ep {
> 	struct ib_pd		*re_pd;
> 	unsigned int		re_max_rdma_segs;
> 	unsigned int		re_max_fr_depth;
> -	bool			re_implicit_roundup;
> 	enum ib_mr_type		re_mrtype;
> 	struct completion	re_done;
> 	unsigned int		re_send_count;
> @@ -441,11 +440,6 @@ rpcrdma_portstr(const struct rpcrdma_xprt *r_xprt)
> 	return r_xprt->rx_xprt.address_strings[RPC_DISPLAY_PORT];
> }
> 
> -/* Setting this to 0 ensures interoperability with early servers.
> - * Setting this to 1 enhances certain unaligned read/write performance.
> - * Default is 0, see sysctl entry and rpc_rdma.c rpcrdma_convert_iovs() */
> -extern int xprt_rdma_pad_optimize;
> -
> /* This setting controls the hunt for a supported memory
>  * registration strategy.
>  */
> -- 
> 2.27.0
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
@ 2021-08-30 17:04   ` Chuck Lever III
  2021-08-30 17:24     ` Olga Kornievskaia
  0 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever III @ 2021-08-30 17:04 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List

Hi Olga-

> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> From: Olga Kornievskaia <kolga@netapp.com>
> 
> Given the patch "Always provide aligned buffers to the RPC read layers",
> RPC over RDMA doesn't need to look at the tail page and add that space
> to the write chunk.
> 
> For the RFC 8166 compliant server, it must not write an XDR padding
> into the write chunk (even if space was provided). Historically
> (before RFC 8166) Solaris RDMA server has been requiring the client
> to provide space for the XDR padding and thus this client code has
> existed.

I don't understand this change.

So, the upper layer doesn't provide XDR padding any more. That doesn't
mean Solaris servers still aren't going to want to write into it. The
client still has to provide this padding from somewhere.

This suggests that "Always provide aligned buffers to the RPC read
layers" breaks our interop with Solaris servers. Does it?


> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> ---
> net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
> 1 file changed, 15 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> index c335c1361564..2c4146bcf2a8 100644
> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
> 		page_base = 0;
> 	}
> 
> -	if (type == rpcrdma_readch)
> -		goto out;
> -
> -	/* When encoding a Write chunk, some servers need to see an
> -	 * extra segment for non-XDR-aligned Write chunks. The upper
> -	 * layer provides space in the tail iovec that may be used
> -	 * for this purpose.
> -	 */
> -	if (type == rpcrdma_writech && r_xprt->rx_ep->re_implicit_roundup)
> -		goto out;
> -
> -	if (xdrbuf->tail[0].iov_len)

Instead of checking for a tail, we could check

	if (xdr_pad_size(xdrbuf->page_len))

and provide some tail space in that case.

> -		rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
> -
> -out:
> 	if (unlikely(n > RPCRDMA_MAX_SEGS))
> 		return -EIO;
> 	return n;
> -- 
> 2.27.0
> 

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 17:04   ` Chuck Lever III
@ 2021-08-30 17:24     ` Olga Kornievskaia
  2021-08-30 17:34       ` Trond Myklebust
  2021-08-30 17:35       ` Chuck Lever III
  0 siblings, 2 replies; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-30 17:24 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List

On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
> Hi Olga-
>
> > On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > From: Olga Kornievskaia <kolga@netapp.com>
> >
> > Given the patch "Always provide aligned buffers to the RPC read layers",
> > RPC over RDMA doesn't need to look at the tail page and add that space
> > to the write chunk.
> >
> > For the RFC 8166 compliant server, it must not write an XDR padding
> > into the write chunk (even if space was provided). Historically
> > (before RFC 8166) Solaris RDMA server has been requiring the client
> > to provide space for the XDR padding and thus this client code has
> > existed.
>
> I don't understand this change.
>
> So, the upper layer doesn't provide XDR padding any more. That doesn't
> mean Solaris servers still aren't going to want to write into it. The
> client still has to provide this padding from somewhere.
>
> This suggests that "Always provide aligned buffers to the RPC read
> layers" breaks our interop with Solaris servers. Does it?

No, I don't believe "Always provide aligned buffers to the RPC read
layers" breaks the interoperability. THIS patch would break the
interop.

If we are not willing to break the interoperability and support only
servers that comply with RFC 8166, this patch is not needed.

> > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > ---
> > net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
> > 1 file changed, 15 deletions(-)
> >
> > diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
> > index c335c1361564..2c4146bcf2a8 100644
> > --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> > +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> > @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
> >               page_base = 0;
> >       }
> >
> > -     if (type == rpcrdma_readch)
> > -             goto out;
> > -
> > -     /* When encoding a Write chunk, some servers need to see an
> > -      * extra segment for non-XDR-aligned Write chunks. The upper
> > -      * layer provides space in the tail iovec that may be used
> > -      * for this purpose.
> > -      */
> > -     if (type == rpcrdma_writech && r_xprt->rx_ep->re_implicit_roundup)
> > -             goto out;
> > -
> > -     if (xdrbuf->tail[0].iov_len)
>
> Instead of checking for a tail, we could check
>
>         if (xdr_pad_size(xdrbuf->page_len))
>
> and provide some tail space in that case.

I don't believe this is any different than what we have now. If the
page size is non-4byte aligned then, we would still allocate size for
the padding which "SHOULD NOT" be there. But yes it is allowed to be
there.

The problem, as you know from our offline discussion, is allocating
the tail page and including it in the write chunk for the Nvidia
environment where Nvidia doesn't support use of data (user) pages and
nfs kernel allocated pages in the same segment.

Alternatively, my ask is then to change rpcrdma_convert_iovs() to
return 2 segs instead of one: one for the pages and another for the
tail.

>
> > -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
> > -
> > -out:
> >       if (unlikely(n > RPCRDMA_MAX_SEGS))
> >               return -EIO;
> >       return n;
> > --
> > 2.27.0
> >
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 17:24     ` Olga Kornievskaia
@ 2021-08-30 17:34       ` Trond Myklebust
  2021-08-30 18:02         ` Chuck Lever III
  2021-08-30 17:35       ` Chuck Lever III
  1 sibling, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2021-08-30 17:34 UTC (permalink / raw)
  To: olga.kornievskaia, chuck.lever; +Cc: linux-nfs, anna.schumaker

On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
> <chuck.lever@oracle.com> wrote:
> > 
> > Hi Olga-
> > 
> > > On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
> > > <olga.kornievskaia@gmail.com> wrote:
> > > 
> > > From: Olga Kornievskaia <kolga@netapp.com>
> > > 
> > > Given the patch "Always provide aligned buffers to the RPC read
> > > layers",
> > > RPC over RDMA doesn't need to look at the tail page and add that
> > > space
> > > to the write chunk.
> > > 
> > > For the RFC 8166 compliant server, it must not write an XDR
> > > padding
> > > into the write chunk (even if space was provided). Historically
> > > (before RFC 8166) Solaris RDMA server has been requiring the
> > > client
> > > to provide space for the XDR padding and thus this client code
> > > has
> > > existed.
> > 
> > I don't understand this change.
> > 
> > So, the upper layer doesn't provide XDR padding any more. That
> > doesn't
> > mean Solaris servers still aren't going to want to write into it.
> > The
> > client still has to provide this padding from somewhere.
> > 
> > This suggests that "Always provide aligned buffers to the RPC read
> > layers" breaks our interop with Solaris servers. Does it?
> 
> No, I don't believe "Always provide aligned buffers to the RPC read
> layers" breaks the interoperability. THIS patch would break the
> interop.
> 
> If we are not willing to break the interoperability and support only
> servers that comply with RFC 8166, this patch is not needed.

Why? The intention of the first patch is to ensure that we do not have
buffers that are not word aligned. If Solaris wants to write padding
after the end of the file, then there is space in the page buffer for
it to do so. There should be no need for an extra tail in which to
write the padding.

This means that the RDMA and TCP cases should end up doing the same
thing for the case of the Solaris server: the padding is written into
the page buffer. There is nothing written to the tail in either case.

> 
> > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > ---
> > > net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
> > > 1 file changed, 15 deletions(-)
> > > 
> > > diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
> > > b/net/sunrpc/xprtrdma/rpc_rdma.c
> > > index c335c1361564..2c4146bcf2a8 100644
> > > --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> > > +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> > > @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt
> > > *r_xprt, struct xdr_buf *xdrbuf,
> > >               page_base = 0;
> > >       }
> > > 
> > > -     if (type == rpcrdma_readch)
> > > -             goto out;
> > > -
> > > -     /* When encoding a Write chunk, some servers need to see an
> > > -      * extra segment for non-XDR-aligned Write chunks. The
> > > upper
> > > -      * layer provides space in the tail iovec that may be used
> > > -      * for this purpose.
> > > -      */
> > > -     if (type == rpcrdma_writech && r_xprt->rx_ep-
> > > >re_implicit_roundup)
> > > -             goto out;
> > > -
> > > -     if (xdrbuf->tail[0].iov_len)
> > 
> > Instead of checking for a tail, we could check
> > 
> >         if (xdr_pad_size(xdrbuf->page_len))
> > 
> > and provide some tail space in that case.
> 
> I don't believe this is any different than what we have now. If the
> page size is non-4byte aligned then, we would still allocate size for
> the padding which "SHOULD NOT" be there. But yes it is allowed to be
> there.
> 
> The problem, as you know from our offline discussion, is allocating
> the tail page and including it in the write chunk for the Nvidia
> environment where Nvidia doesn't support use of data (user) pages and
> nfs kernel allocated pages in the same segment.
> 
> Alternatively, my ask is then to change rpcrdma_convert_iovs() to
> return 2 segs instead of one: one for the pages and another for the
> tail.
> 
> > 
> > > -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
> > > -
> > > -out:
> > >       if (unlikely(n > RPCRDMA_MAX_SEGS))
> > >               return -EIO;
> > >       return n;
> > > --
> > > 2.27.0
> > > 
> > 
> > --
> > Chuck Lever
> > 
> > 
> > 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 17:24     ` Olga Kornievskaia
  2021-08-30 17:34       ` Trond Myklebust
@ 2021-08-30 17:35       ` Chuck Lever III
  1 sibling, 0 replies; 17+ messages in thread
From: Chuck Lever III @ 2021-08-30 17:35 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Trond Myklebust, Anna Schumaker, Linux NFS Mailing List



> On Aug 30, 2021, at 1:24 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> Hi Olga-
>> 
>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
>>> 
>>> From: Olga Kornievskaia <kolga@netapp.com>
>>> 
>>> Given the patch "Always provide aligned buffers to the RPC read layers",
>>> RPC over RDMA doesn't need to look at the tail page and add that space
>>> to the write chunk.
>>> 
>>> For the RFC 8166 compliant server, it must not write an XDR padding
>>> into the write chunk (even if space was provided). Historically
>>> (before RFC 8166) Solaris RDMA server has been requiring the client
>>> to provide space for the XDR padding and thus this client code has
>>> existed.
>> 
>> I don't understand this change.
>> 
>> So, the upper layer doesn't provide XDR padding any more. That doesn't
>> mean Solaris servers still aren't going to want to write into it. The
>> client still has to provide this padding from somewhere.
>> 
>> This suggests that "Always provide aligned buffers to the RPC read
>> layers" breaks our interop with Solaris servers. Does it?
> 
> No, I don't believe "Always provide aligned buffers to the RPC read
> layers" breaks the interoperability. THIS patch would break the
> interop.
> 
> If we are not willing to break the interoperability and support only
> servers that comply with RFC 8166, this patch is not needed.
> 
>>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
>>> ---
>>> net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
>>> 1 file changed, 15 deletions(-)
>>> 
>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> index c335c1361564..2c4146bcf2a8 100644
>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt *r_xprt, struct xdr_buf *xdrbuf,
>>>              page_base = 0;
>>>      }
>>> 
>>> -     if (type == rpcrdma_readch)
>>> -             goto out;
>>> -
>>> -     /* When encoding a Write chunk, some servers need to see an
>>> -      * extra segment for non-XDR-aligned Write chunks. The upper
>>> -      * layer provides space in the tail iovec that may be used
>>> -      * for this purpose.
>>> -      */
>>> -     if (type == rpcrdma_writech && r_xprt->rx_ep->re_implicit_roundup)
>>> -             goto out;
>>> -
>>> -     if (xdrbuf->tail[0].iov_len)
>> 
>> Instead of checking for a tail, we could check
>> 
>>        if (xdr_pad_size(xdrbuf->page_len))
>> 
>> and provide some tail space in that case.
> 
> I don't believe this is any different than what we have now. If the
> page size is non-4byte aligned then, we would still allocate size for
> the padding which "SHOULD NOT" be there. But yes it is allowed to be
> there.
> 
> The problem, as you know from our offline discussion, is allocating
> the tail page and including it in the write chunk for the Nvidia
> environment where Nvidia doesn't support use of data (user) pages and
> nfs kernel allocated pages in the same segment.
> 
> Alternatively, my ask is then to change rpcrdma_convert_iovs() to
> return 2 segs instead of one: one for the pages and another for the
> tail.

At this point, it's clear that the Nvidia proof-of-concept is using
features that are not available in the stock upstream RDMA stack.
I'm not sympathetic to making code changes to accommodate them unless
or until the upstream RDMA ecosystem can deal properly with their use
case. (they are of course free to modify the code themselves, it's
just not likely to get merged).

We'd have to poll the set of deployed Solaris servers to understand
whether it is safe to remove this logic.


>>> -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
>>> -
>>> -out:
>>>      if (unlikely(n > RPCRDMA_MAX_SEGS))
>>>              return -EIO;
>>>      return n;
>>> --
>>> 2.27.0
>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 17:34       ` Trond Myklebust
@ 2021-08-30 18:02         ` Chuck Lever III
  2021-08-30 18:18           ` Trond Myklebust
  0 siblings, 1 reply; 17+ messages in thread
From: Chuck Lever III @ 2021-08-30 18:02 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Olga Kornievskaia, Linux NFS Mailing List, Anna Schumaker



> On Aug 30, 2021, at 1:34 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
>> <chuck.lever@oracle.com> wrote:
>>> 
>>> Hi Olga-
>>> 
>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
>>>> <olga.kornievskaia@gmail.com> wrote:
>>>> 
>>>> From: Olga Kornievskaia <kolga@netapp.com>
>>>> 
>>>> Given the patch "Always provide aligned buffers to the RPC read
>>>> layers",
>>>> RPC over RDMA doesn't need to look at the tail page and add that
>>>> space
>>>> to the write chunk.
>>>> 
>>>> For the RFC 8166 compliant server, it must not write an XDR
>>>> padding
>>>> into the write chunk (even if space was provided). Historically
>>>> (before RFC 8166) Solaris RDMA server has been requiring the
>>>> client
>>>> to provide space for the XDR padding and thus this client code
>>>> has
>>>> existed.
>>> 
>>> I don't understand this change.
>>> 
>>> So, the upper layer doesn't provide XDR padding any more. That
>>> doesn't
>>> mean Solaris servers still aren't going to want to write into it.
>>> The
>>> client still has to provide this padding from somewhere.
>>> 
>>> This suggests that "Always provide aligned buffers to the RPC read
>>> layers" breaks our interop with Solaris servers. Does it?
>> 
>> No, I don't believe "Always provide aligned buffers to the RPC read
>> layers" breaks the interoperability. THIS patch would break the
>> interop.
>> 
>> If we are not willing to break the interoperability and support only
>> servers that comply with RFC 8166, this patch is not needed.
> 
> Why? The intention of the first patch is to ensure that we do not have
> buffers that are not word aligned. If Solaris wants to write padding
> after the end of the file, then there is space in the page buffer for
> it to do so. There should be no need for an extra tail in which to
> write the padding.

The RPC/RDMA protocol is designed for hardware-offloaded direct data
placement. That means the padding, which isn't data, must be directed
to another buffer.

This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
ambiguous, so there are implementations that write XDR padding into
Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
NOT.

I believe rpcrdma-version-two makes it a requirement not to use XDR
padding in either Read or Write data payload chunks.


> This means that the RDMA and TCP cases should end up doing the same
> thing for the case of the Solaris server: the padding is written into
> the page buffer. There is nothing written to the tail in either case.

"Always provide" can guarantee that the NFS client makes aligned
requests for buffered I/O, but what about NFS direct I/O from user
space? The NIC will place the data payload in the application
buffer, but there's no guarantee that the NFS READ request will be
aligned or that the buffer will be able to sink the extra padding
bytes.

We would definitely consider it an error if an unaligned RDMA Read
leaked the link-layer's 4-byte padding into a sink buffer.

So, "Always provide" is nice for the in-kernel NFS client, but I
don't believe it allows the way xprtrdma behaves to be changed.


>>>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
>>>> ---
>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
>>>> 1 file changed, 15 deletions(-)
>>>> 
>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> index c335c1361564..2c4146bcf2a8 100644
>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt
>>>> *r_xprt, struct xdr_buf *xdrbuf,
>>>>               page_base = 0;
>>>>       }
>>>> 
>>>> -     if (type == rpcrdma_readch)
>>>> -             goto out;
>>>> -
>>>> -     /* When encoding a Write chunk, some servers need to see an
>>>> -      * extra segment for non-XDR-aligned Write chunks. The
>>>> upper
>>>> -      * layer provides space in the tail iovec that may be used
>>>> -      * for this purpose.
>>>> -      */
>>>> -     if (type == rpcrdma_writech && r_xprt->rx_ep-
>>>>> re_implicit_roundup)
>>>> -             goto out;
>>>> -
>>>> -     if (xdrbuf->tail[0].iov_len)
>>> 
>>> Instead of checking for a tail, we could check
>>> 
>>>         if (xdr_pad_size(xdrbuf->page_len))
>>> 
>>> and provide some tail space in that case.
>> 
>> I don't believe this is any different than what we have now. If the
>> page size is non-4byte aligned then, we would still allocate size for
>> the padding which "SHOULD NOT" be there. But yes it is allowed to be
>> there.
>> 
>> The problem, as you know from our offline discussion, is allocating
>> the tail page and including it in the write chunk for the Nvidia
>> environment where Nvidia doesn't support use of data (user) pages and
>> nfs kernel allocated pages in the same segment.
>> 
>> Alternatively, my ask is then to change rpcrdma_convert_iovs() to
>> return 2 segs instead of one: one for the pages and another for the
>> tail.
>> 
>>> 
>>>> -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg, &n);
>>>> -
>>>> -out:
>>>>       if (unlikely(n > RPCRDMA_MAX_SEGS))
>>>>               return -EIO;
>>>>       return n;
>>>> --
>>>> 2.27.0
>>>> 
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>>> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 18:02         ` Chuck Lever III
@ 2021-08-30 18:18           ` Trond Myklebust
  2021-08-30 20:37             ` Chuck Lever III
  0 siblings, 1 reply; 17+ messages in thread
From: Trond Myklebust @ 2021-08-30 18:18 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs, anna.schumaker, olga.kornievskaia

On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
> 
> 
> > On Aug 30, 2021, at 1:34 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
> > > On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
> > > <chuck.lever@oracle.com> wrote:
> > > > 
> > > > Hi Olga-
> > > > 
> > > > > On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
> > > > > <olga.kornievskaia@gmail.com> wrote:
> > > > > 
> > > > > From: Olga Kornievskaia <kolga@netapp.com>
> > > > > 
> > > > > Given the patch "Always provide aligned buffers to the RPC
> > > > > read
> > > > > layers",
> > > > > RPC over RDMA doesn't need to look at the tail page and add
> > > > > that
> > > > > space
> > > > > to the write chunk.
> > > > > 
> > > > > For the RFC 8166 compliant server, it must not write an XDR
> > > > > padding
> > > > > into the write chunk (even if space was provided).
> > > > > Historically
> > > > > (before RFC 8166) Solaris RDMA server has been requiring the
> > > > > client
> > > > > to provide space for the XDR padding and thus this client
> > > > > code
> > > > > has
> > > > > existed.
> > > > 
> > > > I don't understand this change.
> > > > 
> > > > So, the upper layer doesn't provide XDR padding any more. That
> > > > doesn't
> > > > mean Solaris servers still aren't going to want to write into
> > > > it.
> > > > The
> > > > client still has to provide this padding from somewhere.
> > > > 
> > > > This suggests that "Always provide aligned buffers to the RPC
> > > > read
> > > > layers" breaks our interop with Solaris servers. Does it?
> > > 
> > > No, I don't believe "Always provide aligned buffers to the RPC
> > > read
> > > layers" breaks the interoperability. THIS patch would break the
> > > interop.
> > > 
> > > If we are not willing to break the interoperability and support
> > > only
> > > servers that comply with RFC 8166, this patch is not needed.
> > 
> > Why? The intention of the first patch is to ensure that we do not
> > have
> > buffers that are not word aligned. If Solaris wants to write
> > padding
> > after the end of the file, then there is space in the page buffer
> > for
> > it to do so. There should be no need for an extra tail in which to
> > write the padding.
> 
> The RPC/RDMA protocol is designed for hardware-offloaded direct data
> placement. That means the padding, which isn't data, must be directed
> to another buffer.
> 
> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
> ambiguous, so there are implementations that write XDR padding into
> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
> NOT.
> 
> I believe rpcrdma-version-two makes it a requirement not to use XDR
> padding in either Read or Write data payload chunks.
> 
> 
Correct, but in order to satisfy the needs of the Solaris server,
you've hijacked the tail for use as a data buffer. AFAICS it is not
being used as a SEND buffer target, but is instead being turned into a
write chunk target. That is not acceptable!

It means that we now are limited to creating COMPOUNDs where there are
no more operations following the READ op because if we do so, we end up
with a situation where the RDMA behaviour breaks.

> > This means that the RDMA and TCP cases should end up doing the same
> > thing for the case of the Solaris server: the padding is written
> > into
> > the page buffer. There is nothing written to the tail in either
> > case.
> 
> "Always provide" can guarantee that the NFS client makes aligned
> requests for buffered I/O, but what about NFS direct I/O from user
> space? The NIC will place the data payload in the application
> buffer, but there's no guarantee that the NFS READ request will be
> aligned or that the buffer will be able to sink the extra padding
> bytes.
> 
> We would definitely consider it an error if an unaligned RDMA Read
> leaked the link-layer's 4-byte padding into a sink buffer.
> 
> So, "Always provide" is nice for the in-kernel NFS client, but I
> don't believe it allows the way xprtrdma behaves to be changed.
> 

If you're doing an unaligned READ from user space then you are already
in a situation where you're doing something that is incompatible with
block device requirements.
If there really are any applications that contain O_DIRECT code
specifically for use with NFS, then we can artificially force the
buffers to be aligned by reducing the size of the buffer to align to a
4 byte boundary. NFS supports returning short reads.

> 
> > > > > Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> > > > > ---
> > > > > net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
> > > > > 1 file changed, 15 deletions(-)
> > > > > 
> > > > > diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
> > > > > b/net/sunrpc/xprtrdma/rpc_rdma.c
> > > > > index c335c1361564..2c4146bcf2a8 100644
> > > > > --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> > > > > +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> > > > > @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt
> > > > > *r_xprt, struct xdr_buf *xdrbuf,
> > > > >               page_base = 0;
> > > > >       }
> > > > > 
> > > > > -     if (type == rpcrdma_readch)
> > > > > -             goto out;
> > > > > -
> > > > > -     /* When encoding a Write chunk, some servers need to
> > > > > see an
> > > > > -      * extra segment for non-XDR-aligned Write chunks. The
> > > > > upper
> > > > > -      * layer provides space in the tail iovec that may be
> > > > > used
> > > > > -      * for this purpose.
> > > > > -      */
> > > > > -     if (type == rpcrdma_writech && r_xprt->rx_ep-
> > > > > > re_implicit_roundup)
> > > > > -             goto out;
> > > > > -
> > > > > -     if (xdrbuf->tail[0].iov_len)
> > > > 
> > > > Instead of checking for a tail, we could check
> > > > 
> > > >         if (xdr_pad_size(xdrbuf->page_len))
> > > > 
> > > > and provide some tail space in that case.
> > > 
> > > I don't believe this is any different than what we have now. If
> > > the
> > > page size is non-4byte aligned then, we would still allocate size
> > > for
> > > the padding which "SHOULD NOT" be there. But yes it is allowed to
> > > be
> > > there.
> > > 
> > > The problem, as you know from our offline discussion, is
> > > allocating
> > > the tail page and including it in the write chunk for the Nvidia
> > > environment where Nvidia doesn't support use of data (user) pages
> > > and
> > > nfs kernel allocated pages in the same segment.
> > > 
> > > Alternatively, my ask is then to change rpcrdma_convert_iovs() to
> > > return 2 segs instead of one: one for the pages and another for
> > > the
> > > tail.
> > > 
> > > > 
> > > > > -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg,
> > > > > &n);
> > > > > -
> > > > > -out:
> > > > >       if (unlikely(n > RPCRDMA_MAX_SEGS))
> > > > >               return -EIO;
> > > > >       return n;
> > > > > --
> > > > > 2.27.0
> > > > > 
> > > > 
> > > > --
> > > > Chuck Lever
> > > > 
> > > > 
> > > > 
> > 
> > -- 
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
> 
> --
> Chuck Lever
> 
> 
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 18:18           ` Trond Myklebust
@ 2021-08-30 20:37             ` Chuck Lever III
  2021-08-31 14:33               ` Chuck Lever III
  2021-08-31 15:54               ` Olga Kornievskaia
  0 siblings, 2 replies; 17+ messages in thread
From: Chuck Lever III @ 2021-08-30 20:37 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List, Anna Schumaker, Olga Kornievskaia



> On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
>> 
>> 
>>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
>>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
>>>> <chuck.lever@oracle.com> wrote:
>>>>> 
>>>>> Hi Olga-
>>>>> 
>>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>> 
>>>>>> From: Olga Kornievskaia <kolga@netapp.com>
>>>>>> 
>>>>>> Given the patch "Always provide aligned buffers to the RPC
>>>>>> read
>>>>>> layers",
>>>>>> RPC over RDMA doesn't need to look at the tail page and add
>>>>>> that
>>>>>> space
>>>>>> to the write chunk.
>>>>>> 
>>>>>> For the RFC 8166 compliant server, it must not write an XDR
>>>>>> padding
>>>>>> into the write chunk (even if space was provided).
>>>>>> Historically
>>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
>>>>>> client
>>>>>> to provide space for the XDR padding and thus this client
>>>>>> code
>>>>>> has
>>>>>> existed.
>>>>> 
>>>>> I don't understand this change.
>>>>> 
>>>>> So, the upper layer doesn't provide XDR padding any more. That
>>>>> doesn't
>>>>> mean Solaris servers still aren't going to want to write into
>>>>> it.
>>>>> The
>>>>> client still has to provide this padding from somewhere.
>>>>> 
>>>>> This suggests that "Always provide aligned buffers to the RPC
>>>>> read
>>>>> layers" breaks our interop with Solaris servers. Does it?
>>>> 
>>>> No, I don't believe "Always provide aligned buffers to the RPC
>>>> read
>>>> layers" breaks the interoperability. THIS patch would break the
>>>> interop.
>>>> 
>>>> If we are not willing to break the interoperability and support
>>>> only
>>>> servers that comply with RFC 8166, this patch is not needed.
>>> 
>>> Why? The intention of the first patch is to ensure that we do not
>>> have
>>> buffers that are not word aligned. If Solaris wants to write
>>> padding
>>> after the end of the file, then there is space in the page buffer
>>> for
>>> it to do so. There should be no need for an extra tail in which to
>>> write the padding.
>> 
>> The RPC/RDMA protocol is designed for hardware-offloaded direct data
>> placement. That means the padding, which isn't data, must be directed
>> to another buffer.
>> 
>> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
>> ambiguous, so there are implementations that write XDR padding into
>> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
>> NOT.
>> 
>> I believe rpcrdma-version-two makes it a requirement not to use XDR
>> padding in either Read or Write data payload chunks.
>> 
>> 
> Correct, but in order to satisfy the needs of the Solaris server,
> you've hijacked the tail for use as a data buffer. AFAICS it is not
> being used as a SEND buffer target, but is instead being turned into a
> write chunk target. That is not acceptable!

The buffer is being used as both. Proper function depends on the
order of RDMA Writes and Receives on the client.

rpcrdma_encode_write_list() registers up to an extra 3 bytes in
rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
segments in the Write chunk are then sent to the server as part
of the RPC Call.

As part of Replying, the server RDMA Writes data into the chunk,
and possibly also RDMA Writes padding. It then does an RDMA Send
containing the RPC Reply.

The Send data always lands in the Receive buffer _after_ the Write
data. The Receive completion guarantees that previous RDMA Writes
are complete. Receive handling invalidates and unmaps the memory,
and then it is made available to the RPC client.


> It means that we now are limited to creating COMPOUNDs where there are
> no more operations following the READ op because if we do so, we end up
> with a situation where the RDMA behaviour breaks.

I haven't heard reports of a problem like this.

However, if there is a problem, it would be simple to create a
persistently-registered memory region that is not part of any RPC
buffer that can be used to catch unused Write chunk XDR padding.


>>> This means that the RDMA and TCP cases should end up doing the same
>>> thing for the case of the Solaris server: the padding is written
>>> into
>>> the page buffer. There is nothing written to the tail in either
>>> case.
>> 
>> "Always provide" can guarantee that the NFS client makes aligned
>> requests for buffered I/O, but what about NFS direct I/O from user
>> space? The NIC will place the data payload in the application
>> buffer, but there's no guarantee that the NFS READ request will be
>> aligned or that the buffer will be able to sink the extra padding
>> bytes.
>> 
>> We would definitely consider it an error if an unaligned RDMA Read
>> leaked the link-layer's 4-byte padding into a sink buffer.
>> 
>> So, "Always provide" is nice for the in-kernel NFS client, but I
>> don't believe it allows the way xprtrdma behaves to be changed.
>> 
> 
> If you're doing an unaligned READ from user space then you are already
> in a situation where you're doing something that is incompatible with
> block device requirements.
> If there really are any applications that contain O_DIRECT code
> specifically for use with NFS, then we can artificially force the
> buffers to be aligned by reducing the size of the buffer to align to a
> 4 byte boundary. NFS supports returning short reads.

Or xprtrdma can use the scheme I describe above. I think that
would be simpler and would avoid layering violations.

That would also possibly address the Nvidia problem, since a
pre-registered MR that handles Write padding would always be a
separate RDMA segment.

Again, I doubt this is necessary to fix any operational problem
with _supported_ use cases, but let me know if you'd like me to
make this change.


>>>>>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
>>>>>> ---
>>>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
>>>>>> 1 file changed, 15 deletions(-)
>>>>>> 
>>>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>>> index c335c1361564..2c4146bcf2a8 100644
>>>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
>>>>>> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt
>>>>>> *r_xprt, struct xdr_buf *xdrbuf,
>>>>>>               page_base = 0;
>>>>>>       }
>>>>>> 
>>>>>> -     if (type == rpcrdma_readch)
>>>>>> -             goto out;
>>>>>> -
>>>>>> -     /* When encoding a Write chunk, some servers need to
>>>>>> see an
>>>>>> -      * extra segment for non-XDR-aligned Write chunks. The
>>>>>> upper
>>>>>> -      * layer provides space in the tail iovec that may be
>>>>>> used
>>>>>> -      * for this purpose.
>>>>>> -      */
>>>>>> -     if (type == rpcrdma_writech && r_xprt->rx_ep-
>>>>>>> re_implicit_roundup)
>>>>>> -             goto out;
>>>>>> -
>>>>>> -     if (xdrbuf->tail[0].iov_len)
>>>>> 
>>>>> Instead of checking for a tail, we could check
>>>>> 
>>>>>         if (xdr_pad_size(xdrbuf->page_len))
>>>>> 
>>>>> and provide some tail space in that case.
>>>> 
>>>> I don't believe this is any different than what we have now. If
>>>> the
>>>> page size is non-4byte aligned then, we would still allocate size
>>>> for
>>>> the padding which "SHOULD NOT" be there. But yes it is allowed to
>>>> be
>>>> there.
>>>> 
>>>> The problem, as you know from our offline discussion, is
>>>> allocating
>>>> the tail page and including it in the write chunk for the Nvidia
>>>> environment where Nvidia doesn't support use of data (user) pages
>>>> and
>>>> nfs kernel allocated pages in the same segment.
>>>> 
>>>> Alternatively, my ask is then to change rpcrdma_convert_iovs() to
>>>> return 2 segs instead of one: one for the pages and another for
>>>> the
>>>> tail.
>>>> 
>>>>> 
>>>>>> -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg,
>>>>>> &n);
>>>>>> -
>>>>>> -out:
>>>>>>       if (unlikely(n > RPCRDMA_MAX_SEGS))
>>>>>>               return -EIO;
>>>>>>       return n;
>>>>>> --
>>>>>> 2.27.0
>>>>>> 
>>>>> 
>>>>> --
>>>>> Chuck Lever
>>>>> 
>>>>> 
>>>>> 
>>> 
>>> -- 
>>> Trond Myklebust
>>> Linux NFS client maintainer, Hammerspace
>>> trond.myklebust@hammerspace.com
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 20:37             ` Chuck Lever III
@ 2021-08-31 14:33               ` Chuck Lever III
  2021-08-31 15:58                 ` Olga Kornievskaia
  2021-08-31 15:54               ` Olga Kornievskaia
  1 sibling, 1 reply; 17+ messages in thread
From: Chuck Lever III @ 2021-08-31 14:33 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List, Anna Schumaker, Olga Kornievskaia


> On Aug 30, 2021, at 4:37 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> 
>> On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>> 
>> On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
>>> 
>>>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
>>>> <trondmy@hammerspace.com> wrote:
>>>> 
>>>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
>>>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
>>>>> <chuck.lever@oracle.com> wrote:
>>>>>> 
>>>>>> Hi Olga-
>>>>>> 
>>>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
>>>>>>> <olga.kornievskaia@gmail.com> wrote:
>>>>>>> 
>>>>>>> From: Olga Kornievskaia <kolga@netapp.com>
>>>>>>> 
>>>>>>> Given the patch "Always provide aligned buffers to the RPC
>>>>>>> read
>>>>>>> layers",
>>>>>>> RPC over RDMA doesn't need to look at the tail page and add
>>>>>>> that
>>>>>>> space
>>>>>>> to the write chunk.
>>>>>>> 
>>>>>>> For the RFC 8166 compliant server, it must not write an XDR
>>>>>>> padding
>>>>>>> into the write chunk (even if space was provided).
>>>>>>> Historically
>>>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
>>>>>>> client
>>>>>>> to provide space for the XDR padding and thus this client
>>>>>>> code
>>>>>>> has
>>>>>>> existed.
>>>>>> 
>>>>>> I don't understand this change.
>>>>>> 
>>>>>> So, the upper layer doesn't provide XDR padding any more. That
>>>>>> doesn't
>>>>>> mean Solaris servers still aren't going to want to write into
>>>>>> it.
>>>>>> The
>>>>>> client still has to provide this padding from somewhere.
>>>>>> 
>>>>>> This suggests that "Always provide aligned buffers to the RPC
>>>>>> read
>>>>>> layers" breaks our interop with Solaris servers. Does it?
>>>>> 
>>>>> No, I don't believe "Always provide aligned buffers to the RPC
>>>>> read
>>>>> layers" breaks the interoperability. THIS patch would break the
>>>>> interop.
>>>>> 
>>>>> If we are not willing to break the interoperability and support
>>>>> only
>>>>> servers that comply with RFC 8166, this patch is not needed.
>>>> 
>>>> Why? The intention of the first patch is to ensure that we do not
>>>> have
>>>> buffers that are not word aligned. If Solaris wants to write
>>>> padding
>>>> after the end of the file, then there is space in the page buffer
>>>> for
>>>> it to do so. There should be no need for an extra tail in which to
>>>> write the padding.
>>> 
>>> The RPC/RDMA protocol is designed for hardware-offloaded direct data
>>> placement. That means the padding, which isn't data, must be directed
>>> to another buffer.
>>> 
>>> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
>>> ambiguous, so there are implementations that write XDR padding into
>>> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
>>> NOT.
>>> 
>>> I believe rpcrdma-version-two makes it a requirement not to use XDR
>>> padding in either Read or Write data payload chunks.
>>> 
>>> 
>> Correct, but in order to satisfy the needs of the Solaris server,
>> you've hijacked the tail for use as a data buffer. AFAICS it is not
>> being used as a SEND buffer target, but is instead being turned into a
>> write chunk target. That is not acceptable!
> 
> The buffer is being used as both. Proper function depends on the
> order of RDMA Writes and Receives on the client.

I've refreshed my memory. The above statement oversimplifies
a bit.

- The memory pointed to by rqst->rq_buffer is persistently
  registered to allow RDMA Send from it. It can also be
  registered on the fly for use in Read chunks.

- The memory pointed to by rqst->rq_rbuffer is NOT
  persistently registered and is never used for RDMA Send
  or Receive. It can be registered on the fly for use in a
  Write or Reply chunk. This is when the tail portion of
  rq_rcv_buf might be used to receive a pad.

- A third set of persistently registered buffers is used
  ONLY to catch RDMA Receive completions.

When a Receive completes, the reply handler decides how to
construct the received RPC message. If there was a Reply
chunk, it leaves rq_rcv_buf pointing to rq_rbuffer, as that's
where the server wrote the Reply chunk. If there wasn't a
Reply chunk, the handler points rq_rcv_buf to the RDMA
Receive buffer.

The only case where there might be overlap is when the client
has provisioned both a Write chunk and a Reply chunk. In that
case, yes, I can see that there might be an opportunity for
the server to RDMA Write the Write chunk's pad and the Reply
chunk into the same client memory buffer. However:

- Most servers don't write a pad. Modern Solaris servers
  behave "correctly" in this regard, I'm now told. Linux
  never writes the pad, and I'm told OnTAP also does not.

- Server implementations I'm aware of Write the Write chunk
  first and then the Reply chunk. The Reply chunk would
  overwrite the pad.

- The client hardly ever uses Write + Reply. It's most
  often one or the other.

So again, there is little to zero chance there is an existing
operational issue here. But see below.


> rpcrdma_encode_write_list() registers up to an extra 3 bytes in
> rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
> segments in the Write chunk are then sent to the server as part
> of the RPC Call.
> 
> As part of Replying, the server RDMA Writes data into the chunk,
> and possibly also RDMA Writes padding. It then does an RDMA Send
> containing the RPC Reply.
> 
> The Send data always lands in the Receive buffer _after_ the Write
> data. The Receive completion guarantees that previous RDMA Writes
> are complete. Receive handling invalidates and unmaps the memory,
> and then it is made available to the RPC client.
> 
> 
>> It means that we now are limited to creating COMPOUNDs where there are
>> no more operations following the READ op because if we do so, we end up
>> with a situation where the RDMA behaviour breaks.
> 
> I haven't heard reports of a problem like this.
> 
> However, if there is a problem, it would be simple to create a
> persistently-registered memory region that is not part of any RPC
> buffer that can be used to catch unused Write chunk XDR padding.
> 
> 
>>>> This means that the RDMA and TCP cases should end up doing the same
>>>> thing for the case of the Solaris server: the padding is written
>>>> into
>>>> the page buffer. There is nothing written to the tail in either
>>>> case.
>>> 
>>> "Always provide" can guarantee that the NFS client makes aligned
>>> requests for buffered I/O, but what about NFS direct I/O from user
>>> space? The NIC will place the data payload in the application
>>> buffer, but there's no guarantee that the NFS READ request will be
>>> aligned or that the buffer will be able to sink the extra padding
>>> bytes.
>>> 
>>> We would definitely consider it an error if an unaligned RDMA Read
>>> leaked the link-layer's 4-byte padding into a sink buffer.
>>> 
>>> So, "Always provide" is nice for the in-kernel NFS client, but I
>>> don't believe it allows the way xprtrdma behaves to be changed.
>>> 
>> 
>> If you're doing an unaligned READ from user space then you are already
>> in a situation where you're doing something that is incompatible with
>> block device requirements.
>> If there really are any applications that contain O_DIRECT code
>> specifically for use with NFS, then we can artificially force the
>> buffers to be aligned by reducing the size of the buffer to align to a
>> 4 byte boundary. NFS supports returning short reads.
> 
> Or xprtrdma can use the scheme I describe above. I think that
> would be simpler and would avoid layering violations.
> 
> That would also possibly address the Nvidia problem, since a
> pre-registered MR that handles Write padding would always be a
> separate RDMA segment.
> 
> Again, I doubt this is necessary to fix any operational problem
> with _supported_ use cases, but let me know if you'd like me to
> make this change.

Since RFC 8666 says "SHOULD NOT", the spec mandates that the client
has to expect there might be a server that will RDMA Write the XDR
pad, so the Linux client really should always provide one. Having
a persistently registered spot to use for that means we have a
potential no-cost mechanism for providing that extra segment. The
whole "pad optimization" thing was because the pad segment is
currently registered on the fly, so it has a per-I/O cost.

So I'm leaning toward implementing this simple solution, at least
as proof-of-concept.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-30 20:37             ` Chuck Lever III
  2021-08-31 14:33               ` Chuck Lever III
@ 2021-08-31 15:54               ` Olga Kornievskaia
  2021-08-31 16:02                 ` Chuck Lever III
  1 sibling, 1 reply; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-31 15:54 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Trond Myklebust, Linux NFS Mailing List, Anna Schumaker

On Mon, Aug 30, 2021 at 4:38 PM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
>
> > On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> >
> > On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
> >>
> >>
> >>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
> >>> <trondmy@hammerspace.com> wrote:
> >>>
> >>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
> >>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
> >>>> <chuck.lever@oracle.com> wrote:
> >>>>>
> >>>>> Hi Olga-
> >>>>>
> >>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
> >>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>
> >>>>>> From: Olga Kornievskaia <kolga@netapp.com>
> >>>>>>
> >>>>>> Given the patch "Always provide aligned buffers to the RPC
> >>>>>> read
> >>>>>> layers",
> >>>>>> RPC over RDMA doesn't need to look at the tail page and add
> >>>>>> that
> >>>>>> space
> >>>>>> to the write chunk.
> >>>>>>
> >>>>>> For the RFC 8166 compliant server, it must not write an XDR
> >>>>>> padding
> >>>>>> into the write chunk (even if space was provided).
> >>>>>> Historically
> >>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
> >>>>>> client
> >>>>>> to provide space for the XDR padding and thus this client
> >>>>>> code
> >>>>>> has
> >>>>>> existed.
> >>>>>
> >>>>> I don't understand this change.
> >>>>>
> >>>>> So, the upper layer doesn't provide XDR padding any more. That
> >>>>> doesn't
> >>>>> mean Solaris servers still aren't going to want to write into
> >>>>> it.
> >>>>> The
> >>>>> client still has to provide this padding from somewhere.
> >>>>>
> >>>>> This suggests that "Always provide aligned buffers to the RPC
> >>>>> read
> >>>>> layers" breaks our interop with Solaris servers. Does it?
> >>>>
> >>>> No, I don't believe "Always provide aligned buffers to the RPC
> >>>> read
> >>>> layers" breaks the interoperability. THIS patch would break the
> >>>> interop.
> >>>>
> >>>> If we are not willing to break the interoperability and support
> >>>> only
> >>>> servers that comply with RFC 8166, this patch is not needed.
> >>>
> >>> Why? The intention of the first patch is to ensure that we do not
> >>> have
> >>> buffers that are not word aligned. If Solaris wants to write
> >>> padding
> >>> after the end of the file, then there is space in the page buffer
> >>> for
> >>> it to do so. There should be no need for an extra tail in which to
> >>> write the padding.
> >>
> >> The RPC/RDMA protocol is designed for hardware-offloaded direct data
> >> placement. That means the padding, which isn't data, must be directed
> >> to another buffer.
> >>
> >> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
> >> ambiguous, so there are implementations that write XDR padding into
> >> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
> >> NOT.
> >>
> >> I believe rpcrdma-version-two makes it a requirement not to use XDR
> >> padding in either Read or Write data payload chunks.
> >>
> >>
> > Correct, but in order to satisfy the needs of the Solaris server,
> > you've hijacked the tail for use as a data buffer. AFAICS it is not
> > being used as a SEND buffer target, but is instead being turned into a
> > write chunk target. That is not acceptable!
>
> The buffer is being used as both. Proper function depends on the
> order of RDMA Writes and Receives on the client.
>
> rpcrdma_encode_write_list() registers up to an extra 3 bytes in
> rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
> segments in the Write chunk are then sent to the server as part
> of the RPC Call.

Just clarifying, nothing in the code limits the registration of upto
3bytes. It allocates/registers the value of the tail.iov_len which
typically much larger than 4bytes (40+bytes).

> As part of Replying, the server RDMA Writes data into the chunk,
> and possibly also RDMA Writes padding. It then does an RDMA Send
> containing the RPC Reply.
>
> The Send data always lands in the Receive buffer _after_ the Write
> data. The Receive completion guarantees that previous RDMA Writes
> are complete. Receive handling invalidates and unmaps the memory,
> and then it is made available to the RPC client.
>
>
> > It means that we now are limited to creating COMPOUNDs where there are
> > no more operations following the READ op because if we do so, we end up
> > with a situation where the RDMA behaviour breaks.
>
> I haven't heard reports of a problem like this.

I think what might be referred to here is that. *If* NFS READ compound
also had a VERIFY added after it (which might be interesting to check
validity). The way xdr encoding would happen is that the tail would
have non-empty bytes for the VERIFY (not just the padding). In RDMA
the code would be registering a write chunk, then it look at the
non-empty tail and allocate/register all the bytes (which are VERIFY)
into the write chunk (basically unused bytes hanging in the write
chunk). I think the actual VERIFY call bytes would go into the RDMA
Send call buffer together with the READ header. The code might also
then create a reply chunk for the VERIFY reply (if the code estimates
the reply is larger than the inline).

> However, if there is a problem, it would be simple to create a
> persistently-registered memory region that is not part of any RPC
> buffer that can be used to catch unused Write chunk XDR padding.

When I think of XDR padding, I'm thinking of a number of bytes less
than 4. Perhaps I'm confused in my understanding.  Tail.iov_len is
never a value less than 4. Tail.iov_len can contain bytes after an
opaque element for which a read or write chunk was used (eg
READ+VERIFY or WRITE+GETATTR). Thus when RDMA looks at tail.iov_len
and allocates that much memory, that allocation does not represent
just XDR padding. Since RDMA can't distinguish between
padding+operation vs padding, can this even be solved?

> >>> This means that the RDMA and TCP cases should end up doing the same
> >>> thing for the case of the Solaris server: the padding is written
> >>> into
> >>> the page buffer. There is nothing written to the tail in either
> >>> case.
> >>
> >> "Always provide" can guarantee that the NFS client makes aligned
> >> requests for buffered I/O, but what about NFS direct I/O from user
> >> space? The NIC will place the data payload in the application

To clarify, is direct I/O (with an application that uses unaligned
buffers) only a problem for a (non-modern) Solaris server that insists
on writing XDR padding into the write chunk?

> >> buffer, but there's no guarantee that the NFS READ request will be
> >> aligned or that the buffer will be able to sink the extra padding
> >> bytes.
> >>
> >> We would definitely consider it an error if an unaligned RDMA Read
> >> leaked the link-layer's 4-byte padding into a sink buffer.
> >>
> >> So, "Always provide" is nice for the in-kernel NFS client, but I
> >> don't believe it allows the way xprtrdma behaves to be changed.
> >>
> >
> > If you're doing an unaligned READ from user space then you are already
> > in a situation where you're doing something that is incompatible with
> > block device requirements.
> > If there really are any applications that contain O_DIRECT code
> > specifically for use with NFS, then we can artificially force the
> > buffers to be aligned by reducing the size of the buffer to align to a
> > 4 byte boundary. NFS supports returning short reads.
>
> Or xprtrdma can use the scheme I describe above. I think that
> would be simpler and would avoid layering violations.
>
> That would also possibly address the Nvidia problem, since a
> pre-registered MR that handles Write padding would always be a
> separate RDMA segment.

If we have separate segment(s) that don't mix page data with NFS
allocated pages for whatever, that would indeed address the Nvidia
problem.

> Again, I doubt this is necessary to fix any operational problem
> with _supported_ use cases, but let me know if you'd like me to
> make this change.
>
>
> >>>>>> Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
> >>>>>> ---
> >>>>>> net/sunrpc/xprtrdma/rpc_rdma.c | 15 ---------------
> >>>>>> 1 file changed, 15 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/sunrpc/xprtrdma/rpc_rdma.c
> >>>>>> b/net/sunrpc/xprtrdma/rpc_rdma.c
> >>>>>> index c335c1361564..2c4146bcf2a8 100644
> >>>>>> --- a/net/sunrpc/xprtrdma/rpc_rdma.c
> >>>>>> +++ b/net/sunrpc/xprtrdma/rpc_rdma.c
> >>>>>> @@ -255,21 +255,6 @@ rpcrdma_convert_iovs(struct rpcrdma_xprt
> >>>>>> *r_xprt, struct xdr_buf *xdrbuf,
> >>>>>>               page_base = 0;
> >>>>>>       }
> >>>>>>
> >>>>>> -     if (type == rpcrdma_readch)
> >>>>>> -             goto out;
> >>>>>> -
> >>>>>> -     /* When encoding a Write chunk, some servers need to
> >>>>>> see an
> >>>>>> -      * extra segment for non-XDR-aligned Write chunks. The
> >>>>>> upper
> >>>>>> -      * layer provides space in the tail iovec that may be
> >>>>>> used
> >>>>>> -      * for this purpose.
> >>>>>> -      */
> >>>>>> -     if (type == rpcrdma_writech && r_xprt->rx_ep-
> >>>>>>> re_implicit_roundup)
> >>>>>> -             goto out;
> >>>>>> -
> >>>>>> -     if (xdrbuf->tail[0].iov_len)
> >>>>>
> >>>>> Instead of checking for a tail, we could check
> >>>>>
> >>>>>         if (xdr_pad_size(xdrbuf->page_len))
> >>>>>
> >>>>> and provide some tail space in that case.
> >>>>
> >>>> I don't believe this is any different than what we have now. If
> >>>> the
> >>>> page size is non-4byte aligned then, we would still allocate size
> >>>> for
> >>>> the padding which "SHOULD NOT" be there. But yes it is allowed to
> >>>> be
> >>>> there.
> >>>>
> >>>> The problem, as you know from our offline discussion, is
> >>>> allocating
> >>>> the tail page and including it in the write chunk for the Nvidia
> >>>> environment where Nvidia doesn't support use of data (user) pages
> >>>> and
> >>>> nfs kernel allocated pages in the same segment.
> >>>>
> >>>> Alternatively, my ask is then to change rpcrdma_convert_iovs() to
> >>>> return 2 segs instead of one: one for the pages and another for
> >>>> the
> >>>> tail.
> >>>>
> >>>>>
> >>>>>> -             rpcrdma_convert_kvec(&xdrbuf->tail[0], seg,
> >>>>>> &n);
> >>>>>> -
> >>>>>> -out:
> >>>>>>       if (unlikely(n > RPCRDMA_MAX_SEGS))
> >>>>>>               return -EIO;
> >>>>>>       return n;
> >>>>>> --
> >>>>>> 2.27.0
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Chuck Lever
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>> --
> >>> Trond Myklebust
> >>> Linux NFS client maintainer, Hammerspace
> >>> trond.myklebust@hammerspace.com
> >>
> >> --
> >> Chuck Lever
> >>
> >>
> >>
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@hammerspace.com
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-31 14:33               ` Chuck Lever III
@ 2021-08-31 15:58                 ` Olga Kornievskaia
  2021-08-31 16:11                   ` Chuck Lever III
  0 siblings, 1 reply; 17+ messages in thread
From: Olga Kornievskaia @ 2021-08-31 15:58 UTC (permalink / raw)
  To: Chuck Lever III; +Cc: Trond Myklebust, Linux NFS Mailing List, Anna Schumaker

On Tue, Aug 31, 2021 at 10:33 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>
>
> > On Aug 30, 2021, at 4:37 PM, Chuck Lever III <chuck.lever@oracle.com> wrote:
> >
> >> On Aug 30, 2021, at 2:18 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> >>
> >> On Mon, 2021-08-30 at 18:02 +0000, Chuck Lever III wrote:
> >>>
> >>>> On Aug 30, 2021, at 1:34 PM, Trond Myklebust
> >>>> <trondmy@hammerspace.com> wrote:
> >>>>
> >>>> On Mon, 2021-08-30 at 13:24 -0400, Olga Kornievskaia wrote:
> >>>>> On Mon, Aug 30, 2021 at 1:04 PM Chuck Lever III
> >>>>> <chuck.lever@oracle.com> wrote:
> >>>>>>
> >>>>>> Hi Olga-
> >>>>>>
> >>>>>>> On Aug 30, 2021, at 12:53 PM, Olga Kornievskaia
> >>>>>>> <olga.kornievskaia@gmail.com> wrote:
> >>>>>>>
> >>>>>>> From: Olga Kornievskaia <kolga@netapp.com>
> >>>>>>>
> >>>>>>> Given the patch "Always provide aligned buffers to the RPC
> >>>>>>> read
> >>>>>>> layers",
> >>>>>>> RPC over RDMA doesn't need to look at the tail page and add
> >>>>>>> that
> >>>>>>> space
> >>>>>>> to the write chunk.
> >>>>>>>
> >>>>>>> For the RFC 8166 compliant server, it must not write an XDR
> >>>>>>> padding
> >>>>>>> into the write chunk (even if space was provided).
> >>>>>>> Historically
> >>>>>>> (before RFC 8166) Solaris RDMA server has been requiring the
> >>>>>>> client
> >>>>>>> to provide space for the XDR padding and thus this client
> >>>>>>> code
> >>>>>>> has
> >>>>>>> existed.
> >>>>>>
> >>>>>> I don't understand this change.
> >>>>>>
> >>>>>> So, the upper layer doesn't provide XDR padding any more. That
> >>>>>> doesn't
> >>>>>> mean Solaris servers still aren't going to want to write into
> >>>>>> it.
> >>>>>> The
> >>>>>> client still has to provide this padding from somewhere.
> >>>>>>
> >>>>>> This suggests that "Always provide aligned buffers to the RPC
> >>>>>> read
> >>>>>> layers" breaks our interop with Solaris servers. Does it?
> >>>>>
> >>>>> No, I don't believe "Always provide aligned buffers to the RPC
> >>>>> read
> >>>>> layers" breaks the interoperability. THIS patch would break the
> >>>>> interop.
> >>>>>
> >>>>> If we are not willing to break the interoperability and support
> >>>>> only
> >>>>> servers that comply with RFC 8166, this patch is not needed.
> >>>>
> >>>> Why? The intention of the first patch is to ensure that we do not
> >>>> have
> >>>> buffers that are not word aligned. If Solaris wants to write
> >>>> padding
> >>>> after the end of the file, then there is space in the page buffer
> >>>> for
> >>>> it to do so. There should be no need for an extra tail in which to
> >>>> write the padding.
> >>>
> >>> The RPC/RDMA protocol is designed for hardware-offloaded direct data
> >>> placement. That means the padding, which isn't data, must be directed
> >>> to another buffer.
> >>>
> >>> This is a problem with RPC/RDMA v1 implementations. RFC 5666 was
> >>> ambiguous, so there are implementations that write XDR padding into
> >>> Write chunks. This is why RFC 8166 says SHOULD NOT instead of MUST
> >>> NOT.
> >>>
> >>> I believe rpcrdma-version-two makes it a requirement not to use XDR
> >>> padding in either Read or Write data payload chunks.
> >>>
> >>>
> >> Correct, but in order to satisfy the needs of the Solaris server,
> >> you've hijacked the tail for use as a data buffer. AFAICS it is not
> >> being used as a SEND buffer target, but is instead being turned into a
> >> write chunk target. That is not acceptable!
> >
> > The buffer is being used as both. Proper function depends on the
> > order of RDMA Writes and Receives on the client.
>
> I've refreshed my memory. The above statement oversimplifies
> a bit.
>
> - The memory pointed to by rqst->rq_buffer is persistently
>   registered to allow RDMA Send from it. It can also be
>   registered on the fly for use in Read chunks.
>
> - The memory pointed to by rqst->rq_rbuffer is NOT
>   persistently registered and is never used for RDMA Send
>   or Receive. It can be registered on the fly for use in a
>   Write or Reply chunk. This is when the tail portion of
>   rq_rcv_buf might be used to receive a pad.
>
> - A third set of persistently registered buffers is used
>   ONLY to catch RDMA Receive completions.
>
> When a Receive completes, the reply handler decides how to
> construct the received RPC message. If there was a Reply
> chunk, it leaves rq_rcv_buf pointing to rq_rbuffer, as that's
> where the server wrote the Reply chunk. If there wasn't a
> Reply chunk, the handler points rq_rcv_buf to the RDMA
> Receive buffer.
>
> The only case where there might be overlap is when the client
> has provisioned both a Write chunk and a Reply chunk. In that
> case, yes, I can see that there might be an opportunity for
> the server to RDMA Write the Write chunk's pad and the Reply
> chunk into the same client memory buffer. However:
>
> - Most servers don't write a pad. Modern Solaris servers
>   behave "correctly" in this regard, I'm now told. Linux
>   never writes the pad, and I'm told OnTAP also does not.
>
> - Server implementations I'm aware of Write the Write chunk
>   first and then the Reply chunk. The Reply chunk would
>   overwrite the pad.
>
> - The client hardly ever uses Write + Reply. It's most
>   often one or the other.
>
> So again, there is little to zero chance there is an existing
> operational issue here. But see below.
>
>
> > rpcrdma_encode_write_list() registers up to an extra 3 bytes in
> > rq_rcv_buf.tail as part of the Write chunk. The R_keys for the
> > segments in the Write chunk are then sent to the server as part
> > of the RPC Call.
> >
> > As part of Replying, the server RDMA Writes data into the chunk,
> > and possibly also RDMA Writes padding. It then does an RDMA Send
> > containing the RPC Reply.
> >
> > The Send data always lands in the Receive buffer _after_ the Write
> > data. The Receive completion guarantees that previous RDMA Writes
> > are complete. Receive handling invalidates and unmaps the memory,
> > and then it is made available to the RPC client.
> >
> >
> >> It means that we now are limited to creating COMPOUNDs where there are
> >> no more operations following the READ op because if we do so, we end up
> >> with a situation where the RDMA behaviour breaks.
> >
> > I haven't heard reports of a problem like this.
> >
> > However, if there is a problem, it would be simple to create a
> > persistently-registered memory region that is not part of any RPC
> > buffer that can be used to catch unused Write chunk XDR padding.
> >
> >
> >>>> This means that the RDMA and TCP cases should end up doing the same
> >>>> thing for the case of the Solaris server: the padding is written
> >>>> into
> >>>> the page buffer. There is nothing written to the tail in either
> >>>> case.
> >>>
> >>> "Always provide" can guarantee that the NFS client makes aligned
> >>> requests for buffered I/O, but what about NFS direct I/O from user
> >>> space? The NIC will place the data payload in the application
> >>> buffer, but there's no guarantee that the NFS READ request will be
> >>> aligned or that the buffer will be able to sink the extra padding
> >>> bytes.
> >>>
> >>> We would definitely consider it an error if an unaligned RDMA Read
> >>> leaked the link-layer's 4-byte padding into a sink buffer.
> >>>
> >>> So, "Always provide" is nice for the in-kernel NFS client, but I
> >>> don't believe it allows the way xprtrdma behaves to be changed.
> >>>
> >>
> >> If you're doing an unaligned READ from user space then you are already
> >> in a situation where you're doing something that is incompatible with
> >> block device requirements.
> >> If there really are any applications that contain O_DIRECT code
> >> specifically for use with NFS, then we can artificially force the
> >> buffers to be aligned by reducing the size of the buffer to align to a
> >> 4 byte boundary. NFS supports returning short reads.
> >
> > Or xprtrdma can use the scheme I describe above. I think that
> > would be simpler and would avoid layering violations.
> >
> > That would also possibly address the Nvidia problem, since a
> > pre-registered MR that handles Write padding would always be a
> > separate RDMA segment.
> >
> > Again, I doubt this is necessary to fix any operational problem
> > with _supported_ use cases, but let me know if you'd like me to
> > make this change.
>
> Since RFC 8666 says "SHOULD NOT", the spec mandates that the client
> has to expect there might be a server that will RDMA Write the XDR
> pad, so the Linux client really should always provide one. Having
> a persistently registered spot to use for that means we have a
> potential no-cost mechanism for providing that extra segment. The
> whole "pad optimization" thing was because the pad segment is
> currently registered on the fly, so it has a per-I/O cost.

I just can't reconcile in my head the logic that rfc 8666 "SHOULD NOT"
allocate and rfc 8166 that says server "MUST NOT" write the XDR
padding, to mean that client should allocate memory.

How can we assess that there are any old Solaris server out there for
which we are keeping this around? Are we expected to keep this
forever? Sorry the answers are probably not what I want to hear and
I'm just more expressing frustration.

> So I'm leaning toward implementing this simple solution, at least
> as proof-of-concept.
>
>
> --
> Chuck Lever
>
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-31 15:54               ` Olga Kornievskaia
@ 2021-08-31 16:02                 ` Chuck Lever III
  0 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever III @ 2021-08-31 16:02 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Trond Myklebust, Linux NFS Mailing List, Anna Schumaker


> On Aug 31, 2021, at 11:54 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
>> However, if there is a problem, it would be simple to create a
>> persistently-registered memory region that is not part of any RPC
>> buffer that can be used to catch unused Write chunk XDR padding.
> 
> When I think of XDR padding, I'm thinking of a number of bytes less
> than 4. Perhaps I'm confused in my understanding.  Tail.iov_len is
> never a value less than 4. Tail.iov_len can contain bytes after an
> opaque element for which a read or write chunk was used (eg
> READ+VERIFY or WRITE+GETATTR). Thus when RDMA looks at tail.iov_len
> and allocates that much memory, that allocation does not represent
> just XDR padding. Since RDMA can't distinguish between
> padding+operation vs padding, can this even be solved?

Yes, I think it can be solved. xprtrdma was relying on the upper
layers to tell it how big the XDR pad was. Instead, it can either
provide a 4-byte pad segment unconditionally, or it can look at
the value of in the rq_rcv_buf::page_len.

The reason for all the current cleverness in there is because
adding the pad segment has a cost. If we make it no-cost, it can
be added unconditionally, and all the cleverness can be tossed
out.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC 1/2] xprtrdma: xdr pad optimization revisted again
  2021-08-31 15:58                 ` Olga Kornievskaia
@ 2021-08-31 16:11                   ` Chuck Lever III
  0 siblings, 0 replies; 17+ messages in thread
From: Chuck Lever III @ 2021-08-31 16:11 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Trond Myklebust, Linux NFS Mailing List, Anna Schumaker


> On Aug 31, 2021, at 11:58 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> 
> On Tue, Aug 31, 2021 at 10:33 AM Chuck Lever III <chuck.lever@oracle.com> wrote:
>> 
>> Since RFC 8666 says "SHOULD NOT", the spec mandates that the client

This should say "RFC 8166."


>> has to expect there might be a server that will RDMA Write the XDR
>> pad, so the Linux client really should always provide one. Having
>> a persistently registered spot to use for that means we have a
>> potential no-cost mechanism for providing that extra segment. The
>> whole "pad optimization" thing was because the pad segment is
>> currently registered on the fly, so it has a per-I/O cost.
> 
> I just can't reconcile in my head the logic that rfc 8666 "SHOULD NOT"
> allocate and rfc 8166 that says server "MUST NOT" write the XDR
> padding, to mean that client should allocate memory.

 3.4.6.2.  Write Chunk Roundup

   When provisioning a Write chunk for a variable-length result data
   item, the Requester SHOULD NOT include additional space for XDR
   roundup padding.  A Responder MUST NOT write XDR roundup padding into
   a Write chunk, even if the Requester made space available for it.
   Therefore, when returning a single variable-length result data item,
   a returned Write chunk’s total length MUST be the same as the encoded
   length of the result data item.

RFC 8166-compliant servers MUST NOT write padding. pre-8166 servers
sometimes do write it. I would like the Linux client to interoperate
with both because not interoperating with pre-8166 servers would be
a regression.


> How can we assess that there are any old Solaris server out there for
> which we are keeping this around? Are we expected to keep this
> forever? Sorry the answers are probably not what I want to hear and
> I'm just more expressing frustration.

There's no way to know how many pre-RFC 8166 servers there are out
there. But that's now moot, since we have a solution that makes it
basically free to support them.

--
Chuck Lever




^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-08-31 16:12 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-30 16:53 [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia
2021-08-30 16:53 ` [RFC 1/2] xprtrdma: xdr pad optimization revisted again Olga Kornievskaia
2021-08-30 17:04   ` Chuck Lever III
2021-08-30 17:24     ` Olga Kornievskaia
2021-08-30 17:34       ` Trond Myklebust
2021-08-30 18:02         ` Chuck Lever III
2021-08-30 18:18           ` Trond Myklebust
2021-08-30 20:37             ` Chuck Lever III
2021-08-31 14:33               ` Chuck Lever III
2021-08-31 15:58                 ` Olga Kornievskaia
2021-08-31 16:11                   ` Chuck Lever III
2021-08-31 15:54               ` Olga Kornievskaia
2021-08-31 16:02                 ` Chuck Lever III
2021-08-30 17:35       ` Chuck Lever III
2021-08-30 16:53 ` [RFC 2/2] xprtrdma: remove re_implicit_roundup xprt_rdma_pad_optimize Olga Kornievskaia
2021-08-30 16:57   ` Chuck Lever III
2021-08-30 16:55 ` [RFC 0/2] revisit RMDA XDR padding management Olga Kornievskaia

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.