linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
@ 2020-11-09 16:03 Chuck Lever
  2020-11-09 17:08 ` Trond Myklebust
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2020-11-09 16:03 UTC (permalink / raw)
  To: netdev, linux-nfs

Daire Byrne reports a ~50% aggregrate throughput regression on his
Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to
use xprt_sock_sendmsg for socket sends"), which replaced
kernel_send_page() calls in NFSD's socket send path with calls to
sock_sendmsg() using iov_iter.

Investigation showed that tcp_sendmsg() was not using zero-copy to
send the xdr_buf's bvec pages, but instead was relying on memcpy.

Set up the socket and each msghdr that bears bvec pages to use the
zero-copy mechanism in tcp_sendmsg.

Reported-by: Daire Byrne <daire@dneg.com>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
Fixes: da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/socklib.c  |    5 ++++-
 net/sunrpc/svcsock.c  |    1 +
 net/sunrpc/xprtsock.c |    1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

This patch does not fully resolve the issue. Daire reports high
softIRQ activity after the patch is applied, and this activity
seems to prevent full restoration of previous performance.


diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
index d52313af82bc..af47596a7bdd 100644
--- a/net/sunrpc/socklib.c
+++ b/net/sunrpc/socklib.c
@@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct socket *sock, struct msghdr *msg,
 	if (err < 0)
 		return err;
 
+	msg->msg_flags |= MSG_ZEROCOPY;
 	iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec, xdr_buf_pagecount(xdr),
 		      xdr->page_len + xdr->page_base);
-	return xprt_sendmsg(sock, msg, base + xdr->page_base);
+	err = xprt_sendmsg(sock, msg, base + xdr->page_base);
+	msg->msg_flags &= ~MSG_ZEROCOPY;
+	return err;
 }
 
 /* Common case:
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index c2752e2b9ce3..c814b4953b15 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct svc_sock *svsk, struct svc_serv *serv)
 		svsk->sk_datalen = 0;
 		memset(&svsk->sk_pages[0], 0, sizeof(svsk->sk_pages));
 
+		sock_set_flag(sk, SOCK_ZEROCOPY);
 		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
 
 		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 7090bbee0ec5..343c6396b297 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -2175,6 +2175,7 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)
 
 		/* socket options */
 		sock_reset_flag(sk, SOCK_LINGER);
+		sock_set_flag(sk, SOCK_ZEROCOPY);
 		tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
 
 		xprt_clear_connected(xprt);



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 16:03 [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations Chuck Lever
@ 2020-11-09 17:08 ` Trond Myklebust
  2020-11-09 17:12   ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2020-11-09 17:08 UTC (permalink / raw)
  To: linux-nfs, netdev, chuck.lever

On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
> Daire Byrne reports a ~50% aggregrate throughput regression on his
> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to
> use xprt_sock_sendmsg for socket sends"), which replaced
> kernel_send_page() calls in NFSD's socket send path with calls to
> sock_sendmsg() using iov_iter.
> 
> Investigation showed that tcp_sendmsg() was not using zero-copy to
> send the xdr_buf's bvec pages, but instead was relying on memcpy.
> 
> Set up the socket and each msghdr that bears bvec pages to use the
> zero-copy mechanism in tcp_sendmsg.
> 
> Reported-by: Daire Byrne <daire@dneg.com>
> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
> for socket sends")
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/socklib.c  |    5 ++++-
>  net/sunrpc/svcsock.c  |    1 +
>  net/sunrpc/xprtsock.c |    1 +
>  3 files changed, 6 insertions(+), 1 deletion(-)
> 
> This patch does not fully resolve the issue. Daire reports high
> softIRQ activity after the patch is applied, and this activity
> seems to prevent full restoration of previous performance.
> 
> 
> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
> index d52313af82bc..af47596a7bdd 100644
> --- a/net/sunrpc/socklib.c
> +++ b/net/sunrpc/socklib.c
> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct socket
> *sock, struct msghdr *msg,
>         if (err < 0)
>                 return err;
>  
> +       msg->msg_flags |= MSG_ZEROCOPY;
>         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
> xdr_buf_pagecount(xdr),
>                       xdr->page_len + xdr->page_base);
> -       return xprt_sendmsg(sock, msg, base + xdr->page_base);
> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
> +       msg->msg_flags &= ~MSG_ZEROCOPY;
> +       return err;
>  }
>  
>  /* Common case:
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index c2752e2b9ce3..c814b4953b15 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct svc_sock *svsk,
> struct svc_serv *serv)
>                 svsk->sk_datalen = 0;
>                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
> >sk_pages));
>  
> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>  
>                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 7090bbee0ec5..343c6396b297 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -2175,6 +2175,7 @@ static int xs_tcp_finish_connecting(struct
> rpc_xprt *xprt, struct socket *sock)
>  
>                 /* socket options */
>                 sock_reset_flag(sk, SOCK_LINGER);
> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>  
>                 xprt_clear_connected(xprt);
> 
> 
I'm thinking we are not really allowed to do that here. The pages we
pass in to the RPC layer are not guaranteed to contain stable data
since they include unlocked page cache pages as well as O_DIRECT pages.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 17:08 ` Trond Myklebust
@ 2020-11-09 17:12   ` Chuck Lever
  2020-11-09 17:32     ` Trond Myklebust
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2020-11-09 17:12 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List, netdev



> On Nov 9, 2020, at 12:08 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>> Daire Byrne reports a ~50% aggregrate throughput regression on his
>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to
>> use xprt_sock_sendmsg for socket sends"), which replaced
>> kernel_send_page() calls in NFSD's socket send path with calls to
>> sock_sendmsg() using iov_iter.
>> 
>> Investigation showed that tcp_sendmsg() was not using zero-copy to
>> send the xdr_buf's bvec pages, but instead was relying on memcpy.
>> 
>> Set up the socket and each msghdr that bears bvec pages to use the
>> zero-copy mechanism in tcp_sendmsg.
>> 
>> Reported-by: Daire Byrne <daire@dneg.com>
>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg
>> for socket sends")
>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>> ---
>>  net/sunrpc/socklib.c  |    5 ++++-
>>  net/sunrpc/svcsock.c  |    1 +
>>  net/sunrpc/xprtsock.c |    1 +
>>  3 files changed, 6 insertions(+), 1 deletion(-)
>> 
>> This patch does not fully resolve the issue. Daire reports high
>> softIRQ activity after the patch is applied, and this activity
>> seems to prevent full restoration of previous performance.
>> 
>> 
>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>> index d52313af82bc..af47596a7bdd 100644
>> --- a/net/sunrpc/socklib.c
>> +++ b/net/sunrpc/socklib.c
>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct socket
>> *sock, struct msghdr *msg,
>>         if (err < 0)
>>                 return err;
>>  
>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>> xdr_buf_pagecount(xdr),
>>                       xdr->page_len + xdr->page_base);
>> -       return xprt_sendmsg(sock, msg, base + xdr->page_base);
>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>> +       return err;
>>  }
>>  
>>  /* Common case:
>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>> index c2752e2b9ce3..c814b4953b15 100644
>> --- a/net/sunrpc/svcsock.c
>> +++ b/net/sunrpc/svcsock.c
>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct svc_sock *svsk,
>> struct svc_serv *serv)
>>                 svsk->sk_datalen = 0;
>>                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>> sk_pages));
>>  
>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>  
>>                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>> index 7090bbee0ec5..343c6396b297 100644
>> --- a/net/sunrpc/xprtsock.c
>> +++ b/net/sunrpc/xprtsock.c
>> @@ -2175,6 +2175,7 @@ static int xs_tcp_finish_connecting(struct
>> rpc_xprt *xprt, struct socket *sock)
>>  
>>                 /* socket options */
>>                 sock_reset_flag(sk, SOCK_LINGER);
>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>  
>>                 xprt_clear_connected(xprt);
>> 
>> 
> I'm thinking we are not really allowed to do that here. The pages we
> pass in to the RPC layer are not guaranteed to contain stable data
> since they include unlocked page cache pages as well as O_DIRECT pages.

I assume you mean the client side only. Those issues aren't a factor
on the server. Not setting SOCK_ZEROCOPY here should be enough to
prevent the use of zero-copy on the client.

However, the client loses the benefits of sending a page at a time.
Is there a desire to remedy that somehow?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 17:12   ` Chuck Lever
@ 2020-11-09 17:32     ` Trond Myklebust
  2020-11-09 17:36       ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2020-11-09 17:32 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs, netdev

On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
> 
> 
> > On Nov 9, 2020, at 12:08 PM, Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
> > > Daire Byrne reports a ~50% aggregrate throughput regression on
> > > his
> > > Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server
> > > to
> > > use xprt_sock_sendmsg for socket sends"), which replaced
> > > kernel_send_page() calls in NFSD's socket send path with calls to
> > > sock_sendmsg() using iov_iter.
> > > 
> > > Investigation showed that tcp_sendmsg() was not using zero-copy
> > > to
> > > send the xdr_buf's bvec pages, but instead was relying on memcpy.
> > > 
> > > Set up the socket and each msghdr that bears bvec pages to use
> > > the
> > > zero-copy mechanism in tcp_sendmsg.
> > > 
> > > Reported-by: Daire Byrne <daire@dneg.com>
> > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
> > > Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
> > > xprt_sock_sendmsg
> > > for socket sends")
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  net/sunrpc/socklib.c  |    5 ++++-
> > >  net/sunrpc/svcsock.c  |    1 +
> > >  net/sunrpc/xprtsock.c |    1 +
> > >  3 files changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > This patch does not fully resolve the issue. Daire reports high
> > > softIRQ activity after the patch is applied, and this activity
> > > seems to prevent full restoration of previous performance.
> > > 
> > > 
> > > diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
> > > index d52313af82bc..af47596a7bdd 100644
> > > --- a/net/sunrpc/socklib.c
> > > +++ b/net/sunrpc/socklib.c
> > > @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct socket
> > > *sock, struct msghdr *msg,
> > >         if (err < 0)
> > >                 return err;
> > >  
> > > +       msg->msg_flags |= MSG_ZEROCOPY;
> > >         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
> > > xdr_buf_pagecount(xdr),
> > >                       xdr->page_len + xdr->page_base);
> > > -       return xprt_sendmsg(sock, msg, base + xdr->page_base);
> > > +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
> > > +       msg->msg_flags &= ~MSG_ZEROCOPY;
> > > +       return err;
> > >  }
> > >  
> > >  /* Common case:
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index c2752e2b9ce3..c814b4953b15 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct svc_sock
> > > *svsk,
> > > struct svc_serv *serv)
> > >                 svsk->sk_datalen = 0;
> > >                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
> > > > sk_pages));
> > >  
> > > +               sock_set_flag(sk, SOCK_ZEROCOPY);
> > >                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
> > >  
> > >                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > index 7090bbee0ec5..343c6396b297 100644
> > > --- a/net/sunrpc/xprtsock.c
> > > +++ b/net/sunrpc/xprtsock.c
> > > @@ -2175,6 +2175,7 @@ static int xs_tcp_finish_connecting(struct
> > > rpc_xprt *xprt, struct socket *sock)
> > >  
> > >                 /* socket options */
> > >                 sock_reset_flag(sk, SOCK_LINGER);
> > > +               sock_set_flag(sk, SOCK_ZEROCOPY);
> > >                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
> > >  
> > >                 xprt_clear_connected(xprt);
> > > 
> > > 
> > I'm thinking we are not really allowed to do that here. The pages
> > we
> > pass in to the RPC layer are not guaranteed to contain stable data
> > since they include unlocked page cache pages as well as O_DIRECT
> > pages.
> 
> I assume you mean the client side only. Those issues aren't a factor
> on the server. Not setting SOCK_ZEROCOPY here should be enough to
> prevent the use of zero-copy on the client.
> 
> However, the client loses the benefits of sending a page at a time.
> Is there a desire to remedy that somehow?

What about splice reads on the server side?


-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 17:32     ` Trond Myklebust
@ 2020-11-09 17:36       ` Chuck Lever
  2020-11-09 17:55         ` J. Bruce Fields
  2020-11-09 18:16         ` Trond Myklebust
  0 siblings, 2 replies; 11+ messages in thread
From: Chuck Lever @ 2020-11-09 17:36 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List, netdev



> On Nov 9, 2020, at 12:32 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
>> 
>> 
>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust
>>> <trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>>>> Daire Byrne reports a ~50% aggregrate throughput regression on
>>>> his
>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server
>>>> to
>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>> kernel_send_page() calls in NFSD's socket send path with calls to
>>>> sock_sendmsg() using iov_iter.
>>>> 
>>>> Investigation showed that tcp_sendmsg() was not using zero-copy
>>>> to
>>>> send the xdr_buf's bvec pages, but instead was relying on memcpy.
>>>> 
>>>> Set up the socket and each msghdr that bears bvec pages to use
>>>> the
>>>> zero-copy mechanism in tcp_sendmsg.
>>>> 
>>>> Reported-by: Daire Byrne <daire@dneg.com>
>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
>>>> xprt_sock_sendmsg
>>>> for socket sends")
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>>  net/sunrpc/socklib.c  |    5 ++++-
>>>>  net/sunrpc/svcsock.c  |    1 +
>>>>  net/sunrpc/xprtsock.c |    1 +
>>>>  3 files changed, 6 insertions(+), 1 deletion(-)
>>>> 
>>>> This patch does not fully resolve the issue. Daire reports high
>>>> softIRQ activity after the patch is applied, and this activity
>>>> seems to prevent full restoration of previous performance.
>>>> 
>>>> 
>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>>>> index d52313af82bc..af47596a7bdd 100644
>>>> --- a/net/sunrpc/socklib.c
>>>> +++ b/net/sunrpc/socklib.c
>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct socket
>>>> *sock, struct msghdr *msg,
>>>>         if (err < 0)
>>>>                 return err;
>>>>  
>>>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>>>         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>>>> xdr_buf_pagecount(xdr),
>>>>                       xdr->page_len + xdr->page_base);
>>>> -       return xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>>>> +       return err;
>>>>  }
>>>>  
>>>>  /* Common case:
>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>> index c2752e2b9ce3..c814b4953b15 100644
>>>> --- a/net/sunrpc/svcsock.c
>>>> +++ b/net/sunrpc/svcsock.c
>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct svc_sock
>>>> *svsk,
>>>> struct svc_serv *serv)
>>>>                 svsk->sk_datalen = 0;
>>>>                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>>>> sk_pages));
>>>>  
>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>  
>>>>                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>> index 7090bbee0ec5..343c6396b297 100644
>>>> --- a/net/sunrpc/xprtsock.c
>>>> +++ b/net/sunrpc/xprtsock.c
>>>> @@ -2175,6 +2175,7 @@ static int xs_tcp_finish_connecting(struct
>>>> rpc_xprt *xprt, struct socket *sock)
>>>>  
>>>>                 /* socket options */
>>>>                 sock_reset_flag(sk, SOCK_LINGER);
>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>  
>>>>                 xprt_clear_connected(xprt);
>>>> 
>>>> 
>>> I'm thinking we are not really allowed to do that here. The pages
>>> we
>>> pass in to the RPC layer are not guaranteed to contain stable data
>>> since they include unlocked page cache pages as well as O_DIRECT
>>> pages.
>> 
>> I assume you mean the client side only. Those issues aren't a factor
>> on the server. Not setting SOCK_ZEROCOPY here should be enough to
>> prevent the use of zero-copy on the client.
>> 
>> However, the client loses the benefits of sending a page at a time.
>> Is there a desire to remedy that somehow?
> 
> What about splice reads on the server side?

On the server, this path formerly used kernel_sendpages(), which I
assumed is similar to the sendmsg zero-copy mechanism. How does
kernel_sendpages() mitigate against page instability?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 17:36       ` Chuck Lever
@ 2020-11-09 17:55         ` J. Bruce Fields
  2020-11-09 18:16         ` Trond Myklebust
  1 sibling, 0 replies; 11+ messages in thread
From: J. Bruce Fields @ 2020-11-09 17:55 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Trond Myklebust, Linux NFS Mailing List, netdev

On Mon, Nov 09, 2020 at 12:36:15PM -0500, Chuck Lever wrote:
> > On Nov 9, 2020, at 12:32 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> > On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
> >> I assume you mean the client side only. Those issues aren't a factor
> >> on the server. Not setting SOCK_ZEROCOPY here should be enough to
> >> prevent the use of zero-copy on the client.
> >> 
> >> However, the client loses the benefits of sending a page at a time.
> >> Is there a desire to remedy that somehow?
> > 
> > What about splice reads on the server side?
> 
> On the server, this path formerly used kernel_sendpages(), which I
> assumed is similar to the sendmsg zero-copy mechanism. How does
> kernel_sendpages() mitigate against page instability?

We turn it off when gss integrity or privacy services is used, to
prevent spurious checksum failures (grep for RQ_SPLICE_OK).

But maybe that's not the only problematic case, I don't know.

--b.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 17:36       ` Chuck Lever
  2020-11-09 17:55         ` J. Bruce Fields
@ 2020-11-09 18:16         ` Trond Myklebust
  2020-11-09 19:31           ` Chuck Lever
  1 sibling, 1 reply; 11+ messages in thread
From: Trond Myklebust @ 2020-11-09 18:16 UTC (permalink / raw)
  To: chuck.lever; +Cc: linux-nfs, netdev

On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote:
> 
> 
> > On Nov 9, 2020, at 12:32 PM, Trond Myklebust <
> > trondmy@hammerspace.com> wrote:
> > 
> > On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
> > > 
> > > 
> > > > On Nov 9, 2020, at 12:08 PM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
> > > > > Daire Byrne reports a ~50% aggregrate throughput regression
> > > > > on
> > > > > his
> > > > > Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach
> > > > > server
> > > > > to
> > > > > use xprt_sock_sendmsg for socket sends"), which replaced
> > > > > kernel_send_page() calls in NFSD's socket send path with
> > > > > calls to
> > > > > sock_sendmsg() using iov_iter.
> > > > > 
> > > > > Investigation showed that tcp_sendmsg() was not using zero-
> > > > > copy
> > > > > to
> > > > > send the xdr_buf's bvec pages, but instead was relying on
> > > > > memcpy.
> > > > > 
> > > > > Set up the socket and each msghdr that bears bvec pages to
> > > > > use
> > > > > the
> > > > > zero-copy mechanism in tcp_sendmsg.
> > > > > 
> > > > > Reported-by: Daire Byrne <daire@dneg.com>
> > > > > BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
> > > > > Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
> > > > > xprt_sock_sendmsg
> > > > > for socket sends")
> > > > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > > > ---
> > > > >  net/sunrpc/socklib.c  |    5 ++++-
> > > > >  net/sunrpc/svcsock.c  |    1 +
> > > > >  net/sunrpc/xprtsock.c |    1 +
> > > > >  3 files changed, 6 insertions(+), 1 deletion(-)
> > > > > 
> > > > > This patch does not fully resolve the issue. Daire reports
> > > > > high
> > > > > softIRQ activity after the patch is applied, and this
> > > > > activity
> > > > > seems to prevent full restoration of previous performance.
> > > > > 
> > > > > 
> > > > > diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
> > > > > index d52313af82bc..af47596a7bdd 100644
> > > > > --- a/net/sunrpc/socklib.c
> > > > > +++ b/net/sunrpc/socklib.c
> > > > > @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct
> > > > > socket
> > > > > *sock, struct msghdr *msg,
> > > > >         if (err < 0)
> > > > >                 return err;
> > > > >  
> > > > > +       msg->msg_flags |= MSG_ZEROCOPY;
> > > > >         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
> > > > > xdr_buf_pagecount(xdr),
> > > > >                       xdr->page_len + xdr->page_base);
> > > > > -       return xprt_sendmsg(sock, msg, base + xdr-
> > > > > >page_base);
> > > > > +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
> > > > > +       msg->msg_flags &= ~MSG_ZEROCOPY;
> > > > > +       return err;
> > > > >  }
> > > > >  
> > > > >  /* Common case:
> > > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > > > index c2752e2b9ce3..c814b4953b15 100644
> > > > > --- a/net/sunrpc/svcsock.c
> > > > > +++ b/net/sunrpc/svcsock.c
> > > > > @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct
> > > > > svc_sock
> > > > > *svsk,
> > > > > struct svc_serv *serv)
> > > > >                 svsk->sk_datalen = 0;
> > > > >                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
> > > > > > sk_pages));
> > > > >  
> > > > > +               sock_set_flag(sk, SOCK_ZEROCOPY);
> > > > >                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
> > > > >  
> > > > >                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> > > > > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > > > > index 7090bbee0ec5..343c6396b297 100644
> > > > > --- a/net/sunrpc/xprtsock.c
> > > > > +++ b/net/sunrpc/xprtsock.c
> > > > > @@ -2175,6 +2175,7 @@ static int
> > > > > xs_tcp_finish_connecting(struct
> > > > > rpc_xprt *xprt, struct socket *sock)
> > > > >  
> > > > >                 /* socket options */
> > > > >                 sock_reset_flag(sk, SOCK_LINGER);
> > > > > +               sock_set_flag(sk, SOCK_ZEROCOPY);
> > > > >                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
> > > > >  
> > > > >                 xprt_clear_connected(xprt);
> > > > > 
> > > > > 
> > > > I'm thinking we are not really allowed to do that here. The
> > > > pages
> > > > we
> > > > pass in to the RPC layer are not guaranteed to contain stable
> > > > data
> > > > since they include unlocked page cache pages as well as
> > > > O_DIRECT
> > > > pages.
> > > 
> > > I assume you mean the client side only. Those issues aren't a
> > > factor
> > > on the server. Not setting SOCK_ZEROCOPY here should be enough to
> > > prevent the use of zero-copy on the client.
> > > 
> > > However, the client loses the benefits of sending a page at a
> > > time.
> > > Is there a desire to remedy that somehow?
> > 
> > What about splice reads on the server side?
> 
> On the server, this path formerly used kernel_sendpages(), which I
> assumed is similar to the sendmsg zero-copy mechanism. How does
> kernel_sendpages() mitigate against page instability?
> 

It copies the data. 🙂

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 18:16         ` Trond Myklebust
@ 2020-11-09 19:31           ` Chuck Lever
  2020-11-09 20:10             ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2020-11-09 19:31 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List, netdev



> On Nov 9, 2020, at 1:16 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote:
>> 
>> 
>>> On Nov 9, 2020, at 12:32 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust
>>>>> <trondmy@hammerspace.com> wrote:
>>>>> 
>>>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>>>>>> Daire Byrne reports a ~50% aggregrate throughput regression
>>>>>> on
>>>>>> his
>>>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach
>>>>>> server
>>>>>> to
>>>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>>>> kernel_send_page() calls in NFSD's socket send path with
>>>>>> calls to
>>>>>> sock_sendmsg() using iov_iter.
>>>>>> 
>>>>>> Investigation showed that tcp_sendmsg() was not using zero-
>>>>>> copy
>>>>>> to
>>>>>> send the xdr_buf's bvec pages, but instead was relying on
>>>>>> memcpy.
>>>>>> 
>>>>>> Set up the socket and each msghdr that bears bvec pages to
>>>>>> use
>>>>>> the
>>>>>> zero-copy mechanism in tcp_sendmsg.
>>>>>> 
>>>>>> Reported-by: Daire Byrne <daire@dneg.com>
>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
>>>>>> xprt_sock_sendmsg
>>>>>> for socket sends")
>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>> ---
>>>>>>  net/sunrpc/socklib.c  |    5 ++++-
>>>>>>  net/sunrpc/svcsock.c  |    1 +
>>>>>>  net/sunrpc/xprtsock.c |    1 +
>>>>>>  3 files changed, 6 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> This patch does not fully resolve the issue. Daire reports
>>>>>> high
>>>>>> softIRQ activity after the patch is applied, and this
>>>>>> activity
>>>>>> seems to prevent full restoration of previous performance.
>>>>>> 
>>>>>> 
>>>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>>>>>> index d52313af82bc..af47596a7bdd 100644
>>>>>> --- a/net/sunrpc/socklib.c
>>>>>> +++ b/net/sunrpc/socklib.c
>>>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct
>>>>>> socket
>>>>>> *sock, struct msghdr *msg,
>>>>>>         if (err < 0)
>>>>>>                 return err;
>>>>>>  
>>>>>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>>>>>         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>>>>>> xdr_buf_pagecount(xdr),
>>>>>>                       xdr->page_len + xdr->page_base);
>>>>>> -       return xprt_sendmsg(sock, msg, base + xdr-
>>>>>>> page_base);
>>>>>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>>>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>> +       return err;
>>>>>>  }
>>>>>>  
>>>>>>  /* Common case:
>>>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>>>> index c2752e2b9ce3..c814b4953b15 100644
>>>>>> --- a/net/sunrpc/svcsock.c
>>>>>> +++ b/net/sunrpc/svcsock.c
>>>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct
>>>>>> svc_sock
>>>>>> *svsk,
>>>>>> struct svc_serv *serv)
>>>>>>                 svsk->sk_datalen = 0;
>>>>>>                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>>>>>> sk_pages));
>>>>>>  
>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>  
>>>>>>                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>>> index 7090bbee0ec5..343c6396b297 100644
>>>>>> --- a/net/sunrpc/xprtsock.c
>>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>>> @@ -2175,6 +2175,7 @@ static int
>>>>>> xs_tcp_finish_connecting(struct
>>>>>> rpc_xprt *xprt, struct socket *sock)
>>>>>>  
>>>>>>                 /* socket options */
>>>>>>                 sock_reset_flag(sk, SOCK_LINGER);
>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>  
>>>>>>                 xprt_clear_connected(xprt);
>>>>>> 
>>>>>> 
>>>>> I'm thinking we are not really allowed to do that here. The
>>>>> pages
>>>>> we
>>>>> pass in to the RPC layer are not guaranteed to contain stable
>>>>> data
>>>>> since they include unlocked page cache pages as well as
>>>>> O_DIRECT
>>>>> pages.
>>>> 
>>>> I assume you mean the client side only. Those issues aren't a
>>>> factor
>>>> on the server. Not setting SOCK_ZEROCOPY here should be enough to
>>>> prevent the use of zero-copy on the client.
>>>> 
>>>> However, the client loses the benefits of sending a page at a
>>>> time.
>>>> Is there a desire to remedy that somehow?
>>> 
>>> What about splice reads on the server side?
>> 
>> On the server, this path formerly used kernel_sendpages(), which I
>> assumed is similar to the sendmsg zero-copy mechanism. How does
>> kernel_sendpages() mitigate against page instability?
> 
> It copies the data. 🙂

tcp_sendmsg_locked() invokes skb_copy_to_page_nocache(), which is
where Daire's performance-robbing memcpy occurs.

do_tcp_sendpages() has no such call site. Therefore the legacy
sendpage-based path has at least one fewer data copy operations.

What is the appropriate way to make tcp_sendmsg() treat a bvec-bearing
msghdr like an array of struct page pointers passed to kernel_sendpage() ?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 19:31           ` Chuck Lever
@ 2020-11-09 20:10             ` Eric Dumazet
  2020-11-09 20:11               ` Chuck Lever
  2020-11-10 14:49               ` Chuck Lever
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2020-11-09 20:10 UTC (permalink / raw)
  To: Chuck Lever, Trond Myklebust; +Cc: Linux NFS Mailing List, netdev



On 11/9/20 8:31 PM, Chuck Lever wrote:
> 
> 
>> On Nov 9, 2020, at 1:16 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>>
>> On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote:
>>>
>>>
>>>> On Nov 9, 2020, at 12:32 PM, Trond Myklebust <
>>>> trondmy@hammerspace.com> wrote:
>>>>
>>>> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
>>>>>
>>>>>
>>>>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust
>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>
>>>>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>>>>>>> Daire Byrne reports a ~50% aggregrate throughput regression
>>>>>>> on
>>>>>>> his
>>>>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach
>>>>>>> server
>>>>>>> to
>>>>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>>>>> kernel_send_page() calls in NFSD's socket send path with
>>>>>>> calls to
>>>>>>> sock_sendmsg() using iov_iter.
>>>>>>>
>>>>>>> Investigation showed that tcp_sendmsg() was not using zero-
>>>>>>> copy
>>>>>>> to
>>>>>>> send the xdr_buf's bvec pages, but instead was relying on
>>>>>>> memcpy.
>>>>>>>
>>>>>>> Set up the socket and each msghdr that bears bvec pages to
>>>>>>> use
>>>>>>> the
>>>>>>> zero-copy mechanism in tcp_sendmsg.
>>>>>>>
>>>>>>> Reported-by: Daire Byrne <daire@dneg.com>
>>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
>>>>>>> xprt_sock_sendmsg
>>>>>>> for socket sends")
>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>>> ---
>>>>>>>  net/sunrpc/socklib.c  |    5 ++++-
>>>>>>>  net/sunrpc/svcsock.c  |    1 +
>>>>>>>  net/sunrpc/xprtsock.c |    1 +
>>>>>>>  3 files changed, 6 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> This patch does not fully resolve the issue. Daire reports
>>>>>>> high
>>>>>>> softIRQ activity after the patch is applied, and this
>>>>>>> activity
>>>>>>> seems to prevent full restoration of previous performance.
>>>>>>>
>>>>>>>
>>>>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>>>>>>> index d52313af82bc..af47596a7bdd 100644
>>>>>>> --- a/net/sunrpc/socklib.c
>>>>>>> +++ b/net/sunrpc/socklib.c
>>>>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct
>>>>>>> socket
>>>>>>> *sock, struct msghdr *msg,
>>>>>>>         if (err < 0)
>>>>>>>                 return err;
>>>>>>>  
>>>>>>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>>>>>>         iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>>>>>>> xdr_buf_pagecount(xdr),
>>>>>>>                       xdr->page_len + xdr->page_base);
>>>>>>> -       return xprt_sendmsg(sock, msg, base + xdr-
>>>>>>>> page_base);
>>>>>>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>>>>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>>> +       return err;
>>>>>>>  }
>>>>>>>  
>>>>>>>  /* Common case:
>>>>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>>>>> index c2752e2b9ce3..c814b4953b15 100644
>>>>>>> --- a/net/sunrpc/svcsock.c
>>>>>>> +++ b/net/sunrpc/svcsock.c
>>>>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct
>>>>>>> svc_sock
>>>>>>> *svsk,
>>>>>>> struct svc_serv *serv)
>>>>>>>                 svsk->sk_datalen = 0;
>>>>>>>                 memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>>>>>>> sk_pages));
>>>>>>>  
>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>  
>>>>>>>                 set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>>>> index 7090bbee0ec5..343c6396b297 100644
>>>>>>> --- a/net/sunrpc/xprtsock.c
>>>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>>>> @@ -2175,6 +2175,7 @@ static int
>>>>>>> xs_tcp_finish_connecting(struct
>>>>>>> rpc_xprt *xprt, struct socket *sock)
>>>>>>>  
>>>>>>>                 /* socket options */
>>>>>>>                 sock_reset_flag(sk, SOCK_LINGER);
>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>                 tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>  
>>>>>>>                 xprt_clear_connected(xprt);
>>>>>>>
>>>>>>>
>>>>>> I'm thinking we are not really allowed to do that here. The
>>>>>> pages
>>>>>> we
>>>>>> pass in to the RPC layer are not guaranteed to contain stable
>>>>>> data
>>>>>> since they include unlocked page cache pages as well as
>>>>>> O_DIRECT
>>>>>> pages.
>>>>>
>>>>> I assume you mean the client side only. Those issues aren't a
>>>>> factor
>>>>> on the server. Not setting SOCK_ZEROCOPY here should be enough to
>>>>> prevent the use of zero-copy on the client.
>>>>>
>>>>> However, the client loses the benefits of sending a page at a
>>>>> time.
>>>>> Is there a desire to remedy that somehow?
>>>>
>>>> What about splice reads on the server side?
>>>
>>> On the server, this path formerly used kernel_sendpages(), which I
>>> assumed is similar to the sendmsg zero-copy mechanism. How does
>>> kernel_sendpages() mitigate against page instability?
>>
>> It copies the data. 🙂
> 
> tcp_sendmsg_locked() invokes skb_copy_to_page_nocache(), which is
> where Daire's performance-robbing memcpy occurs.
> 
> do_tcp_sendpages() has no such call site. Therefore the legacy
> sendpage-based path has at least one fewer data copy operations.
> 
> What is the appropriate way to make tcp_sendmsg() treat a bvec-bearing
> msghdr like an array of struct page pointers passed to kernel_sendpage() ?
> 


MSG_ZEROCOPY is only accepted if sock_flag(sk, SOCK_ZEROCOPY) is true,
ie if SO_ZEROCOPY socket option has been set earlier.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 20:10             ` Eric Dumazet
@ 2020-11-09 20:11               ` Chuck Lever
  2020-11-10 14:49               ` Chuck Lever
  1 sibling, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2020-11-09 20:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Trond Myklebust, Linux NFS Mailing List, netdev



> On Nov 9, 2020, at 3:10 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 11/9/20 8:31 PM, Chuck Lever wrote:
>> 
>> 
>>> On Nov 9, 2020, at 1:16 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Nov 9, 2020, at 12:32 PM, Trond Myklebust <
>>>>> trondmy@hammerspace.com> wrote:
>>>>> 
>>>>> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust
>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>> 
>>>>>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>>>>>>>> Daire Byrne reports a ~50% aggregrate throughput regression
>>>>>>>> on
>>>>>>>> his
>>>>>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach
>>>>>>>> server
>>>>>>>> to
>>>>>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>>>>>> kernel_send_page() calls in NFSD's socket send path with
>>>>>>>> calls to
>>>>>>>> sock_sendmsg() using iov_iter.
>>>>>>>> 
>>>>>>>> Investigation showed that tcp_sendmsg() was not using zero-
>>>>>>>> copy
>>>>>>>> to
>>>>>>>> send the xdr_buf's bvec pages, but instead was relying on
>>>>>>>> memcpy.
>>>>>>>> 
>>>>>>>> Set up the socket and each msghdr that bears bvec pages to
>>>>>>>> use
>>>>>>>> the
>>>>>>>> zero-copy mechanism in tcp_sendmsg.
>>>>>>>> 
>>>>>>>> Reported-by: Daire Byrne <daire@dneg.com>
>>>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>>>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
>>>>>>>> xprt_sock_sendmsg
>>>>>>>> for socket sends")
>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>>>> ---
>>>>>>>> net/sunrpc/socklib.c  |    5 ++++-
>>>>>>>> net/sunrpc/svcsock.c  |    1 +
>>>>>>>> net/sunrpc/xprtsock.c |    1 +
>>>>>>>> 3 files changed, 6 insertions(+), 1 deletion(-)
>>>>>>>> 
>>>>>>>> This patch does not fully resolve the issue. Daire reports
>>>>>>>> high
>>>>>>>> softIRQ activity after the patch is applied, and this
>>>>>>>> activity
>>>>>>>> seems to prevent full restoration of previous performance.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>>>>>>>> index d52313af82bc..af47596a7bdd 100644
>>>>>>>> --- a/net/sunrpc/socklib.c
>>>>>>>> +++ b/net/sunrpc/socklib.c
>>>>>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct
>>>>>>>> socket
>>>>>>>> *sock, struct msghdr *msg,
>>>>>>>>        if (err < 0)
>>>>>>>>                return err;
>>>>>>>> 
>>>>>>>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>>>>>>>        iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>>>>>>>> xdr_buf_pagecount(xdr),
>>>>>>>>                      xdr->page_len + xdr->page_base);
>>>>>>>> -       return xprt_sendmsg(sock, msg, base + xdr-
>>>>>>>>> page_base);
>>>>>>>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>>>>>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>>>> +       return err;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> /* Common case:
>>>>>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>>>>>> index c2752e2b9ce3..c814b4953b15 100644
>>>>>>>> --- a/net/sunrpc/svcsock.c
>>>>>>>> +++ b/net/sunrpc/svcsock.c
>>>>>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct
>>>>>>>> svc_sock
>>>>>>>> *svsk,
>>>>>>>> struct svc_serv *serv)
>>>>>>>>                svsk->sk_datalen = 0;
>>>>>>>>                memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>>>>>>>> sk_pages));
>>>>>>>> 
>>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>>                tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>> 
>>>>>>>>                set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>>>>> index 7090bbee0ec5..343c6396b297 100644
>>>>>>>> --- a/net/sunrpc/xprtsock.c
>>>>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>>>>> @@ -2175,6 +2175,7 @@ static int
>>>>>>>> xs_tcp_finish_connecting(struct
>>>>>>>> rpc_xprt *xprt, struct socket *sock)
>>>>>>>> 
>>>>>>>>                /* socket options */
>>>>>>>>                sock_reset_flag(sk, SOCK_LINGER);
>>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>>                tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>> 
>>>>>>>>                xprt_clear_connected(xprt);
>>>>>>>> 
>>>>>>>> 
>>>>>>> I'm thinking we are not really allowed to do that here. The
>>>>>>> pages
>>>>>>> we
>>>>>>> pass in to the RPC layer are not guaranteed to contain stable
>>>>>>> data
>>>>>>> since they include unlocked page cache pages as well as
>>>>>>> O_DIRECT
>>>>>>> pages.
>>>>>> 
>>>>>> I assume you mean the client side only. Those issues aren't a
>>>>>> factor
>>>>>> on the server. Not setting SOCK_ZEROCOPY here should be enough to
>>>>>> prevent the use of zero-copy on the client.
>>>>>> 
>>>>>> However, the client loses the benefits of sending a page at a
>>>>>> time.
>>>>>> Is there a desire to remedy that somehow?
>>>>> 
>>>>> What about splice reads on the server side?
>>>> 
>>>> On the server, this path formerly used kernel_sendpages(), which I
>>>> assumed is similar to the sendmsg zero-copy mechanism. How does
>>>> kernel_sendpages() mitigate against page instability?
>>> 
>>> It copies the data. 🙂
>> 
>> tcp_sendmsg_locked() invokes skb_copy_to_page_nocache(), which is
>> where Daire's performance-robbing memcpy occurs.
>> 
>> do_tcp_sendpages() has no such call site. Therefore the legacy
>> sendpage-based path has at least one fewer data copy operations.
>> 
>> What is the appropriate way to make tcp_sendmsg() treat a bvec-bearing
>> msghdr like an array of struct page pointers passed to kernel_sendpage() ?
>> 
> 
> 
> MSG_ZEROCOPY is only accepted if sock_flag(sk, SOCK_ZEROCOPY) is true,
> ie if SO_ZEROCOPY socket option has been set earlier.

The patch does set both SO_ZEROCOPY, and MSG_ZEROCOPY when appropriate.


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations
  2020-11-09 20:10             ` Eric Dumazet
  2020-11-09 20:11               ` Chuck Lever
@ 2020-11-10 14:49               ` Chuck Lever
  1 sibling, 0 replies; 11+ messages in thread
From: Chuck Lever @ 2020-11-10 14:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Trond Myklebust, Linux NFS Mailing List, netdev



> On Nov 9, 2020, at 3:10 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 11/9/20 8:31 PM, Chuck Lever wrote:
>> 
>> 
>>> On Nov 9, 2020, at 1:16 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2020-11-09 at 12:36 -0500, Chuck Lever wrote:
>>>> 
>>>> 
>>>>> On Nov 9, 2020, at 12:32 PM, Trond Myklebust <
>>>>> trondmy@hammerspace.com> wrote:
>>>>> 
>>>>> On Mon, 2020-11-09 at 12:12 -0500, Chuck Lever wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Nov 9, 2020, at 12:08 PM, Trond Myklebust
>>>>>>> <trondmy@hammerspace.com> wrote:
>>>>>>> 
>>>>>>> On Mon, 2020-11-09 at 11:03 -0500, Chuck Lever wrote:
>>>>>>>> Daire Byrne reports a ~50% aggregrate throughput regression
>>>>>>>> on
>>>>>>>> his
>>>>>>>> Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach
>>>>>>>> server
>>>>>>>> to
>>>>>>>> use xprt_sock_sendmsg for socket sends"), which replaced
>>>>>>>> kernel_send_page() calls in NFSD's socket send path with
>>>>>>>> calls to
>>>>>>>> sock_sendmsg() using iov_iter.
>>>>>>>> 
>>>>>>>> Investigation showed that tcp_sendmsg() was not using zero-
>>>>>>>> copy
>>>>>>>> to
>>>>>>>> send the xdr_buf's bvec pages, but instead was relying on
>>>>>>>> memcpy.
>>>>>>>> 
>>>>>>>> Set up the socket and each msghdr that bears bvec pages to
>>>>>>>> use
>>>>>>>> the
>>>>>>>> zero-copy mechanism in tcp_sendmsg.
>>>>>>>> 
>>>>>>>> Reported-by: Daire Byrne <daire@dneg.com>
>>>>>>>> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439
>>>>>>>> Fixes: da1661b93bf4 ("SUNRPC: Teach server to use
>>>>>>>> xprt_sock_sendmsg
>>>>>>>> for socket sends")
>>>>>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>>>>>> ---
>>>>>>>> net/sunrpc/socklib.c  |    5 ++++-
>>>>>>>> net/sunrpc/svcsock.c  |    1 +
>>>>>>>> net/sunrpc/xprtsock.c |    1 +
>>>>>>>> 3 files changed, 6 insertions(+), 1 deletion(-)
>>>>>>>> 
>>>>>>>> This patch does not fully resolve the issue. Daire reports
>>>>>>>> high
>>>>>>>> softIRQ activity after the patch is applied, and this
>>>>>>>> activity
>>>>>>>> seems to prevent full restoration of previous performance.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> diff --git a/net/sunrpc/socklib.c b/net/sunrpc/socklib.c
>>>>>>>> index d52313af82bc..af47596a7bdd 100644
>>>>>>>> --- a/net/sunrpc/socklib.c
>>>>>>>> +++ b/net/sunrpc/socklib.c
>>>>>>>> @@ -226,9 +226,12 @@ static int xprt_send_pagedata(struct
>>>>>>>> socket
>>>>>>>> *sock, struct msghdr *msg,
>>>>>>>>        if (err < 0)
>>>>>>>>                return err;
>>>>>>>> 
>>>>>>>> +       msg->msg_flags |= MSG_ZEROCOPY;
>>>>>>>>        iov_iter_bvec(&msg->msg_iter, WRITE, xdr->bvec,
>>>>>>>> xdr_buf_pagecount(xdr),
>>>>>>>>                      xdr->page_len + xdr->page_base);
>>>>>>>> -       return xprt_sendmsg(sock, msg, base + xdr-
>>>>>>>>> page_base);
>>>>>>>> +       err = xprt_sendmsg(sock, msg, base + xdr->page_base);
>>>>>>>> +       msg->msg_flags &= ~MSG_ZEROCOPY;
>>>>>>>> +       return err;
>>>>>>>> }
>>>>>>>> 
>>>>>>>> /* Common case:
>>>>>>>> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
>>>>>>>> index c2752e2b9ce3..c814b4953b15 100644
>>>>>>>> --- a/net/sunrpc/svcsock.c
>>>>>>>> +++ b/net/sunrpc/svcsock.c
>>>>>>>> @@ -1176,6 +1176,7 @@ static void svc_tcp_init(struct
>>>>>>>> svc_sock
>>>>>>>> *svsk,
>>>>>>>> struct svc_serv *serv)
>>>>>>>>                svsk->sk_datalen = 0;
>>>>>>>>                memset(&svsk->sk_pages[0], 0, sizeof(svsk-
>>>>>>>>> sk_pages));
>>>>>>>> 
>>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>>                tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>> 
>>>>>>>>                set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
>>>>>>>> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
>>>>>>>> index 7090bbee0ec5..343c6396b297 100644
>>>>>>>> --- a/net/sunrpc/xprtsock.c
>>>>>>>> +++ b/net/sunrpc/xprtsock.c
>>>>>>>> @@ -2175,6 +2175,7 @@ static int
>>>>>>>> xs_tcp_finish_connecting(struct
>>>>>>>> rpc_xprt *xprt, struct socket *sock)
>>>>>>>> 
>>>>>>>>                /* socket options */
>>>>>>>>                sock_reset_flag(sk, SOCK_LINGER);
>>>>>>>> +               sock_set_flag(sk, SOCK_ZEROCOPY);
>>>>>>>>                tcp_sk(sk)->nonagle |= TCP_NAGLE_OFF;
>>>>>>>> 
>>>>>>>>                xprt_clear_connected(xprt);
>>>>>>>> 
>>>>>>>> 
>>>>>>> I'm thinking we are not really allowed to do that here. The
>>>>>>> pages
>>>>>>> we
>>>>>>> pass in to the RPC layer are not guaranteed to contain stable
>>>>>>> data
>>>>>>> since they include unlocked page cache pages as well as
>>>>>>> O_DIRECT
>>>>>>> pages.
>>>>>> 
>>>>>> I assume you mean the client side only. Those issues aren't a
>>>>>> factor
>>>>>> on the server. Not setting SOCK_ZEROCOPY here should be enough to
>>>>>> prevent the use of zero-copy on the client.
>>>>>> 
>>>>>> However, the client loses the benefits of sending a page at a
>>>>>> time.
>>>>>> Is there a desire to remedy that somehow?
>>>>> 
>>>>> What about splice reads on the server side?
>>>> 
>>>> On the server, this path formerly used kernel_sendpages(), which I
>>>> assumed is similar to the sendmsg zero-copy mechanism. How does
>>>> kernel_sendpages() mitigate against page instability?
>>> 
>>> It copies the data. 🙂
>> 
>> tcp_sendmsg_locked() invokes skb_copy_to_page_nocache(), which is
>> where Daire's performance-robbing memcpy occurs.
>> 
>> do_tcp_sendpages() has no such call site. Therefore the legacy
>> sendpage-based path has at least one fewer data copy operations.
>> 
>> What is the appropriate way to make tcp_sendmsg() treat a bvec-bearing
>> msghdr like an array of struct page pointers passed to kernel_sendpage() ?
>> 
> 
> 
> MSG_ZEROCOPY is only accepted if sock_flag(sk, SOCK_ZEROCOPY) is true,
> ie if SO_ZEROCOPY socket option has been set earlier.

Eric, are you suggesting that ZEROCOPY is the mechanism that socket
consumers should be using with sock_sendmsg to get the same behavior
as kernel_sendpage() ?

If no, what is the preferred approach?

If yes, can you comment on the added soft IRQ workload when NFSD
sets these flags?


--
Chuck Lever




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-11-10 14:49 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 16:03 [PATCH RFC] SUNRPC: Use zero-copy to perform socket send operations Chuck Lever
2020-11-09 17:08 ` Trond Myklebust
2020-11-09 17:12   ` Chuck Lever
2020-11-09 17:32     ` Trond Myklebust
2020-11-09 17:36       ` Chuck Lever
2020-11-09 17:55         ` J. Bruce Fields
2020-11-09 18:16         ` Trond Myklebust
2020-11-09 19:31           ` Chuck Lever
2020-11-09 20:10             ` Eric Dumazet
2020-11-09 20:11               ` Chuck Lever
2020-11-10 14:49               ` Chuck Lever

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).