linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Trond Myklebust <trondmy@hammerspace.com>
To: "chuck.lever@oracle.com" <chuck.lever@oracle.com>
Cc: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
	"bfields@redhat.com" <bfields@redhat.com>
Subject: Re: [PATCH] SUNRPC: Use TCP_CORK to optimise send performance on the server
Date: Sun, 14 Feb 2021 18:18:50 +0000	[thread overview]
Message-ID: <531d2cc4836f95c07b2a41c4bb1eb7d5306208d1.camel@hammerspace.com> (raw)
In-Reply-To: <BE094630-AA8E-4C7E-ACF8-B153AECA2EDA@oracle.com>

On Sun, 2021-02-14 at 17:58 +0000, Chuck Lever wrote:
> 
> 
> > On Feb 14, 2021, at 12:41 PM, Trond Myklebust < 
> > trondmy@hammerspace.com> wrote:
> > 
> > On Sun, 2021-02-14 at 17:27 +0000, Chuck Lever wrote:
> > > 
> > > 
> > > > On Feb 14, 2021, at 11:21 AM, Trond Myklebust
> > > > <trondmy@hammerspace.com> wrote:
> > > > 
> > > > On Sat, 2021-02-13 at 23:30 +0000, Chuck Lever wrote:
> > > > > 
> > > > > 
> > > > > > On Feb 13, 2021, at 5:10 PM, Trond Myklebust <
> > > > > > trondmy@hammerspace.com> wrote:
> > > > > > 
> > > > > > On Sat, 2021-02-13 at 21:53 +0000, Chuck Lever wrote:
> > > > > > > Hi Trond-
> > > > > > > 
> > > > > > > > On Feb 13, 2021, at 3:25 PM, trondmy@kernel.org wrote:
> > > > > > > > 
> > > > > > > > From: Trond Myklebust <trond.myklebust@hammerspace.com>
> > > > > > > > 
> > > > > > > > Use a counter to keep track of how many requests are
> > > > > > > > queued
> > > > > > > > behind
> > > > > > > > the
> > > > > > > > xprt->xpt_mutex, and keep TCP_CORK set until the queue
> > > > > > > > is
> > > > > > > > empty.
> > > > > > > 
> > > > > > > I'm intrigued, but IMO, the patch description needs to
> > > > > > > explain
> > > > > > > why this change should be made. Why abandon Nagle?
> > > > > > > 
> > > > > > 
> > > > > > This doesn't change the Nagle/TCP_NODELAY settings. It just
> > > > > > switches to
> > > > > > using the new documented kernel interface.
> > > > > > 
> > > > > > The only change is to use TCP_CORK so that we don't send
> > > > > > out
> > > > > > partially
> > > > > > filled TCP frames, when we can see that there are other RPC
> > > > > > replies
> > > > > > that are queued up for transmission.
> > > > > > 
> > > > > > Note the combination TCP_CORK+TCP_NODELAY is common, and
> > > > > > the
> > > > > > main
> > > > > > effect of the latter is that when we turn off the TCP_CORK,
> > > > > > then
> > > > > > there
> > > > > > is an immediate forced push of the TCP queue.
> > > > > 
> > > > > The description above suggests the patch is just a
> > > > > clean-up, but a forced push has potential to change
> > > > > the server's behavior.
> > > > 
> > > > Well, yes. That's very much the point.
> > > > 
> > > > Right now, the TCP_NODELAY/Nagle setting means that we're doing
> > > > that
> > > > forced push at the end of _every_ RPC reply, whether or not
> > > > there
> > > > is
> > > > more stuff that can be queued up in the socket. The MSG_MORE is
> > > > the
> > > > only thing that keeps us from doing the forced push on every
> > > > sendpage()
> > > > call.
> > > > So the TCP_CORK is there to _further delay_ that forced push
> > > > until
> > > > we
> > > > think the queue is empty.
> > > 
> > > My concern is that waiting for the queue to empty before pushing
> > > could
> > > improve throughput at the cost of increased average round-trip
> > > latency.
> > > That concern is based on experience I've had attempting to batch
> > > sends
> > > in the RDMA transport.
> > > 
> > > 
> > > > IOW: it attempts to optimise the scheduling of that push until
> > > > we're
> > > > actually done pushing more stuff into the socket.
> > > 
> > > Yep, clear, thanks. It would help a lot if the above were
> > > included in
> > > the patch description.
> > > 
> > > And, I presume that the TCP layer will push anyway if it needs to
> > > reclaim resources to handle more queued sends.
> > > 
> > > Let's also consider starvation; ie, that the server will continue
> > > queuing replies such that it never uncorks. The logic in the
> > > patch
> > > appears to depend on the client stopping at some point to wait
> > > for
> > > the
> > > server to catch up. There probably should be a trap door that
> > > uncorks
> > > after a few requests (say, 8) or certain number of bytes are
> > > pending
> > > without a break.
> > 
> > So, firstly, the TCP layer will still push every time a frame is
> > full,
> > so traffic does not stop altogether while TCP_CORK is set.
> 
> OK.
> 
> 
> > Secondly, TCP will also always push on send errors (e.g. when out
> > of free socket
> > buffer).
> 
> As I presumed. OK.
> 
> 
> > Thirdly, it will always push after hitting the 200ms ceiling,
> > as described in the tcp(7) manpage.
> 
> That's the trap door we need to ensure no starvation or deadlock,
> assuming there are no bugs in the TCP layer's logic.
> 
> 200ms seems a long time to wait, though, when average round trip
> latencies are usually under 1ms on typical Ethernet. It would be
> good to know how often sending has to wait this long.

If it does wait that long, then it would be because the system is under
such load that scheduling the next task waiting for the xpt_mutex is
taking more than 200ms. I don't see how this would make things any
worse.

> > IOW: The TCP_CORK feature is not designed to block the socket
> > forever.
> > It is there to allow the application to hint to the TCP layer what
> > it
> > needs to do in exactly the same way that MSG_MORE does.
> 
> As long as it is only a hint, then we're good.
> 
> Out of interest, why not use only MSG_MORE, or remove the use
> of MSG_MORE in favor of only cork? If these are essentially the
> same mechanism, seems like we need only one or the other.
> 

The advantage of TCP_CORK is that you can perform the queue length
evaluation _after_ the sendmsg/sendpage/writev call. Since all of those
can block due to memory allocation, socket buffer shortage, etc, then
it is quite likely under many workloads that other RPC calls may have
time to finish processing and get queued up.

I agree that we probably could remove MSG_MORE for tcp once TCP_CORK is
implemented. However those flags continue to be useful for udp.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@hammerspace.com



  reply	other threads:[~2021-02-14 18:20 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-13 20:25 [PATCH] SUNRPC: Use TCP_CORK to optimise send performance on the server trondmy
2021-02-13 21:53 ` Chuck Lever
2021-02-13 22:10   ` Trond Myklebust
2021-02-13 23:30     ` Chuck Lever
2021-02-14 14:13       ` Daire Byrne
2021-02-14 16:24         ` Trond Myklebust
2021-02-14 16:59         ` Chuck Lever
2021-02-15 14:08           ` Daire Byrne
2021-02-15 18:22             ` Chuck Lever
2021-02-14 16:21       ` Trond Myklebust
2021-02-14 17:27         ` Chuck Lever
2021-02-14 17:41           ` Trond Myklebust
2021-02-14 17:58             ` Chuck Lever
2021-02-14 18:18               ` Trond Myklebust [this message]
2021-02-14 19:43                 ` Chuck Lever
2021-02-16 16:27 ` Chuck Lever
2021-02-16 17:10   ` Trond Myklebust
2021-02-16 17:12     ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=531d2cc4836f95c07b2a41c4bb1eb7d5306208d1.camel@hammerspace.com \
    --to=trondmy@hammerspace.com \
    --cc=bfields@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).