All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Paasch <cpaasch@apple.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev@vger.kernel.org, Ian Swett <ianswett@google.com>,
	Leif Hedstrom <lhedstrom@apple.com>,
	Jana Iyengar <jri.ietf@gmail.com>
Subject: Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
Date: Wed, 31 Oct 2018 20:50:50 -0700	[thread overview]
Message-ID: <20181101035050.GO80792@MacBook-Pro-19.local> (raw)
In-Reply-To: <0ce864f0-38b9-59cc-18ea-e071afca347d@gmail.com>

On 31/10/18 - 17:53:22, Eric Dumazet wrote:
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> > 
> 
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
> 
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
> very few packets per second per flow)
> 
> Maybe you could elaborate on the need of having one UDP socket per connection.

I let Leif chime in on that as the ask came from him. Leif & his team are
implementing Quic in the Apache Traffic Server.


One advantage I can see is that it would allow to benefit from fq_pacing as
one could set sk_pacing_rate simply on the socket. That way there is no need
to implement the pacing in the user-space anymore.


> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand. Only after the connect() a full 4-tuple match
> > is happening. So, one can't really create this kind of a server that has
> > a connected UDP-socket per Quic connection.
> > 
> > So, what is needed is an "atomic bind & connect" that basically
> > prevents any incoming traffic until the connect() call has been issued
> > at which point the full 4-tuple is known.
> > 
> > 
> > This patchset implements this functionality and exposes a socket-option
> > to do this.
> > 
> > Usage would be:
> > 
> >         int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> > 
> >         int val = 1;
> >         setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> > 
> >         bind(fd, (struct sockaddr *)&src, sizeof(src));
> > 
> > 	/* At this point, incoming traffic will never match on this socket */
> > 
> >         connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> > 
> > 	/* Only now incoming traffic will reach the socket */
> > 
> > 
> > 
> > There is literally an infinite number of ways on how to implement it,
> > which is why I first send it out as an RFC. With this approach here I
> > chose the least invasive one, just preventing the match on the incoming
> > path.
> > 
> > 
> > The reason for choosing a SOL_SOCKET socket-option and not at the
> > SOL_UDP-level is because that functionality actually could be useful for
> > other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> > by binding to the source-IP and the destination-IP at the same time.
> 
> Passive TCP flows can not benefit from this idea.
> 
> Active TCP flows can already do that, I do not really understand what you are suggesting.

What we had here is that we wanted to let a server initiate more than 64K
connections *while* binding also to a source-IP.
With TCP the bind() would then pick a source-port and we ended up hitting the
64K limit. If we could do an atomic "bind + connect", source-port selection
could ensure that the 4-tuple is unique.

Or has something changed in recent times that allows to use the 4-tuple
matching when doing this with TCP?


Christoph

  reply	other threads:[~2018-11-01 14:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
2018-10-31 23:26 ` [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND Christoph Paasch
2018-10-31 23:26 ` [RFC 2/2] udp: Support SO_DELAYED_BIND Christoph Paasch
2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
2018-11-01  3:50   ` Christoph Paasch [this message]
2018-11-01  5:04     ` Eric Dumazet
2018-11-01  5:07       ` Christoph Paasch
2018-11-01  5:08     ` Eric Dumazet
2018-11-01  5:17       ` Eric Dumazet
2018-11-01 17:58   ` Leif Hedstrom
2018-11-01 18:21     ` Eric Dumazet
2018-11-01 21:51 ` Willem de Bruijn
2018-11-01 22:11   ` Christoph Paasch
     [not found]     ` <CAKcm_gNZqgRGRj2J5yJDsavHsoaeXtozrbGp+TmAj_DRsCUOLQ@mail.gmail.com>
     [not found]       ` <CACpbDccs6WmLCknpu2GLMMBnkHwS4apsr3Z3sAKt4Ch_2HPwgg@mail.gmail.com>
2018-11-04 18:58         ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20181101035050.GO80792@MacBook-Pro-19.local \
    --to=cpaasch@apple.com \
    --cc=eric.dumazet@gmail.com \
    --cc=ianswett@google.com \
    --cc=jri.ietf@gmail.com \
    --cc=lhedstrom@apple.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.