netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Network Development <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Marek Majkowski <marek@cloudflare.com>
Subject: Re: Delayed source port allocation for connected UDP sockets
Date: Mon, 2 Dec 2019 11:03:12 -0500	[thread overview]
Message-ID: <CA+FuTSfA9o=yQk5EjR2hMuhwRDLXCAwYQ+eGqx2YSh=hx03c8g@mail.gmail.com> (raw)
In-Reply-To: <877e3fniep.fsf@cloudflare.com>

On Mon, Dec 2, 2019 at 5:15 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Wed, Nov 27, 2019 at 03:07 PM CET, Marek Majkowski wrote:
> > In my applications I need something like a connectx()[1] syscall. On
> > Linux I can get quite far with using bind-before-connect and
> > IP_BIND_ADDRESS_NO_PORT. One corner case is missing though.
> >
> > For various UDP applications I'm establishing connected sockets from
> > specific 2-tuple. This is working fine with bind-before-connect, but
> > in UDP it creates a slight race condition. It's possible the socket
> > will receive packet from arbitrary source after bind():
> >
> > s = socket(SOCK_DGRAM)
> > s.bind((192.0.2.1, 1703))
> > # here be dragons
> > s.connect((198.18.0.1, 58910))
> >
> > For the short amount of time after bind() and before connect(), the
> > socket may receive packets from any peer. For situations when I don't
> > need to specify source port, IP_BIND_ADDRESS_NO_PORT flag solves the
> > issue. This code is fine:
> >
> > s = socket(SOCK_DGRAM)
> > s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> > s.bind((192.0.2.1, 0))
> > s.connect((198.18.0.1, 58910))
> >
> > But the IP_BIND_ADDRESS_NO_PORT doesn't work when the source port is
> > selected. It seems natural to expand the scope of
> > IP_BIND_ADDRESS_NO_PORT flag. Perhaps this could be made to work:
> >
> > s = socket(SOCK_DGRAM)
> > s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> > s.bind((192.0.2.1, 1703))
> > s.connect((198.18.0.1, 58910))
> >
> > I would like such code to delay the binding to port 1703 up until the
> > connect(). IP_BIND_ADDRESS_NO_PORT only makes sense for connected
> > sockets anyway. This raises a couple of questions though:
> >
> >  - IP_BIND_ADDRESS_NO_PORT name is confusing - we specify the port
> > number in the bind!
> >
> >  - Where to store the source port in __inet_bind. Neither
> > inet->inet_sport nor inet->inet_num seem like correct places to store
> > the user-passed source port hint. The alternative is to introduce
> > yet-another field onto inet_sock struct, but that is wasteful.
>
> We've been talking with Marek about it some more. I'll summarize for the
> sake of keeping the discussion open.
>
> 1. inet->inet_sport as storage for port hint
>
>    It seems inet->inet_sport could be used to hold the port passed to
>    bind() when we're delaying port allocation with
>    IP_BIND_ADDRESS_NO_PORT. As long as local port, inet->inet_num, is
>    not set, connect() and sendmsg() will know the socket needs to be
>    bound to a port first.

So bind might succeed, but connect fail later if the port is already
bound by another socket inbetween?

Related, I have toyed with unhashed sockets with inet_sport set in the
past for a different use-case: transmit-only sockets. If all receive
processing happens on a small set (say, per cpu) of unconnected
listening sockets. Then have unhashed transmit-only connected sockets
to transmit without route lookup. But the route caching did not
warrant the cost of maintaining a socket per connection at scale.

>
>    We didn't do a detailed audit of all access sites to
>    inet->inet_sport. Potentially we missed something.
>
>

> 4. Why connected UDP sockets?
>
>    We know that it's better to stick to receiving UDP sockets and
>    demultiplex the client requests/sessions in user-space. Being hashed
>    just by local address & port, connected UDP sockets don't scale well.
>
>    We think there is one useful application, though. Service draining
>    during restarts.
>
>    When a service is being restarted, we would like the dying process to
>    handle the ongoing L7 sessions until they come to an end. New UDP
>    flows should go to a fresh service instance.

Service hand-off is a prime use case of reuseport BPF. With UDP it is
trickier than TCP. Requires a map to store session to process affinity,
likely.

>    To achieve that, for each ongoing session we would open a connected
>    UDP socket. This way socket lookup logic would deliver just the flows
>    we care about to the old process.
>
> 5. reuseport BPF with SOCKARRAY to the rescue?
>
>    Since we're talking about opening connected UDP sockets that share
>    the local port with other receiving UDP sockets (owned by another
>    process), we would need to opt for port sharing with REUSEPORT [3].
>
>    If we don't want the connected UDP sockets to receive any traffic
>    during the short window of opportunity when the socket is bound but
>    not connected, we could exclude it from the reuseport group by
>    controlling the socket set with BPF & SOCKARRAY.
>
> Comments and thoughts more than welcome.

If CAP_NET_RAW is no issue, Maciej's suggestion of temporarily binding
to a dummy device (or even lo) might be the simplest approach?

>
> -Jakub
>
> [0] Unless we call it IP_BIND_ADDRESS_NO_PORT_FOR_REAL... ;-)
> [1] https://www.unix.com/man-page/mojave/2/connectx/
> [2] Or REUSEADDR which semantics allow it for unicast UDP.

  reply	other threads:[~2019-12-02 16:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27 14:07 Delayed source port allocation for connected UDP sockets Marek Majkowski
2019-11-27 16:09 ` Maciej Żenczykowski
2019-11-27 16:18   ` Maciej Żenczykowski
2019-11-27 17:15     ` Marek Majkowski
2019-12-02 10:14 ` Jakub Sitnicki
2019-12-02 16:03   ` Willem de Bruijn [this message]
2019-12-03 14:59     ` Marek Majkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+FuTSfA9o=yQk5EjR2hMuhwRDLXCAwYQ+eGqx2YSh=hx03c8g@mail.gmail.com' \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=jakub@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=marek@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).