All of lore.kernel.org
 help / color / mirror / Atom feed
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: Network Development <netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Marek Majkowski <marek@cloudflare.com>
Subject: Re: Delayed source port allocation for connected UDP sockets
Date: Mon, 2 Dec 2019 11:03:12 -0500	[thread overview]
Message-ID: <CA+FuTSfA9o=yQk5EjR2hMuhwRDLXCAwYQ+eGqx2YSh=hx03c8g@mail.gmail.com> (raw)
In-Reply-To: <877e3fniep.fsf@cloudflare.com>

On Mon, Dec 2, 2019 at 5:15 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Wed, Nov 27, 2019 at 03:07 PM CET, Marek Majkowski wrote:
> > In my applications I need something like a connectx()[1] syscall. On
> > Linux I can get quite far with using bind-before-connect and
> > IP_BIND_ADDRESS_NO_PORT. One corner case is missing though.
> >
> > For various UDP applications I'm establishing connected sockets from
> > specific 2-tuple. This is working fine with bind-before-connect, but
> > in UDP it creates a slight race condition. It's possible the socket
> > will receive packet from arbitrary source after bind():
> >
> > s = socket(SOCK_DGRAM)
> > s.bind((192.0.2.1, 1703))
> > # here be dragons
> > s.connect((198.18.0.1, 58910))
> >
> > For the short amount of time after bind() and before connect(), the
> > socket may receive packets from any peer. For situations when I don't
> > need to specify source port, IP_BIND_ADDRESS_NO_PORT flag solves the
> > issue. This code is fine:
> >
> > s = socket(SOCK_DGRAM)
> > s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> > s.bind((192.0.2.1, 0))
> > s.connect((198.18.0.1, 58910))
> >
> > But the IP_BIND_ADDRESS_NO_PORT doesn't work when the source port is
> > selected. It seems natural to expand the scope of
> > IP_BIND_ADDRESS_NO_PORT flag. Perhaps this could be made to work:
> >
> > s = socket(SOCK_DGRAM)
> > s.setsockopt(IP_BIND_ADDRESS_NO_PORT)
> > s.bind((192.0.2.1, 1703))
> > s.connect((198.18.0.1, 58910))
> >
> > I would like such code to delay the binding to port 1703 up until the
> > connect(). IP_BIND_ADDRESS_NO_PORT only makes sense for connected
> > sockets anyway. This raises a couple of questions though:
> >
> >  - IP_BIND_ADDRESS_NO_PORT name is confusing - we specify the port
> > number in the bind!
> >
> >  - Where to store the source port in __inet_bind. Neither
> > inet->inet_sport nor inet->inet_num seem like correct places to store
> > the user-passed source port hint. The alternative is to introduce
> > yet-another field onto inet_sock struct, but that is wasteful.
>
> We've been talking with Marek about it some more. I'll summarize for the
> sake of keeping the discussion open.
>
> 1. inet->inet_sport as storage for port hint
>
>    It seems inet->inet_sport could be used to hold the port passed to
>    bind() when we're delaying port allocation with
>    IP_BIND_ADDRESS_NO_PORT. As long as local port, inet->inet_num, is
>    not set, connect() and sendmsg() will know the socket needs to be
>    bound to a port first.

So bind might succeed, but connect fail later if the port is already
bound by another socket inbetween?

Related, I have toyed with unhashed sockets with inet_sport set in the
past for a different use-case: transmit-only sockets. If all receive
processing happens on a small set (say, per cpu) of unconnected
listening sockets. Then have unhashed transmit-only connected sockets
to transmit without route lookup. But the route caching did not
warrant the cost of maintaining a socket per connection at scale.

>
>    We didn't do a detailed audit of all access sites to
>    inet->inet_sport. Potentially we missed something.
>
>

> 4. Why connected UDP sockets?
>
>    We know that it's better to stick to receiving UDP sockets and
>    demultiplex the client requests/sessions in user-space. Being hashed
>    just by local address & port, connected UDP sockets don't scale well.
>
>    We think there is one useful application, though. Service draining
>    during restarts.
>
>    When a service is being restarted, we would like the dying process to
>    handle the ongoing L7 sessions until they come to an end. New UDP
>    flows should go to a fresh service instance.

Service hand-off is a prime use case of reuseport BPF. With UDP it is
trickier than TCP. Requires a map to store session to process affinity,
likely.

>    To achieve that, for each ongoing session we would open a connected
>    UDP socket. This way socket lookup logic would deliver just the flows
>    we care about to the old process.
>
> 5. reuseport BPF with SOCKARRAY to the rescue?
>
>    Since we're talking about opening connected UDP sockets that share
>    the local port with other receiving UDP sockets (owned by another
>    process), we would need to opt for port sharing with REUSEPORT [3].
>
>    If we don't want the connected UDP sockets to receive any traffic
>    during the short window of opportunity when the socket is bound but
>    not connected, we could exclude it from the reuseport group by
>    controlling the socket set with BPF & SOCKARRAY.
>
> Comments and thoughts more than welcome.

If CAP_NET_RAW is no issue, Maciej's suggestion of temporarily binding
to a dummy device (or even lo) might be the simplest approach?

>
> -Jakub
>
> [0] Unless we call it IP_BIND_ADDRESS_NO_PORT_FOR_REAL... ;-)
> [1] https://www.unix.com/man-page/mojave/2/connectx/
> [2] Or REUSEADDR which semantics allow it for unicast UDP.

  reply	other threads:[~2019-12-02 16:03 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-27 14:07 Delayed source port allocation for connected UDP sockets Marek Majkowski
2019-11-27 16:09 ` Maciej Żenczykowski
2019-11-27 16:18   ` Maciej Żenczykowski
2019-11-27 17:15     ` Marek Majkowski
2019-12-02 10:14 ` Jakub Sitnicki
2019-12-02 16:03   ` Willem de Bruijn [this message]
2019-12-03 14:59     ` Marek Majkowski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CA+FuTSfA9o=yQk5EjR2hMuhwRDLXCAwYQ+eGqx2YSh=hx03c8g@mail.gmail.com' \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=jakub@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=marek@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.