bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lorenz Bauer <lmb@cloudflare.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team <kernel-team@cloudflare.com>,
	Networking <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>
Subject: Re: [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup
Date: Thu, 12 Mar 2020 09:16:34 +0000	[thread overview]
Message-ID: <CACAyw9-Ui5FECjAaehP8raRjcRJVx2nQAj5=XPu=zXME2acMhg@mail.gmail.com> (raw)
In-Reply-To: <20200312015822.bhu6ptkx5jpabkr6@ast-mbp.dhcp.thefacebook.com>

On Thu, 12 Mar 2020 at 01:58, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> we do store the socket FD into a sockmap, but returning new FD to that socket
> feels weird. The user space suppose to hold those sockets. If it was bpf prog
> that stored a socket then what does user space want to do with that foreign
> socket? It likely belongs to some other process. Stealing it from other process
> doesn't feel right.

For our BPF socket dispatch control plane this is true by design: all sockets
belong to another process. The privileged user space is the steward of these,
and needs to make sure traffic is steered to them. I agree that stealing them is
weird, but after all this is CAP_NET_ADMIN only. pidfd_getfd allows you to
really steal an fd from another process, so that cat is out of the bag ;)

Marek wrote a PoC control plane: https://github.com/majek/inet-tool
It is a CLI tool and not a service, so it can't hold on to any sockets.

You can argue that we should turn it into a service, but that leads to another
problem: there is no way of recovering these fds if the service crashes for
some reason. The only solution would be to restart all services, which in
our set up is the same as rebooting a machine really.

> Sounds like the use case is to take sockets one by one from one map, allocate
> another map and store them there? The whole process has plenty of races.

It doesn't have to race. Our user space can do the appropriate locking to ensure
that operations are atomic wrt. dispatching to sockets:

- lock
- read sockets from sockmap
- write sockets into new sockmap
- create new instance of BPF socket dispatch program
- attach BPF socket dispatch program
- remove old map
- unlock

> I think it's better to tackle the problem from resize perspective. imo making it
> something like sk_local_storage (which is already resizable pseudo map of
> sockets) is a better way forward.

Resizing is only one aspect. We may also need to shuffle services around,
think "defragmentation", and I think there will be other cases as we gain more
experience with the control plane. Being able to recover fds from the sockmap
will make it more resilient. Adding a special API for every one of these cases
seems cumbersome.


Lorenz Bauer  |  Systems Engineer
6th Floor, County Hall/The Riverside Building, SE1 7PB, UK


  reply	other threads:[~2020-03-12  9:16 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10 17:47 [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Lorenz Bauer
2020-03-10 17:47 ` [PATCH 1/5] bpf: add map_copy_value hook Lorenz Bauer
2020-03-10 17:47 ` [PATCH 2/5] bpf: convert queue and stack map to map_copy_value Lorenz Bauer
2020-03-11 14:00   ` Jakub Sitnicki
2020-03-11 22:31     ` John Fastabend
2020-03-10 17:47 ` [PATCH 3/5] bpf: convert sock map and hash " Lorenz Bauer
2020-03-11 13:55   ` Jakub Sitnicki
2020-03-10 17:47 ` [PATCH 4/5] bpf: sockmap, sockhash: return file descriptors from privileged lookup Lorenz Bauer
2020-03-11 23:27   ` John Fastabend
2020-03-17 10:17     ` Lorenz Bauer
2020-03-17 15:18   ` Jakub Sitnicki
2020-03-17 18:16     ` John Fastabend
2020-03-10 17:47 ` [PATCH 5/5] bpf: sockmap, sockhash: test looking up fds Lorenz Bauer
2020-03-11 13:52   ` Jakub Sitnicki
2020-03-11 17:24     ` Lorenz Bauer
2020-03-11 13:44 ` [PATCH 0/5] Return fds from privileged sockhash/sockmap lookup Jakub Sitnicki
2020-03-11 22:40   ` John Fastabend
2020-03-12  1:58 ` Alexei Starovoitov
2020-03-12  9:16   ` Lorenz Bauer [this message]
2020-03-12 17:58     ` Alexei Starovoitov
2020-03-12 19:32       ` John Fastabend
2020-03-13 11:03         ` Lorenz Bauer
2020-03-13 10:48       ` Lorenz Bauer
2020-03-14  2:58         ` Alexei Starovoitov
2020-03-17  9:55           ` Lorenz Bauer
2020-03-17 19:05             ` John Fastabend
2020-03-20 15:12               ` Lorenz Bauer
2020-04-07  3:08                 ` Alexei Starovoitov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACAyw9-Ui5FECjAaehP8raRjcRJVx2nQAj5=XPu=zXME2acMhg@mail.gmail.com' \
    --to=lmb@cloudflare.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@cloudflare.com \
    --cc=netdev@vger.kernel.org \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).