bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Daniel Borkmann <daniel@iogearbox.net>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@fb.com>,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Andrii Nakryiko <andriin@fb.com>, bpf <bpf@vger.kernel.org>,
	Networking <netdev@vger.kernel.org>,
	Kernel Team <kernel-team@fb.com>
Subject: Re: [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction
Date: Fri, 06 Mar 2020 11:42:31 +0100	[thread overview]
Message-ID: <871rq53gtk.fsf@toke.dk> (raw)
In-Reply-To: <374e23b6-572a-8dac-88cb-855069535917@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 3/6/20 9:31 AM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 3/5/20 11:50 PM, Alexei Starovoitov wrote:
>>>> On Thu, Mar 05, 2020 at 11:34:18PM +0100, Daniel Borkmann wrote:
>>>>> On 3/5/20 5:34 PM, Alexei Starovoitov wrote:
>>>>>> On Thu, Mar 05, 2020 at 11:37:11AM +0100, Toke Høiland-Jørgensen wrote:
>>>>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>>>>>>> On Wed, Mar 04, 2020 at 08:47:44AM +0100, Toke Høiland-Jørgensen wrote:
>>>>> [...]
>>>>>>> Anyway, what I was trying to express:
>>>>>>>
>>>>>>>> Still that doesn't mean that pinned link is 'immutable'.
>>>>>>>
>>>>>>> I don't mean 'immutable' in the sense that it cannot be removed ever.
>>>>>>> Just that we may end up in a situation where an application can see a
>>>>>>> netdev with an XDP program attached, has the right privileges to modify
>>>>>>> it, but can't because it can't find the pinned bpf_link. Right? Or am I
>>>>>>> misunderstanding your proposal?
>>>>>>>
>>>>>>> Amending my example from before, this could happen by:
>>>>>>>
>>>>>>> 1. Someone attaches a program to eth0, and pins the bpf_link to
>>>>>>>       /sys/fs/bpf/myprog
>>>>>>>
>>>>>>> 2. eth0 is moved to a different namespace which mounts a new sysfs at
>>>>>>>       /sys
>>>>>>>
>>>>>>> 3. Inside that namespace, /sys/fs/bpf/myprog is no longer accessible, so
>>>>>>>       xdp-loader can't get access to the original bpf_link; but the XDP
>>>>>>>       program is still attached to eth0.
>>>>>>
>>>>>> The key to decide is whether moving netdev across netns should be allowed
>>>>>> when xdp attached. I think it should be denied. Even when legacy xdp
>>>>>> program is attached, since it will confuse user space managing part.
>>>>>
>>>>> There are perfectly valid use cases where this is done already today (minus
>>>>> bpf_link), for example, consider an orchestrator that is setting up the BPF
>>>>> program on the device, moving to the newly created application pod during
>>>>> the CNI call in k8s, such that the new pod does not have the /sys/fs/bpf/
>>>>> mount instance and if unprivileged cannot remove the BPF prog from the dev
>>>>> either. We do something like this in case of ipvlan, meaning, we attach a
>>>>> rootlet prog that calls into single slot of a tail call map, move it to the
>>>>> application pod, and only out of Cilium's own pod and it's pod-local bpf fs
>>>>> instance we manage the pinned tail call map to update the main programs in
>>>>> that single slot w/o having to switch any netns later on.
>>>>
>>>> Right. You mentioned this use case before, but I managed to forget about it.
>>>> Totally makes sense for prog to stay attached to netdev when it's moved.
>>>> I think pod manager would also prefer that pod is not able to replace
>>>> xdp prog from inside the container. It sounds to me that steps 1,2,3 above
>>>> is exactly the desired behavior. Otherwise what stops some application
>>>> that started in a pod to override it?
>>>
>>> Generally, yes, and it shouldn't need to care nor see what is happening in
>>> /sys/fs/bpf/ from the orchestrator at least (or could potentially have its
>>> own private mount under /sys/fs/bpf/ or elsewhere). Ideally, the behavior
>>> should be that orchestrator does all the setup out of its own namespace,
>>> then moves everything over to the newly created target namespace and e.g.
>>> only if the pod has the capable(cap_sys_admin) permissions, it could mess
>>> around with anything attached there, or via similar model as done in [0]
>>> when there is a master device.
>> 
>> Yup, I can see how this can be a reasonable use case where you *would*
>> want the locking. However, my concern is that there should be a way for
>> an admin to recover from this (say, if it happens by mistake, or a
>> misbehaving application). Otherwise, I fear we'll end up with support
>> cases where the only answer is "try rebooting", which is obviously not
>> ideal.
>
> I'm not quite sure I follow the concern, if you're an admin and have
> the right permissions, then you should be able to introspect and
> change settings like with anything else there is today.

Well, that's what I want to make sure of :)

However, I don't think such introspection is possible today? Or at least
there's no API exposed to do this, you'll have to go write drgn scripts
or something. But I expect an admin will want a command like 'xdp unload
eth0 --yes-i-really-really-mean-this', which would override any locking
done by bpf_link. So how to implement that? It's not enough to learn
'this bpf_link is pinned at links/id-123 on bpffs', you'll also need to
learn on *which* bpffs, and where to find that mountpoint. So how do you
do that?

Whereas an API that says 'return the bpf_link currently attached to
ifindex X' would sidestep this issue; but then, that is basically the
netlink API we have already, except it doesn't have the bpf_link
abstraction... So why do we need bpf_link?

>>> Last time I looked, there is a down/up cycle on the dev upon netns
>>> migration and it flushes e.g. attached qdiscs afaik, so there are
>>> limitations that still need to be addressed. Not sure atm if same is
>>> happening to XDP right now.
>> 
>> XDP programs will stay attached. devmaps (for redirect) have a notifier
>> that will remove devices when they move out of a namespace. Not sure if
>> there are any other issues with netns moves somewhere.
>> 
>>> In this regards veth devices are a bit nicer to work with since
>>> everything can be attached on hostns ingress w/o needing to worry on
>>> the peer dev in the pod's netns.
>> 
>> Presumably the XDP EGRESS hook that David Ahern is working on will make
>> this doable for XDP on veth as well?
>
> I'm not sure I see a use-case for XDP egress for Cilium yet, but maybe
> I'm still lacking a clear picture on why one should use it. We
> currently use various layers where we orchestrate our BPF programs
> from the agent. XDP/rx on the phys nic on the one end, BPF sock progs
> attached to cgroups on the other end of the spectrum. The processing
> in between on virtual devices is mainly tc/BPF-based since everything
> is skb based anyway and more flexible also with interaction with the
> rest of the stack. There is also not this pain of having to linearize
> all the skbs, but at least there's a path to tackle that.

I agree that there's not really any reason to use XDP on veth as long as
you'll end up with an skb eventually anyway. The only reason to do
something different I can think of is if you have an application inside
a container using AF_XDP, and you want to carry XDP frames all the way
through to that without ever building an skb. Not really sure this is a
good deployment model, but I kinda suspect that the NFV guys will want
to do this eventually...

-Toke


  reply	other threads:[~2020-03-09  9:42 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-28 22:39 [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Andrii Nakryiko
2020-02-28 22:39 ` [PATCH bpf-next 1/3] bpf: introduce pinnable bpf_link abstraction Andrii Nakryiko
2020-03-02 10:13   ` Toke Høiland-Jørgensen
2020-03-02 18:06     ` Andrii Nakryiko
2020-03-02 21:40       ` Toke Høiland-Jørgensen
2020-03-02 23:37         ` Andrii Nakryiko
2020-03-03  2:50   ` Alexei Starovoitov
2020-03-03  4:18     ` Andrii Nakryiko
2020-02-28 22:39 ` [PATCH bpf-next 2/3] libbpf: add bpf_link pinning/unpinning Andrii Nakryiko
2020-03-02 10:16   ` Toke Høiland-Jørgensen
2020-03-02 18:09     ` Andrii Nakryiko
2020-03-02 21:45       ` Toke Høiland-Jørgensen
2020-02-28 22:39 ` [PATCH bpf-next 3/3] selftests/bpf: add link pinning selftests Andrii Nakryiko
2020-03-02 10:11 ` [PATCH bpf-next 0/3] Introduce pinnable bpf_link kernel abstraction Toke Høiland-Jørgensen
2020-03-02 18:05   ` Andrii Nakryiko
2020-03-02 22:24     ` Toke Høiland-Jørgensen
2020-03-02 23:35       ` Andrii Nakryiko
2020-03-03  8:12         ` Toke Høiland-Jørgensen
2020-03-03  8:12       ` Daniel Borkmann
2020-03-03 15:46         ` Alexei Starovoitov
2020-03-03 19:23           ` Daniel Borkmann
2020-03-03 19:46             ` Andrii Nakryiko
2020-03-03 20:24               ` Toke Høiland-Jørgensen
2020-03-03 20:53                 ` Daniel Borkmann
2020-03-03 22:01                   ` Alexei Starovoitov
2020-03-03 22:27                     ` Toke Høiland-Jørgensen
2020-03-04  4:36                       ` Alexei Starovoitov
2020-03-04  7:47                         ` Toke Høiland-Jørgensen
2020-03-04 15:47                           ` Alexei Starovoitov
2020-03-05 10:37                             ` Toke Høiland-Jørgensen
2020-03-05 16:34                               ` Alexei Starovoitov
2020-03-05 22:34                                 ` Daniel Borkmann
2020-03-05 22:50                                   ` Alexei Starovoitov
2020-03-05 23:42                                     ` Daniel Borkmann
2020-03-06  8:31                                       ` Toke Høiland-Jørgensen
2020-03-06 10:25                                         ` Daniel Borkmann
2020-03-06 10:42                                           ` Toke Høiland-Jørgensen [this message]
2020-03-06 18:09                                           ` David Ahern
2020-03-04 19:41                         ` Jakub Kicinski
2020-03-04 20:45                           ` Alexei Starovoitov
2020-03-04 21:24                             ` Jakub Kicinski
2020-03-05  1:07                               ` Alexei Starovoitov
2020-03-05  8:16                                 ` Jakub Kicinski
2020-03-05 11:05                                   ` Toke Høiland-Jørgensen
2020-03-05 18:13                                     ` Jakub Kicinski
2020-03-09 11:41                                       ` Toke Høiland-Jørgensen
2020-03-09 18:50                                         ` Jakub Kicinski
2020-03-10 12:22                                           ` Toke Høiland-Jørgensen
2020-03-05 16:39                                   ` Alexei Starovoitov
2020-03-03 22:40                 ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=871rq53gtk.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andriin@fb.com \
    --cc=ast@fb.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).