Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support

From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>
Cc: Martin Lau <kafai@fb.com>, bpf <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>,
	Kernel Team <Kernel-team@fb.com>,
	Networking <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
Date: Fri, 20 Dec 2019 09:34:42 -0800	[thread overview]
Message-ID: <CAEf4Bzaa1p9z_N3=fVfba_Ukck6ch4gUhjR7JN8KuN3A7r_0xw@mail.gmail.com> (raw)
In-Reply-To: <87pngj2tf0.fsf@toke.dk>

On Fri, Dec 20, 2019 at 2:16 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Thu, Dec 19, 2019 at 12:54 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
> >> >>
> >> >> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> >> >> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> >> >> > >
> >> >> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> >> >> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >> >> > > > >
> >> >> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> >> >> > > > >
> >> >> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> >> >> > > > > struct ops implemented in BPF, e.g.
> >> >> > > > > SEC("struct_ops")
> >> >> > > > > struct tcp_congestion_ops dctcp = {
> >> >> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> >> >> > > > >         /* ... some more func prts ... */
> >> >> > > > >         .name           = "bpf_dctcp",
> >> >> > > > > };
> >> >> > > > >
> >> >> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> >> >> > > > > elf section and find out what is the btf-type the "struct_ops" is
> >> >> > > > > implementing.  Note that the btf-type here is referring to
> >> >> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> >> >> > > > > where are the bpf progs that the func ptrs are referring to.
> >> >> > > > >
> >> >> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> >> >> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> >> >> > > > > With the kernel's btf-type, it can then set the prog->type,
> >> >> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> >> >> > > > > the prog's properties do not rely on its section name.
> >> >> > > > >
> >> >> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> >> >> > > > > process is as simple as: member-name match + btf-kind match + size match.
> >> >> > > > > If these matching conditions fail, libbpf will reject.
> >> >> > > > > The current targeting support is "struct tcp_congestion_ops" which
> >> >> > > > > most of its members are function pointers.
> >> >> > > > > The member ordering of the bpf_prog's btf-type can be different from
> >> >> > > > > the btf_vmlinux's btf-type.
> >> >> > > > >
> >> >> > > > > Once the prog's properties are all set,
> >> >> > > > > the libbpf will proceed to load all the progs.
> >> >> > > > >
> >> >> > > > > After that, register_struct_ops() will create a map, finalize the
> >> >> > > > > map-value by populating it with the prog-fd, and then register this
> >> >> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> >> >> > > > >
> >> >> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> >> >> > > > > during bpf_object__close().  It can be changed by setting the new
> >> >> > > > > "unreg_st_ops" in bpf_object_open_opts.
> >> >> > > > >
> >> >> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> >> >> > > > > ---
> >> >> > > >
> >> >> > > > This looks pretty good to me. The big two things is exposing structops
> >> >> > > > as real struct bpf_map, so that users can interact with it using
> >> >> > > > libbpf APIs, as well as splitting struct_ops map creation and
> >> >> > > > registration. bpf_object__load() should only make sure all maps are
> >> >> > > > created, progs are loaded/verified, but none of BPF program can yet be
> >> >> > > > called. Then attach is the phase where registration happens.
> >> >> > > Thanks for the review.
> >> >> > >
> >> >> > > [ ... ]
> >> >> > >
> >> >> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> >> >> > > > >  {
> >> >> > > > >         return (__u64) (unsigned long) ptr;
> >> >> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> >> >> > > > >         bool reused;
> >> >> > > > >  };
> >> >> > > > >
> >> >> > > > > +struct bpf_struct_ops {
> >> >> > > > > +       const char *var_name;
> >> >> > > > > +       const char *tname;
> >> >> > > > > +       const struct btf_type *type;
> >> >> > > > > +       struct bpf_program **progs;
> >> >> > > > > +       __u32 *kern_func_off;
> >> >> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> >> >> > > > > +       void *data;
> >> >> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> >> >> > > >
> >> >> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> >> >> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> >> >> > > > it btf_ops_ or btf_structops_?
> >> >> > > Is it a concern on name collision?
> >> >> > >
> >> >> > > The prefix pick is to use a more representative name.
> >> >> > > struct_ops use many bpf pieces and btf is one of them.
> >> >> > > Very soon, all new codes will depend on BTF and btf_ prefix
> >> >> > > could become generic also.
> >> >> > >
> >> >> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >> >> >
> >> >> > Not so much name collision, as being able to immediately recognize
> >> >> > that it's used to provide type information for struct_ops. Think about
> >> >> > some automated tooling parsing vmlinux BTF and trying to create some
> >> >> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> >> >> > unique prefix that identifies what kind of type-providing struct it is
> >> >> > is very useful to do generic tool like that. While __bpf_ isn't
> >> >> > specifying in any ways that it's for struct_ops.
> >> >> >
> >> >> > >
> >> >> > > >
> >> >> > > >
> >> >> > > > > +        * format.
> >> >> > > > > +        * struct __bpf_tcp_congestion_ops {
> >> >> > > > > +        *      [... some other kernel fields ...]
> >> >> > > > > +        *      struct tcp_congestion_ops data;
> >> >> > > > > +        * }
> >> >> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> >> >> > > >
> >> >> > > > Comment isn't very clear.. do you mean that data pointed to by
> >> >> > > > kern_vdata is of sizeof(...) bytes?
> >> >> > > >
> >> >> > > > > +        * prepare_struct_ops() will populate the "data" into
> >> >> > > > > +        * "kern_vdata".
> >> >> > > > > +        */
> >> >> > > > > +       void *kern_vdata;
> >> >> > > > > +       __u32 type_id;
> >> >> > > > > +       __u32 kern_vtype_id;
> >> >> > > > > +       __u32 kern_vtype_size;
> >> >> > > > > +       int fd;
> >> >> > > > > +       bool unreg;
> >> >> > > >
> >> >> > > > This unreg flag (and default behavior to not unregister) is bothering
> >> >> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> >> >> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> >> >> > > > on map destruction. If application wants to keep BPF programs
> >> >> > > > attached, it should make sure to pin map, before userspace part exits?
> >> >> > > > Is this problematic in any way?
> >> >> > > I don't think it should in the struct_ops case.  I think of the
> >> >> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> >> >> > > in this case) and this map-progs stay (or keep attaching) until it is
> >> >> > > detached.  Like other attached bpf_prog keeps running without
> >> >> > > caring if the bpf_prog is pinned or not.
> >> >> >
> >> >> > I'll let someone else comment on how this behaves for cgroup, xdp,
> >> >> > etc,
> >> >> > but for tracing, for example, we have FD-based BPF links, which
> >> >> > will detach program automatically when FD is closed. I think the idea
> >> >> > is to extend this to other types of BPF programs as well, so there is
> >> >> > no risk of leaving some stray BPF program running after unintended
> >> >> Like xdp_prog, struct_ops does not have another fd-based-link.
> >> >> This link can be created for struct_ops, xdp_prog and others later.
> >> >> I don't see a conflict here.
> >> >
> >> > My point was that default behavior should be conservative: free up
> >> > resources automatically on process exit, unless specifically pinned by
> >> > user.
> >> > But this discussion made me realize that we miss one thing from
> >> > general bpf_link framework. See below.
> >> >
> >> >>
> >> >> > crash of userspace program. When application explicitly needs BPF
> >> >> > program to outlive its userspace control app, then this can be
> >> >> > achieved by pinning map/program in BPFFS.
> >> >> If the concern is about not leaving struct_ops behind,
> >> >> lets assume there is no "detach" and only depends on the very
> >> >> last userspace's handles (FD/pinned) of a map goes away,
> >> >> what may be an easy way to remove bpf_cubic from the system:
> >> >
> >> > Yeah, I think this "last map FD close frees up resources/detaches" is
> >> > a good behavior.
> >> >
> >> > Where we do have problem is with bpf_link__destroy() unconditionally
> >> > also detaching whatever was attached (tracepoint, kprobe, or whatever
> >> > was done to create bpf_link in the first place). Now,
> >> > bpf_link__destroy() has to be called by user (or skeleton) to at least
> >> > free up malloc()'ed structs. But it appears that it's not always
> >> > desirable that upon bpf_link destruction underlying BPF program gets
> >> > detached. I think this will be the case for xdp and others as well.
> >>
> >> For XDP the model has thus far been "once attached, the program stays
> >> until explicitly detached". Changing that would certainly be surprising,
> >> so I agree that splitting the API is best (not that I'm sure how many
> >> XDP programs will end up using that API, but that's a different
> >> concern)...
> >
> > This would be a new FD-based API for XDP, I don't think we can change
> > existing API. But I think default behavior should still be to
> > auto-detach, unless explicitly "pinned" in whatever way. That would
> > prevent surprising "leakage" of BPF programs for unsuspecting users.
>
> But why do we need a new API for attaching XDP programs? Also, what are
> the use cases where it makes sense to have this kind of "transient" XDP
> program? The only one I can think about is something like xdpdump, which

During development, for instance, when you buggy userspace program
crashes? I think by default all those attached BPF programs should be
auto-detachable, if possible. That's the direction that worked out
really well with kprobes/tracepoints/perf_events. Previously, using
old APIs, you'd attach kprobe and if userspace doesn't clean up, that
kprobe would stay attached in the system, consuming resources without
users noticing this (which is especially critical in production).
Switching to auto-detachable FD-based interface greatly improved that
experience. I think this is a good model going forward.

In practice, for production use cases, it will be just a trivial piece
of code to keep it attached:

struct bpf_link *xdp_link = bpf_program__attach_xdp(...);
bpf_link__disconnect(xdp_link); /* now if userspace program crashes,
xdp BPF program will stay connected */

> moves packets to userspace (and should stop doing that when the
> userspace listener goes away). But with bpf-to-bpf tracing, xdpdump
> won't actually be an XDP program, so what's left? The system firewall
> rules don't go away when the program that installed them exits either;
> why should an XDP program?

See above, I'm not saying that it shouldn't be possible to keep it
attached. I'm just arguing it's not a good default, because it can
catch developers off guard and cause problems, especially in
production environments. In the end, it is a resource leak, unless you
want and expect it.

>
> -Toke
>