netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: Martin Lau <kafai@fb.com>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>,
	Kernel Team <Kernel-team@fb.com>,
	Networking <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support
Date: Wed, 18 Dec 2019 10:14:04 -0800	[thread overview]
Message-ID: <CAEf4BzboyRio_KaQtd2eOqmH+x0FPfYp_CDfnUzv4H698j_wsQ@mail.gmail.com> (raw)
In-Reply-To: <20191218173350.nll5766abgkptjac@kafai-mbp.dhcp.thefacebook.com>

On Wed, Dec 18, 2019 at 9:34 AM Martin Lau <kafai@fb.com> wrote:
>
> On Wed, Dec 18, 2019 at 08:34:25AM -0800, Andrii Nakryiko wrote:
> > On Tue, Dec 17, 2019 at 11:03 PM Martin Lau <kafai@fb.com> wrote:
> > >
> > > On Tue, Dec 17, 2019 at 07:07:23PM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Dec 13, 2019 at 4:48 PM Martin KaFai Lau <kafai@fb.com> wrote:
> > > > >
> > > > > This patch adds BPF STRUCT_OPS support to libbpf.
> > > > >
> > > > > The only sec_name convention is SEC("struct_ops") to identify the
> > > > > struct ops implemented in BPF, e.g.
> > > > > SEC("struct_ops")
> > > > > struct tcp_congestion_ops dctcp = {
> > > > >         .init           = (void *)dctcp_init,  /* <-- a bpf_prog */
> > > > >         /* ... some more func prts ... */
> > > > >         .name           = "bpf_dctcp",
> > > > > };
> > > > >
> > > > > In the bpf_object__open phase, libbpf will look for the "struct_ops"
> > > > > elf section and find out what is the btf-type the "struct_ops" is
> > > > > implementing.  Note that the btf-type here is referring to
> > > > > a type in the bpf_prog.o's btf.  It will then collect (through SHT_REL)
> > > > > where are the bpf progs that the func ptrs are referring to.
> > > > >
> > > > > In the bpf_object__load phase, the prepare_struct_ops() will load
> > > > > the btf_vmlinux and obtain the corresponding kernel's btf-type.
> > > > > With the kernel's btf-type, it can then set the prog->type,
> > > > > prog->attach_btf_id and the prog->expected_attach_type.  Thus,
> > > > > the prog's properties do not rely on its section name.
> > > > >
> > > > > Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
> > > > > process is as simple as: member-name match + btf-kind match + size match.
> > > > > If these matching conditions fail, libbpf will reject.
> > > > > The current targeting support is "struct tcp_congestion_ops" which
> > > > > most of its members are function pointers.
> > > > > The member ordering of the bpf_prog's btf-type can be different from
> > > > > the btf_vmlinux's btf-type.
> > > > >
> > > > > Once the prog's properties are all set,
> > > > > the libbpf will proceed to load all the progs.
> > > > >
> > > > > After that, register_struct_ops() will create a map, finalize the
> > > > > map-value by populating it with the prog-fd, and then register this
> > > > > "struct_ops" to the kernel by updating the map-value to the map.
> > > > >
> > > > > By default, libbpf does not unregister the struct_ops from the kernel
> > > > > during bpf_object__close().  It can be changed by setting the new
> > > > > "unreg_st_ops" in bpf_object_open_opts.
> > > > >
> > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> > > > > ---
> > > >
> > > > This looks pretty good to me. The big two things is exposing structops
> > > > as real struct bpf_map, so that users can interact with it using
> > > > libbpf APIs, as well as splitting struct_ops map creation and
> > > > registration. bpf_object__load() should only make sure all maps are
> > > > created, progs are loaded/verified, but none of BPF program can yet be
> > > > called. Then attach is the phase where registration happens.
> > > Thanks for the review.
> > >
> > > [ ... ]
> > >
> > > > >  static inline __u64 ptr_to_u64(const void *ptr)
> > > > >  {
> > > > >         return (__u64) (unsigned long) ptr;
> > > > > @@ -233,6 +239,32 @@ struct bpf_map {
> > > > >         bool reused;
> > > > >  };
> > > > >
> > > > > +struct bpf_struct_ops {
> > > > > +       const char *var_name;
> > > > > +       const char *tname;
> > > > > +       const struct btf_type *type;
> > > > > +       struct bpf_program **progs;
> > > > > +       __u32 *kern_func_off;
> > > > > +       /* e.g. struct tcp_congestion_ops in bpf_prog's btf format */
> > > > > +       void *data;
> > > > > +       /* e.g. struct __bpf_tcp_congestion_ops in btf_vmlinux's btf
> > > >
> > > > Using __bpf_ prefix for this struct_ops-specific types is a bit too
> > > > generic (e.g., for raw_tp stuff Alexei used btf_trace_). So maybe make
> > > > it btf_ops_ or btf_structops_?
> > > Is it a concern on name collision?
> > >
> > > The prefix pick is to use a more representative name.
> > > struct_ops use many bpf pieces and btf is one of them.
> > > Very soon, all new codes will depend on BTF and btf_ prefix
> > > could become generic also.
> > >
> > > Unlike tracepoint, there is no non-btf version of struct_ops.
> >
> > Not so much name collision, as being able to immediately recognize
> > that it's used to provide type information for struct_ops. Think about
> > some automated tooling parsing vmlinux BTF and trying to create some
> > derivative types for those btf_trace_xxx and __bpf_xxx types. Having
> > unique prefix that identifies what kind of type-providing struct it is
> > is very useful to do generic tool like that. While __bpf_ isn't
> > specifying in any ways that it's for struct_ops.
> >
> > >
> > > >
> > > >
> > > > > +        * format.
> > > > > +        * struct __bpf_tcp_congestion_ops {
> > > > > +        *      [... some other kernel fields ...]
> > > > > +        *      struct tcp_congestion_ops data;
> > > > > +        * }
> > > > > +        * kern_vdata in the sizeof(struct __bpf_tcp_congestion_ops).
> > > >
> > > > Comment isn't very clear.. do you mean that data pointed to by
> > > > kern_vdata is of sizeof(...) bytes?
> > > >
> > > > > +        * prepare_struct_ops() will populate the "data" into
> > > > > +        * "kern_vdata".
> > > > > +        */
> > > > > +       void *kern_vdata;
> > > > > +       __u32 type_id;
> > > > > +       __u32 kern_vtype_id;
> > > > > +       __u32 kern_vtype_size;
> > > > > +       int fd;
> > > > > +       bool unreg;
> > > >
> > > > This unreg flag (and default behavior to not unregister) is bothering
> > > > me a bit.. Shouldn't this be controlled by map's lifetime, at least.
> > > > E.g., if no one pins that map - then struct_ops should be unregistered
> > > > on map destruction. If application wants to keep BPF programs
> > > > attached, it should make sure to pin map, before userspace part exits?
> > > > Is this problematic in any way?
> > > I don't think it should in the struct_ops case.  I think of the
> > > struct_ops map is a set of progs "attach" to a subsystem (tcp_cong
> > > in this case) and this map-progs stay (or keep attaching) until it is
> > > detached.  Like other attached bpf_prog keeps running without
> > > caring if the bpf_prog is pinned or not.
> >
> > I'll let someone else comment on how this behaves for cgroup, xdp,
> > etc,
> > but for tracing, for example, we have FD-based BPF links, which
> > will detach program automatically when FD is closed. I think the idea
> > is to extend this to other types of BPF programs as well, so there is
> > no risk of leaving some stray BPF program running after unintended
> Like xdp_prog, struct_ops does not have another fd-based-link.
> This link can be created for struct_ops, xdp_prog and others later.
> I don't see a conflict here.

My point was that default behavior should be conservative: free up
resources automatically on process exit, unless specifically pinned by
user.
But this discussion made me realize that we miss one thing from
general bpf_link framework. See below.

>
> > crash of userspace program. When application explicitly needs BPF
> > program to outlive its userspace control app, then this can be
> > achieved by pinning map/program in BPFFS.
> If the concern is about not leaving struct_ops behind,
> lets assume there is no "detach" and only depends on the very
> last userspace's handles (FD/pinned) of a map goes away,
> what may be an easy way to remove bpf_cubic from the system:

Yeah, I think this "last map FD close frees up resources/detaches" is
a good behavior.

Where we do have problem is with bpf_link__destroy() unconditionally
also detaching whatever was attached (tracepoint, kprobe, or whatever
was done to create bpf_link in the first place). Now,
bpf_link__destroy() has to be called by user (or skeleton) to at least
free up malloc()'ed structs. But it appears that it's not always
desirable that upon bpf_link destruction underlying BPF program gets
detached. I think this will be the case for xdp and others as well.

I think the good and generic way to go about this is to have this as a
general concept of destroying the link without detaching BPF programs.
E.g., what if we have new API call `void bpf_link__unlink()`, which
will mark that link as not requiring to detach underlying BPF program.
When bpf_link__destroy() is called later, it will just free resources
allocated to maintain bpf_link itself, but won't detach any BPF
programs/resources.

With this, user will have to explicitly specify that he doesn't want
to detach even when skeleton/link is destroyed. If we get consensus on
this, I can add support for this to all the existing bpf_links and you
can build on that?

>
> [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
>     net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
>     net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
>     net.ipv4.tcp_congestion_control = bpf_cubic
>
> >
> > >
> > > About the "bool unreg;", the default can be changed to true if
> > > it makes more sense.
> > >

  reply	other threads:[~2019-12-18 18:14 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-14  0:47 [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Martin KaFai Lau
2019-12-14  0:47 ` [PATCH bpf-next 01/13] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
2019-12-16 19:48   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 02/13] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
2019-12-16 21:34   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 03/13] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
2019-12-16 21:36   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 04/13] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
2019-12-16 22:05   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 05/13] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-17  6:14   ` Yonghong Song
2019-12-18 16:41     ` Martin Lau
2019-12-14  0:47 ` [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-17  7:48   ` [Potential Spoof] " Yonghong Song
2019-12-20  7:22     ` Martin Lau
2019-12-20 16:52       ` Martin Lau
2019-12-20 18:41         ` Andrii Nakryiko
2019-12-14  0:47 ` [PATCH bpf-next 07/13] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
2019-12-17 17:36   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 08/13] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
2019-12-17 17:41   ` Yonghong Song
2019-12-14  0:47 ` [PATCH bpf-next 09/13] bpf: Add BPF_FUNC_jiffies Martin KaFai Lau
2019-12-14  1:59   ` Eric Dumazet
2019-12-14 19:25     ` Neal Cardwell
2019-12-16 19:30       ` Martin Lau
2019-12-17  8:26       ` Jakub Sitnicki
2019-12-17 18:22         ` Martin Lau
2019-12-17 21:04           ` Eric Dumazet
2019-12-18  9:03           ` Jakub Sitnicki
2019-12-16 19:14     ` Martin Lau
2019-12-16 19:33       ` Eric Dumazet
2019-12-16 21:17         ` Martin Lau
2019-12-16 23:08       ` Alexei Starovoitov
2019-12-17  0:34         ` Eric Dumazet
2019-12-14  0:48 ` [PATCH bpf-next 10/13] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
2019-12-14  0:48 ` [PATCH bpf-next 11/13] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
2019-12-18  3:07   ` Andrii Nakryiko
2019-12-18  7:03     ` Martin Lau
2019-12-18  7:20       ` Martin Lau
2019-12-18 16:36         ` Andrii Nakryiko
2019-12-18 16:34       ` Andrii Nakryiko
2019-12-18 17:33         ` Martin Lau
2019-12-18 18:14           ` Andrii Nakryiko [this message]
2019-12-18 20:19             ` Martin Lau
2019-12-19  8:53             ` Toke Høiland-Jørgensen
2019-12-19 20:49               ` Andrii Nakryiko
2019-12-20 10:16                 ` Toke Høiland-Jørgensen
2019-12-20 17:34                   ` Andrii Nakryiko
2019-12-14  0:48 ` [PATCH bpf-next 12/13] bpf: Add bpf_dctcp example Martin KaFai Lau
2019-12-14  0:48 ` [PATCH bpf-next 13/13] bpf: Add bpf_cubic example Martin KaFai Lau
2019-12-14  2:26 ` [PATCH bpf-next 00/13] Introduce BPF STRUCT_OPS Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEf4BzboyRio_KaQtd2eOqmH+x0FPfYp_CDfnUzv4H698j_wsQ@mail.gmail.com \
    --to=andrii.nakryiko@gmail.com \
    --cc=Kernel-team@fb.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).