bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Borkmann <daniel@iogearbox.net>
To: Martin Lau <kafai@fb.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	David Miller <davem@davemloft.net>,
	Kernel Team <Kernel-team@fb.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next v3 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS
Date: Wed, 8 Jan 2020 17:53:50 +0100	[thread overview]
Message-ID: <20200108165350.GA7014@linux-3.fritz.box> (raw)
In-Reply-To: <20200108015223.sdecaqnjeconwpgq@kafai-mbp>

On Wed, Jan 08, 2020 at 01:52:26AM +0000, Martin Lau wrote:
> On Wed, Jan 08, 2020 at 01:21:39AM +0100, Daniel Borkmann wrote:
> > On 12/31/19 7:20 AM, Martin KaFai Lau wrote:
> > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > > is a kernel struct with its func ptr implemented in bpf prog.
> > > This new map is the interface to register/unregister/introspect
> > > a bpf implemented kernel struct.
> > > 
> > > The kernel struct is actually embedded inside another new struct
> > > (or called the "value" struct in the code).  For example,
> > > "struct tcp_congestion_ops" is embbeded in:
> > > struct bpf_struct_ops_tcp_congestion_ops {
> > > 	refcount_t refcnt;
> > > 	enum bpf_struct_ops_state state;
> > > 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > > }
> > > The map value is "struct bpf_struct_ops_tcp_congestion_ops".
> > > The "bpftool map dump" will then be able to show the
> > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
> > > number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
> > > is created automatically by a macro.  Having a separate "value" struct
> > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
> > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
> > > initialization works before registering the struct_ops to the kernel
> > > subsystem).  The libbpf will take care of finding and populating the
> > > "struct bpf_struct_ops_XYZ" from "struct XYZ".
> > > 
> > > Register a struct_ops to a kernel subsystem:
> > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> > >     set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
> > >     running kernel.
> > >     Instead of reusing the attr->btf_value_type_id,
> > >     btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
> > >     used as the "user" btf which could store other useful sysadmin/debug
> > >     info that may be introduced in the furture,
> > >     e.g. creation-date/compiler-details/map-creator...etc.
> > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
> > >     in the running kernel btf.  Populate the value of this object.
> > >     The function ptr should be populated with the prog fds.
> > > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> > >     the map value.  The key is always "0".
> > > 
> > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > > (e.g. ensure all mandatory func ptrs are implemented).
> > > If everything looks good, it will register this kernel struct
> > > to the kernel subsystem.  The map will not allow further update
> > > from this point.
> > 
> > Btw, did you have any thoughts on whether it would have made sense to add
> > a new core construct for BPF aside from progs or maps, e.g. BPF modules
> > which then resemble a collection of progs/ops (given this would not be limited
> > to tcp congestion control only). Given the possibilities, having a bit of second
> > thoughts on abusing BPF map interface this way which is not overly pretty. It's
> > not a map anymore at this point anyway, we're just reusing the syscall interface
> > since it's convenient though cannot be linked to any prog is just a single slot
> > etc, but technically some sort of BPF module registration would be nicer. Also in
> > terms of 'bpftool modules' then listing all such currently loaded modules which
> > need to be cleaned up this way through explicit removal (similar to insmod/
> > lsmod/rmmod); at least feels more natural conceptually than BPF maps and the way
> > you refcount them, and would perhaps also be a fit for BPF lib helpers for dynamic
> > linking to load that way. So essentially similar but more lightweight infrastructure
> > as with kernel modules. Thoughts?
> Inventing a new bpf obj type (vs adding new map type like in this patch) was
> one considered (and briefly-tried) option.
> 
> Once BTF was introduced to bpf map,  I see bpf map as an introspectible
> bpf obj that can store any blob described by BTF.  I don't think
> creating a new bpf obj type worth it while both of them are basically
> storing a value described by BTF.
> 
> I did try to create register/unregister interface and new bpf-cmd.
> At the end, it ends up very similar to update_elem() which is basically
> updating a blob of a struct described by BTF.  Hence, I tossed that and
> came back to the current approach.
> 
> Put aside the new bpf obj type needs kernel support like another idr,
> likely pin-able, fd, get_info...etc,  I suspect most users have already
> been used to do 'bpftool map dump' to introspect bpf obj that is storing
> a 'struct'.
> 
> The map type is enough to distinguish the map usage instead of creating
> another bpf obj type.  The 'bpftool modules' will work on the struct_ops
> map only.

Right, but under long-term I'd expect more users of this interface and given
we abuse the map only to keep other entities (here: bpf tcp congctl module)
'alive', but cannot do anything else with this map (as in: usage in the BPF
program), it feels that this begs for a better interface. Given we need an
explicit delete operation of the map slot in order to eventually unregister
the congctl module once no application is using it anymore, how are users
supposed to operate this considering the loader performs either only a load
or crashes before the map delete happens? If you had 'bpftool modules' like
cmdline interface with similar insmod/lsmod/rmmod type operation as we have
for kernel modules, it's pretty obvious and intuitive. Here, you'd need a
'bpftool map dump' to get to the concrete ops map, and then perform an
explicit delete operation for releasing the ops refcount and thus to unload
the set of progs. Such extension for bpftool should be done regardless, even
if we end up to keep abusing the map interface for this, but API wise feels
way cleaner to have a dedicated register/unregister interface.

Thanks,
Daniel

  reply	other threads:[~2020-01-08 16:54 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-31  6:20 [PATCH bpf-next v3 00/11] Introduce BPF STRUCT_OPS Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 01/11] bpf: Save PTR_TO_BTF_ID register state when spilling to stack Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 02/11] bpf: Avoid storing modifier to info->btf_id Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 03/11] bpf: Add enum support to btf_ctx_access() Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 04/11] bpf: Support bitfield read access in btf_struct_access Martin KaFai Lau
2020-01-02 18:04   ` Yonghong Song
2019-12-31  6:20 ` [PATCH bpf-next v3 05/11] bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 06/11] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS Martin KaFai Lau
2020-01-02 18:33   ` Yonghong Song
2020-01-07 16:00   ` Daniel Borkmann
2020-01-07 18:55     ` Martin Lau
2020-01-07 21:44       ` Daniel Borkmann
2020-01-08  1:53         ` Martin Lau
2020-01-07 20:50   ` Daniel Borkmann
2020-01-07 20:51     ` Daniel Borkmann
2020-01-08  0:21   ` Daniel Borkmann
2020-01-08  1:52     ` Martin Lau
2020-01-08 16:53       ` Daniel Borkmann [this message]
2020-01-08 18:41         ` Martin Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 07/11] bpf: tcp: Support tcp_congestion_ops in bpf Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 08/11] bpf: Add BPF_FUNC_tcp_send_ack helper Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 09/11] bpf: Synch uapi bpf.h to tools/ Martin KaFai Lau
2019-12-31  6:20 ` [PATCH bpf-next v3 10/11] bpf: libbpf: Add STRUCT_OPS support Martin KaFai Lau
2019-12-31  6:21 ` [PATCH bpf-next v3 11/11] bpf: Add bpf_dctcp example Martin KaFai Lau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200108165350.GA7014@linux-3.fritz.box \
    --to=daniel@iogearbox.net \
    --cc=Kernel-team@fb.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=davem@davemloft.net \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).