Re: [PATCH bpf-next 06/13] bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS

From: Martin Lau <kafai@fb.com>
To: Yonghong Song <yhs@fb.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>,
	Kernel Team <Kernel-team@fb.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>
Subject: Re: [PATCH bpf-next 06/13] bpf: Introduce  BPF_MAP_TYPE_STRUCT_OPS
Date: Fri, 20 Dec 2019 07:22:17 +0000	[thread overview]
Message-ID: <20191220072150.flfxsix4s6jndswn@kafai-mbp> (raw)
In-Reply-To: <6ca4e58d-4bb3-bb4b-0fe2-e10a29a53d1f@fb.com>

On Mon, Dec 16, 2019 at 11:48:18PM -0800, Yonghong Song wrote:
> 
> 
> On 12/13/19 4:47 PM, Martin KaFai Lau wrote:
> > The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
> > is a kernel struct with its func ptr implemented in bpf prog.
> > This new map is the interface to register/unregister/introspect
> > a bpf implemented kernel struct.
> > 
> > The kernel struct is actually embedded inside another new struct
> > (or called the "value" struct in the code).  For example,
> > "struct tcp_congestion_ops" is embbeded in:
> > struct __bpf_tcp_congestion_ops {
> > 	refcount_t refcnt;
> > 	enum bpf_struct_ops_state state;
> > 	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
> > }
> > The map value is "struct __bpf_tcp_congestion_ops".  The "bpftool map dump"
> > will then be able to show the state ("inuse"/"tobefree") and the number of
> > subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case).
> > This "value" struct is created automatically by a macro.  Having a separate
> > "value" struct will also make extending "struct __bpf_XYZ" easier (e.g. adding
> > "void (*init)(void)" to "struct __bpf_XYZ" to do some initialization
> > works before registering the struct_ops to the kernel subsystem).
> > The libbpf will take care of finding and populating the "struct __bpf_XYZ"
> > from "struct XYZ".
> > 
> > Register a struct_ops to a kernel subsystem:
> > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
> > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
> >     set to the btf id "struct __bpf_tcp_congestion_ops" of the running
> >     kernel.
> >     Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id
> >     is added such that attr->btf_fd can still be used as the "user" btf
> >     which could store other useful sysadmin/debug info that may be
> >     introduced in the furture,
> >     e.g. creation-date/compiler-details/map-creator...etc.
> > 3. Create a "struct __bpf_tcp_congestion_ops" object as described in
> >     the running kernel btf.  Populate the value of this object.
> >     The function ptr should be populated with the prog fds.
> > 4. Call BPF_MAP_UPDATE with the object created in (3) as
> >     the map value.  The key is always "0".
> 
> This is really a special one element map. The key "0" should work.
> Not sure whether we should generalize this and maps for global variables
> to a kind of key-less map. Just some thought.
key-less.  I think it mostly means, no key is passed or pass NULL
as a key.  Not sure if it worths the uapi and userspace disruption,
e.g. think about "bpftool map dump".
I did try to add new bpf cmd to do register/unregister
which do not need the key.  I stopped in the middle because
it does not worth it when considering the lookup side also.

Also, like the global value map, the attr->btf_key_type_id is 0
which is a "void" btf-type and I think it is an as good way as
saying it is keyless.  The bpftool is already ready for this keyless
signal.  The difference between passing 0 or NULL to represent
"void" key is also arguably less.  
In struct_ops case, only btf_vmlinux_value_type_id is added but
not for the key.  

> 
> > 
> > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
> > args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
> > the specific struct_ops to do some final checks in "st_ops->init_member()"
> > (e.g. ensure all mandatory func ptrs are implemented).
> > If everything looks good, it will register this kernel struct
> > to the kernel subsystem.  The map will not allow further update
> > from this point.
> > 
> > Unregister a struct_ops from the kernel subsystem:
> > BPF_MAP_DELETE with key "0".
> > 
> > Introspect a struct_ops:
> > BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
> > have the prog _id_ populated as the func ptr.
> > 
> > The map value state (enum bpf_struct_ops_state) will transit from:
> > INIT (map created) =>
> > INUSE (map updated, i.e. reg) =>
> > TOBEFREE (map value deleted, i.e. unreg)
> > 
> > Note that the above state is not exposed to the uapi/bpf.h.
> > It will be obtained from the btf of the running kernel.
> 
> It is not really from btf, right? It is from kernel internal
> data structure which will be copied to user space.
> 
> Since such information is available to bpftool dump and is common
> to all st_ops maps. I am wondering whether we should expose
> this through uapi.
The data is from the kernel-map's value.

The enum's type and its values' "string", meaning "INIT", "INUSE",
and "TOBEFREE" are from the kernel BTF.  These do not need to be
exposed in uapi.  kernel BTF is the way to get them.

[ ... ]

> > +/* __bpf_##_name (e.g. __bpf_tcp_congestion_ops) is the map's value
> > + * exposed to the userspace and its btf-type-id is stored
> > + * at the map->btf_vmlinux_value_type_id.
> > + *
> > + * The *_name##_dummy is to ensure the BTF type is emitted.
> > + */
> > +
> >   #define BPF_STRUCT_OPS_TYPE(_name)				\
> > -extern struct bpf_struct_ops bpf_##_name;
> > +extern struct bpf_struct_ops bpf_##_name;			\
> > +								\
> > +static struct __bpf_##_name {					\
> > +	BPF_STRUCT_OPS_COMMON_VALUE;				\
> > +	struct _name data ____cacheline_aligned_in_smp;		\
> > +} *_name##_dummy;
> 
> There are other ways to retain types in debug info without
> creating new variables. For example, you can use it in a cast
> like
>      (void *)(struct __bpf_##_name *)v
hmm... What is v?

> Not sure whether we could easily find a place for such casting or not.
> 
> >   #include "bpf_struct_ops_types.h"
> >   #undef BPF_STRUCT_OPS_TYPE
> >   
> > @@ -37,19 +97,46 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = {
> >   const struct bpf_prog_ops bpf_struct_ops_prog_ops = {
> >   };
> >   
> > +static const struct btf_type *module_type;
> > +
> >   void bpf_struct_ops_init(struct btf *_btf_vmlinux)
> >   {
> > +	char value_name[128] = VALUE_PREFIX;
> > +	s32 type_id, value_id, module_id;
> >   	const struct btf_member *member;
> >   	struct bpf_struct_ops *st_ops;
> >   	struct bpf_verifier_log log = {};
> >   	const struct btf_type *t;
> >   	const char *mname;
> > -	s32 type_id;
> >   	u32 i, j;
> >   
> > +	/* Avoid unused var compiler warning */
> > +#define BPF_STRUCT_OPS_TYPE(_name) (void)(_name##_dummy);
> > +#include "bpf_struct_ops_types.h"
> > +#undef BPF_STRUCT_OPS_TYPE
> > +
> > +	module_id = btf_find_by_name_kind(_btf_vmlinux, "module",
> > +					  BTF_KIND_STRUCT);
> > +	if (module_id < 0) {
> > +		pr_warn("Cannot find struct module in btf_vmlinux\n");
> > +		return;
> > +	}
> > +	module_type = btf_type_by_id(_btf_vmlinux, module_id);
> > +
> >   	for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) {
> >   		st_ops = bpf_struct_ops[i];
> >   
> > +		value_name[VALUE_PREFIX_LEN] = '\0';
> > +		strncat(value_name + VALUE_PREFIX_LEN, st_ops->name,
> > +			sizeof(value_name) - VALUE_PREFIX_LEN - 1);
> 
> Do we have restrictions on the length of st_ops->name?
> We probably do not want truncation, right?
It is unlikely the following btf_find_by_name_kind() would succeed.

I will add a check here to ensure no truncation.

> 
> > +		value_id = btf_find_by_name_kind(_btf_vmlinux, value_name,
> > +						 BTF_KIND_STRUCT);
> > +		if (value_id < 0) {
> > +			pr_warn("Cannot find struct %s in btf_vmlinux\n",
> > +				value_name);
> > +			continue;
> > +		}
> > +
> >   		type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name,
> >   						BTF_KIND_STRUCT);
> >   		if (type_id < 0) {

[ ... ]

> > +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
> > +					  void *value, u64 flags)
> > +{
> > +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> > +	const struct bpf_struct_ops *st_ops = st_map->st_ops;
> > +	struct bpf_struct_ops_value *uvalue, *kvalue;
> > +	const struct btf_member *member;
> > +	const struct btf_type *t = st_ops->type;
> > +	void *udata, *kdata;
> > +	int prog_fd, err = 0;
> > +	void *image;
> > +	u32 i;
> > +
> > +	if (flags)
> > +		return -EINVAL;
> > +
> > +	if (*(u32 *)key != 0)
> > +		return -E2BIG;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)value;
> > +	if (uvalue->state || refcount_read(&uvalue->refcnt))
> > +		return -EINVAL;
> > +
> > +	uvalue = (struct bpf_struct_ops_value *)st_map->uvalue;
> > +	kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue;
> > +
> > +	spin_lock(&st_map->lock);
> > +
> > +	if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) {
> > +		err = -EBUSY;
> > +		goto unlock;
> > +	}
> > +
> > +	memcpy(uvalue, value, map->value_size);
> > +
> > +	udata = &uvalue->data;
> > +	kdata = &kvalue->data;
> > +	image = st_map->image;
> > +
> > +	for_each_member(i, t, member) {
> > +		const struct btf_type *mtype, *ptype;
> > +		struct bpf_prog *prog;
> > +		u32 moff;
> > +
> > +		moff = btf_member_bit_offset(t, member) / 8;
> > +		mtype = btf_type_by_id(btf_vmlinux, member->type);
> > +		ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL);
> > +		if (ptype == module_type) {
> > +			*(void **)(kdata + moff) = BPF_MODULE_OWNER;
> > +			continue;
> > +		}
> > +
> > +		err = st_ops->init_member(t, member, kdata, udata);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		/* The ->init_member() has handled this member */
> > +		if (err > 0)
> > +			continue;
> > +
> > +		/* If st_ops->init_member does not handle it,
> > +		 * we will only handle func ptrs and zero-ed members
> > +		 * here.  Reject everything else.
> > +		 */
> > +
> > +		/* All non func ptr member must be 0 */
> > +		if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type,
> > +					       NULL)) {
> > +			u32 msize;
> > +
> > +			mtype = btf_resolve_size(btf_vmlinux, mtype,
> > +						 &msize, NULL, NULL);
> > +			if (IS_ERR(mtype)) {
> > +				err = PTR_ERR(mtype);
> > +				goto reset_unlock;
> > +			}
> > +
> > +			if (memchr_inv(udata + moff, 0, msize)) {
> > +				err = -EINVAL;
> > +				goto reset_unlock;
> > +			}
> > +
> > +			continue;
> > +		}
> > +
> > +		prog_fd = (int)(*(unsigned long *)(udata + moff));
> > +		/* Similar check as the attr->attach_prog_fd */
> > +		if (!prog_fd)
> > +			continue;
> > +
> > +		prog = bpf_prog_get(prog_fd);
> > +		if (IS_ERR(prog)) {
> > +			err = PTR_ERR(prog);
> > +			goto reset_unlock;
> > +		}
> > +		st_map->progs[i] = prog;
> > +
> > +		if (prog->type != BPF_PROG_TYPE_STRUCT_OPS ||
> > +		    prog->aux->attach_btf_id != st_ops->type_id ||
> > +		    prog->expected_attach_type != i) {
> > +			err = -EINVAL;
> > +			goto reset_unlock;
> > +		}
> > +
> > +		err = arch_prepare_bpf_trampoline(image,
> > +						  &st_ops->func_models[i], 0,
> > +						  &prog, 1, NULL, 0, NULL);
> > +		if (err < 0)
> > +			goto reset_unlock;
> > +
> > +		*(void **)(kdata + moff) = image;
> > +		image += err;
> 
> Do we still want to check whether image out of page boundary or not?
It should never happen.  It would be too late to check here also.

The BPF_STRUCT_OPS_MAX_NR_MEMBERS (which is 64) is picked
based on each trampoline will take less than 64 bytes.
Thus, PAGE_SIZE / 64(bytes) => 64 members

I can add a BUILD_BUG_ON() to ensure the future BPF_STRUCT_OPS_MAX_NR_MEMBERS
change won't violate this.

> 
> > +
> > +		/* put prog_id to udata */
> > +		*(unsigned long *)(udata + moff) = prog->aux->id;
> > +	}
> > +
> > +	refcount_set(&kvalue->refcnt, 1);
> > +	bpf_map_inc(map);
> > +
> > +	err = st_ops->reg(kdata);
> > +	if (!err) {
> > +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE);
> > +		goto unlock;
> > +	}
> > +
> > +	/* Error during st_ops->reg() */
> > +	bpf_map_put(map);
> > +
> > +reset_unlock:
> > +	bpf_struct_ops_map_put_progs(st_map);
> > +	memset(uvalue, 0, map->value_size);
> > +	memset(kvalue, 0, map->value_size);
> > +
> > +unlock:
> > +	spin_unlock(&st_map->lock);
> > +	return err;
> > +}
> > +
> > +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
> > +{
> > +	enum bpf_struct_ops_state prev_state;
> > +	struct bpf_struct_ops_map *st_map;
> > +
> > +	st_map = (struct bpf_struct_ops_map *)map;
> > +	prev_state = cmpxchg(&st_map->kvalue.state,
> > +			     BPF_STRUCT_OPS_STATE_INUSE,
> > +			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> > +	if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) {
> > +		st_map->st_ops->unreg(&st_map->kvalue.data);
> > +		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
> > +			bpf_map_put(map);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> [...]