Re: [PATCH v2 bpf-next 1/3] bpf: Consolidate inner-map-compatible properties into bpf_types.h

From: Martin KaFai Lau <kafai@fb.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Kernel Team <kernel-team@fb.com>,
	Networking <netdev@vger.kernel.org>, Andrey Ignatov <rdna@fb.com>
Subject: Re: [PATCH v2 bpf-next 1/3] bpf: Consolidate inner-map-compatible properties into bpf_types.h
Date: Thu, 28 May 2020 23:30:54 -0700	[thread overview]
Message-ID: <20200529063054.edz7qfiqgfgjzj43@kafai-mbp> (raw)
In-Reply-To: <CAEf4Bza===GwERi_x0Evf_Wjm+8=wBHnG4VHPNtZ=GPPZ+twiQ@mail.gmail.com>

On Tue, May 26, 2020 at 10:54:26AM -0700, Andrii Nakryiko wrote:
> On Fri, May 22, 2020 at 6:01 PM Martin KaFai Lau <kafai@fb.com> wrote:
> >
> > On Sat, May 23, 2020 at 12:22:48AM +0200, Daniel Borkmann wrote:
> > > On 5/22/20 4:23 AM, Martin KaFai Lau wrote:
> > > [...]
> > > >   };
> > > > +/* Cannot be used as an inner map */
> > > > +#define BPF_MAP_NO_INNER_MAP (1 << 0)
> > > > +
> > > >   struct bpf_map {
> > > >     /* The first two cachelines with read-mostly members of which some
> > > >      * are also accessed in fast-path (e.g. ops, max_entries).
> > > > @@ -120,6 +123,7 @@ struct bpf_map {
> > > >     struct bpf_map_memory memory;
> > > >     char name[BPF_OBJ_NAME_LEN];
> > > >     u32 btf_vmlinux_value_type_id;
> > > > +   u32 properties;
> > > >     bool bypass_spec_v1;
> > > >     bool frozen; /* write-once; write-protected by freeze_mutex */
> > > >     /* 22 bytes hole */
> > > > @@ -1037,12 +1041,12 @@ extern const struct file_operations bpf_iter_fops;
> > > >   #define BPF_PROG_TYPE(_id, _name, prog_ctx_type, kern_ctx_type) \
> > > >     extern const struct bpf_prog_ops _name ## _prog_ops; \
> > > >     extern const struct bpf_verifier_ops _name ## _verifier_ops;
> > > > -#define BPF_MAP_TYPE(_id, _ops) \
> > > > +#define BPF_MAP_TYPE_FL(_id, _ops, properties) \
> > > >     extern const struct bpf_map_ops _ops;
> > > >   #define BPF_LINK_TYPE(_id, _name)
> > > >   #include <linux/bpf_types.h>
> > > >   #undef BPF_PROG_TYPE
> > > > -#undef BPF_MAP_TYPE
> > > > +#undef BPF_MAP_TYPE_FL
> > > >   #undef BPF_LINK_TYPE
> > > >   extern const struct bpf_prog_ops bpf_offload_prog_ops;
> > > > diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> > > > index 29d22752fc87..3f32702c9bf4 100644
> > > > --- a/include/linux/bpf_types.h
> > > > +++ b/include/linux/bpf_types.h
> > > > @@ -76,16 +76,25 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LSM, lsm,
> > > >   #endif /* CONFIG_BPF_LSM */
> > > >   #endif
> > > > +#define BPF_MAP_TYPE(x, y) BPF_MAP_TYPE_FL(x, y, 0)
> > > > +
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
> > > > -BPF_MAP_TYPE(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops)
> > > > +/* prog_array->aux->{type,jited} is a runtime binding.
> > > > + * Doing static check alone in the verifier is not enough,
> > > > + * so BPF_MAP_NO_INNTER_MAP is needed.
> > >
> > > typo: INNTER
> > Good catch.
> >
> > >
> > > > + */
> > > > +BPF_MAP_TYPE_FL(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops,
> > > > +           BPF_MAP_NO_INNER_MAP)
> > >
> > > Probably nit, but what is "FL"? flags? We do have map_flags already, but here the
> > > BPF_MAP_NO_INNER_MAP ends up in 'properties' instead. To avoid confusion, it would
> > > probably be better to name it 'map_flags_fixed' since this is what it really means;
> > > fixed flags that cannot be changed/controlled when creating a map.
> > ok. may be BPF_MAP_TYPE_FIXED_FL?
> >
> > >
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERF_EVENT_ARRAY, perf_event_array_map_ops)
> > > >   #ifdef CONFIG_CGROUPS
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_ARRAY, cgroup_array_map_ops)
> > > >   #endif
> > > >   #ifdef CONFIG_CGROUP_BPF
> > > > -BPF_MAP_TYPE(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > -BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops)
> > > > +BPF_MAP_TYPE_FL(BPF_MAP_TYPE_CGROUP_STORAGE, cgroup_storage_map_ops,
> > > > +           BPF_MAP_NO_INNER_MAP)
> > > > +BPF_MAP_TYPE_FL(BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE, cgroup_storage_map_ops,
> > > > +           BPF_MAP_NO_INNER_MAP)
> > > >   #endif
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_HASH, htab_percpu_map_ops)
> > > > @@ -116,8 +125,10 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> > > >   BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> > > >   #if defined(CONFIG_BPF_JIT)
> > > > -BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops)
> > > > +BPF_MAP_TYPE_FL(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops,
> > > > +           BPF_MAP_NO_INNER_MAP)
> > > >   #endif
> > > > +#undef BPF_MAP_TYPE
> > > >   BPF_LINK_TYPE(BPF_LINK_TYPE_RAW_TRACEPOINT, raw_tracepoint)
> > > >   BPF_LINK_TYPE(BPF_LINK_TYPE_TRACING, tracing)
> > > [...]
> > > > diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c
> > > > index 17738c93bec8..d965a1d328a9 100644
> > > > --- a/kernel/bpf/map_in_map.c
> > > > +++ b/kernel/bpf/map_in_map.c
> > > > @@ -17,13 +17,7 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd)
> > > >     if (IS_ERR(inner_map))
> > > >             return inner_map;
> > > > -   /* prog_array->aux->{type,jited} is a runtime binding.
> > > > -    * Doing static check alone in the verifier is not enough.
> > > > -    */
> > > > -   if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY ||
> > > > -       inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE ||
> > > > -       inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE ||
> > > > -       inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
> > > > +   if (inner_map->properties & BPF_MAP_NO_INNER_MAP) {
> > > >             fdput(f);
> > > >             return ERR_PTR(-ENOTSUPP);
> > > >     }
> > >
> > > This whole check here is currently very fragile. For example, given we forbid cgroup
> > > local storage here, why do we not forbid socket local storage? What about other maps
> > > like stackmap? It's quite unclear if it even works as expected and if there's also a
> > > use-case we are aware of. Why not making this an explicit opt-in?
> > Re: "cgroup-local-storage", my understanding is,
> > cgroup-local-storage is local to the bpf's cgroup that it is running under,
> > so it is not ok for a cgroup's bpf to be able to access other cgroup's local
> > storage through map-in-map, so they are excluded here.
> >
> > sk-local-storage does not have this restriction.  For other maps, if there is
> > no known safety issue, why restricting it and create unnecessary API
> > discrepancy?
> >
> > I think we cannot restrict the existing map either unless there is a
> > known safety issue.
> >
> > >
> > > Like explicit annotating via struct bpf_map_ops where everything is visible in one
> > > single place where the map is defined:
> > >
> > > const struct bpf_map_ops array_map_ops = {
> > >         .map_alloc_check = array_map_alloc_check,
> > >         [...]
> > >         .map_flags_fixed = BPF_MAP_IN_MAP_OK,
> > > };
> > I am not sure about adding it to bpf_map_ops instead of bpf_types.h.
> > It will be easier to figure out what map types do not support MAP_IN_MAP (and
> > other future map's fixed properties) in one place "bpf_types.h" instead of
> > having to dig into each map src file.
> 
> I'm 100% with Daniel here. If we are consolidating such things, I'd
> rather have them in one place where differences between maps are
> defined, which is ops. Despite an "ops" name, this seems like a
> perfect place for specifying all those per-map-type properties and
> behaviors. Adding flags into bpf_types.h just splits everything into
> two places: bpf_types.h specifies some differences, while ops specify
> all the other ones.
> 
> Figuring out map-in-map support is just one of many questions one
> might ask about differences between map types, I don't think that
> justifies adding them to bpf_types.h. Grepping for struct bpf_map_ops
> with search context (i.e., -A15 or something like that) should be
> enough to get a quick glance at all possible maps and what they
> define/override.
> 
> It also feels like adding this as bool field for each aspect instead
> of a collection of bits is cleaner and a bit more scalable. If we need
> to add another property with some parameter/constant, or just enum,
> defining one of few possible behaviors, it would be easier to just add
> another field, instead of trying to cram that into u32. It also solves
> your problem of "at the glance" view of map-in-map support features.
> Just name that field unique enough to grep by it :)
How about another way.  What patch 2 want is each map could have its own
bpf_map_meta_equal().  Instead of adding 2 flags, add the bpf_map_meta_equal()
as a ops to bpf_map_ops.  Each map supports to be used as an inner_map
needs to set this ops.  Then it will be an opt-in.
A default implementation can be provided for most maps' use.
The maps (e.g. arraymap and other future maps) that has different requirement
can implement its own.  For the existing maps, when we address those
limitations (e.g. arraymap's gen_lookup) later,  we can then change its
bpf_map_meta_equal.

Thoughts?

> 
> >
> > If the objective is to have the future map "consciously" opt-in, how about
> > keeping the "BPF_MAP_TYPE" name as is but add a fixed_flags param as the
> > earlier v1 and flip it from NO to OK flag.  It will be clear that,
> > it is a decision that the new map needs to make instead of a quiet 0
> > in "struct bpf_map_ops".
> >
> > For example,
> > BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops, BPF_MAP_IN_MAP_OK)
> > BPF_MAP_TYPE(BPF_MAP_TYPE_PROG_ARRAY, prog_array_map_ops, 0)
> > BPF_MAP_TYPE(BPF_MAP_TYPE_HASH, htab_map_ops, BPF_MAP_IN_MAP_OK | BPF_MAP_IN_MAP_DYNAMIC_SIZE_OK)
> >
> > >
> > > That way, if someone forgets to add .map_flags_fixed to a new map type, it's okay since
> > > it's _safe_ to forget to add these flags (and okay to add in future uapi-wise) as opposed
> > > to the other way round where one can easily miss the opt-out case and potentially crash
> > > the machine worst case.