Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Petar Penkov <peterpenkov96@gmail.com>,
	Network Development <netdev@vger.kernel.org>,
	David Miller <davem@davemloft.net>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	simon.horman@netronome.com, ecree@solarflare.com,
	songliubraving@fb.com, Tom Herbert <tom@herbertland.com>,
	Petar Penkov <ppenkov@google.com>,
	Willem de Bruijn <willemb@google.com>
Subject: Re: [bpf-next, v2 1/3] flow_dissector: implements flow dissector BPF hook
Date: Wed, 12 Sep 2018 14:43:37 -0400	[thread overview]
Message-ID: <CAF=yD-LOvyj9t1O7MpqfXE7eYKpFZn96Gx144vmkFgppZe2UbQ@mail.gmail.com> (raw)
In-Reply-To: <20180912034632.bkxpktzvze242rvr@ast-mbp>

On Tue, Sep 11, 2018 at 11:47 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Sep 07, 2018 at 05:11:08PM -0700, Petar Penkov wrote:
> > From: Petar Penkov <ppenkov@google.com>
> >
> > Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
> > attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
> > path. The BPF program is per-network namespace.
> >
> > Signed-off-by: Petar Penkov <ppenkov@google.com>
> > Signed-off-by: Willem de Bruijn <willemb@google.com>
> > ---
> >  include/linux/bpf.h            |   1 +
> >  include/linux/bpf_types.h      |   1 +
> >  include/linux/skbuff.h         |   7 ++
> >  include/net/net_namespace.h    |   3 +
> >  include/net/sch_generic.h      |  12 ++-
> >  include/uapi/linux/bpf.h       |  25 ++++++
> >  kernel/bpf/syscall.c           |   8 ++
> >  kernel/bpf/verifier.c          |  32 ++++++++
> >  net/core/filter.c              |  67 ++++++++++++++++
> >  net/core/flow_dissector.c      | 136 +++++++++++++++++++++++++++++++++
> >  tools/bpf/bpftool/prog.c       |   1 +
> >  tools/include/uapi/linux/bpf.h |  25 ++++++
> >  tools/lib/bpf/libbpf.c         |   2 +
>
> please split up update to tools/include/uapi/linux/bpf.h as a separate patch 2.
> We often have conflicts in there, so best to have a separate.
> Also please split tools/lib and tools/bpf chnages into patch 3.

Will do in v3.

> >  struct qdisc_skb_cb {
> > -     unsigned int            pkt_len;
> > -     u16                     slave_dev_queue_mapping;
> > -     u16                     tc_classid;
> > +     union {
> > +             struct {
> > +                     unsigned int            pkt_len;
> > +                     u16                     slave_dev_queue_mapping;
> > +                     u16                     tc_classid;
> > +             };
> > +             struct bpf_flow_keys *flow_keys;
> > +     };
>
> is this magic really necessary? flow_dissector runs very early in recv path.
> There is no qdisc or conflicts with tcp/ip use of cb.
> I think the whole cb block can be used.

The flow dissector also runs in the context of TC, from flower.
But that is not a reason to use this struct.

We need both (a) data shared with the BPF application and between
applications using tail-calls, to pass along the parse offset (nhoff),
and (b) data not accessible by the program, to store the flow_keys
pointer.

qdisc_skb_cb already has this split, exposing only the 20B .data
field to the application. Flow dissection currently reuses the existing
bpf_convert_ctx_access logic for this field.

We could create a separate flowdissect_skb_cb struct with the
same split setup, but a second constraint is that relevant internal
BPF interfaces already expect qdisc_skb_cb, notably
bkf_skb_data_end. So the union was easier.

There is another way to pass nhoff besides cb[] (see below). But
I don't immediately see another place to store the flow_keys ptr.

At least, if we pass skb as context. One larger change would
be to introduce another ctx struct, similar to sk_reuseport_(kern|md).

> > @@ -2333,6 +2335,7 @@ struct __sk_buff {
> >       /* ... here. */
> >
> >       __u32 data_meta;
> > +     __u32 flow_keys;
>
> please use
> struct bpf_flow_keys *flow_keys;
> instead.
>
> See what we did in 'struct sk_msg_md' and in 'struct sk_reuseport_md'.
> There is no need to hide pointers in u32.
>

Will do in v3.

> > @@ -658,6 +754,46 @@ bool __skb_flow_dissect(const struct sk_buff *skb,
> >                                             FLOW_DISSECTOR_KEY_BASIC,
> >                                             target_container);
> >
> > +     rcu_read_lock();
> > +     attached = skb ? rcu_dereference(dev_net(skb->dev)->flow_dissector_prog)
> > +                    : NULL;
> > +     if (attached) {
> > +             /* Note that even though the const qualifier is discarded
> > +              * throughout the execution of the BPF program, all changes(the
> > +              * control block) are reverted after the BPF program returns.
> > +              * Therefore, __skb_flow_dissect does not alter the skb.
> > +              */
> > +             struct bpf_flow_keys flow_keys = {};
> > +             struct qdisc_skb_cb cb_saved;
> > +             struct qdisc_skb_cb *cb;
> > +             u16 *pseudo_cb;
> > +             u32 result;
> > +
> > +             cb = qdisc_skb_cb(skb);
> > +             pseudo_cb = (u16 *)bpf_skb_cb((struct sk_buff *)skb);
> > +
> > +             /* Save Control Block */
> > +             memcpy(&cb_saved, cb, sizeof(cb_saved));
> > +             memset(cb, 0, sizeof(cb_saved));
> > +
> > +             /* Pass parameters to the BPF program */
> > +             cb->flow_keys = &flow_keys;
> > +             *pseudo_cb = nhoff;
>
> I don't understand this bit.
> What is this pseudo_cb and why nhoff goes in there?
> Some odd way to pass it into the prog?

Yes. nhoff passes the offset to the program to start parsing from.
Applications also pass this during tail calls.

Alternatively we can just add a new field to struct bpf_flow_keys.