All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andrii Nakryiko <andrii.nakryiko@gmail.com>
To: John Fastabend <john.fastabend@gmail.com>
Cc: "Jesper Dangaard Brouer" <brouer@redhat.com>,
	BPF-dev-list <bpf@vger.kernel.org>,
	"Alexander Lobakin" <alexandr.lobakin@intel.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Magnus Karlsson" <magnus.karlsson@gmail.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Björn Töpel" <bjorn@kernel.org>,
	"Saeed Mahameed" <saeed@kernel.org>,
	"kurt@linutronix.de" <kurt@linutronix.de>,
	"Raczynski, Piotr" <piotr.raczynski@intel.com>,
	"Zhang, Jessica" <jessica.zhang@intel.com>,
	"Maloor, Kishen" <kishen.maloor@intel.com>,
	"Gomes, Vinicius" <vinicius.gomes@intel.com>,
	"Brandeburg, Jesse" <jesse.brandeburg@intel.com>,
	"Swiatkowski, Michal" <michal.swiatkowski@intel.com>,
	"Plantykow, Marta A" <marta.a.plantykow@intel.com>,
	"Desouza, Ederson" <ederson.desouza@intel.com>,
	"Song, Yoong Siang" <yoong.siang.song@intel.com>,
	"Czapnik, Lukasz" <lukasz.czapnik@intel.com>,
	"Joseph, Jithu" <jithu.joseph@intel.com>,
	"William Tu" <u9012063@gmail.com>,
	"Ong Boon Leong" <boon.leong.ong@intel.com>
Subject: Re: XDP-hints: Howto support multiple BTF types per packet basis?
Date: Thu, 27 May 2021 10:44:00 -0700	[thread overview]
Message-ID: <CAEf4Bzb1OZHpHYagbVs7s9tMSk4wrbxzGeBCCBHQ-qCOgdu6EQ@mail.gmail.com> (raw)
In-Reply-To: <60aeeb5252147_19a622085a@john-XPS-13-9370.notmuch>

On Wed, May 26, 2021 at 5:44 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> [...]
>
> > > Best to start with the simplest possible usable thing and get more
> > > complex over time.
> > >
> > > For a C definition I would expect drivers to do something like this,
> > >
> > >  struct mynic_rx_descriptor {
> > >         __u64 len;
> > >         __u64 head;
> > >         __u64 tail;
> > >         __u64 foobar;
> > >  }
> > >
> > >  struct mynic_metadata {
> > >         __u64 timestamp;
> > >         __u64 hash;
> > >         __u64 pkt_type;
> > >         struct mynic_rx_descriptor *ptr_to_rx;
> > >         /* other things */
> > >  }
> > >
> > > It doesn't really matter how the driver folks generate their metadata
> > > though. They might use some non-C thing that is more natural for
> > > writing parser/action/tcam codes.
> > >
> > > Anyways given some C block like above we generate BTF from above
> > > using normal method, quick hack just `pahole -J` the thing. Now we
> > > have a BTF file.
> > >
> > > Next up write some XDP program to do something with it,
> > >
> > >  void myxdp_prog(struct xdp_md *ctx) {
> > >         struct mynic_metadata m = (struct mynic_metadata *)ctx->data_meta;
> > >
> > >         // now I can get data using normal CO-RE
> > >         // I usually have this _(&) to put CO-RE attributes in I
> > >         // believe that is standard? Or use the other macros
> > >         __u64 pkt_type = _(&m->pkt_type)
> >
> > add __attribute__((preserve_access_index)) to the struct
> > mynic_metadata above (when compiling your BPF program) and you don't
> > need _() ugliness:
>
> +1. Although sometimes I like the ugliness so I can keep track
> of whats in CO-RE and not.

Oh, I'm just against using underscore as an identifier, I'd use
something a bit more explicit.

>
> >
> > __u64 pkt_type = m->pkt_type; /* it's CO-RE relocatable already */
> >
> > we have preserve_access_index as a code block (some selftests do this)
> > for cases when you can't annotate types
> >
> > >
> > >         // we can even walk into structs if we have probe read
> > >         // around.
> > >         struct mynic_rx_descriptor *rxdesc = _(&m->ptr_to_rx)
> > >
> > >         // now do whatever I like with above metadata
> > >  }
> > >
> > > Run above program through normal CO-RE pass and as long as it has
> > > access to the BTF from above it will work. I have some logic
> > > sitting around to stitch two BTF blocks together but we have
> > > that now done properly for linking.
> >
> > "stitching BTF blocks together" sort of jumped out of nowhere, what is
> > this needed for? And not sure what "BTF block" means exactly, it's a
> > new terminology.
>
> I didn't know what the correct terminology here would be.

I just wasn't sure if "BTF block" is a single BTF type or it's a
collection of types built on top of vmlinux BTF (what we call split
BTF). Seems like it's the latter.

>
> What I meant is I think what you have here,
>
> "
>  BTW, not that I encourage such abuse, but for the experiment's sake,
>  you can (ab)use module BTFs mechanism today to allow dynamically
>  adding/removing split BTFs built on top of kernel (vmlinux) BTF
> "
>
> So if vendor/driver writer has a BTF file for whatever the current
> hardware is doing we can use the split BTF build mechanism to
> include it. This can be used to get Jespers dynamic reprogram
> hardware example. We just need someway to get the BTF of the
> current running hardware. What I'm suggesting to get going we
> can just take that out of band, libbpf/kernel don't have
> to care where it comes from as long as libbpf can consume the
> split BTFs before doing CO-RE.
>
> With this model I can have a single XDP program and it will
> run on multiple hardware or the same hardware across updates
> when I can use the normal CO-RE macros to access the metadata.
> When I update my hardware I just need to get ahold of the
> BTF when I do that update and my programs will continue to
> work.
>
> Once we show the value of above we can talk about a driver
> mechanism to expose the BTF over some interface, maybe in
> /sys/fs. But that would still look like a split BTF from libbpf
> side. The advantage is it should work today.

Right, except I don't think we have libbpf APIs to specify this, but
that's solvable.

>
> I called the process of taking two BTF files, vmlinux BTF and
> user provided NIC metadata BTF, and using those for CO-RE
> logic "stitching BTF blocks together".
>
> >
> > >
> > > probe_read from XDP should be added regardless of above. I've
> > > found it super handy in skmsg programs to dig out kernel info
> > > inline. With probe_read we can also start to walk net_device
> > > struct for more detailed info as needed. Or into sock structs
> >
> > yes, libbpf provides BPF_CORE_READ() macro that allows to walk across
> > struct referenced by pointers, e.g.,:
> >
> > int my_data = BPF_CORE_READ(m, ptr_to_rx, rx_field);
> >
> > is logical equivalent of
> >
> > int my_data = m->ptr_to_rx->rx_field;
>
> The only complication here is ptr_to_rx is outside XDP data
> so we need XDP program to support probe_read(). So depending
> on current capabilities a BPF program might be limited to
> just its own data block or with higher caps able to use
> more of the features.
>

Right.

> >
> > > for process level conntrack (other thread). Even without
> > > probe_read above would be useful but fields would need to fit
> > > into the metadata where we know we can read/write data.
> > >
> > > Having drivers export their BTF over a /sys/fs/ interface
> > > so that BTF can change with fimware/parser updates is possible
> > > as well, but I would want to see above working in real world
> > > before committing to a /sys/fs interface. Anyways the
> > > interface is just a convienence.
> >
> > it's important enough to discuss because libbpf has to get it somehow
> > (or be directly provided as an extra option or something).
>
> I believe to start with directly providing it is the easiest
> approach. Then as a second step we can pull it from a /sys/fs
> interface.
>
> >
> > >
> > > >
> > > > As for BTF on a per-packet basis. This means that BTF itself is not
> > > > known at the BPF program verification time, so there will be some sort
> > > > of if/else if/else conditions to handle all recognized BTF IDs? Is
> > > > that right? Fake but specific code would help (at least me) to
> > > > actually join the discussion. Thanks.
> > >
> > > I don't think we actually want per-packet data that sounds a bit
> > > clumsy for me. Lets use a union and define it so that we have a
> > > single BTF.
> >
> > union and independent set of BTFs are two different things, I'll let
> > you guys figure out which one you need, but I replied how it could
> > look like in CO-RE world
>
> I think a union is sufficient and more aligned with how the
> hardware would actually work.

Sure. And I think those are two orthogonal concerns. You can start
with a single struct mynic_metadata with union inside it, and later
add the ability to swap mynic_metadata with another
mynic_metadata___v2 that will have a similar union but with a
different layout.

>
> >
> > >
> > >  struct mynic_metadata {
> > >   __u64 pkt_type
> > >   union {
> > >       struct ipv6_meta meta;
> > >       struct ipv4_meta meta;
> > >       struct arp_meta meta;
> >
> > obviously fields can't be named the same, so you'll have meta_ipv6,
>
> Sure just typing a quick example.
>
> > meta_ipv4, meta_arp fields, but I get the idea. This works if BTF
> > layout is set in stone. What Jesper proposes would allow to adds new
> > BTF layouts at runtime and still be able to handle that (as in detect
> > and ignore) with already running BPF programs.
>
> Same answer as above. As long as the BTF can be split into two
> files I don't think libbpf should care if its always the same
> NIC.btf + vmlinux.btf or diffent correct?
>
> >
> > CO-RE is sufficiently sophisticated to handle both today, so I don't care :)
>
> +1
>
> >
> > >   }
> > >  };
> > >
> > > Then program has to swivel on pkt_type but that is most natural
> > > C thing to do IMO.
> > >
> > > Honestly we have about 90% of the necessary bits to do this now.
> > > Typed that up a bit fast hope its legible. Got a lot going on today.
> > >
> > > Andrii, make sense?
> >
> > Yes, thanks! The logistics of getting that BTF to libbpf is the most
> > fuzzy area and not worked out completely. The low-level details of
> > relocations are already in place if libbpf can be pointed to the right
> > set of BTF types.
>
> Per above, getting that BTF to libbpf should be a user problem for
> a bit. Once how those programs look is worked out I think drivers
> can push them out via /sys/kernel/btf/mynic
>
> >
> > BTW, not that I encourage such abuse, but for the experiment's sake,
> > you can (ab)use module BTFs mechanism today to allow dynamically
> > adding/removing split BTFs built on top of kernel (vmlinux) BTF. I
> > suggest looking into how module BTFs are handled both inside the
> > kernel and in libbpf.
> >
>
> Exactly the abuse I was thinking ;)

  reply	other threads:[~2021-05-27 17:44 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 10:58 XDP-hints: Howto support multiple BTF types per packet basis? Jesper Dangaard Brouer
2021-05-26 19:12 ` Andrii Nakryiko
2021-05-26 20:20   ` Jesper Dangaard Brouer
2021-05-26 20:37     ` John Fastabend
2021-05-26 22:39     ` Andrii Nakryiko
2021-05-28 11:16       ` Jesper Dangaard Brouer
2021-05-30  3:24         ` Andrii Nakryiko
2021-05-26 20:31   ` John Fastabend
2021-05-26 22:54     ` Andrii Nakryiko
2021-05-27  0:44       ` John Fastabend
2021-05-27 17:44         ` Andrii Nakryiko [this message]
2021-05-28  5:48           ` John Fastabend
2021-05-28  9:16             ` Toke Høiland-Jørgensen
2021-05-28 10:38               ` Alexander Lobakin
2021-05-28 14:35               ` John Fastabend
2021-05-28 15:33                 ` Toke Høiland-Jørgensen
2021-05-28 16:02                 ` Jesper Dangaard Brouer
2021-05-28 17:29                   ` John Fastabend
2021-05-30  3:27                     ` Andrii Nakryiko
2021-05-31 11:03                     ` Toke Høiland-Jørgensen
2021-05-31 13:17                       ` Jesper Dangaard Brouer
2021-06-02  0:22                       ` John Fastabend
2021-06-02 16:18                         ` Jakub Kicinski
2021-06-22  7:44                           ` Michal Swiatkowski
2021-06-22 11:53                             ` Toke Høiland-Jørgensen
2021-06-23  8:32                               ` Michal Swiatkowski
2021-06-24 12:23                                 ` Toke Høiland-Jørgensen
2021-06-24 13:07                                   ` Magnus Karlsson
2021-06-24 14:58                                     ` Alexei Starovoitov
2021-06-24 15:11                                   ` Zvi Effron
2021-06-24 16:04                                     ` Toke Høiland-Jørgensen
2021-06-24 16:32                                       ` Zvi Effron
2021-06-24 16:45                                       ` Jesper Dangaard Brouer
2021-07-08  8:32                                   ` Michal Swiatkowski
2021-07-09 10:57                                     ` Toke Høiland-Jørgensen
2021-09-02  2:49                                       ` Michal Swiatkowski
2021-09-02  9:17                                         ` Jesper Dangaard Brouer
2021-09-07  6:27                                           ` Michal Swiatkowski
2021-09-08 13:28                                             ` Jesper Dangaard Brouer
2021-09-09 18:19                                               ` Andrii Nakryiko
2021-09-10 11:16                                                 ` Jesper Dangaard Brouer
2021-07-28  9:54 ` XDP-hints: how to inform driver about hints Michal Swiatkowski
2021-07-28 18:40   ` John Fastabend

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAEf4Bzb1OZHpHYagbVs7s9tMSk4wrbxzGeBCCBHQ-qCOgdu6EQ@mail.gmail.com \
    --to=andrii.nakryiko@gmail.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=bjorn@kernel.org \
    --cc=boon.leong.ong@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=dsahern@kernel.org \
    --cc=ederson.desouza@intel.com \
    --cc=jesse.brandeburg@intel.com \
    --cc=jessica.zhang@intel.com \
    --cc=jithu.joseph@intel.com \
    --cc=john.fastabend@gmail.com \
    --cc=kishen.maloor@intel.com \
    --cc=kurt@linutronix.de \
    --cc=lukasz.czapnik@intel.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=marta.a.plantykow@intel.com \
    --cc=michal.swiatkowski@intel.com \
    --cc=piotr.raczynski@intel.com \
    --cc=saeed@kernel.org \
    --cc=u9012063@gmail.com \
    --cc=vinicius.gomes@intel.com \
    --cc=yoong.siang.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.