From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexei Starovoitov Subject: Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure Date: Fri, 6 Jul 2018 17:53:18 -0700 Message-ID: <20180707005317.y255ymd7m5bw3zof@ast-mbp.dhcp.thefacebook.com> References: <20180627024615.17856-1-saeedm@mellanox.com> <20180627024615.17856-3-saeedm@mellanox.com> <20180703230137.hdoy2fujz3x4oeij@ast-mbp.dhcp.thefacebook.com> <13f973a9937834ae8c10bfcc7d90909e94c543f1.camel@mellanox.com> <65b964eb-9ee1-9fd8-d54a-9290377eb1e4@iogearbox.net> <20180705101800.3c5d6af0@cakuba.netronome.com> <20180706163041.xstyfednmgho23m3@ast-mbp.dhcp.thefacebook.com> <20180706143358.0240ff66@cakuba.netronome.com> <20180706234249.dchucouomzwilytx@ast-mbp.dhcp.thefacebook.com> <20180706170847.3b2c7809@cakuba.netronome.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Daniel Borkmann , Saeed Mahameed , "saeedm@dev.mellanox.co.il" , "alexander.h.duyck@intel.com" , "netdev@vger.kernel.org" , Tariq Toukan , "john.fastabend@gmail.com" , "brouer@redhat.com" , "borkmann@iogearbox.net" , "peter.waskiewicz.jr@intel.com" To: Jakub Kicinski Return-path: Received: from mail-pl0-f65.google.com ([209.85.160.65]:43985 "EHLO mail-pl0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753450AbeGGAxW (ORCPT ); Fri, 6 Jul 2018 20:53:22 -0400 Received: by mail-pl0-f65.google.com with SMTP id c41-v6so3553449plj.10 for ; Fri, 06 Jul 2018 17:53:22 -0700 (PDT) Content-Disposition: inline In-Reply-To: <20180706170847.3b2c7809@cakuba.netronome.com> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, Jul 06, 2018 at 05:08:47PM -0700, Jakub Kicinski wrote: > On Fri, 6 Jul 2018 16:42:51 -0700, Alexei Starovoitov wrote: > > On Fri, Jul 06, 2018 at 02:33:58PM -0700, Jakub Kicinski wrote: > > > On Fri, 6 Jul 2018 09:30:42 -0700, Alexei Starovoitov wrote: > > > > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote: > > > > > > > > > > I'm also not 100% on board with the argument that "future" FW can > > > > > reshuffle things whatever way it wants to. Is the assumption that > > > > > future ASICs/FW will be designed to always use the "blessed" BTF > > > > > format? Or will it be reconfigurable at runtime? > > > > > > > > let's table configuration of metadata aside for a second. > > > > > > > > Describing metedata layout in BTF allows NICs to disclose everything > > > > NIC has to users in a standard and generic way. > > > > Whether firmware is reconfigurable on the fly or has to reflashed > > > > and hw powercycled to have new md layout (and corresponding BTF description) > > > > is a separate discussion. > > > > Saeed's proposal introduces the concept of 'offset' inside 'struct xdp_md_info' > > > > to reach 'hash' value in metadata. > > > > Essentially it's a run-time way to access 'hash' instead of build-time. > > > > So bpf program would need two loads to read csum or hash field instead of one. > > > > With BTF the layout of metadata is known to the program at build-time. > > > > > > > > To reiterate the proposal: > > > > - driver+firmware keep layout of the metadata in BTF format (either in the driver > > > > or driver can read it from firmware) > > > > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query the driver and > > > > generate normal C header file based on BTF in the given NIC > > > > - user does #include "md_desc.h" and bpf program can access md->csum or md->hash > > > > with direct single load out of metadata area in front of the packet > > > > - llvm compiles bpf program and records how program is doing this md->csum accesses > > > > in BTF format as well (the compiler will be keeping such records > > > > for __sk_buff and all other structs too, but that's separate discussion) > > > > - during sys_bpf(prog_load) the kernel checks (via supplied BTF) that the way the program > > > > accesses metadata (and other structs) matches BTF from the driver, > > > > so no surprises if driver+firmware got updated, but program is not recompiled > > > > - every NIC can have their own layout of metadata and its own meaning of the fields, > > > > but would be good to standardize at least a few common fields like hash > > > > > > Can I expose HW descriptors this way, though, or is the proposal to > > > copy this data into the packet buffer? > > > > That crossed my mind too. We can use BTF to describe HW descriptors too, > > but I don't think it would buy us anything. AF_XDP approach is better. > > > > > > Once this is working we can do more cool things with BTF. > > > > Like doing offset rewriting at program load time similar to what we plan > > > > to do for tracing. Tracing programs will be doing 'task->pid' access > > > > and the kernel will adjust offsetof(struct task_struct, pid) during program load > > > > depending on BTF for the kernel. > > > > The same trick we can do for networking metadata. > > > > The program will contain load instruction md->hash that will get automatically > > > > adjusted to proper offset depending on BTF of 'hash' field in the NIC. > > > > For now I'm proposing _not_ to go that far with offset rewriting and start > > > > with simple steps described above. > > > > > > Why? :( Could we please go with the rewrite/driver helpers instead of > > > impacting fast paths of the drivers yet again? This rewrite should be > > > easier than task->pid, because we have the synthetic user space struct > > > xdp_md definition. > > > > I don't understand 'impact fast path yet again' concern. > > If NIC has certain metadata today, just derscribe what it has in BTF > > and supply the actual per-packet md to xdp program as-is. > > No changes for fast path at all. > > Future rewritting will be done by the bpf/xdp core with zero > > changes to the driver. All driver has to do is to provide BTF. > > I'm confused. AFAIK most *existing* NICs have the metadata in the > "descriptor", i.e. not in the packet buffer. So if the NIC just > describes what it has, and there is no data shuffling/copying > (performance) then we have to expose the descriptor, no? which piece of sw put that data into desciptor ? I bet it's firmware. It could have stored it into pre-packet data, no? I'd like to avoid _all_ copies. Right now xdp program can only see a pointer to packet and pre-packet. If we need another pointer to a piece of the packet descriptor, that's also fine. Both pre-packet metadata and pieces of descriptor can be described in BTF. I'd like to push back on firmware folks that should be listening to feedback from driver folks and kernel stack instead of saying 'here is hw spec that firmware provides'. Firmware is software. It can change and should be open to change by the community with proper maintainership.