From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Subject: Re: [RFC bpf-next 2/6] net: xdp: RX meta data infrastructure
Date: Fri, 6 Jul 2018 16:38:21 -0700
Message-ID: <20180706233819.hqpxvoht7ijb5zqe@ast-mbp.dhcp.thefacebook.com>
References: <20180627024615.17856-1-saeedm@mellanox.com>
 <20180627024615.17856-3-saeedm@mellanox.com>
 <20180703230137.hdoy2fujz3x4oeij@ast-mbp.dhcp.thefacebook.com>
 <13f973a9937834ae8c10bfcc7d90909e94c543f1.camel@mellanox.com>
 <65b964eb-9ee1-9fd8-d54a-9290377eb1e4@iogearbox.net>
 <20180705101800.3c5d6af0@cakuba.netronome.com>
 <20180706163041.xstyfednmgho23m3@ast-mbp.dhcp.thefacebook.com>
 <1530909863.4291.12.camel@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: "jakub.kicinski@netronome.com" <jakub.kicinski@netronome.com>,
        "Duyck, Alexander H" <alexander.h.duyck@intel.com>,
        "daniel@iogearbox.net" <daniel@iogearbox.net>,
        "saeedm@mellanox.com" <saeedm@mellanox.com>,
        "brouer@redhat.com" <brouer@redhat.com>,
        "borkmann@iogearbox.net" <borkmann@iogearbox.net>,
        "tariqt@mellanox.com" <tariqt@mellanox.com>,
        "john.fastabend@gmail.com" <john.fastabend@gmail.com>,
        "netdev@vger.kernel.org" <netdev@vger.kernel.org>,
        "saeedm@dev.mellanox.co.il" <saeedm@dev.mellanox.co.il>
To: "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mail-pf0-f196.google.com ([209.85.192.196]:43757 "EHLO
        mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S932774AbeGFXiZ (ORCPT
        <rfc822;netdev@vger.kernel.org>); Fri, 6 Jul 2018 19:38:25 -0400
Received: by mail-pf0-f196.google.com with SMTP id y8-v6so9547335pfm.10
        for <netdev@vger.kernel.org>; Fri, 06 Jul 2018 16:38:25 -0700 (PDT)
Content-Disposition: inline
In-Reply-To: <1530909863.4291.12.camel@intel.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Fri, Jul 06, 2018 at 08:44:24PM +0000, Waskiewicz Jr, Peter wrote:
> On Fri, 2018-07-06 at 09:30 -0700, Alexei Starovoitov wrote:
> > On Thu, Jul 05, 2018 at 10:18:23AM -0700, Jakub Kicinski wrote:
> > > 
> > > I'm also not 100% on board with the argument that "future" FW can
> > > reshuffle things whatever way it wants to.  Is the assumption that
> > > future ASICs/FW will be designed to always use the "blessed" BTF
> > > format?  Or will it be reconfigurable at runtime?
> > 
> > let's table configuration of metadata aside for a second.
> 
> I agree that this should/could be NIC-specific and shouldn't weigh on
> the metadata interface between the drivers and XDP layer.
> 
> > Describing metedata layout in BTF allows NICs to disclose everything
> > NIC has to users in a standard and generic way.
> > Whether firmware is reconfigurable on the fly or has to reflashed
> > and hw powercycled to have new md layout (and corresponding BTF
> > description)
> > is a separate discussion.
> > Saeed's proposal introduces the concept of 'offset' inside 'struct
> > xdp_md_info'
> > to reach 'hash' value in metadata.
> > Essentially it's a run-time way to access 'hash' instead of build-
> > time.
> > So bpf program would need two loads to read csum or hash field
> > instead of one.
> > With BTF the layout of metadata is known to the program at build-
> > time.
> > 
> > To reiterate the proposal:
> > - driver+firmware keep layout of the metadata in BTF format (either
> > in the driver
> >   or driver can read it from firmware)
> > - 'bpftool read-metadata-desc eth0 > md_desc.h' command will query
> > the driver and
> >   generate normal C header file based on BTF in the given NIC
> > - user does #include "md_desc.h" and bpf program can access md->csum
> > or md->hash
> >   with direct single load out of metadata area in front of the packet
> 
> This piece is where I'd like to discuss more.  When we discussed this
> in Seoul, the initial proposal was a static struct that we'd try to
> hammer out a common layout between the interested parties.  That
> obviously wasn't going to scale, and we wanted to pursue something more
> dynamic.  But I thought the goal was the XDP/eBPF program wouldn't want
> to care what the underlying device is, and could just ask for metadata
> that it's interested in.  With this approach, your eBPF program is now
> bound/tied to the NIC/driver, and if you switched to a differen
> NIC/driver combo, then you'd have to rewrite part of your eBPF program
> to comprehend that.  I thought we were trying to avoid that.

It looks to me that NICs have a lot more differences instead of common pieces.
If we focus the architecture on making common things as generic
as possible we miss the bigger picture of covering distinct
and unique features that hw provides.

In the last email when I said "would be good to standardize at least a few common fields"
I meant that it make sense only at the later phases.
I don't think we should put things into uapi upfront.
Otherwise we will end up with likely useless fields.
Like vlan field. Surely a lot of nics can store it into metadata, but why?
bpf program can easily read it from the packet.
What's the benefit of adding it to md?
This metadata concept we're discussing in the context of program acceleration.
If metadata doesn't give perf benefits it should not be done by
firmware, it should not be supported by the driver and certainly
should not be part of uapi.
Hence I'd like to see pure BTF description without any common/standard
fields across the drivers and figure out what (if anything) is useful
to accelerate xdp programs (and af_xdp in the future).

> Our proposed approach (still working on something ready to RFC) is to
> provide a method for the eBPF program to send a struct of requested
> hints down to the driver on load.  If the driver can provide the hints,
> then that'd be how they'd be laid out in the metadata.  If it can't
> provide them, we'd probably reject the program loading, or discuss
> providing a software fallback (I know this is an area of contention).

I don't think that approach can work.
My .02 is the only useful acceleration feature out if intel nics is
an ability to parse the packet and report it to the xdp program
as a node id in the parse graph. As far as I know that is pretty unique
to intel. Configuration of that is a different matter.
Other things like hash are interesting, but we will quickly get
into rss spec definition if we go standardization route now instead
of later phases.
Therefore I'd rather let firmware+driver define its own BTF description
that says here is 'hash' of the packet hw can provide and it's fine
that this hash may be over different tuples depending on the nic
and even version of the firmware in that nic.
We need to see performance numbers and benefits for real world
xdp programs before drilling into specific semantics of the hash.

> I suppose we could get there with the rewriting mechanism described
> below, but that'd be a tough sell to set a bit of ABI for metadata,
> then change it to be potentially dynamic at runtime in the future.

Hmm. I don't see how future offset rewriting will break abi.
It looks to me as natural extension that wouldn't break any apps
that would be written for specific nic with given BTF description.
With rewritting in place some progs would become portable. That's all.