All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: AF_XDP metadata/hints
       [not found]                         ` <DM4PR11MB54224769926B06EE76635A6484299@DM4PR11MB5422.namprd11.prod.outlook.com>
@ 2021-05-21 13:31                           ` Jesper Dangaard Brouer
  2021-05-21 17:53                             ` Saeed Mahameed
  0 siblings, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-21 13:31 UTC (permalink / raw)
  To: Lobakin, Alexandr
  Cc: Saeed Mahameed, Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin,
	Joseph, Jithu, kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg,
	Jesse, Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, brouer,
	netdev

On Fri, 21 May 2021 10:53:40 +0000
"Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:

> I've opened two discussions at https://github.com/alobakin/linux,
> feel free to join them and/or create new ones to share your thoughts
> and concerns.

Thanks Alexandr for keeping the thread/subject alive.
 
I guess this is a new GitHub features "Discussions".  I've never used
that in a project before, lets see how this goes.  The usual approach
is discussions over email on netdev (Cc. netdev@vger.kernel.org).

Lets make it a bit easier to find these discussion threads:
 https://github.com/alobakin/linux/discussions

#1: Approach for generating metadata from HW descriptors #1
 https://github.com/alobakin/linux/discussions/1

#2: The idea of obtaining BTF directly from /sys/kernel/btf #2
 https://github.com/alobakin/linux/discussions/2

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-21 13:31                           ` AF_XDP metadata/hints Jesper Dangaard Brouer
@ 2021-05-21 17:53                             ` Saeed Mahameed
  2021-05-23 11:54                               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 18+ messages in thread
From: Saeed Mahameed @ 2021-05-21 17:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Lobakin, Alexandr
  Cc: Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin, Joseph, Jithu,
	kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg, Jesse,
	Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, netdev

On Fri, 2021-05-21 at 15:31 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 21 May 2021 10:53:40 +0000
> "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
> 
> > I've opened two discussions at https://github.com/alobakin/linux,
> > feel free to join them and/or create new ones to share your thoughts
> > and concerns.
> 
> Thanks Alexandr for keeping the thread/subject alive.
>  
> I guess this is a new GitHub features "Discussions".  I've never used
> that in a project before, lets see how this goes.  The usual approach
> is discussions over email on netdev (Cc. netdev@vger.kernel.org).

I agree we need full visibility and transparency, i actually recommend:
bpf@vger.kernel.org

so we wont spam the netdev ML.

> 
> Lets make it a bit easier to find these discussion threads:
>  https://github.com/alobakin/linux/discussions
> 
> #1: Approach for generating metadata from HW descriptors #1
>  https://github.com/alobakin/linux/discussions/1
> 
> #2: The idea of obtaining BTF directly from /sys/kernel/btf #2
>  https://github.com/alobakin/linux/discussions/2
> 




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-21 17:53                             ` Saeed Mahameed
@ 2021-05-23 11:54                               ` Toke Høiland-Jørgensen
  2021-05-25 14:20                                 ` Alexander Lobakin
  0 siblings, 1 reply; 18+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-05-23 11:54 UTC (permalink / raw)
  To: Saeed Mahameed, Jesper Dangaard Brouer, Lobakin, Alexandr
  Cc: Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin, Joseph, Jithu,
	kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg, Jesse,
	Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, netdev

Saeed Mahameed <saeed@kernel.org> writes:

> On Fri, 2021-05-21 at 15:31 +0200, Jesper Dangaard Brouer wrote:
>> On Fri, 21 May 2021 10:53:40 +0000
>> "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
>> 
>> > I've opened two discussions at https://github.com/alobakin/linux,
>> > feel free to join them and/or create new ones to share your thoughts
>> > and concerns.
>> 
>> Thanks Alexandr for keeping the thread/subject alive.
>>  
>> I guess this is a new GitHub features "Discussions".  I've never used
>> that in a project before, lets see how this goes.  The usual approach
>> is discussions over email on netdev (Cc. netdev@vger.kernel.org).
>
> I agree we need full visibility and transparency, i actually recommend:
> bpf@vger.kernel.org

+1, please keep this on the list :)

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-23 11:54                               ` Toke Høiland-Jørgensen
@ 2021-05-25 14:20                                 ` Alexander Lobakin
  2021-05-26  4:51                                   ` John Fastabend
  2021-05-26 14:49                                   ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 18+ messages in thread
From: Alexander Lobakin @ 2021-05-25 14:20 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Saeed Mahameed
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, Raczynski, Piotr,
	Zhang, Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor,
	Kishen, Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Sun, 23 May 2021 13:54:47 +0200

> Saeed Mahameed <saeed@kernel.org> writes:
> 
> > On Fri, 2021-05-21 at 15:31 +0200, Jesper Dangaard Brouer wrote:
> >> On Fri, 21 May 2021 10:53:40 +0000
> >> "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
> >>
> >> > I've opened two discussions at https://github.com/alobakin/linux,
> >> > feel free to join them and/or create new ones to share your thoughts
> >> > and concerns.
> >>
> >> Thanks Alexandr for keeping the thread/subject alive.
> >>
> >> I guess this is a new GitHub features "Discussions".  I've never used
> >> that in a project before, lets see how this goes.  The usual approach
> >> is discussions over email on netdev (Cc. netdev@vger.kernel.org).
> >
> > I agree we need full visibility and transparency, i actually recommend:
> > bpf@vger.kernel.org
> 
> +1, please keep this on the list :)

Sure, let's keep it the classic way.
I removed the netdev ML from the CCs and added bpf there.

Regarding the comments from GitHub discussions:

alobakin:

> Since 5.11, it's now possible to obtain a BTF not only for vmlinux,
> but also for modules.
> This will eliminate a need for manually composing and registering a
> BTF inside the driver code, which is 100+ locs for ice for example.
> 
> That's obviously not the most straightforward and trivial way, but
> could help a lot.

saeedtx:

> the point of registering BTF directly from the driver is to allow
> "Flex metadata" meaning that meta data format can be constructed on
> the fly according to user demand.
> BTF for modules is constructed only at compilation time and
> registered only on module load. so there is no way to implement flex
> metadata with vmlinux BTF. we still need a dynamic registration API
> for current and future HW where the HW will provide the BTF
> dynamically.
> 
> I am sure we can find mutliple ways to reduce the 100+ LOC, but the
> goal is to have the dynamic btf_register/unregister API

We initially planned to register just one (or several) predefined
BTF(s) per module/netdevice that would provide a full list of
supported fields. The flexibility of metadata then is in that BPF
core calls for netdevice's ndo_bpf() on BPF program setup and
provides a metadata layout requested by that BPF prog to the driver,
so it could configure its hotpath (current NICs) or a hardware
(future NICs) to build metadata accordingly.
Driver can declare several BTFs (e.g. a "generic" one with things
like hashes and csums one and a custom one) and it would work either
through dynamic registering or through /sys approach.

This is all discussable anyways, we're happy to hear different
opinions and thoughts to collectively choose the optimal way.

> -Toke

Thanks,
Al

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-25 14:20                                 ` Alexander Lobakin
@ 2021-05-26  4:51                                   ` John Fastabend
  2021-05-26 11:49                                     ` Jesper Dangaard Brouer
  2021-05-26 14:49                                   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 18+ messages in thread
From: John Fastabend @ 2021-05-26  4:51 UTC (permalink / raw)
  To: Alexander Lobakin, Toke Høiland-Jørgensen, Saeed Mahameed
  Cc: Alexander Lobakin, Jesper Dangaard Brouer, Raczynski, Piotr,
	Zhang, Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor,
	Kishen, Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf

Alexander Lobakin wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Date: Sun, 23 May 2021 13:54:47 +0200
> 
> > Saeed Mahameed <saeed@kernel.org> writes:
> > 
> > > On Fri, 2021-05-21 at 15:31 +0200, Jesper Dangaard Brouer wrote:
> > >> On Fri, 21 May 2021 10:53:40 +0000
> > >> "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
> > >>
> > >> > I've opened two discussions at https://github.com/alobakin/linux,
> > >> > feel free to join them and/or create new ones to share your thoughts
> > >> > and concerns.
> > >>
> > >> Thanks Alexandr for keeping the thread/subject alive.
> > >>
> > >> I guess this is a new GitHub features "Discussions".  I've never used
> > >> that in a project before, lets see how this goes.  The usual approach
> > >> is discussions over email on netdev (Cc. netdev@vger.kernel.org).
> > >
> > > I agree we need full visibility and transparency, i actually recommend:
> > > bpf@vger.kernel.org
> > 
> > +1, please keep this on the list :)
> 
> Sure, let's keep it the classic way.
> I removed the netdev ML from the CCs and added bpf there.
> 
> Regarding the comments from GitHub discussions:
> 
> alobakin:
> 
> > Since 5.11, it's now possible to obtain a BTF not only for vmlinux,
> > but also for modules.
> > This will eliminate a need for manually composing and registering a
> > BTF inside the driver code, which is 100+ locs for ice for example.
> > 
> > That's obviously not the most straightforward and trivial way, but
> > could help a lot.
> 
> saeedtx:
> 
> > the point of registering BTF directly from the driver is to allow

There is no paticular reason the BTF has to come from the driver it
could also be generated in userspace or elsewhere. The driver is
handy because at least the driver should always have correct BTF so
you avoid versioning to some extent.

> > "Flex metadata" meaning that meta data format can be constructed on
> > the fly according to user demand.

How is flex metadata configured? I believe this is going to need
some user tooling and a hard reset (ucode load?) in the driver to
transition the hardware state.

My original vision was use P4 (or whatever language) to build
your necessary microcode/firmware/blob. Compile that to your
specific hardware backend NIC. That process should give you
two objects. The BTF and the blob to throw at the hardware.
Letting the driver expose the BTF over /sys/fs/btf/driver.btf
makes a lot of sense as well, but is not strictly necessary
as long as you have some way to get the BTF.

Anyways from a design side IMO hardware configuration should be
done independent of any BPF/BTF operations.

> > BTF for modules is constructed only at compilation time and
> > registered only on module load. so there is no way to implement flex
> > metadata with vmlinux BTF. we still need a dynamic registration API
> > for current and future HW where the HW will provide the BTF
> > dynamically.

+1 can we expose it in /sys/fs/btf/ seems like the reasonable
thing to me.

> > 
> > I am sure we can find mutliple ways to reduce the 100+ LOC, but the
> > goal is to have the dynamic btf_register/unregister API
> 
> We initially planned to register just one (or several) predefined
> BTF(s) per module/netdevice that would provide a full list of
> supported fields. The flexibility of metadata then is in that BPF
> core calls for netdevice's ndo_bpf() on BPF program setup and
> provides a metadata layout requested by that BPF prog to the driver,

I don't think this is the right direction. The driver should be
telling us whats supported or we should "just" know because we
configured it. Overloading ndo_bpf with the config step
seems unnecessarily complex. CO-RE is going to happen way before
we even get to the ndo_bpf() so trying to decide layout this
late is likely not going to work. How would you even know what
to do with a load op?

> so it could configure its hotpath (current NICs) or a hardware
> (future NICs) to build metadata accordingly.
> Driver can declare several BTFs (e.g. a "generic" one with things
> like hashes and csums one and a custom one) and it would work either
> through dynamic registering or through /sys approach.

IMO driver needs to expose one single BTF image of what CO-RE ops
need to be done on a object.

Separate the config of hardware from the BPF infrastructure these
are two separate things.

> 
> This is all discussable anyways, we're happy to hear different
> opinions and thoughts to collectively choose the optimal way.
> 
> > -Toke
> 
> Thanks,
> Al

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26  4:51                                   ` John Fastabend
@ 2021-05-26 11:49                                     ` Jesper Dangaard Brouer
  2021-05-26 13:06                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-26 11:49 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen,
	Saeed Mahameed, Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin,
	Joseph, Jithu, kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg,
	Jesse, Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, bpf,
	brouer

On Tue, 25 May 2021 21:51:22 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Separate the config of hardware from the BPF infrastructure these
> are two separate things.

I fully agree.

How should we handle existing config interfaces?

Let me give some concrete examples. Today there are multiple existing
interfaces to enable/disable NIC hardware features that change what is
available to put in our BTF-layout.

E.g. changing if VLAN is in descriptor:
 # ethtool -K ixgbe1 rx-vlan-offload off
 # ethtool -k ixgbe1 | grep vlan-offload
 rx-vlan-offload: off
 tx-vlan-offload: on

The timestamping features can be listed by ethtool -T (see below
signature), but it is a socket option that enable[1] these
(see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).

Or tuning RSS hash fields:
 [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md

I assume we need to stay compatible and respect the existing config
interfaces, right?

Should we simple leverage existing interfaces?

E.g. tcpdump --time-stamp-type=adapter_unsynced could simple enable the
BTF-layout that contains the RX-timestamp.  This would make it avail to
XDP/AF_XDP and the xdp_frame can also create a SKB with the timestamp.


[1] https://www.kernel.org/doc/html/latest/networking/timestamping.html
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


# ethtool -T ixgbe1
Time stamping parameters for ixgbe1:
Capabilities:
	hardware-transmit
	software-transmit
	hardware-receive
	software-receive
	software-system-clock
	hardware-raw-clock
PTP Hardware Clock: 7
Hardware Transmit Timestamp Modes:
	off
	on
Hardware Receive Filter Modes:
	none
	ptpv1-l4-sync
	ptpv1-l4-delay-req
	ptpv2-event


# ethtool -T igc1
Time stamping parameters for igc1:
Capabilities:
	hardware-transmit
	software-transmit
	hardware-receive
	software-receive
	software-system-clock
	hardware-raw-clock
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
	off
	on
Hardware Receive Filter Modes:
	none
	all


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 11:49                                     ` Jesper Dangaard Brouer
@ 2021-05-26 13:06                                       ` Toke Høiland-Jørgensen
  2021-05-26 15:35                                         ` John Fastabend
  0 siblings, 1 reply; 18+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-05-26 13:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, John Fastabend
  Cc: Alexander Lobakin, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, brouer

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Tue, 25 May 2021 21:51:22 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> Separate the config of hardware from the BPF infrastructure these
>> are two separate things.
>
> I fully agree.

+1. Another reason why is the case of multiple XDP programs on a single
interface: When attaching these (using freplace as libxdp does it), the
kernel can just check the dest interface when verifying the freplace
program and any rewriting of the bytecode from the BTF format can happen
at that point. Whereas if the BPF attach needs to have side effects,
suddenly we have to copy over all the features to the dispatcher program
and do some kind of set union operation; and what happens if an freplace
program is attached after the fact (same thing with tail calls)?

So in my mind there's no doubt this needs to be:

driver is config'ed -> it changes its exposed BTF metadata -> program is
attached -> verifier rewrites program to access metadata correctly

> How should we handle existing config interfaces?
>
> Let me give some concrete examples. Today there are multiple existing
> interfaces to enable/disable NIC hardware features that change what is
> available to put in our BTF-layout.
>
> E.g. changing if VLAN is in descriptor:
>  # ethtool -K ixgbe1 rx-vlan-offload off
>  # ethtool -k ixgbe1 | grep vlan-offload
>  rx-vlan-offload: off
>  tx-vlan-offload: on
>
> The timestamping features can be listed by ethtool -T (see below
> signature), but it is a socket option that enable[1] these
> (see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).
>
> Or tuning RSS hash fields:
>  [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md
>
> I assume we need to stay compatible and respect the existing config
> interfaces, right?
>
> Should we simple leverage existing interfaces?

Now that ethtool has moved to netlink it should be quite
straight-forward to add a separate subset of commands for configuring
metadata fields; and internally the kernel can map those to the existing
config knobs, no?

E.g., if you tell the kernel you'd like to have the VLAN field as a
metadata field that kinda implies that rx-vlan-offload should be turned
on; etc. Any reason this would break down?

-Toke


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-25 14:20                                 ` Alexander Lobakin
  2021-05-26  4:51                                   ` John Fastabend
@ 2021-05-26 14:49                                   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-26 14:49 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: Toke Høiland-Jørgensen, Saeed Mahameed, Raczynski,
	Piotr, Zhang, Jessica, Kubiak, Marcin, Joseph, Jithu, kurt,
	Maloor, Kishen, Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski,
	Michal, Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson,
	Song, Yoong Siang, Czapnik, Lukasz, bpf, brouer

On Tue, 25 May 2021 16:20:27 +0200 Alexander Lobakin <alexandr.lobakin@intel.com> wrote:

> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> > Saeed Mahameed <saeed@kernel.org> writes:
> >   
> > > On Fri, 2021-05-21 at 15:31 +0200, Jesper Dangaard Brouer wrote:  
[...]
> > > I agree we need full visibility and transparency, i actually recommend:
> > > bpf@vger.kernel.org  
> > 
> > +1, please keep this on the list :)  
> 
> Sure, let's keep it the classic way.
> I removed the netdev ML from the CCs and added bpf there.

I'm very happy with the momentum, that multiple people and companies are
interested in this XDP-hints subject.  I doubt everyone is subscribed
to bpf@vger.kernel.org and I'm having a hard time remembering who to
Cc.  I see this as a working group effort.

Is anybody against that we create xdp-hints@xdp-project.net (and I/toke
will add people to this list).  This is only to reduce the Cc list, and
we should still keep bpf@vger.kernel.org in Cc as well as our upstream
discussion list.  If nobody object then we will do this in a couple of
days...

(Do let me know offlist, if you want to be removed from this Cc list
prior to creating this... else CC list, you are opting in :-P ).
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 13:06                                       ` Toke Høiland-Jørgensen
@ 2021-05-26 15:35                                         ` John Fastabend
  2021-05-26 15:41                                           ` John Fastabend
                                                             ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: John Fastabend @ 2021-05-26 15:35 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer, John Fastabend
  Cc: Alexander Lobakin, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, brouer

Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> 
> > On Tue, 25 May 2021 21:51:22 -0700
> > John Fastabend <john.fastabend@gmail.com> wrote:
> >
> >> Separate the config of hardware from the BPF infrastructure these
> >> are two separate things.
> >
> > I fully agree.
> 
> +1. Another reason why is the case of multiple XDP programs on a single
> interface: When attaching these (using freplace as libxdp does it), the
> kernel can just check the dest interface when verifying the freplace
> program and any rewriting of the bytecode from the BTF format can happen
> at that point. Whereas if the BPF attach needs to have side effects,
> suddenly we have to copy over all the features to the dispatcher program
> and do some kind of set union operation; and what happens if an freplace
> program is attached after the fact (same thing with tail calls)?
> 
> So in my mind there's no doubt this needs to be:
> 
> driver is config'ed -> it changes its exposed BTF metadata -> program is
> attached -> verifier rewrites program to access metadata correctly

Well likely libbpf would do the rewrite I think.

> 
> > How should we handle existing config interfaces?
> >
> > Let me give some concrete examples. Today there are multiple existing
> > interfaces to enable/disable NIC hardware features that change what is
> > available to put in our BTF-layout.
> >
> > E.g. changing if VLAN is in descriptor:
> >  # ethtool -K ixgbe1 rx-vlan-offload off
> >  # ethtool -k ixgbe1 | grep vlan-offload
> >  rx-vlan-offload: off
> >  tx-vlan-offload: on
> >
> > The timestamping features can be listed by ethtool -T (see below
> > signature), but it is a socket option that enable[1] these
> > (see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).
> >
> > Or tuning RSS hash fields:
> >  [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md
> >
> > I assume we need to stay compatible and respect the existing config
> > interfaces, right?

I'm not convinced its a strict requirement, rather its a nice to
have. These are low level ethtool hooks into the hardware its
fine IMO if the hardware just reports off and uses a more robust
configuration channel. In general we should try to get away from
this model where kernel devs are acting as the gate keepers for
all hardware offloads and we explicit add checkboxs that driver
writers can use. The result is the current state of things where
we have very flexible hardware that are not usable from Linux.

> >
> > Should we simple leverage existing interfaces?
> 
> Now that ethtool has moved to netlink it should be quite
> straight-forward to add a separate subset of commands for configuring
> metadata fields; and internally the kernel can map those to the existing
> config knobs, no?

Its unclear to me how you simple expose knobs to reconfigure hardware.
It looks to me that you need to push a blob down to the hardware to
reconfigure it for new parsers, new actions, etc. But, maybe the
folks working on current hardware can speak up.

> 
> E.g., if you tell the kernel you'd like to have the VLAN field as a
> metadata field that kinda implies that rx-vlan-offload should be turned
> on; etc. Any reason this would break down?
> 
> -Toke
> 

Agree driver should be able to map these back onto 'legacy' feature
sets.

I'll still have a basic question though. I've never invested much time
into the hints because its still not clear to me what the use case is?
What would we put in the hints and do we have any data to show it would be
a performance win.

If its a simple hash of the headers then how would we use it? The
map_lookup/updates use IP addrs for keys in Cilium. So I think the
suggestion is to offload the jhash operation? But that requires some
program changes to work. Could someone convince me?

Maybe packet timestamp?

Thanks,
John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 15:35                                         ` John Fastabend
@ 2021-05-26 15:41                                           ` John Fastabend
  2021-05-26 15:54                                           ` Alexander Lobakin
  2021-05-26 17:38                                           ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2021-05-26 15:41 UTC (permalink / raw)
  To: John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, John Fastabend
  Cc: Alexander Lobakin, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, brouer

John Fastabend wrote:
> Toke Høiland-Jørgensen wrote:
> > Jesper Dangaard Brouer <brouer@redhat.com> writes:
> > 
> > > On Tue, 25 May 2021 21:51:22 -0700
> > > John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > >> Separate the config of hardware from the BPF infrastructure these
> > >> are two separate things.
> > >
> > > I fully agree.
> > 
> > +1. Another reason why is the case of multiple XDP programs on a single
> > interface: When attaching these (using freplace as libxdp does it), the
> > kernel can just check the dest interface when verifying the freplace
> > program and any rewriting of the bytecode from the BTF format can happen
> > at that point. Whereas if the BPF attach needs to have side effects,
> > suddenly we have to copy over all the features to the dispatcher program
> > and do some kind of set union operation; and what happens if an freplace
> > program is attached after the fact (same thing with tail calls)?
> > 
> > So in my mind there's no doubt this needs to be:
> > 
> > driver is config'ed -> it changes its exposed BTF metadata -> program is
> > attached -> verifier rewrites program to access metadata correctly
> 
> Well likely libbpf would do the rewrite I think.
> 
> > 
> > > How should we handle existing config interfaces?
> > >
> > > Let me give some concrete examples. Today there are multiple existing
> > > interfaces to enable/disable NIC hardware features that change what is
> > > available to put in our BTF-layout.
> > >
> > > E.g. changing if VLAN is in descriptor:
> > >  # ethtool -K ixgbe1 rx-vlan-offload off
> > >  # ethtool -k ixgbe1 | grep vlan-offload
> > >  rx-vlan-offload: off
> > >  tx-vlan-offload: on
> > >
> > > The timestamping features can be listed by ethtool -T (see below
> > > signature), but it is a socket option that enable[1] these
> > > (see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).
> > >
> > > Or tuning RSS hash fields:
> > >  [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md
> > >
> > > I assume we need to stay compatible and respect the existing config
> > > interfaces, right?
> 
> I'm not convinced its a strict requirement, rather its a nice to
> have. These are low level ethtool hooks into the hardware its
> fine IMO if the hardware just reports off and uses a more robust
> configuration channel. In general we should try to get away from
> this model where kernel devs are acting as the gate keepers for
> all hardware offloads and we explicit add checkboxs that driver
> writers can use. The result is the current state of things where
> we have very flexible hardware that are not usable from Linux.
> 
> > >
> > > Should we simple leverage existing interfaces?
> > 
> > Now that ethtool has moved to netlink it should be quite
> > straight-forward to add a separate subset of commands for configuring
> > metadata fields; and internally the kernel can map those to the existing
> > config knobs, no?
> 
> Its unclear to me how you simple expose knobs to reconfigure hardware.
> It looks to me that you need to push a blob down to the hardware to
> reconfigure it for new parsers, new actions, etc. But, maybe the
> folks working on current hardware can speak up.
> 
> > 
> > E.g., if you tell the kernel you'd like to have the VLAN field as a
> > metadata field that kinda implies that rx-vlan-offload should be turned
> > on; etc. Any reason this would break down?
> > 
> > -Toke
> > 
> 
> Agree driver should be able to map these back onto 'legacy' feature
> sets.
> 
> I'll still have a basic question though. I've never invested much time
> into the hints because its still not clear to me what the use case is?
> What would we put in the hints and do we have any data to show it would be
> a performance win.
> 
> If its a simple hash of the headers then how would we use it? The
> map_lookup/updates use IP addrs for keys in Cilium. So I think the
> suggestion is to offload the jhash operation? But that requires some
> program changes to work. Could someone convince me?
> 
> Maybe packet timestamp?
> 
> Thanks,
> John

Also I'll add before we add any code to the kernel for this we should have
a fully working hardware workflow. We don't actually need any kernel bits
to do above today.

I driver simply needs to generate a BTF file along with the hardware
configuration (firmware update is fine for hard update) and users could
start to use hints. libbpf knows how to work with BTF and exposing the
BTF through /sys/fs/ is just a nice to have.

.John

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 15:35                                         ` John Fastabend
  2021-05-26 15:41                                           ` John Fastabend
@ 2021-05-26 15:54                                           ` Alexander Lobakin
  2021-05-26 16:33                                             ` John Fastabend
  2021-05-26 16:41                                             ` Alexei Starovoitov
  2021-05-26 17:38                                           ` Jesper Dangaard Brouer
  2 siblings, 2 replies; 18+ messages in thread
From: Alexander Lobakin @ 2021-05-26 15:54 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, xdp-hints

From: John Fastabend <john.fastabend@gmail.com>
Date: Wed, 26 May 2021 08:35:49 -0700

Hi all,

>Toke Høiland-Jørgensen wrote:
>> Jesper Dangaard Brouer <brouer@redhat.com> writes:
>> 
>> > On Tue, 25 May 2021 21:51:22 -0700
>> > John Fastabend <john.fastabend@gmail.com> wrote:
>> >
>> >> Separate the config of hardware from the BPF infrastructure these
>> >> are two separate things.
>> >
>> > I fully agree.

It was not about hardware feature reprogramming like csumming and
stuff, only about composing the medatada according to the prog's
request.

>> +1. Another reason why is the case of multiple XDP programs on a single
>> interface: When attaching these (using freplace as libxdp does it), the
>> kernel can just check the dest interface when verifying the freplace
>> program and any rewriting of the bytecode from the BTF format can happen
>> at that point. Whereas if the BPF attach needs to have side effects,
>> suddenly we have to copy over all the features to the dispatcher program
>> and do some kind of set union operation; and what happens if an freplace
>> program is attached after the fact (same thing with tail calls)?
>> 
>> So in my mind there's no doubt this needs to be:
>> 
>> driver is config'ed -> it changes its exposed BTF metadata -> program is
>> attached -> verifier rewrites program to access metadata correctly
>
>Well likely libbpf would do the rewrite I think.

So your proposal is to not compose metadata according to the prog's
request, but rather reprogram the prog itself to access metadata
accordingly? Sounds very nice.

If follow this path, is it something like this?

1. Driver exposes the fields layout (e.g. Rx/Tx descriptor fields)
via BTF to the BPF layer.
2. When an XDP prog is attached, BPF reprograms it to look for the
required fields at the right offset.

>> 
>> > How should we handle existing config interfaces?
>> >
>> > Let me give some concrete examples. Today there are multiple existing
>> > interfaces to enable/disable NIC hardware features that change what is
>> > available to put in our BTF-layout.
>> >
>> > E.g. changing if VLAN is in descriptor:
>> >  # ethtool -K ixgbe1 rx-vlan-offload off
>> >  # ethtool -k ixgbe1 | grep vlan-offload
>> >  rx-vlan-offload: off
>> >  tx-vlan-offload: on
>> >
>> > The timestamping features can be listed by ethtool -T (see below
>> > signature), but it is a socket option that enable[1] these
>> > (see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).
>> >
>> > Or tuning RSS hash fields:
>> >  [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md
>> >
>> > I assume we need to stay compatible and respect the existing config
>> > interfaces, right?

Again, XDP Hints won't change any netdev features and stuff, only
compose provide the hardware provided fields that are currently
inaccessible by the XDP prog and say cpumap code, but that are
highly needed (cpumap builds skbs without csums -> GRO layer
consumes CPU time to calculate it manually, without RSS hash ->
Flow Dissector consumes CPU time to calculate it manually +
possible NAPI bucket misses etc.).

So, neither Ethtool (that doesn't belong to XDP at all) nor anything
else that changes hardware behaviour is involved.

>I'm not convinced its a strict requirement, rather its a nice to
>have. These are low level ethtool hooks into the hardware its
>fine IMO if the hardware just reports off and uses a more robust
>configuration channel. In general we should try to get away from
>this model where kernel devs are acting as the gate keepers for
>all hardware offloads and we explicit add checkboxs that driver
>writers can use. The result is the current state of things where
>we have very flexible hardware that are not usable from Linux.
>
>> >
>> > Should we simple leverage existing interfaces?
>> 
>> Now that ethtool has moved to netlink it should be quite
>> straight-forward to add a separate subset of commands for configuring
>> metadata fields; and internally the kernel can map those to the existing
>> config knobs, no?
>
>Its unclear to me how you simple expose knobs to reconfigure hardware.
>It looks to me that you need to push a blob down to the hardware to
>reconfigure it for new parsers, new actions, etc. But, maybe the
>folks working on current hardware can speak up.
>
>> 
>> E.g., if you tell the kernel you'd like to have the VLAN field as a
>> metadata field that kinda implies that rx-vlan-offload should be turned
>> on; etc. Any reason this would break down?
>> 
>> -Toke
>> 

That's right. If you want to have a VLAN tag in the medatdata, make
sure Rx offloading is enabled. Same with csums and the rest. No
explicit switches on prog insertion/removing.

>Agree driver should be able to map these back onto 'legacy' feature
>sets.
>
>I'll still have a basic question though. I've never invested much time
>into the hints because its still not clear to me what the use case is?
>What would we put in the hints and do we have any data to show it would be
>a performance win.

As I said, currently both XDP Rx and XDP Tx paths suffer from that
they lack some essential info like Rx csums and hashes and e.g.
Tx csum status. XDP Hints are about to place these fields into
the metadata area that is located right before the frame, so either
BPF prog or mentioned cpumap could access it.

>If its a simple hash of the headers then how would we use it? The
>map_lookup/updates use IP addrs for keys in Cilium. So I think the
>suggestion is to offload the jhash operation? But that requires some
>program changes to work. Could someone convince me?
>
>Maybe packet timestamp?
>
>Thanks,
>John

Thanks,
Al

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 15:54                                           ` Alexander Lobakin
@ 2021-05-26 16:33                                             ` John Fastabend
  2021-05-26 18:44                                               ` Jesper Dangaard Brouer
  2021-05-26 16:41                                             ` Alexei Starovoitov
  1 sibling, 1 reply; 18+ messages in thread
From: John Fastabend @ 2021-05-26 16:33 UTC (permalink / raw)
  To: Alexander Lobakin, John Fastabend
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, xdp-hints

Alexander Lobakin wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Wed, 26 May 2021 08:35:49 -0700
> 
> Hi all,
> 
> >Toke Høiland-Jørgensen wrote:
> >> Jesper Dangaard Brouer <brouer@redhat.com> writes:
> >> 
> >> > On Tue, 25 May 2021 21:51:22 -0700
> >> > John Fastabend <john.fastabend@gmail.com> wrote:
> >> >
> >> >> Separate the config of hardware from the BPF infrastructure these
> >> >> are two separate things.
> >> >
> >> > I fully agree.
> 
> It was not about hardware feature reprogramming like csumming and
> stuff, only about composing the medatada according to the prog's
> request.
> 
> >> +1. Another reason why is the case of multiple XDP programs on a single
> >> interface: When attaching these (using freplace as libxdp does it), the
> >> kernel can just check the dest interface when verifying the freplace
> >> program and any rewriting of the bytecode from the BTF format can happen
> >> at that point. Whereas if the BPF attach needs to have side effects,
> >> suddenly we have to copy over all the features to the dispatcher program
> >> and do some kind of set union operation; and what happens if an freplace
> >> program is attached after the fact (same thing with tail calls)?
> >> 
> >> So in my mind there's no doubt this needs to be:
> >> 
> >> driver is config'ed -> it changes its exposed BTF metadata -> program is
> >> attached -> verifier rewrites program to access metadata correctly
> >
> >Well likely libbpf would do the rewrite I think.
> 
> So your proposal is to not compose metadata according to the prog's
> request, but rather reprogram the prog itself to access metadata
> accordingly? Sounds very nice.

Correct, otherwise you end up trying to parse programs and infer what
its trying to access and I don't want that logic in the driver. Or you
have to go into the BPF core and try to get it to pass the driver metadata.
And I don't want to add complexity to the BPF core bits.

> 
> If follow this path, is it something like this?
> 
> 1. Driver exposes the fields layout (e.g. Rx/Tx descriptor fields)
> via BTF to the BPF layer.

Not to the kernel BPF layer but to userspace so we can do CO-RE in
userspace before loading the program. While we learn how to use
this I expect we can just pass around a BTF file its not needed
to have the driver interface expose it as a first step.

Presumably you can generate the BTF when you configure the hardware.
I assume hardware config is close to a firmware update.

> 2. When an XDP prog is attached, BPF reprograms it to look for the
> required fields at the right offset.

User space can rewrite the fields using the existing infrastructure.
Dumb snippit,

  struct my_metadata m = (struct my_metadata *) data->metadata

  my_foo = m->foo;

Then CO-RE layer will know how to rewrite that m->foo to the right
offset into the metadata if we give it the normal CO-RE annotations.

> 
> >> 
> >> > How should we handle existing config interfaces?
> >> >
> >> > Let me give some concrete examples. Today there are multiple existing
> >> > interfaces to enable/disable NIC hardware features that change what is
> >> > available to put in our BTF-layout.
> >> >
> >> > E.g. changing if VLAN is in descriptor:
> >> >  # ethtool -K ixgbe1 rx-vlan-offload off
> >> >  # ethtool -k ixgbe1 | grep vlan-offload
> >> >  rx-vlan-offload: off
> >> >  tx-vlan-offload: on
> >> >
> >> > The timestamping features can be listed by ethtool -T (see below
> >> > signature), but it is a socket option that enable[1] these
> >> > (see SO_TIMESTAMPNS or SOF_TIMESTAMPING_RX_HARDWARE).
> >> >
> >> > Or tuning RSS hash fields:
> >> >  [2] https://github.com/stackpath/rxtxcpu/blob/master/Documentation/case-studies/observing-rss-on-ixgbe-advanced-rss-configuration-rss-hash-fields.md
> >> >
> >> > I assume we need to stay compatible and respect the existing config
> >> > interfaces, right?
> 
> Again, XDP Hints won't change any netdev features and stuff, only
> compose provide the hardware provided fields that are currently
> inaccessible by the XDP prog and say cpumap code, but that are
> highly needed (cpumap builds skbs without csums -> GRO layer
> consumes CPU time to calculate it manually, without RSS hash ->
> Flow Dissector consumes CPU time to calculate it manually +
> possible NAPI bucket misses etc.).

Thats a specific cpumap problem correct? In general checksums work
as expected?

> 
> So, neither Ethtool (that doesn't belong to XDP at all) nor anything
> else that changes hardware behaviour is involved.

Good.

> 
> >I'm not convinced its a strict requirement, rather its a nice to
> >have. These are low level ethtool hooks into the hardware its
> >fine IMO if the hardware just reports off and uses a more robust
> >configuration channel. In general we should try to get away from
> >this model where kernel devs are acting as the gate keepers for
> >all hardware offloads and we explicit add checkboxs that driver
> >writers can use. The result is the current state of things where
> >we have very flexible hardware that are not usable from Linux.
> >
> >> >
> >> > Should we simple leverage existing interfaces?
> >> 
> >> Now that ethtool has moved to netlink it should be quite
> >> straight-forward to add a separate subset of commands for configuring
> >> metadata fields; and internally the kernel can map those to the existing
> >> config knobs, no?
> >
> >Its unclear to me how you simple expose knobs to reconfigure hardware.
> >It looks to me that you need to push a blob down to the hardware to
> >reconfigure it for new parsers, new actions, etc. But, maybe the
> >folks working on current hardware can speak up.
> >
> >> 
> >> E.g., if you tell the kernel you'd like to have the VLAN field as a
> >> metadata field that kinda implies that rx-vlan-offload should be turned
> >> on; etc. Any reason this would break down?
> >> 
> >> -Toke
> >> 
> 
> That's right. If you want to have a VLAN tag in the medatdata, make
> sure Rx offloading is enabled. Same with csums and the rest. No
> explicit switches on prog insertion/removing.
> 
> >Agree driver should be able to map these back onto 'legacy' feature
> >sets.
> >
> >I'll still have a basic question though. I've never invested much time
> >into the hints because its still not clear to me what the use case is?
> >What would we put in the hints and do we have any data to show it would be
> >a performance win.
> 
> As I said, currently both XDP Rx and XDP Tx paths suffer from that
> they lack some essential info like Rx csums and hashes and e.g.
> Tx csum status. XDP Hints are about to place these fields into
> the metadata area that is located right before the frame, so either
> BPF prog or mentioned cpumap could access it.

I'm not convinced hashes and csum are so interesting but show me some
data. Also my admittedly rough understanding of cpumap is that it helps
the case where hardware RSS is not sufficient. Seeing your coming from
the Intel hardware side why not fix the RSS root problem instead
of using cpumap at all? I think your hardware is flexible enough.

I would really prefer to see example use cases that are more generic
than the cpumap case.

> 
> >If its a simple hash of the headers then how would we use it? The
> >map_lookup/updates use IP addrs for keys in Cilium. So I think the
> >suggestion is to offload the jhash operation? But that requires some
> >program changes to work. Could someone convince me?
> >
> >Maybe packet timestamp?
> >
> >Thanks,
> >John
> 
> Thanks,
> Al



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 15:54                                           ` Alexander Lobakin
  2021-05-26 16:33                                             ` John Fastabend
@ 2021-05-26 16:41                                             ` Alexei Starovoitov
  2021-05-26 17:01                                               ` John Fastabend
  1 sibling, 1 reply; 18+ messages in thread
From: Alexei Starovoitov @ 2021-05-26 16:41 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, xdp-hints

On Wed, May 26, 2021 at 8:57 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
> >
> >Well likely libbpf would do the rewrite I think.
>
> So your proposal is to not compose metadata according to the prog's
> request, but rather reprogram the prog itself to access metadata
> accordingly? Sounds very nice.
>
> If follow this path, is it something like this?
>
> 1. Driver exposes the fields layout (e.g. Rx/Tx descriptor fields)
> via BTF to the BPF layer.
> 2. When an XDP prog is attached, BPF reprograms it to look for the
> required fields at the right offset.

The driver doesn't need to expose it directly via ndo.
There is already generic support for BTF in modules
and support for encoding btf_id for further use inside verifier
and other components.
I think the driver can simply do:
BTF_ID_LIST(known_packet_fields)
and the bpf core will pick it from there.
While libbpf will do a CO-RE style re-write when driver layout changes.
Ideally bpf core doesn't need to be involved and it's done completely in libbpf.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 16:41                                             ` Alexei Starovoitov
@ 2021-05-26 17:01                                               ` John Fastabend
  0 siblings, 0 replies; 18+ messages in thread
From: John Fastabend @ 2021-05-26 17:01 UTC (permalink / raw)
  To: Alexei Starovoitov, Alexander Lobakin
  Cc: John Fastabend, Toke Høiland-Jørgensen,
	Jesper Dangaard Brouer, Saeed Mahameed, Raczynski, Piotr, Zhang,
	Jessica, Kubiak, Marcin, Joseph, Jithu, kurt, Maloor, Kishen,
	Gomes, Vinicius, Brandeburg, Jesse, Swiatkowski, Michal,
	Plantykow, Marta A, Ong, Boon Leong, Desouza, Ederson, Song,
	Yoong Siang, Czapnik, Lukasz, bpf, xdp-hints

Alexei Starovoitov wrote:
> On Wed, May 26, 2021 at 8:57 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
> > >
> > >Well likely libbpf would do the rewrite I think.
> >
> > So your proposal is to not compose metadata according to the prog's
> > request, but rather reprogram the prog itself to access metadata
> > accordingly? Sounds very nice.
> >
> > If follow this path, is it something like this?
> >
> > 1. Driver exposes the fields layout (e.g. Rx/Tx descriptor fields)
> > via BTF to the BPF layer.
> > 2. When an XDP prog is attached, BPF reprograms it to look for the
> > required fields at the right offset.
> 
> The driver doesn't need to expose it directly via ndo.

+1

> There is already generic support for BTF in modules
> and support for encoding btf_id for further use inside verifier
> and other components.
> I think the driver can simply do:
> BTF_ID_LIST(known_packet_fields)
> and the bpf core will pick it from there.
> While libbpf will do a CO-RE style re-write when driver layout changes.
> Ideally bpf core doesn't need to be involved and it's done completely in libbpf.

Agree, I don't see any reason bpf core is needed for any of the above.

The only downside of BTF_ID_LIST is its not dynamic, a ucode update
might move the parser and fields around. But that can be handled
in userspace, by publishing a supplemental BTF file, to start with.
Once we have known value and understand the use case we can discuss
the dynamic case exposed from kernel side. Even the static case
with BTF_ID_LIST would be a first step.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 15:35                                         ` John Fastabend
  2021-05-26 15:41                                           ` John Fastabend
  2021-05-26 15:54                                           ` Alexander Lobakin
@ 2021-05-26 17:38                                           ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-26 17:38 UTC (permalink / raw)
  To: John Fastabend
  Cc: Toke Høiland-Jørgensen, Alexander Lobakin,
	Saeed Mahameed, Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin,
	Joseph, Jithu, kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg,
	Jesse, Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, bpf,
	brouer

On Wed, 26 May 2021 08:35:49 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> I'll still have a basic question though. I've never invested much time
> into the hints because its still not clear to me what the use case is?
> What would we put in the hints and do we have any data to show it would be
> a performance win.

I've documented and measured[1] the performance overhead of the missing
checksum for UDP packets when XDP-redirecting into veth (that does
XDP_PASS).  Full delivery into a socket we can save 8% (54.28 ns /
+109Kpps).  Lorenzo is working patches outside XDP-hints for this, but
it only handle CHECKSUM_UNNECESSARY, and if we need CHECKSUM_COMPLETE
then we also need XDP-hints/metadata (for storing skb->csum).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
 
> If its a simple hash of the headers then how would we use it?

Even with a simple/smaller hash you can tune the RSS-hash on parts of
the packet you like, see[2]. That would be valuable for doing lookups
in BPF-maps.

[2] https://github.com/stackpath/rxtxcpu/tree/master/Documentation/case-studies

For cpumap redirect I would like to spread packets with this RX-hash,
as I can avoid parsing packets headers on RX-CPU.

The mlx5 NIC support 64-bit unique flow hash, that you could use as a
lookup key in your (Cilium) conntrack table, or a container/sockmap
redirect-tracking table.

> The map_lookup/updates use IP addrs for keys in Cilium. So I think the
> suggestion is to offload the jhash operation? But that requires some
> program changes to work. Could someone convince me?

Is my explanations enough, or are you still not convinced? 
 
> Maybe packet timestamp?

I have a concrete use-case that needs packet timestamps for AF_XDP. It
is the control system inside a wind-turbine that use a time-triggered
Real-Time protocol.  I actually both need hardware RX-timestamps and
TX-timestamps.  A lot it lacking on the TX-side to allow AF_XDP to send
down a transmission timestamp, but I have a real-use-case that needs
this (before end-of year).  I have hardware i210 and i225 chips in my
testlab so I can get this working.

Thanks you John for engaging and challenging us in our design of
XDP-hints, I truly appreciate your feedback! :-)
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-05-26 16:33                                             ` John Fastabend
@ 2021-05-26 18:44                                               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 18+ messages in thread
From: Jesper Dangaard Brouer @ 2021-05-26 18:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen,
	Saeed Mahameed, Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin,
	Joseph, Jithu, kurt, Maloor, Kishen, Gomes, Vinicius, Brandeburg,
	Jesse, Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Desouza, Ederson, Song, Yoong Siang, Czapnik, Lukasz, bpf,
	xdp-hints, brouer, David Ahern

On Wed, 26 May 2021 09:33:11 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> Alexander Lobakin wrote:
> > From: John Fastabend <john.fastabend@gmail.com>
> > Date: Wed, 26 May 2021 08:35:49 -0700
> > 
[...]
> > >> >
> > >> > I assume we need to stay compatible and respect the existing config
> > >> > interfaces, right?  
> > 
> > Again, XDP Hints won't change any netdev features and stuff, only
> > compose provide the hardware provided fields that are currently
> > inaccessible by the XDP prog and say cpumap code, but that are
> > highly needed (cpumap builds skbs without csums -> GRO layer
> > consumes CPU time to calculate it manually, without RSS hash ->
> > Flow Dissector consumes CPU time to calculate it manually +
> > possible NAPI bucket misses etc.).  
> 
> Thats a specific cpumap problem correct?

No, it is not a specific cpumap problem.  It is actually a general
XDP_REDIRECT problem.  The veth container use-case is also hit by this
slowdown due to lacking HW-csum and RSS-hash, as describe by Alexander.

It also exists for redirect into Virtual Machines, which is David
Ahern's use-case actually.

> In general checksums work as expected?

Nope, the checksums are non-existing for XDP_REDIRECT'ed packets.
 
> [...] 
> I'm not convinced hashes and csum are so interesting but show me some
> data. 

Checksum overhead measurements for veth container use-case see here[1].
 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
 

> Also my admittedly rough understanding of cpumap is that it helps
> the case where hardware RSS is not sufficient. 

I feel the need to explain what I'm using cpumap for, so bear with me.

In xdp-cpumap-tc[2] the XDP cpumap redirect solves the TC Qdisc locking
problem.  This runs in production at an ISP that uses MQ+HTB shaping.
It makes sure customer assigned IP-addresses (can be multiple) are
redirected to the same CPU. Thus, the customer specific HTB shaper
works correctly. (Multiple HTB qdisc are attached under MQ [6]).

 [2] https://github.com/xdp-project/xdp-cpumap-tc
 [6] https://github.com/xdp-project/xdp-cpumap-tc/blob/master/bin/tc_mq_htb_setup_example.sh

In traffic-pacing-edt[3] the cpumap code[4] maps VLAN tagged traffic to
the same CPU.  This allows the TC-BPF code[5] to be "concurrency
correct" as it updates the VLAN based EDT-rate-limit BPF-map without any
atomic operations.  This runs in production at another ISP, that need
to shape (traffic pace) 1Gbit/s customer on a 10Gbit/s link due to
crappy 1G GPON switches closer to the end-customer.  It would be useful
to get the offloaded VLAN info in XDP-metadata.

 [3] https://github.com/xdp-project/bpf-examples/tree/master/traffic-pacing-edt
 [4] https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/xdp_cpumap_qinq.c
 [5] https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c

> Seeing your coming from the Intel hardware side why not fix the RSS
> root problem instead of using cpumap at all? I think your hardware is
> flexible enough.

Yes, please fix i40e hardware/firmware to support Q-in-Q packets.  We
are actaully hitting this at a customer site.  But my above cpumap
use-cases were not due to bad RSS-hashing.

 
> I would really prefer to see example use cases that are more generic
> than the cpumap case.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
       [not found]         ` <DM4PR11MB5422FE9618B3692D48FCE4EA84549@DM4PR11MB5422.namprd11.prod.outlook.com>
       [not found]           ` <20210510185029.1ca6f872@carbon>
@ 2021-06-05  0:32           ` Desouza, Ederson
  2021-06-11 19:25             ` Alexander Lobakin
  1 sibling, 1 reply; 18+ messages in thread
From: Desouza, Ederson @ 2021-06-05  0:32 UTC (permalink / raw)
  To: Lobakin, Alexandr, brouer
  Cc: bpf, Raczynski, Piotr, Zhang, Jessica, Kubiak, Marcin, Joseph,
	Jithu, xdp-hints, Maloor, Kishen, Gomes, Vinicius, Brandeburg,
	Jesse, Swiatkowski, Michal, Plantykow, Marta A, Ong, Boon Leong,
	Czapnik, Lukasz, Song, Yoong Siang

On Mon, 2021-05-10 at 15:49 +0000, Lobakin, Alexandr wrote:
> Hi,
> 
> So here it is: https://github.com/alobakin/linux/branches

Quick question: are you planning any update here (like incorporating
the CO-RE feedback)? 

I'm thinking of igc, and wondering if you have something in your
pipeline already that I could reuse for some tests =D

> 
> Default branch is just merged with v5.13-rc1 without any testing, don't expect much from it. I'll fix it a bit later.
> 
> Al
> 
> -----Original Message-----
> From: Jesper Dangaard Brouer <brouer@redhat.com> 
> Sent: Friday, May 7, 2021 1:11 PM
> To: Lobakin, Alexandr <alexandr.lobakin@intel.com>
> Cc: Kubiak, Marcin <marcin.kubiak@intel.com>; Ong, Boon Leong <boon.leong.ong@intel.com>; Brandeburg, Jesse <jesse.brandeburg@intel.com>; Desouza, Ederson <ederson.desouza@intel.com>; Swiatkowski, Michal <michal.swiatkowski@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A <marta.a.plantykow@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>; Raczynski, Piotr <piotr.raczynski@intel.com>; Song, Yoong Siang <yoong.siang.song@intel.com>; brouer@redhat.com
> Subject: Re: AF_XDP metadata/hints
> 
> (Answer inlined below)
> 
> On Fri, 7 May 2021 10:08:51 +0000
> "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
> 
> > + Jesper.
> 
> Thanks for including me!  Just see me as a resource that can help out on this project, both coding and (performance) testing.  I have a customer use-case related to LaunchTime mode using AF_XDP, which is one of the more tricky cases (due to XDP lacking a proper TX layer, that can pushback if TX-queue is full/paused).
> 
> 
> > So long story short: driver advertises the XDP hints it supports (Rx:
> > RSS hash, csum status or complete csum, C/S-VLAN tag if stripped etc., 
> > Tx: csum offload etc.) on netdev probing so BPF prog could request for 
> > them.
> 
> Yes exactly and veth and devmap also want to consume these e.g. RSS hash + csum status when they create an SKB based on an xdp_frame.
> 
> (Note David Ahern and I have patches placing the csum status bits in xdp_frame to transfer that info (xdp_frame is located in memory top).
> note performance analysis here[1])
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
> 
> 
> > There's a plan to provide 2 types of hints: generic ones (that almost 
> > every NIC is capable to provide) and custom ones (up to 
> > vendor/developers). Drivers that don't want to support hints at all 
> > could opt out from the generic ones just by setting their 
> > xdp.data_meta to xdp.data + 1, just how it's now done in the mainline.
> 
> I agree the NIC need to support different types.  And different types per packet as e.g. PTP timestamps might not be in every packet.
> 
> > XDP Hints will be obviously stored in metadata,
> 
> I have considered[2] storing XDP hints in xdp_frame area (top of memory), but I'm more and more convinced is should be stored in metadata area... mostly because this will allow future hardware to write this data for us.
> 
> [2] https://people.netfilter.org/hawk/presentations/KernelRecipes2019/xdp-netstack-concert.pdf
> 
> I think there is a small performance problem when writing into metadata area.  Because this is a cacheline that might be (semi)cold.  Hint we prefetch xdp_frame area and we could do the same for metadata, but I still see a slowdown when converting xdp_buff to xdp_frame.
> 
> XDP is extremely performance sensitive.  Even if metadata area is in L1 cache it will still cost a couple of nanosec to update the fields.
> This will be measureable in an XDP_DROP test.  The easiest workaround I see is: that we allow the config interface for XDP-hints to allow disabling this feature.
> 
> 
> > the layout is not written in stone yet.
> 
> Exactly, the main reason/motivation for this work is to allow NIC vendors to invent new hardware hints without having to wait for the Linux kernel to adopt these.  Instead they are instant available via BPF and BTF-info that describe this layout.
> 
>  
> > I'll be preparing an open repo with our drafts today, let you know 
> > once it will be ready and available so you could take a look, review, 
> > fork and go forth with playing with it.
> 
> Great!!! :-)))
> 
> > Any of you can also share any bits of code or thoughts that you have 
> > and may want to share.
> > 
> > Thanks,
> > Al
> > 
> > -----Original Message-----
> > From: Kubiak, Marcin <marcin.kubiak@intel.com>
> > Sent: Friday, May 7, 2021 9:35 AM
> > To: Ong, Boon Leong <boon.leong.ong@intel.com>; Brandeburg, Jesse 
> > <jesse.brandeburg@intel.com>; Desouza, Ederson 
> > <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > <alexandr.lobakin@intel.com>; Swiatkowski, Michal 
> > <michal.swiatkowski@intel.com>
> > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > <piotr.raczynski@intel.com>; Song, Yoong Siang 
> > <yoong.siang.song@intel.com>
> > Subject: RE: AF_XDP metadata/hints
> > 
> > + @Swiatkowski, Michal
> > 
> > -----Original Message-----
> > From: Ong, Boon Leong <boon.leong.ong@intel.com>
> > Sent: Friday, May 7, 2021 3:13 AM
> > To: Brandeburg, Jesse <jesse.brandeburg@intel.com>; Desouza, Ederson 
> > <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > <alexandr.lobakin@intel.com>
> > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > <piotr.raczynski@intel.com>; Kubiak, Marcin <marcin.kubiak@intel.com>; 
> > Song, Yoong Siang <yoong.siang.song@intel.com>
> > Subject: RE: AF_XDP metadata/hints
> > 
> > + Yoong Siang
> > 
> > > -----Original Message-----
> > > From: Brandeburg, Jesse <jesse.brandeburg@intel.com>
> > > Sent: Friday, May 7, 2021 8:32 AM
> > > To: Desouza, Ederson <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>; Ong, Boon Leong 
> > > <boon.leong.ong@intel.com>
> > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > > <piotr.raczynski@intel.com>; Kubiak, Marcin <marcin.kubiak@intel.com>
> > > Subject: RE: AF_XDP metadata/hints
> > > 
> > > I think our XDP team is on this, and I'll let them answer in detail. 
> > > Yes, I'm a goof for top posting using outlook....
> > >  
> > > > So, I'd like to ask some questions:
> > > >  - Are you guys also interested in AF_XDP support, or do you care 
> > > > about XDP only?
> > > 
> > > Yes! Both.  
> > 
> > For us (BL and Siang) in Penang, Since, we have 3 time-zones and I don't think it is effective for us to run 3 time zone discussion. So, please carry on without us. 
> > 
> > However, we would like to follow the RFC and POC so that we can help to implement the stmmac driver part and help test.
> > 
> > BL
> > >  
> > > >  - Did you already start the work? Could you describe how you 
> > > > handle the metadata? Can we see some preview? We're planning to 
> > > > start next week, so not much to show[*].
> > > >  - I believe there may be opportunities to collaborate here - how 
> > > > can we help?
> > > 
> > > Please folks, get together and talk. Ederson, this team is in Poland 
> > > so is +9 hours from PST. Keep me informed on what's up, but I don't 
> > > need to be in the day to day, but you're welcome to consult me anytime.
> > > 
> > > -Jesse
> > > 
> > > -----Original Message-----
> > > From: Desouza, Ederson <ederson.desouza@intel.com>
> > > Sent: Thursday, May 6, 2021 5:08 PM
> > > To: Lobakin, Alexandr <alexandr.lobakin@intel.com>; Ong, Boon Leong 
> > > <boon.leong.ong@intel.com>; Brandeburg, Jesse 
> > > <jesse.brandeburg@intel.com>
> > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > Joseph, Jithu <jithu.joseph@intel.com>
> > > Subject: Re: AF_XDP metadata/hints
> > > 
> > > +Jessica
> > > 
> > > Gentle ping to check if anyone's interested =D
> > > 
> > > On Fri, 2021-04-30 at 19:24 -0700, Ederson de Souza wrote:  
> > > > Hi folks,
> > > > 
> > > > I've noticed that you are some Intel people that mentioned (on a 
> > > > thread with the community) interest and some development regarding 
> > > > XDP metadata (for RX/TX timestamp and SO_TXTIME support). And my 
> > > > team, on IAGS/SSE, is also working on that for the i225 igc driver.
> > > > 
> > > > As there's some overlap in this work, I'd like to figure it out 
> > > > what is the state of these developments and align what we're doing.
> > > > 
> > > > I'll talk about our ideas and ask some questions to help in this 
> > > > alignment.
> > > > 
> > > > We want the "SO_TXTIME" support for AF_XDP ZC. Right now, we're 
> > > > thinking about working on top of Saeed's work that enables BTF for XDP.
> > > > Our idea is to use BTF and the XDP metadata area (that headroom 
> > > > area just before the frame data starts) to store the timestamp.
> > > > 
> > > > For this to work, we're thinking of adding some new API to 
> > > > `tools/lib/bpf/xsk.h` to allow userspace application to manipulate 
> > > > that area and add the metadata (without explicitly subtracting 
> > > > pointers and such).
> > > > 
> > > > Then, add support on both igc driver to extract this metadata and 
> > > > set the launch time and the generic AF_XDP socket (for copy mode
> > > > compatibility) to extract and add it to the skb sent down the stack.
> > > > 
> > > > Of course, the issue here is how to make this generic enough. 
> > > > Saeed's work kinda expects a metadata struct per driver, IIUC.
> > > > 
> > > > So, I'd like to ask some questions:
> > > >  - Are you guys also interested in AF_XDP support, or do you care 
> > > > about XDP only?
> > > >  - Did you already start the work? Could you describe how you 
> > > > handle the metadata? Can we see some preview? We're planning to 
> > > > start next week, so not much to show[*].
> > > >  - I believe there may be opportunities to collaborate here - how 
> > > > can we help?
> > > > 
> > > > Cheers!
> > > > 
> > > > [*] In fact, we're using BTF metadata to get HW RX/TX timestamps 
> > > > for some internal latency measurements we're doing. It's fairly 
> > > > hackish (and ugly) at this point, but if you are interested, we 
> > > > could share what we've done.
> > 
> > 
> > > >  
> > > 
> > >  
> > 
> 
> 
> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: AF_XDP metadata/hints
  2021-06-05  0:32           ` Desouza, Ederson
@ 2021-06-11 19:25             ` Alexander Lobakin
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Lobakin @ 2021-06-11 19:25 UTC (permalink / raw)
  To: Desouza, Ederson
  Cc: Alexander Lobakin, brouer, bpf, Raczynski, Piotr, Zhang, Jessica,
	Kubiak, Marcin, Joseph, Jithu, xdp-hints, Maloor, Kishen, Gomes,
	Vinicius, Brandeburg, Jesse, Swiatkowski, Michal, Plantykow,
	Marta A, Ong, Boon Leong, Czapnik, Lukasz, Song, Yoong Siang

From: Ederson Desouza <ederson.desouza@intel.com>
Date: Sat, 5 Jun 2021 00:32:45 +0000

Hi there,

> On Mon, 2021-05-10 at 15:49 +0000, Lobakin, Alexandr wrote:
> > Hi,
> > 
> > So here it is: https://github.com/alobakin/linux/branches
> 
> Quick question: are you planning any update here (like incorporating
> the CO-RE feedback)? 

Sure, the repo is not dead, neither our work is. I was OOO last
week due to some urgent events and didn't have time to get my
hands on XDP Hints.
And yes, we are planning to rewrite the entire thing upon
the CO-RE. Stay tuned.

> I'm thinking of igc, and wondering if you have something in your
> pipeline already that I could reuse for some tests =D

Not yet, but I'll be keeping my GH mirror in sync with our drafts
anyways.
I mean, you could try to reuse the current implementation from the
ice driver, as it works just fine. But the most juicy things are
to come.

> > 
> > Default branch is just merged with v5.13-rc1 without any testing, don't expect much from it. I'll fix it a bit later.
> > 
> > Al
> > 
> > -----Original Message-----
> > From: Jesper Dangaard Brouer <brouer@redhat.com> 
> > Sent: Friday, May 7, 2021 1:11 PM
> > To: Lobakin, Alexandr <alexandr.lobakin@intel.com>
> > Cc: Kubiak, Marcin <marcin.kubiak@intel.com>; Ong, Boon Leong <boon.leong.ong@intel.com>; Brandeburg, Jesse <jesse.brandeburg@intel.com>; Desouza, Ederson <ederson.desouza@intel.com>; Swiatkowski, Michal <michal.swiatkowski@intel.com>; Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A <marta.a.plantykow@intel.com>; Czapnik, Lukasz <lukasz.czapnik@intel.com>; Raczynski, Piotr <piotr.raczynski@intel.com>; Song, Yoong Siang <yoong.siang.song@intel.com>; brouer@redhat.com
> > Subject: Re: AF_XDP metadata/hints
> > 
> > (Answer inlined below)
> > 
> > On Fri, 7 May 2021 10:08:51 +0000
> > "Lobakin, Alexandr" <alexandr.lobakin@intel.com> wrote:
> > 
> > > + Jesper.
> > 
> > Thanks for including me!  Just see me as a resource that can help out on this project, both coding and (performance) testing.  I have a customer use-case related to LaunchTime mode using AF_XDP, which is one of the more tricky cases (due to XDP lacking a proper TX layer, that can pushback if TX-queue is full/paused).
> > 
> > 
> > > So long story short: driver advertises the XDP hints it supports (Rx:
> > > RSS hash, csum status or complete csum, C/S-VLAN tag if stripped etc., 
> > > Tx: csum offload etc.) on netdev probing so BPF prog could request for 
> > > them.
> > 
> > Yes exactly and veth and devmap also want to consume these e.g. RSS hash + csum status when they create an SKB based on an xdp_frame.
> > 
> > (Note David Ahern and I have patches placing the csum status bits in xdp_frame to transfer that info (xdp_frame is located in memory top).
> > note performance analysis here[1])
> > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org
> > 
> > 
> > > There's a plan to provide 2 types of hints: generic ones (that almost 
> > > every NIC is capable to provide) and custom ones (up to 
> > > vendor/developers). Drivers that don't want to support hints at all 
> > > could opt out from the generic ones just by setting their 
> > > xdp.data_meta to xdp.data + 1, just how it's now done in the mainline.
> > 
> > I agree the NIC need to support different types.  And different types per packet as e.g. PTP timestamps might not be in every packet.
> > 
> > > XDP Hints will be obviously stored in metadata,
> > 
> > I have considered[2] storing XDP hints in xdp_frame area (top of memory), but I'm more and more convinced is should be stored in metadata area... mostly because this will allow future hardware to write this data for us.
> > 
> > [2] https://people.netfilter.org/hawk/presentations/KernelRecipes2019/xdp-netstack-concert.pdf
> > 
> > I think there is a small performance problem when writing into metadata area.  Because this is a cacheline that might be (semi)cold.  Hint we prefetch xdp_frame area and we could do the same for metadata, but I still see a slowdown when converting xdp_buff to xdp_frame.
> > 
> > XDP is extremely performance sensitive.  Even if metadata area is in L1 cache it will still cost a couple of nanosec to update the fields.
> > This will be measureable in an XDP_DROP test.  The easiest workaround I see is: that we allow the config interface for XDP-hints to allow disabling this feature.
> > 
> > 
> > > the layout is not written in stone yet.
> > 
> > Exactly, the main reason/motivation for this work is to allow NIC vendors to invent new hardware hints without having to wait for the Linux kernel to adopt these.  Instead they are instant available via BPF and BTF-info that describe this layout.
> > 
> >  
> > > I'll be preparing an open repo with our drafts today, let you know 
> > > once it will be ready and available so you could take a look, review, 
> > > fork and go forth with playing with it.
> > 
> > Great!!! :-)))
> > 
> > > Any of you can also share any bits of code or thoughts that you have 
> > > and may want to share.
> > > 
> > > Thanks,
> > > Al
> > > 
> > > -----Original Message-----
> > > From: Kubiak, Marcin <marcin.kubiak@intel.com>
> > > Sent: Friday, May 7, 2021 9:35 AM
> > > To: Ong, Boon Leong <boon.leong.ong@intel.com>; Brandeburg, Jesse 
> > > <jesse.brandeburg@intel.com>; Desouza, Ederson 
> > > <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>; Swiatkowski, Michal 
> > > <michal.swiatkowski@intel.com>
> > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > > <piotr.raczynski@intel.com>; Song, Yoong Siang 
> > > <yoong.siang.song@intel.com>
> > > Subject: RE: AF_XDP metadata/hints
> > > 
> > > + @Swiatkowski, Michal
> > > 
> > > -----Original Message-----
> > > From: Ong, Boon Leong <boon.leong.ong@intel.com>
> > > Sent: Friday, May 7, 2021 3:13 AM
> > > To: Brandeburg, Jesse <jesse.brandeburg@intel.com>; Desouza, Ederson 
> > > <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>
> > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > > <piotr.raczynski@intel.com>; Kubiak, Marcin <marcin.kubiak@intel.com>; 
> > > Song, Yoong Siang <yoong.siang.song@intel.com>
> > > Subject: RE: AF_XDP metadata/hints
> > > 
> > > + Yoong Siang
> > > 
> > > > -----Original Message-----
> > > > From: Brandeburg, Jesse <jesse.brandeburg@intel.com>
> > > > Sent: Friday, May 7, 2021 8:32 AM
> > > > To: Desouza, Ederson <ederson.desouza@intel.com>; Lobakin, Alexandr 
> > > > <alexandr.lobakin@intel.com>; Ong, Boon Leong 
> > > > <boon.leong.ong@intel.com>
> > > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > > Joseph, Jithu <jithu.joseph@intel.com>; Plantykow, Marta A 
> > > > <marta.a.plantykow@intel.com>; Lobakin, Alexandr 
> > > > <alexandr.lobakin@intel.com>; Czapnik, Lukasz 
> > > > <lukasz.czapnik@intel.com>; Raczynski, Piotr 
> > > > <piotr.raczynski@intel.com>; Kubiak, Marcin <marcin.kubiak@intel.com>
> > > > Subject: RE: AF_XDP metadata/hints
> > > > 
> > > > I think our XDP team is on this, and I'll let them answer in detail. 
> > > > Yes, I'm a goof for top posting using outlook....
> > > >  
> > > > > So, I'd like to ask some questions:
> > > > >  - Are you guys also interested in AF_XDP support, or do you care 
> > > > > about XDP only?
> > > > 
> > > > Yes! Both.  
> > > 
> > > For us (BL and Siang) in Penang, Since, we have 3 time-zones and I don't think it is effective for us to run 3 time zone discussion. So, please carry on without us. 
> > > 
> > > However, we would like to follow the RFC and POC so that we can help to implement the stmmac driver part and help test.
> > > 
> > > BL
> > > >  
> > > > >  - Did you already start the work? Could you describe how you 
> > > > > handle the metadata? Can we see some preview? We're planning to 
> > > > > start next week, so not much to show[*].
> > > > >  - I believe there may be opportunities to collaborate here - how 
> > > > > can we help?
> > > > 
> > > > Please folks, get together and talk. Ederson, this team is in Poland 
> > > > so is +9 hours from PST. Keep me informed on what's up, but I don't 
> > > > need to be in the day to day, but you're welcome to consult me anytime.
> > > > 
> > > > -Jesse
> > > > 
> > > > -----Original Message-----
> > > > From: Desouza, Ederson <ederson.desouza@intel.com>
> > > > Sent: Thursday, May 6, 2021 5:08 PM
> > > > To: Lobakin, Alexandr <alexandr.lobakin@intel.com>; Ong, Boon Leong 
> > > > <boon.leong.ong@intel.com>; Brandeburg, Jesse 
> > > > <jesse.brandeburg@intel.com>
> > > > Cc: Gomes, Vinicius <vinicius.gomes@intel.com>; Maloor, Kishen 
> > > > <kishen.maloor@intel.com>; Zhang, Jessica <jessica.zhang@intel.com>; 
> > > > Joseph, Jithu <jithu.joseph@intel.com>
> > > > Subject: Re: AF_XDP metadata/hints
> > > > 
> > > > +Jessica
> > > > 
> > > > Gentle ping to check if anyone's interested =D
> > > > 
> > > > On Fri, 2021-04-30 at 19:24 -0700, Ederson de Souza wrote:  
> > > > > Hi folks,
> > > > > 
> > > > > I've noticed that you are some Intel people that mentioned (on a 
> > > > > thread with the community) interest and some development regarding 
> > > > > XDP metadata (for RX/TX timestamp and SO_TXTIME support). And my 
> > > > > team, on IAGS/SSE, is also working on that for the i225 igc driver.
> > > > > 
> > > > > As there's some overlap in this work, I'd like to figure it out 
> > > > > what is the state of these developments and align what we're doing.
> > > > > 
> > > > > I'll talk about our ideas and ask some questions to help in this 
> > > > > alignment.
> > > > > 
> > > > > We want the "SO_TXTIME" support for AF_XDP ZC. Right now, we're 
> > > > > thinking about working on top of Saeed's work that enables BTF for XDP.
> > > > > Our idea is to use BTF and the XDP metadata area (that headroom 
> > > > > area just before the frame data starts) to store the timestamp.
> > > > > 
> > > > > For this to work, we're thinking of adding some new API to 
> > > > > `tools/lib/bpf/xsk.h` to allow userspace application to manipulate 
> > > > > that area and add the metadata (without explicitly subtracting 
> > > > > pointers and such).
> > > > > 
> > > > > Then, add support on both igc driver to extract this metadata and 
> > > > > set the launch time and the generic AF_XDP socket (for copy mode
> > > > > compatibility) to extract and add it to the skb sent down the stack.
> > > > > 
> > > > > Of course, the issue here is how to make this generic enough. 
> > > > > Saeed's work kinda expects a metadata struct per driver, IIUC.
> > > > > 
> > > > > So, I'd like to ask some questions:
> > > > >  - Are you guys also interested in AF_XDP support, or do you care 
> > > > > about XDP only?
> > > > >  - Did you already start the work? Could you describe how you 
> > > > > handle the metadata? Can we see some preview? We're planning to 
> > > > > start next week, so not much to show[*].
> > > > >  - I believe there may be opportunities to collaborate here - how 
> > > > > can we help?
> > > > > 
> > > > > Cheers!
> > > > > 
> > > > > [*] In fact, we're using BTF metadata to get HW RX/TX timestamps 
> > > > > for some internal latency measurements we're doing. It's fairly 
> > > > > hackish (and ugly) at this point, but if you are interested, we 
> > > > > could share what we've done.
> > > 
> > > 
> > > > >  
> > > > 
> > > >  
> > > 
> > 
> > 
> > 
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-06-11 19:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <dc2c38cdccfa5eca925cfc9d59b0674e208c9c9d.camel@intel.com>
     [not found] ` <2226aeaab7a4ca8e4f26413514bf54ab2c81ea36.camel@intel.com>
     [not found] ` <DM6PR11MB2780A8C5410ECB3C9700EAB5CA579@DM6PR11MB2780.namprd11.prod.outlook.com>
     [not found]   ` <PH0PR11MB487034313697F395BB5BA3C5E4579@PH0PR11MB4870.namprd11.prod.outlook.com>
     [not found]     ` <DM4PR11MB5422733A87913EFF8904C17184579@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]       ` <20210507131034.5a62ce56@carbon>
     [not found]         ` <DM4PR11MB5422FE9618B3692D48FCE4EA84549@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]           ` <20210510185029.1ca6f872@carbon>
     [not found]             ` <DM4PR11MB54227C25DFD4E882CB03BD3884539@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]               ` <20210512102546.5c098483@carbon>
     [not found]                 ` <DM4PR11MB542273C9D8BF63505DC6E21784519@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]                   ` <7b347a985e590e2a422f837971b30bd83f9c7ac3.camel@nvidia.com>
     [not found]                     ` <DM4PR11MB5422762E82C0531B92BDF09A842B9@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]                       ` <DM4PR11MB5422269F6113268172B9E26A842A9@DM4PR11MB5422.namprd11.prod.outlook.com>
     [not found]                         ` <DM4PR11MB54224769926B06EE76635A6484299@DM4PR11MB5422.namprd11.prod.outlook.com>
2021-05-21 13:31                           ` AF_XDP metadata/hints Jesper Dangaard Brouer
2021-05-21 17:53                             ` Saeed Mahameed
2021-05-23 11:54                               ` Toke Høiland-Jørgensen
2021-05-25 14:20                                 ` Alexander Lobakin
2021-05-26  4:51                                   ` John Fastabend
2021-05-26 11:49                                     ` Jesper Dangaard Brouer
2021-05-26 13:06                                       ` Toke Høiland-Jørgensen
2021-05-26 15:35                                         ` John Fastabend
2021-05-26 15:41                                           ` John Fastabend
2021-05-26 15:54                                           ` Alexander Lobakin
2021-05-26 16:33                                             ` John Fastabend
2021-05-26 18:44                                               ` Jesper Dangaard Brouer
2021-05-26 16:41                                             ` Alexei Starovoitov
2021-05-26 17:01                                               ` John Fastabend
2021-05-26 17:38                                           ` Jesper Dangaard Brouer
2021-05-26 14:49                                   ` Jesper Dangaard Brouer
2021-06-05  0:32           ` Desouza, Ederson
2021-06-11 19:25             ` Alexander Lobakin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.