netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Per-queue XDP programs, thoughts
       [not found]   ` <64259723-f0d8-8ade-467e-ad865add4908@intel.com>
@ 2019-04-15 16:32     ` Jesper Dangaard Brouer
  2019-04-15 17:08       ` Toke Høiland-Jørgensen
                         ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-15 16:32 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Björn Töpel, ilias.apalodimas, toke, magnus.karlsson,
	maciej.fijalkowski, brouer, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon


On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:

> Hi,
> 
> As you probably can derive from the amount of time this is taking, I'm
> not really satisfied with the design of per-queue XDP program. (That,
> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> in this mail!
> 
> Beware, it's kind of a long post, and it's all over the place.

Cc'ing all the XDP-maintainers (and netdev).

> There are a number of ways of setting up flows in the kernel, e.g.
> 
> * Connecting/accepting a TCP socket (in-band)
> * Using tc-flower (out-of-band)
> * ethtool (out-of-band)
> * ...
> 
> The first acts on sockets, the second on netdevs. Then there's ethtool
> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> to queues. Most users care about sockets and netdevices. Queues is
> more of an implementation detail of Rx or for QoS on the Tx side.

Let me first acknowledge that the current Linux tools to administrator
HW filters is lacking (well sucks).  We know the hardware is capable,
as DPDK have an full API for this called rte_flow[1]. If nothing else
you/we can use the DPDK API to create a program to configure the
hardware, examples here[2]

 [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
 [2] https://doc.dpdk.org/guides/howto/rte_flow.html

> XDP is something that we can attach to a netdevice. Again, very
> natural from a user perspective. As for XDP sockets, the current
> mechanism is that we attach to an existing netdevice queue. Ideally
> what we'd like is to *remove* the queue concept. A better approach
> would be creating the socket and set it up -- but not binding it to a
> queue. Instead just binding it to a netdevice (or crazier just
> creating a socket without a netdevice).

Let me just remind everybody that the AF_XDP performance gains comes
from binding the resource, which allow for lock-free semantics, as
explained here[3].

[3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from


> The socket is an endpoint, where I'd like data to end up (or get sent
> from). If the kernel can attach the socket to a hardware queue,
> there's zerocopy if not, copy-mode. Dito for Tx.

Well XDP programs per RXQ is just a building block to achieve this.

As Van Jacobson explain[4], sockets or applications "register" a
"transport signature", and gets back a "channel".   In our case, the
netdev-global XDP program is our way to register/program these transport
signatures and redirect (e.g. into the AF_XDP socket).
This requires some work in software to parse and match transport
signatures to sockets.  The XDP programs per RXQ is a way to get
hardware to perform this filtering for us.

 [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf


> Does a user (control plane) want/need to care about queues? Just
> create a flow to a socket (out-of-band or inband) or to a netdevice
> (out-of-band).

A userspace "control-plane" program, could hide the setup and use what
the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
that the "listen" socket first register the transport signature (with
the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
can register a 5-tuple (or create TC-HW rules) and load our
"transport-signature" XDP prog on the queue number we choose.  If not,
when our netdev-global XDP prog need a hash-table with 5-tuple and do
5-tuple parsing.

Creating netdevices via HW filter into queues is an interesting idea.
DPDK have an example here[5], on how to per flow (via ethtool filter
setup even!) send packets to queues, that endup in SRIOV devices.

 [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html


> Do we envison any other uses for per-queue XDP other than AF_XDP? If
> not, it would make *more* sense to attach the XDP program to the
> socket (e.g. if the endpoint would like to use kernel data structures
> via XDP).

As demonstrated in [5] you can use (ethtool) hardware filters to
redirect packets into VFs (Virtual Functions).

I also want us to extend XDP to allow for redirect from a PF (Physical
Function) into a VF (Virtual Function).  First the netdev-global
XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
info).  Next configure HW filter to queue# and load XDP prog on that
queue# that only "redirect" to a single VF.  Now if driver+HW supports
it, it can "eliminate" the per-queue XDP-prog and do everything in HW. 


> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> or similar *virtual* netdevices a better path, instead of introducing
> yet another abstraction?

XDP redirect a more generic abstraction that allow us to implement
macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
Next I configure HW filters that match the MAC-addr into a queue# and
attach simpler XDP-prog to queue#, that redirect into macvlan device.

 
> Further, is queue/socket a good abstraction for all devices? Wifi? By
> just viewing sockets as an endpoint, we leave it up to the kernel to
> figure out the best way. "Here's an endpoint. Give me data **here**."
> 
> The OpenFlow protocol does however support the concept of queues per
> port, but do we want to introduce that into the kernel?
> 
> So, if per-queue XDP programs is only for AF_XDP, I think it's better
> to stick the program to the socket. For me per-queue is sort of a
> leaky abstraction...
>
> More thoughts. If we go the route of per-queue XDP programs. Would it
> be better to leave the setup to XDP -- i.e. the XDP program is
> controlling the per-queue programs (think tail-calls, but a map with
> per-q programs). Instead of the netlink layer. This is part of a
> bigger discussion, namely should XDP really implement the control
> plane?
> 
> I really like that a software switch/router can be implemented
> effectively with XDP, but ideally I'd like it to be offloaded by
> hardware -- using the same control/configuration plane. Can we do it
> in hardware, do that. If not, emulate via XDP.

That is actually the reason I want XDP per-queue, as it is a way to
offload the filtering to the hardware.  And if the per-queue XDP-prog
becomes simple enough, the hardware can eliminate and do everything in
hardware (hopefully).


> The control plane should IMO be outside of the XDP program.
> 
> Ok, please convince me! :-D

I tried to above...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


More use-cases for per-queue XDP-prog:

XDP for containers
------------------
XDP can redirect into veth, used for containers.  So, I want to
implement container protection/isolation in XDP layer.  E.g. I only
want my container to talk to 1 external src-IP to my container dst-IP
on port 80.  I can implement that check in netdev-global XDP BPF-code.
But I can also hardware "offload" this simple filter (via ethtool or
rte_flow) and simplify the per-queue XDP-prog.  Given the queue now
only receives traffic that match my desc, I have now protected/isolated
the traffic to my container.


DPDK using per-queue AF_XDP
---------------------------
AFAIK an AF_XDP PMD driver have been merged in DPDK (but I've not
looked at the code).

It would be very natural for DPDK to load per-queue XDP-progs for
interfacing with AF_XDP, as they already have rte_flow API (see
[1]+[2]) for configuring HW filters.  And loading per-queue XDP-progs
would also avoid disturbing other users of XDP on same machine (if we
choose the semantics defined in [6]).

[6] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#proposal-rxq-prog-takes-precedence

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
@ 2019-04-15 17:08       ` Toke Høiland-Jørgensen
  2019-04-15 17:58       ` Jonathan Lemon
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-15 17:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: Björn Töpel, ilias.apalodimas, magnus.karlsson,
	maciej.fijalkowski, brouer, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon

Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
>
>> Hi,
>> 
>> As you probably can derive from the amount of time this is taking, I'm
>> not really satisfied with the design of per-queue XDP program. (That,
>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>> in this mail!
>> 
>> Beware, it's kind of a long post, and it's all over the place.
>
> Cc'ing all the XDP-maintainers (and netdev).
>
>> There are a number of ways of setting up flows in the kernel, e.g.
>> 
>> * Connecting/accepting a TCP socket (in-band)
>> * Using tc-flower (out-of-band)
>> * ethtool (out-of-band)
>> * ...
>> 
>> The first acts on sockets, the second on netdevs. Then there's ethtool
>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>> to queues. Most users care about sockets and netdevices. Queues is
>> more of an implementation detail of Rx or for QoS on the Tx side.
>
> Let me first acknowledge that the current Linux tools to administrator
> HW filters is lacking (well sucks).  We know the hardware is capable,
> as DPDK have an full API for this called rte_flow[1]. If nothing else
> you/we can use the DPDK API to create a program to configure the
> hardware, examples here[2]
>
>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>
>> XDP is something that we can attach to a netdevice. Again, very
>> natural from a user perspective. As for XDP sockets, the current
>> mechanism is that we attach to an existing netdevice queue. Ideally
>> what we'd like is to *remove* the queue concept. A better approach
>> would be creating the socket and set it up -- but not binding it to a
>> queue. Instead just binding it to a netdevice (or crazier just
>> creating a socket without a netdevice).
>
> Let me just remind everybody that the AF_XDP performance gains comes
> from binding the resource, which allow for lock-free semantics, as
> explained here[3].
>
> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>
>
>> The socket is an endpoint, where I'd like data to end up (or get sent
>> from). If the kernel can attach the socket to a hardware queue,
>> there's zerocopy if not, copy-mode. Dito for Tx.
>
> Well XDP programs per RXQ is just a building block to achieve this.
>
> As Van Jacobson explain[4], sockets or applications "register" a
> "transport signature", and gets back a "channel".   In our case, the
> netdev-global XDP program is our way to register/program these transport
> signatures and redirect (e.g. into the AF_XDP socket).
> This requires some work in software to parse and match transport
> signatures to sockets.  The XDP programs per RXQ is a way to get
> hardware to perform this filtering for us.
>
>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>
>
>> Does a user (control plane) want/need to care about queues? Just
>> create a flow to a socket (out-of-band or inband) or to a netdevice
>> (out-of-band).
>
> A userspace "control-plane" program, could hide the setup and use what
> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> that the "listen" socket first register the transport signature (with
> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> can register a 5-tuple (or create TC-HW rules) and load our
> "transport-signature" XDP prog on the queue number we choose.  If not,
> when our netdev-global XDP prog need a hash-table with 5-tuple and do
> 5-tuple parsing.

I agree with the "per-queue XDP is a building block" sentiment, but I
think we really need to hash out the "control plane" part. Is it good
enough to make this userspace's problem, or should there be some kind of
kernel support? And if it's going to be userspace only, who is going to
write the demuxer that runs as the root program? Should we start a
separate project for this, should it be part of libbpf, or something
entirely different?

-Toke

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
  2019-04-15 17:08       ` Toke Høiland-Jørgensen
@ 2019-04-15 17:58       ` Jonathan Lemon
  2019-04-16 14:48         ` Jesper Dangaard Brouer
  2019-04-15 22:49       ` Jakub Kicinski
                         ` (4 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Jonathan Lemon @ 2019-04-15 17:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Björn Töpel, ilias.apalodimas,
	toke, magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	John Fastabend, David Miller, Andy Gospodarek, netdev, bpf,
	Thomas Graf, Thomas Monjalon


On 15 Apr 2019, at 9:32, Jesper Dangaard Brouer wrote:

> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel 
> <bjorn.topel@intel.com> wrote:
>
>> Hi,
>>
>> As you probably can derive from the amount of time this is taking, 
>> I'm
>> not really satisfied with the design of per-queue XDP program. (That,
>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my 
>> thinking
>> in this mail!
>>
>> Beware, it's kind of a long post, and it's all over the place.
>
> Cc'ing all the XDP-maintainers (and netdev).
>
>> There are a number of ways of setting up flows in the kernel, e.g.
>>
>> * Connecting/accepting a TCP socket (in-band)
>> * Using tc-flower (out-of-band)
>> * ethtool (out-of-band)
>> * ...
>>
>> The first acts on sockets, the second on netdevs. Then there's 
>> ethtool
>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can 
>> steer
>> to queues. Most users care about sockets and netdevices. Queues is
>> more of an implementation detail of Rx or for QoS on the Tx side.
>
> Let me first acknowledge that the current Linux tools to administrator
> HW filters is lacking (well sucks).  We know the hardware is capable,
> as DPDK have an full API for this called rte_flow[1]. If nothing else
> you/we can use the DPDK API to create a program to configure the
> hardware, examples here[2]
>
>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>
>> XDP is something that we can attach to a netdevice. Again, very
>> natural from a user perspective. As for XDP sockets, the current
>> mechanism is that we attach to an existing netdevice queue. Ideally
>> what we'd like is to *remove* the queue concept. A better approach
>> would be creating the socket and set it up -- but not binding it to a
>> queue. Instead just binding it to a netdevice (or crazier just
>> creating a socket without a netdevice).
>
> Let me just remind everybody that the AF_XDP performance gains comes
> from binding the resource, which allow for lock-free semantics, as
> explained here[3].
>
> [3] 
> https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>
>
>> The socket is an endpoint, where I'd like data to end up (or get sent
>> from). If the kernel can attach the socket to a hardware queue,
>> there's zerocopy if not, copy-mode. Dito for Tx.
>
> Well XDP programs per RXQ is just a building block to achieve this.
>
> As Van Jacobson explain[4], sockets or applications "register" a
> "transport signature", and gets back a "channel".   In our case, the
> netdev-global XDP program is our way to register/program these 
> transport
> signatures and redirect (e.g. into the AF_XDP socket).
> This requires some work in software to parse and match transport
> signatures to sockets.  The XDP programs per RXQ is a way to get
> hardware to perform this filtering for us.
>
>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>
>
>> Does a user (control plane) want/need to care about queues? Just
>> create a flow to a socket (out-of-band or inband) or to a netdevice
>> (out-of-band).
>
> A userspace "control-plane" program, could hide the setup and use what
> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> that the "listen" socket first register the transport signature (with
> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> can register a 5-tuple (or create TC-HW rules) and load our
> "transport-signature" XDP prog on the queue number we choose.  If not,
> when our netdev-global XDP prog need a hash-table with 5-tuple and do
> 5-tuple parsing.
>
> Creating netdevices via HW filter into queues is an interesting idea.
> DPDK have an example here[5], on how to per flow (via ethtool filter
> setup even!) send packets to queues, that endup in SRIOV devices.
>
>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>
>
>> Do we envison any other uses for per-queue XDP other than AF_XDP? If
>> not, it would make *more* sense to attach the XDP program to the
>> socket (e.g. if the endpoint would like to use kernel data structures
>> via XDP).
>
> As demonstrated in [5] you can use (ethtool) hardware filters to
> redirect packets into VFs (Virtual Functions).
>
> I also want us to extend XDP to allow for redirect from a PF (Physical
> Function) into a VF (Virtual Function).  First the netdev-global
> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> info).  Next configure HW filter to queue# and load XDP prog on that
> queue# that only "redirect" to a single VF.  Now if driver+HW supports
> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.

One thing I'd like to see is have RSS distribute incoming traffic
across a set of queues.  The application would open a set of xsk's which
are bound to those queues.

I'm not seeing how a transport signature would achieve this.  The 
current
tooling seems to treat the queue as the basic building block, which 
seems
generally appropriate.

Whittling things down (receiving packets only for a specific flow) could
be achieved by creating a queue which only contains those packets which
atched via some form of classification (or perhaps steered to a VF 
device),
aka [5] above.   Exposing multiple queues allows load distribution for
those apps which care about it.
-- 
Jonathan

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
  2019-04-15 17:08       ` Toke Høiland-Jørgensen
  2019-04-15 17:58       ` Jonathan Lemon
@ 2019-04-15 22:49       ` Jakub Kicinski
  2019-04-16  7:45         ` Björn Töpel
  2019-04-16 13:55         ` Jesper Dangaard Brouer
  2019-04-16  7:44       ` Björn Töpel
                         ` (3 subsequent siblings)
  6 siblings, 2 replies; 19+ messages in thread
From: Jakub Kicinski @ 2019-04-15 22:49 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Björn Töpel, ilias.apalodimas,
	toke, magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	David Miller, Andy Gospodarek, netdev, bpf, Thomas Graf,
	Thomas Monjalon

On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
> > Hi,
> > 
> > As you probably can derive from the amount of time this is taking, I'm
> > not really satisfied with the design of per-queue XDP program. (That,
> > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > in this mail!

Jesper was advocating per-queue progs since very early days of XDP.
If it was easy to implement cleanly we would've already gotten it ;)

> > Beware, it's kind of a long post, and it's all over the place.  
> 
> Cc'ing all the XDP-maintainers (and netdev).
> 
> > There are a number of ways of setting up flows in the kernel, e.g.
> > 
> > * Connecting/accepting a TCP socket (in-band)
> > * Using tc-flower (out-of-band)
> > * ethtool (out-of-band)
> > * ...
> > 
> > The first acts on sockets, the second on netdevs. Then there's ethtool
> > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> > to queues. Most users care about sockets and netdevices. Queues is
> > more of an implementation detail of Rx or for QoS on the Tx side.  
> 
> Let me first acknowledge that the current Linux tools to administrator
> HW filters is lacking (well sucks).  We know the hardware is capable,
> as DPDK have an full API for this called rte_flow[1]. If nothing else
> you/we can use the DPDK API to create a program to configure the
> hardware, examples here[2]
> 
>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> 
> > XDP is something that we can attach to a netdevice. Again, very
> > natural from a user perspective. As for XDP sockets, the current
> > mechanism is that we attach to an existing netdevice queue. Ideally
> > what we'd like is to *remove* the queue concept. A better approach
> > would be creating the socket and set it up -- but not binding it to a
> > queue. Instead just binding it to a netdevice (or crazier just
> > creating a socket without a netdevice).  

You can remove the concept of a queue from the AF_XDP ABI (well, extend
it to not require the queue being explicitly specified..), but you can't
avoid the user space knowing there is a queue.  Because if you do you
can no longer track and configure that queue (things like IRQ
moderation, descriptor count etc.)

Currently the term "queue" refers mostly to the queues that stack uses.
Which leaves e.g. the XDP TX queues in a strange grey zone (from
ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice to
have the HW queue ID somewhat detached from stack queue ID.  Or at least
it'd be nice to introduce queue types?  I've been pondering this for a
while, I don't see any silver bullet here..

> Let me just remind everybody that the AF_XDP performance gains comes
> from binding the resource, which allow for lock-free semantics, as
> explained here[3].
> 
> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> 
> 
> > The socket is an endpoint, where I'd like data to end up (or get sent
> > from). If the kernel can attach the socket to a hardware queue,
> > there's zerocopy if not, copy-mode. Dito for Tx.  
> 
> Well XDP programs per RXQ is just a building block to achieve this.
> 
> As Van Jacobson explain[4], sockets or applications "register" a
> "transport signature", and gets back a "channel".   In our case, the
> netdev-global XDP program is our way to register/program these transport
> signatures and redirect (e.g. into the AF_XDP socket).
> This requires some work in software to parse and match transport
> signatures to sockets.  The XDP programs per RXQ is a way to get
> hardware to perform this filtering for us.
> 
>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> 
> 
> > Does a user (control plane) want/need to care about queues? Just
> > create a flow to a socket (out-of-band or inband) or to a netdevice
> > (out-of-band).  
> 
> A userspace "control-plane" program, could hide the setup and use what
> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> that the "listen" socket first register the transport signature (with
> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> can register a 5-tuple (or create TC-HW rules) and load our
> "transport-signature" XDP prog on the queue number we choose.  If not,
> when our netdev-global XDP prog need a hash-table with 5-tuple and do
> 5-tuple parsing.

But we do want the ability to configure the queue, and get stats for
that queue.. so we can't hide the queue completely, right?

> Creating netdevices via HW filter into queues is an interesting idea.
> DPDK have an example here[5], on how to per flow (via ethtool filter
> setup even!) send packets to queues, that endup in SRIOV devices.
> 
>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html

I wish I had the courage to nack the ethtool redirect to VF Intel
added :)

> > Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > not, it would make *more* sense to attach the XDP program to the
> > socket (e.g. if the endpoint would like to use kernel data structures
> > via XDP).  
> 
> As demonstrated in [5] you can use (ethtool) hardware filters to
> redirect packets into VFs (Virtual Functions).
> 
> I also want us to extend XDP to allow for redirect from a PF (Physical
> Function) into a VF (Virtual Function).  First the netdev-global
> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> info).  Next configure HW filter to queue# and load XDP prog on that
> queue# that only "redirect" to a single VF.  Now if driver+HW supports
> it, it can "eliminate" the per-queue XDP-prog and do everything in HW. 

That sounds slightly contrived.  If the program is not doing anything,
why involve XDP at all?  As stated above we already have too many ways
to do flow config and/or VF redirect.

> > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > or similar *virtual* netdevices a better path, instead of introducing
> > yet another abstraction?  

Yes, the question of use cases is extremely important.  It seems
Mellanox is working on "spawning devlink ports" IOW slicing a device
into subdevices.  Which is a great way to run bifurcated DPDK/netdev
applications :/  If that gets merged I think we have to recalculate
what purpose AF_XDP is going to serve, if any.

In my view we have different "levels" of slicing:

 (1) full HW device;
 (2) software device (mdev?);
 (3) separate netdev;
 (4) separate "RSS instance";
 (5) dedicated application queues.

1 - is SR-IOV VFs
2 - is software device slicing with mdev (Mellanox)
3 - is (I think) Intel's VSI debugfs... "thing"..
4 - is just ethtool RSS contexts (Solarflare)
5 - is currently AF-XDP (Intel)

(2) or lower is required to have raw register access allowing vfio/DPDK
to run "natively".

(3) or lower allows for full reuse of all networking APIs, with very
natural RSS configuration, TC/QoS configuration on TX etc.

(5) is sufficient for zero copy.

So back to the use case.. seems like AF_XDP is evolving into allowing
"level 3" to pass all frames directly to the application?  With
optional XDP filtering?  It's not a trick question - I'm just trying to
place it somewhere on my mental map :)

> XDP redirect a more generic abstraction that allow us to implement
> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> Next I configure HW filters that match the MAC-addr into a queue# and
> attach simpler XDP-prog to queue#, that redirect into macvlan device.
> 
> > Further, is queue/socket a good abstraction for all devices? Wifi? 

Right, queue is no abstraction whatsoever.  Queue is a low level
primitive.

> > By just viewing sockets as an endpoint, we leave it up to the kernel to
> > figure out the best way. "Here's an endpoint. Give me data **here**."
> > 
> > The OpenFlow protocol does however support the concept of queues per
> > port, but do we want to introduce that into the kernel?

Switch queues != host queues.  Switch/HW queues are for QoS, host queues
are for RSS.  Those two concepts are similar yet different.  In Linux
if you offload basic TX TC (mq)prio (the old work John has done for
Intel) the actual number of HW queues becomes "channel count" x "num TC
prios".  What would queue ID mean for AF_XDP in that setup, I wonder.

> > So, if per-queue XDP programs is only for AF_XDP, I think it's better
> > to stick the program to the socket. For me per-queue is sort of a
> > leaky abstraction...
> >
> > More thoughts. If we go the route of per-queue XDP programs. Would it
> > be better to leave the setup to XDP -- i.e. the XDP program is
> > controlling the per-queue programs (think tail-calls, but a map with
> > per-q programs). Instead of the netlink layer. This is part of a
> > bigger discussion, namely should XDP really implement the control
> > plane?
> >
> > I really like that a software switch/router can be implemented
> > effectively with XDP, but ideally I'd like it to be offloaded by
> > hardware -- using the same control/configuration plane. Can we do it
> > in hardware, do that. If not, emulate via XDP.  

There is already a number of proposals in the "device application
slicing", it would be really great if we could make sure we don't
repeat the mistakes of flow configuration APIs, and try to prevent
having too many of them..

Which is very challenging unless we have strong use cases..

> That is actually the reason I want XDP per-queue, as it is a way to
> offload the filtering to the hardware.  And if the per-queue XDP-prog
> becomes simple enough, the hardware can eliminate and do everything in
> hardware (hopefully).
> 
> > The control plane should IMO be outside of the XDP program.

ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
plance.  Do you mean application should not control the "context/
channel/subdev" creation?  You're not saying "it's not the XDP program
which should be making the classification", no?  XDP program
controlling the classification was _the_ reason why we liked AF_XDP :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
                         ` (2 preceding siblings ...)
  2019-04-15 22:49       ` Jakub Kicinski
@ 2019-04-16  7:44       ` Björn Töpel
  2019-04-16  9:36         ` Toke Høiland-Jørgensen
  2019-04-16 10:15       ` Jason Wang
                         ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Björn Töpel @ 2019-04-16  7:44 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Ilias Apalodimas,
	Toke Høiland-Jørgensen, Karlsson, Magnus,
	maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon,
	Jonathan Lemon

On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>
> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
>
> > Hi,
> >
> > As you probably can derive from the amount of time this is taking, I'm
> > not really satisfied with the design of per-queue XDP program. (That,
> > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > in this mail!
> >
> > Beware, it's kind of a long post, and it's all over the place.
>
> Cc'ing all the XDP-maintainers (and netdev).
>
> > There are a number of ways of setting up flows in the kernel, e.g.
> >
> > * Connecting/accepting a TCP socket (in-band)
> > * Using tc-flower (out-of-band)
> > * ethtool (out-of-band)
> > * ...
> >
> > The first acts on sockets, the second on netdevs. Then there's ethtool
> > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> > to queues. Most users care about sockets and netdevices. Queues is
> > more of an implementation detail of Rx or for QoS on the Tx side.
>
> Let me first acknowledge that the current Linux tools to administrator
> HW filters is lacking (well sucks).  We know the hardware is capable,
> as DPDK have an full API for this called rte_flow[1]. If nothing else
> you/we can use the DPDK API to create a program to configure the
> hardware, examples here[2]
>
>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>
> > XDP is something that we can attach to a netdevice. Again, very
> > natural from a user perspective. As for XDP sockets, the current
> > mechanism is that we attach to an existing netdevice queue. Ideally
> > what we'd like is to *remove* the queue concept. A better approach
> > would be creating the socket and set it up -- but not binding it to a
> > queue. Instead just binding it to a netdevice (or crazier just
> > creating a socket without a netdevice).
>
> Let me just remind everybody that the AF_XDP performance gains comes
> from binding the resource, which allow for lock-free semantics, as
> explained here[3].
>
> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>

Yes, but leaving the "binding to queue" to the kernel wouldn't really
change much. It would mostly be that the *user* doesn't need to care
about hardware details. My concern is about "what is a good
abstraction".

>
> > The socket is an endpoint, where I'd like data to end up (or get sent
> > from). If the kernel can attach the socket to a hardware queue,
> > there's zerocopy if not, copy-mode. Dito for Tx.
>
> Well XDP programs per RXQ is just a building block to achieve this.
>
> As Van Jacobson explain[4], sockets or applications "register" a
> "transport signature", and gets back a "channel".   In our case, the
> netdev-global XDP program is our way to register/program these transport
> signatures and redirect (e.g. into the AF_XDP socket).
> This requires some work in software to parse and match transport
> signatures to sockets.  The XDP programs per RXQ is a way to get
> hardware to perform this filtering for us.
>
>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>

There are a lot of things that are missing to build what you're
describing above. Yes, we need a better way to program the HW from
Linux userland (old topic); What I fail to see is how per-queue XDP is
a way to get hardware to perform filtering. Could you give a
longer/complete example (obviously with non-existing features :-)), so
I get a better view what you're aiming for?


>
> > Does a user (control plane) want/need to care about queues? Just
> > create a flow to a socket (out-of-band or inband) or to a netdevice
> > (out-of-band).
>
> A userspace "control-plane" program, could hide the setup and use what
> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> that the "listen" socket first register the transport signature (with
> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> can register a 5-tuple (or create TC-HW rules) and load our
> "transport-signature" XDP prog on the queue number we choose.  If not,
> when our netdev-global XDP prog need a hash-table with 5-tuple and do
> 5-tuple parsing.
>
> Creating netdevices via HW filter into queues is an interesting idea.
> DPDK have an example here[5], on how to per flow (via ethtool filter
> setup even!) send packets to queues, that endup in SRIOV devices.
>
>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>
>
> > Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > not, it would make *more* sense to attach the XDP program to the
> > socket (e.g. if the endpoint would like to use kernel data structures
> > via XDP).
>
> As demonstrated in [5] you can use (ethtool) hardware filters to
> redirect packets into VFs (Virtual Functions).
>
> I also want us to extend XDP to allow for redirect from a PF (Physical
> Function) into a VF (Virtual Function).  First the netdev-global
> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> info).  Next configure HW filter to queue# and load XDP prog on that
> queue# that only "redirect" to a single VF.  Now if driver+HW supports
> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>

Again, let's try to be more concrete! So, one (non-existing) mechanism
to program filtering to HW queues, and then attaching a per-queue
program to that HW queue, which can in some cases be elided? I'm not
opposing the idea of per-queue, I'm just trying to figure out
*exactly* what we're aiming for.

My concern is, again, mainly that is a queue abstraction something
we'd like to introduce to userland. It's not there (well, no really
:-)) today. And from an AF_XDP userland perspective that's painful.
"Oh, you need to fix your RSS hashing/flow." E.g. if I read what
Jonathan is looking for, it's more of something like what Jiri Pirko
suggested in [1] (slide 9, 10).

Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
tricky to use from XDP IMO. Per-queue XDP program would *optimize*
AF_XDP, but not solving the filtering. Maybe starting in the
filtering/metadata offload path end of things, and then see what we're
missing.

>
> > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > or similar *virtual* netdevices a better path, instead of introducing
> > yet another abstraction?
>
> XDP redirect a more generic abstraction that allow us to implement
> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> Next I configure HW filters that match the MAC-addr into a queue# and
> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>

Just for context; I was thinking something like macvlan with
ndo_dfwd_add/del_station functionality. "A virtual interface that is
simply is a view of a physical". A per-queue program would then mean
"create a netdev for that queue".

>
> > Further, is queue/socket a good abstraction for all devices? Wifi? By
> > just viewing sockets as an endpoint, we leave it up to the kernel to
> > figure out the best way. "Here's an endpoint. Give me data **here**."
> >
> > The OpenFlow protocol does however support the concept of queues per
> > port, but do we want to introduce that into the kernel?
> >
> > So, if per-queue XDP programs is only for AF_XDP, I think it's better
> > to stick the program to the socket. For me per-queue is sort of a
> > leaky abstraction...
> >
> > More thoughts. If we go the route of per-queue XDP programs. Would it
> > be better to leave the setup to XDP -- i.e. the XDP program is
> > controlling the per-queue programs (think tail-calls, but a map with
> > per-q programs). Instead of the netlink layer. This is part of a
> > bigger discussion, namely should XDP really implement the control
> > plane?
> >
> > I really like that a software switch/router can be implemented
> > effectively with XDP, but ideally I'd like it to be offloaded by
> > hardware -- using the same control/configuration plane. Can we do it
> > in hardware, do that. If not, emulate via XDP.
>
> That is actually the reason I want XDP per-queue, as it is a way to
> offload the filtering to the hardware.  And if the per-queue XDP-prog
> becomes simple enough, the hardware can eliminate and do everything in
> hardware (hopefully).
>
>
> > The control plane should IMO be outside of the XDP program.
> >
> > Ok, please convince me! :-D
>
> I tried to above...
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
>
> More use-cases for per-queue XDP-prog:
>
> XDP for containers
> ------------------
> XDP can redirect into veth, used for containers.  So, I want to
> implement container protection/isolation in XDP layer.  E.g. I only
> want my container to talk to 1 external src-IP to my container dst-IP
> on port 80.  I can implement that check in netdev-global XDP BPF-code.
> But I can also hardware "offload" this simple filter (via ethtool or
> rte_flow) and simplify the per-queue XDP-prog.  Given the queue now
> only receives traffic that match my desc, I have now protected/isolated
> the traffic to my container.
>

And are you sure that you'd like this at a queue granularity, and not
netdevice or socket?

>
> DPDK using per-queue AF_XDP
> ---------------------------
> AFAIK an AF_XDP PMD driver have been merged in DPDK (but I've not
> looked at the code).
>
> It would be very natural for DPDK to load per-queue XDP-progs for
> interfacing with AF_XDP, as they already have rte_flow API (see
> [1]+[2]) for configuring HW filters.  And loading per-queue XDP-progs
> would also avoid disturbing other users of XDP on same machine (if we
> choose the semantics defined in [6]).
>

Yes, here it would definitely help the PMD, but having a socket
without per-queue (bound directly w/o XDP ala the "built-in" path)
would help even more. I guess this part of the "does per-queue XDP
programs make sense for anyone else but AF_XDP".

> [6] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#proposal-rxq-prog-takes-precedence


Björn

[1] https://www.netdevconf.org/0.1/docs/pirko-ovstc-slides.pdf

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 22:49       ` Jakub Kicinski
@ 2019-04-16  7:45         ` Björn Töpel
  2019-04-16 21:17           ` Jakub Kicinski
  2019-04-16 13:55         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 19+ messages in thread
From: Björn Töpel @ 2019-04-16  7:45 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Jesper Dangaard Brouer, Björn Töpel, Ilias Apalodimas,
	Toke Høiland-Jørgensen, Karlsson, Magnus,
	maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, David Miller, Andy Gospodarek,
	netdev, bpf, Thomas Graf, Thomas Monjalon, Jonathan Lemon

On Tue, 16 Apr 2019 at 00:49, Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
>
> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
> > > Hi,
> > >
> > > As you probably can derive from the amount of time this is taking, I'm
> > > not really satisfied with the design of per-queue XDP program. (That,
> > > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > > in this mail!
>
> Jesper was advocating per-queue progs since very early days of XDP.
> If it was easy to implement cleanly we would've already gotten it ;)
>

I don't think it's a matter of implementing cleanly (again I know very
little about the XDP HW offloads, so that aside! :-D), it's a matter a
"what's the use-case". Mine is "Are queues something we'd like to
expose to the users?".

> > > Beware, it's kind of a long post, and it's all over the place.
> >
> > Cc'ing all the XDP-maintainers (and netdev).
> >
> > > There are a number of ways of setting up flows in the kernel, e.g.
> > >
> > > * Connecting/accepting a TCP socket (in-band)
> > > * Using tc-flower (out-of-band)
> > > * ethtool (out-of-band)
> > > * ...
> > >
> > > The first acts on sockets, the second on netdevs. Then there's ethtool
> > > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> > > to queues. Most users care about sockets and netdevices. Queues is
> > > more of an implementation detail of Rx or for QoS on the Tx side.
> >
> > Let me first acknowledge that the current Linux tools to administrator
> > HW filters is lacking (well sucks).  We know the hardware is capable,
> > as DPDK have an full API for this called rte_flow[1]. If nothing else
> > you/we can use the DPDK API to create a program to configure the
> > hardware, examples here[2]
> >
> >  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> >  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> >
> > > XDP is something that we can attach to a netdevice. Again, very
> > > natural from a user perspective. As for XDP sockets, the current
> > > mechanism is that we attach to an existing netdevice queue. Ideally
> > > what we'd like is to *remove* the queue concept. A better approach
> > > would be creating the socket and set it up -- but not binding it to a
> > > queue. Instead just binding it to a netdevice (or crazier just
> > > creating a socket without a netdevice).
>
> You can remove the concept of a queue from the AF_XDP ABI (well, extend
> it to not require the queue being explicitly specified..), but you can't
> avoid the user space knowing there is a queue.  Because if you do you
> can no longer track and configure that queue (things like IRQ
> moderation, descriptor count etc.)
>

Exactly. There *would* be an underlying queue (if the socket is backed
by a HW queue). Copy-mode is really just a software mechanism, but we
rely on queue-id for performance and conformance with zc mode.

For a user that "just want to toss packets to userspace", they care
about *sockets*. Maybe this is where my argument breaks down. :-) I'd
prefer if a user would just need to care about netdevs and sockets
(including INET sockets). And if the hardware can back the socket
implementation by a hardware queue and get better performance, that's
great, but does the user really need to care?

> Currently the term "queue" refers mostly to the queues that stack uses.
> Which leaves e.g. the XDP TX queues in a strange grey zone (from
> ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice to
> have the HW queue ID somewhat detached from stack queue ID.  Or at least
> it'd be nice to introduce queue types?  I've been pondering this for a
> while, I don't see any silver bullet here..
>

Yup! And for AF_XDP we'd like to create a *new* hw queue (ideally) for
Rx and Tx, which would also land in the XDP Tx queue gray zone (from a
configuration perspective).

> > Let me just remind everybody that the AF_XDP performance gains comes
> > from binding the resource, which allow for lock-free semantics, as
> > explained here[3].
> >
> > [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> >
> >
> > > The socket is an endpoint, where I'd like data to end up (or get sent
> > > from). If the kernel can attach the socket to a hardware queue,
> > > there's zerocopy if not, copy-mode. Dito for Tx.
> >
> > Well XDP programs per RXQ is just a building block to achieve this.
> >
> > As Van Jacobson explain[4], sockets or applications "register" a
> > "transport signature", and gets back a "channel".   In our case, the
> > netdev-global XDP program is our way to register/program these transport
> > signatures and redirect (e.g. into the AF_XDP socket).
> > This requires some work in software to parse and match transport
> > signatures to sockets.  The XDP programs per RXQ is a way to get
> > hardware to perform this filtering for us.
> >
> >  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> >
> >
> > > Does a user (control plane) want/need to care about queues? Just
> > > create a flow to a socket (out-of-band or inband) or to a netdevice
> > > (out-of-band).
> >
> > A userspace "control-plane" program, could hide the setup and use what
> > the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> > that the "listen" socket first register the transport signature (with
> > the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> > can register a 5-tuple (or create TC-HW rules) and load our
> > "transport-signature" XDP prog on the queue number we choose.  If not,
> > when our netdev-global XDP prog need a hash-table with 5-tuple and do
> > 5-tuple parsing.
>
> But we do want the ability to configure the queue, and get stats for
> that queue.. so we can't hide the queue completely, right?
>

Hmm... right.

> > Creating netdevices via HW filter into queues is an interesting idea.
> > DPDK have an example here[5], on how to per flow (via ethtool filter
> > setup even!) send packets to queues, that endup in SRIOV devices.
> >
> >  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>
> I wish I had the courage to nack the ethtool redirect to VF Intel
> added :)
>
> > > Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > > not, it would make *more* sense to attach the XDP program to the
> > > socket (e.g. if the endpoint would like to use kernel data structures
> > > via XDP).
> >
> > As demonstrated in [5] you can use (ethtool) hardware filters to
> > redirect packets into VFs (Virtual Functions).
> >
> > I also want us to extend XDP to allow for redirect from a PF (Physical
> > Function) into a VF (Virtual Function).  First the netdev-global
> > XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> > info).  Next configure HW filter to queue# and load XDP prog on that
> > queue# that only "redirect" to a single VF.  Now if driver+HW supports
> > it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>
> That sounds slightly contrived.  If the program is not doing anything,
> why involve XDP at all?  As stated above we already have too many ways
> to do flow config and/or VF redirect.
>
> > > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > > or similar *virtual* netdevices a better path, instead of introducing
> > > yet another abstraction?
>
> Yes, the question of use cases is extremely important.  It seems
> Mellanox is working on "spawning devlink ports" IOW slicing a device
> into subdevices.  Which is a great way to run bifurcated DPDK/netdev
> applications :/  If that gets merged I think we have to recalculate
> what purpose AF_XDP is going to serve, if any.
>

I really like the subdevice-think, but let's have the drivers in the
kernel. I don't see how the XDP view (including AF_XDP) changes with
subdevices. My view on AF_XDP is that it's a socket that can
receive/send data efficiently from/to the kernel. What subdevice
*might* change is the requirement for a per-queue XDP program.

> In my view we have different "levels" of slicing:
>
>  (1) full HW device;
>  (2) software device (mdev?);
>  (3) separate netdev;
>  (4) separate "RSS instance";
>  (5) dedicated application queues.
>
> 1 - is SR-IOV VFs
> 2 - is software device slicing with mdev (Mellanox)
> 3 - is (I think) Intel's VSI debugfs... "thing"..

:-) The VMDq netdevice?

> 4 - is just ethtool RSS contexts (Solarflare)
> 5 - is currently AF-XDP (Intel)
>
> (2) or lower is required to have raw register access allowing vfio/DPDK
> to run "natively".
>
> (3) or lower allows for full reuse of all networking APIs, with very
> natural RSS configuration, TC/QoS configuration on TX etc.
>
> (5) is sufficient for zero copy.
>
> So back to the use case.. seems like AF_XDP is evolving into allowing
> "level 3" to pass all frames directly to the application?  With
> optional XDP filtering?  It's not a trick question - I'm just trying to
> place it somewhere on my mental map :)
>

I wouldn't say evolving into. That is one use-case that we'd like to
support efficiently. Being able to toss some packets out to userspace
from an XDP program is also useful. Some applications just want all
packets to userspace. Some applications want to use
kernel-infrastructure via XDP prior tossing it to userland. Some
application want the XDP packet metadata in userland...


> > XDP redirect a more generic abstraction that allow us to implement
> > macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> > write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> > Next I configure HW filters that match the MAC-addr into a queue# and
> > attach simpler XDP-prog to queue#, that redirect into macvlan device.
> >
> > > Further, is queue/socket a good abstraction for all devices? Wifi?
>
> Right, queue is no abstraction whatsoever.  Queue is a low level
> primitive.
>

...and do we want to introduce that as a proper kernel object?

> > > By just viewing sockets as an endpoint, we leave it up to the kernel to
> > > figure out the best way. "Here's an endpoint. Give me data **here**."
> > >
> > > The OpenFlow protocol does however support the concept of queues per
> > > port, but do we want to introduce that into the kernel?
>
> Switch queues != host queues.  Switch/HW queues are for QoS, host queues
> are for RSS.  Those two concepts are similar yet different.  In Linux
> if you offload basic TX TC (mq)prio (the old work John has done for
> Intel) the actual number of HW queues becomes "channel count" x "num TC
> prios".  What would queue ID mean for AF_XDP in that setup, I wonder.
>
> > > So, if per-queue XDP programs is only for AF_XDP, I think it's better
> > > to stick the program to the socket. For me per-queue is sort of a
> > > leaky abstraction...
> > >
> > > More thoughts. If we go the route of per-queue XDP programs. Would it
> > > be better to leave the setup to XDP -- i.e. the XDP program is
> > > controlling the per-queue programs (think tail-calls, but a map with
> > > per-q programs). Instead of the netlink layer. This is part of a
> > > bigger discussion, namely should XDP really implement the control
> > > plane?
> > >
> > > I really like that a software switch/router can be implemented
> > > effectively with XDP, but ideally I'd like it to be offloaded by
> > > hardware -- using the same control/configuration plane. Can we do it
> > > in hardware, do that. If not, emulate via XDP.
>
> There is already a number of proposals in the "device application
> slicing", it would be really great if we could make sure we don't
> repeat the mistakes of flow configuration APIs, and try to prevent
> having too many of them..
>
> Which is very challenging unless we have strong use cases..
>

Agree!

> > That is actually the reason I want XDP per-queue, as it is a way to
> > offload the filtering to the hardware.  And if the per-queue XDP-prog
> > becomes simple enough, the hardware can eliminate and do everything in
> > hardware (hopefully).
> >
> > > The control plane should IMO be outside of the XDP program.
>
> ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
> plance.  Do you mean application should not control the "context/
> channel/subdev" creation?

Yes, but I'm not sure. I'd like to hear more opinions.

Let me try to think out loud here. Say that per-queue XDP programs
exist. The main XDP program receives all packets and makes the
decision that a certain flow should end up in say queue X, and that
the hardware supports offloading that. Should the knobs to program the
hardware be in via BPF or by some other mechanism (perf ring to
userland daemon)? Further, setting the XDP program per queue. Should
that be done via XDP (the main XDP program has knowledge of all
programs) or via say netlink (as XDP is today). One could argue that
the per-queue setup should be a map (like tail-calls).

> You're not saying "it's not the XDP program
> which should be making the classification", no?  XDP program
> controlling the classification was _the_ reason why we liked AF_XDP :)

XDP program not doing classification would be weird. But if there's a
scenario where *everything for a certain HW filter* end up in an
AF_XDP queue, should we require an XDP program. I've been going back
and forth here... :-)


Björn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16  7:44       ` Björn Töpel
@ 2019-04-16  9:36         ` Toke Høiland-Jørgensen
  2019-04-16 12:07           ` Björn Töpel
  0 siblings, 1 reply; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-16  9:36 UTC (permalink / raw)
  To: Björn Töpel, Jesper Dangaard Brouer
  Cc: Björn Töpel, Ilias Apalodimas, Karlsson, Magnus,
	maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon,
	Jonathan Lemon

Björn Töpel <bjorn.topel@gmail.com> writes:

> On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>
>>
>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
>>
>> > Hi,
>> >
>> > As you probably can derive from the amount of time this is taking, I'm
>> > not really satisfied with the design of per-queue XDP program. (That,
>> > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>> > in this mail!
>> >
>> > Beware, it's kind of a long post, and it's all over the place.
>>
>> Cc'ing all the XDP-maintainers (and netdev).
>>
>> > There are a number of ways of setting up flows in the kernel, e.g.
>> >
>> > * Connecting/accepting a TCP socket (in-band)
>> > * Using tc-flower (out-of-band)
>> > * ethtool (out-of-band)
>> > * ...
>> >
>> > The first acts on sockets, the second on netdevs. Then there's ethtool
>> > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>> > to queues. Most users care about sockets and netdevices. Queues is
>> > more of an implementation detail of Rx or for QoS on the Tx side.
>>
>> Let me first acknowledge that the current Linux tools to administrator
>> HW filters is lacking (well sucks).  We know the hardware is capable,
>> as DPDK have an full API for this called rte_flow[1]. If nothing else
>> you/we can use the DPDK API to create a program to configure the
>> hardware, examples here[2]
>>
>>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>
>> > XDP is something that we can attach to a netdevice. Again, very
>> > natural from a user perspective. As for XDP sockets, the current
>> > mechanism is that we attach to an existing netdevice queue. Ideally
>> > what we'd like is to *remove* the queue concept. A better approach
>> > would be creating the socket and set it up -- but not binding it to a
>> > queue. Instead just binding it to a netdevice (or crazier just
>> > creating a socket without a netdevice).
>>
>> Let me just remind everybody that the AF_XDP performance gains comes
>> from binding the resource, which allow for lock-free semantics, as
>> explained here[3].
>>
>> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>
>
> Yes, but leaving the "binding to queue" to the kernel wouldn't really
> change much. It would mostly be that the *user* doesn't need to care
> about hardware details. My concern is about "what is a good
> abstraction".

Can we really guarantee that we will make the right decision from inside
the kernel, though?

>>
>> > The socket is an endpoint, where I'd like data to end up (or get sent
>> > from). If the kernel can attach the socket to a hardware queue,
>> > there's zerocopy if not, copy-mode. Dito for Tx.
>>
>> Well XDP programs per RXQ is just a building block to achieve this.
>>
>> As Van Jacobson explain[4], sockets or applications "register" a
>> "transport signature", and gets back a "channel".   In our case, the
>> netdev-global XDP program is our way to register/program these transport
>> signatures and redirect (e.g. into the AF_XDP socket).
>> This requires some work in software to parse and match transport
>> signatures to sockets.  The XDP programs per RXQ is a way to get
>> hardware to perform this filtering for us.
>>
>>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>
>
> There are a lot of things that are missing to build what you're
> describing above. Yes, we need a better way to program the HW from
> Linux userland (old topic); What I fail to see is how per-queue XDP is
> a way to get hardware to perform filtering. Could you give a
> longer/complete example (obviously with non-existing features :-)), so
> I get a better view what you're aiming for?
>
>
>>
>> > Does a user (control plane) want/need to care about queues? Just
>> > create a flow to a socket (out-of-band or inband) or to a netdevice
>> > (out-of-band).
>>
>> A userspace "control-plane" program, could hide the setup and use what
>> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
>> that the "listen" socket first register the transport signature (with
>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>> can register a 5-tuple (or create TC-HW rules) and load our
>> "transport-signature" XDP prog on the queue number we choose.  If not,
>> when our netdev-global XDP prog need a hash-table with 5-tuple and do
>> 5-tuple parsing.
>>
>> Creating netdevices via HW filter into queues is an interesting idea.
>> DPDK have an example here[5], on how to per flow (via ethtool filter
>> setup even!) send packets to queues, that endup in SRIOV devices.
>>
>>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>
>>
>> > Do we envison any other uses for per-queue XDP other than AF_XDP? If
>> > not, it would make *more* sense to attach the XDP program to the
>> > socket (e.g. if the endpoint would like to use kernel data structures
>> > via XDP).
>>
>> As demonstrated in [5] you can use (ethtool) hardware filters to
>> redirect packets into VFs (Virtual Functions).
>>
>> I also want us to extend XDP to allow for redirect from a PF (Physical
>> Function) into a VF (Virtual Function).  First the netdev-global
>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
>> info).  Next configure HW filter to queue# and load XDP prog on that
>> queue# that only "redirect" to a single VF.  Now if driver+HW supports
>> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>>
>
> Again, let's try to be more concrete! So, one (non-existing) mechanism
> to program filtering to HW queues, and then attaching a per-queue
> program to that HW queue, which can in some cases be elided? I'm not
> opposing the idea of per-queue, I'm just trying to figure out
> *exactly* what we're aiming for.
>
> My concern is, again, mainly that is a queue abstraction something
> we'd like to introduce to userland. It's not there (well, no really
> :-)) today. And from an AF_XDP userland perspective that's painful.
> "Oh, you need to fix your RSS hashing/flow." E.g. if I read what
> Jonathan is looking for, it's more of something like what Jiri Pirko
> suggested in [1] (slide 9, 10).
>
> Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
> tricky to use from XDP IMO. Per-queue XDP program would *optimize*
> AF_XDP, but not solving the filtering. Maybe starting in the
> filtering/metadata offload path end of things, and then see what we're
> missing.
>
>>
>> > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>> > or similar *virtual* netdevices a better path, instead of introducing
>> > yet another abstraction?
>>
>> XDP redirect a more generic abstraction that allow us to implement
>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
>> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
>> Next I configure HW filters that match the MAC-addr into a queue# and
>> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>>
>
> Just for context; I was thinking something like macvlan with
> ndo_dfwd_add/del_station functionality. "A virtual interface that is
> simply is a view of a physical". A per-queue program would then mean
> "create a netdev for that queue".

My immediate reaction is that I kinda like this model from an API PoV;
not sure what it would take to get there, though? When you say
'something like macvlan', you do mean we'd have to add something
completely new, right?

-Toke

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
                         ` (3 preceding siblings ...)
  2019-04-16  7:44       ` Björn Töpel
@ 2019-04-16 10:15       ` Jason Wang
  2019-04-16 10:41       ` Jason Wang
  2019-04-17 16:46       ` Björn Töpel
  6 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2019-04-16 10:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: Björn Töpel, ilias.apalodimas, toke, magnus.karlsson,
	maciej.fijalkowski, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, John Fastabend, David Miller, Andy Gospodarek,
	netdev, bpf, Thomas Graf, Thomas Monjalon


On 2019/4/16 上午12:32, Jesper Dangaard Brouer wrote:
>> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>> or similar*virtual*  netdevices a better path, instead of introducing
>> yet another abstraction?
> XDP redirect a more generic abstraction that allow us to implement
> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> Next I configure HW filters that match the MAC-addr into a queue# and
> attach simpler XDP-prog to queue#, that redirect into macvlan device.


I'm afraid what we want is a full XDP support for macvlan (RX) not only 
a XDP TX support? This could not be done through XDP_REDIRECT. If we 
want to use  XDP_REDIRECT, we should implement XDP support for ifb then 
redirect packet there.

Thanks


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
                         ` (4 preceding siblings ...)
  2019-04-16 10:15       ` Jason Wang
@ 2019-04-16 10:41       ` Jason Wang
  2019-04-17 16:46       ` Björn Töpel
  6 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2019-04-16 10:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Björn Töpel
  Cc: Björn Töpel, ilias.apalodimas, toke, magnus.karlsson,
	maciej.fijalkowski, Alexei Starovoitov, Daniel Borkmann,
	Jakub Kicinski, John Fastabend, David Miller, Andy Gospodarek,
	netdev, bpf, Thomas Graf, Thomas Monjalon, Michael S. Tsirkin


On 2019/4/16 上午12:32, Jesper Dangaard Brouer wrote:
>> XDP is something that we can attach to a netdevice. Again, very
>> natural from a user perspective. As for XDP sockets, the current
>> mechanism is that we attach to an existing netdevice queue. Ideally
>> what we'd like is to*remove*  the queue concept. A better approach
>> would be creating the socket and set it up -- but not binding it to a
>> queue. Instead just binding it to a netdevice (or crazier just
>> creating a socket without a netdevice).


Isn't XDP support for TUN/TAP just a good example of this. It hides the 
all details and depends on XDP_REDIRECT to work. This allows the eBPF 
program or other steering tool to do anything it want on host. You can 
implement AF_XDP ring layout mmap for TUN/TAP or just use 
vhost_net(virtio ring layout) instead.

Thanks


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16  9:36         ` Toke Høiland-Jørgensen
@ 2019-04-16 12:07           ` Björn Töpel
  2019-04-16 13:25             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 19+ messages in thread
From: Björn Töpel @ 2019-04-16 12:07 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel,
	Jesper Dangaard Brouer
  Cc: Ilias Apalodimas, Karlsson, Magnus, maciej.fijalkowski,
	Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	John Fastabend, David Miller, Andy Gospodarek, netdev, bpf,
	Thomas Graf, Thomas Monjalon, Jonathan Lemon

On 2019-04-16 11:36, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
> 
>> On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>
>>>
>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> As you probably can derive from the amount of time this is taking, I'm
>>>> not really satisfied with the design of per-queue XDP program. (That,
>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>>>> in this mail!
>>>>
>>>> Beware, it's kind of a long post, and it's all over the place.
>>>
>>> Cc'ing all the XDP-maintainers (and netdev).
>>>
>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>
>>>> * Connecting/accepting a TCP socket (in-band)
>>>> * Using tc-flower (out-of-band)
>>>> * ethtool (out-of-band)
>>>> * ...
>>>>
>>>> The first acts on sockets, the second on netdevs. Then there's ethtool
>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>
>>> Let me first acknowledge that the current Linux tools to administrator
>>> HW filters is lacking (well sucks).  We know the hardware is capable,
>>> as DPDK have an full API for this called rte_flow[1]. If nothing else
>>> you/we can use the DPDK API to create a program to configure the
>>> hardware, examples here[2]
>>>
>>>   [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>   [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>
>>>> XDP is something that we can attach to a netdevice. Again, very
>>>> natural from a user perspective. As for XDP sockets, the current
>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>> what we'd like is to *remove* the queue concept. A better approach
>>>> would be creating the socket and set it up -- but not binding it to a
>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>> creating a socket without a netdevice).
>>>
>>> Let me just remind everybody that the AF_XDP performance gains comes
>>> from binding the resource, which allow for lock-free semantics, as
>>> explained here[3].
>>>
>>> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>
>>
>> Yes, but leaving the "binding to queue" to the kernel wouldn't really
>> change much. It would mostly be that the *user* doesn't need to care
>> about hardware details. My concern is about "what is a good
>> abstraction".
> 
> Can we really guarantee that we will make the right decision from inside
> the kernel, though?
>

Uhm, what do you mean here?


>>>
>>>> The socket is an endpoint, where I'd like data to end up (or get sent
>>>> from). If the kernel can attach the socket to a hardware queue,
>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>
>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>
>>> As Van Jacobson explain[4], sockets or applications "register" a
>>> "transport signature", and gets back a "channel".   In our case, the
>>> netdev-global XDP program is our way to register/program these transport
>>> signatures and redirect (e.g. into the AF_XDP socket).
>>> This requires some work in software to parse and match transport
>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>> hardware to perform this filtering for us.
>>>
>>>   [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>
>>
>> There are a lot of things that are missing to build what you're
>> describing above. Yes, we need a better way to program the HW from
>> Linux userland (old topic); What I fail to see is how per-queue XDP is
>> a way to get hardware to perform filtering. Could you give a
>> longer/complete example (obviously with non-existing features :-)), so
>> I get a better view what you're aiming for?
>>
>>
>>>
>>>> Does a user (control plane) want/need to care about queues? Just
>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>> (out-of-band).
>>>
>>> A userspace "control-plane" program, could hide the setup and use what
>>> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
>>> that the "listen" socket first register the transport signature (with
>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>> can register a 5-tuple (or create TC-HW rules) and load our
>>> "transport-signature" XDP prog on the queue number we choose.  If not,
>>> when our netdev-global XDP prog need a hash-table with 5-tuple and do
>>> 5-tuple parsing.
>>>
>>> Creating netdevices via HW filter into queues is an interesting idea.
>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>
>>>   [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>>
>>>
>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? If
>>>> not, it would make *more* sense to attach the XDP program to the
>>>> socket (e.g. if the endpoint would like to use kernel data structures
>>>> via XDP).
>>>
>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>> redirect packets into VFs (Virtual Functions).
>>>
>>> I also want us to extend XDP to allow for redirect from a PF (Physical
>>> Function) into a VF (Virtual Function).  First the netdev-global
>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>> queue# that only "redirect" to a single VF.  Now if driver+HW supports
>>> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>>>
>>
>> Again, let's try to be more concrete! So, one (non-existing) mechanism
>> to program filtering to HW queues, and then attaching a per-queue
>> program to that HW queue, which can in some cases be elided? I'm not
>> opposing the idea of per-queue, I'm just trying to figure out
>> *exactly* what we're aiming for.
>>
>> My concern is, again, mainly that is a queue abstraction something
>> we'd like to introduce to userland. It's not there (well, no really
>> :-)) today. And from an AF_XDP userland perspective that's painful.
>> "Oh, you need to fix your RSS hashing/flow." E.g. if I read what
>> Jonathan is looking for, it's more of something like what Jiri Pirko
>> suggested in [1] (slide 9, 10).
>>
>> Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
>> tricky to use from XDP IMO. Per-queue XDP program would *optimize*
>> AF_XDP, but not solving the filtering. Maybe starting in the
>> filtering/metadata offload path end of things, and then see what we're
>> missing.
>>
>>>
>>>> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>>>> or similar *virtual* netdevices a better path, instead of introducing
>>>> yet another abstraction?
>>>
>>> XDP redirect a more generic abstraction that allow us to implement
>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
>>> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
>>> Next I configure HW filters that match the MAC-addr into a queue# and
>>> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>>>
>>
>> Just for context; I was thinking something like macvlan with
>> ndo_dfwd_add/del_station functionality. "A virtual interface that is
>> simply is a view of a physical". A per-queue program would then mean
>> "create a netdev for that queue".
> 
> My immediate reaction is that I kinda like this model from an API PoV;
> not sure what it would take to get there, though? When you say
> 'something like macvlan', you do mean we'd have to add something
> completely new, right?
>

Macvlan with that can be hw-offloaded is there today. XDP support is not 
in place.

The Mellanox subdev-work [1] (I just started to dig into the details)
looks like this (i.e. slicing a physical device). Personally I really
like this approach, but I need to dig into the details more.


Björn

[1] 
https://lore.kernel.org/netdev/1551418672-12822-1-git-send-email-parav@mellanox.com/

> -Toke
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16 12:07           ` Björn Töpel
@ 2019-04-16 13:25             ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 19+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-16 13:25 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, Jesper Dangaard Brouer
  Cc: Ilias Apalodimas, Karlsson, Magnus, maciej.fijalkowski,
	Jason Wang, Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	John Fastabend, David Miller, Andy Gospodarek, netdev, bpf,
	Thomas Graf, Thomas Monjalon, Jonathan Lemon

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2019-04-16 11:36, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>> 
>>> On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>>>>
>>>>
>>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> As you probably can derive from the amount of time this is taking, I'm
>>>>> not really satisfied with the design of per-queue XDP program. (That,
>>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
>>>>> in this mail!
>>>>>
>>>>> Beware, it's kind of a long post, and it's all over the place.
>>>>
>>>> Cc'ing all the XDP-maintainers (and netdev).
>>>>
>>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>>
>>>>> * Connecting/accepting a TCP socket (in-band)
>>>>> * Using tc-flower (out-of-band)
>>>>> * ethtool (out-of-band)
>>>>> * ...
>>>>>
>>>>> The first acts on sockets, the second on netdevs. Then there's ethtool
>>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
>>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>>
>>>> Let me first acknowledge that the current Linux tools to administrator
>>>> HW filters is lacking (well sucks).  We know the hardware is capable,
>>>> as DPDK have an full API for this called rte_flow[1]. If nothing else
>>>> you/we can use the DPDK API to create a program to configure the
>>>> hardware, examples here[2]
>>>>
>>>>   [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>>   [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>>
>>>>> XDP is something that we can attach to a netdevice. Again, very
>>>>> natural from a user perspective. As for XDP sockets, the current
>>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>>> what we'd like is to *remove* the queue concept. A better approach
>>>>> would be creating the socket and set it up -- but not binding it to a
>>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>>> creating a socket without a netdevice).
>>>>
>>>> Let me just remind everybody that the AF_XDP performance gains comes
>>>> from binding the resource, which allow for lock-free semantics, as
>>>> explained here[3].
>>>>
>>>> [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>>
>>>
>>> Yes, but leaving the "binding to queue" to the kernel wouldn't really
>>> change much. It would mostly be that the *user* doesn't need to care
>>> about hardware details. My concern is about "what is a good
>>> abstraction".
>> 
>> Can we really guarantee that we will make the right decision from inside
>> the kernel, though?
>>
>
> Uhm, what do you mean here?

I took your 'leaving the "binding to queue" to the kernel' statement to
imply that this mechanism would be entirely hidden from userspace, in
which case the kernel would have to infer automatically how to do the
binding, right?

>>>>
>>>>> The socket is an endpoint, where I'd like data to end up (or get sent
>>>>> from). If the kernel can attach the socket to a hardware queue,
>>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>>
>>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>>
>>>> As Van Jacobson explain[4], sockets or applications "register" a
>>>> "transport signature", and gets back a "channel".   In our case, the
>>>> netdev-global XDP program is our way to register/program these transport
>>>> signatures and redirect (e.g. into the AF_XDP socket).
>>>> This requires some work in software to parse and match transport
>>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>>> hardware to perform this filtering for us.
>>>>
>>>>   [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>>
>>>
>>> There are a lot of things that are missing to build what you're
>>> describing above. Yes, we need a better way to program the HW from
>>> Linux userland (old topic); What I fail to see is how per-queue XDP is
>>> a way to get hardware to perform filtering. Could you give a
>>> longer/complete example (obviously with non-existing features :-)), so
>>> I get a better view what you're aiming for?
>>>
>>>
>>>>
>>>>> Does a user (control plane) want/need to care about queues? Just
>>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>>> (out-of-band).
>>>>
>>>> A userspace "control-plane" program, could hide the setup and use what
>>>> the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
>>>> that the "listen" socket first register the transport signature (with
>>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>>> can register a 5-tuple (or create TC-HW rules) and load our
>>>> "transport-signature" XDP prog on the queue number we choose.  If not,
>>>> when our netdev-global XDP prog need a hash-table with 5-tuple and do
>>>> 5-tuple parsing.
>>>>
>>>> Creating netdevices via HW filter into queues is an interesting idea.
>>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>>
>>>>   [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>>>
>>>>
>>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? If
>>>>> not, it would make *more* sense to attach the XDP program to the
>>>>> socket (e.g. if the endpoint would like to use kernel data structures
>>>>> via XDP).
>>>>
>>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>>> redirect packets into VFs (Virtual Functions).
>>>>
>>>> I also want us to extend XDP to allow for redirect from a PF (Physical
>>>> Function) into a VF (Virtual Function).  First the netdev-global
>>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
>>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>>> queue# that only "redirect" to a single VF.  Now if driver+HW supports
>>>> it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
>>>>
>>>
>>> Again, let's try to be more concrete! So, one (non-existing) mechanism
>>> to program filtering to HW queues, and then attaching a per-queue
>>> program to that HW queue, which can in some cases be elided? I'm not
>>> opposing the idea of per-queue, I'm just trying to figure out
>>> *exactly* what we're aiming for.
>>>
>>> My concern is, again, mainly that is a queue abstraction something
>>> we'd like to introduce to userland. It's not there (well, no really
>>> :-)) today. And from an AF_XDP userland perspective that's painful.
>>> "Oh, you need to fix your RSS hashing/flow." E.g. if I read what
>>> Jonathan is looking for, it's more of something like what Jiri Pirko
>>> suggested in [1] (slide 9, 10).
>>>
>>> Hey, maybe I just need to see the fuller picture. :-) AF_XDP is too
>>> tricky to use from XDP IMO. Per-queue XDP program would *optimize*
>>> AF_XDP, but not solving the filtering. Maybe starting in the
>>> filtering/metadata offload path end of things, and then see what we're
>>> missing.
>>>
>>>>
>>>>> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
>>>>> or similar *virtual* netdevices a better path, instead of introducing
>>>>> yet another abstraction?
>>>>
>>>> XDP redirect a more generic abstraction that allow us to implement
>>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
>>>> write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
>>>> Next I configure HW filters that match the MAC-addr into a queue# and
>>>> attach simpler XDP-prog to queue#, that redirect into macvlan device.
>>>>
>>>
>>> Just for context; I was thinking something like macvlan with
>>> ndo_dfwd_add/del_station functionality. "A virtual interface that is
>>> simply is a view of a physical". A per-queue program would then mean
>>> "create a netdev for that queue".
>> 
>> My immediate reaction is that I kinda like this model from an API PoV;
>> not sure what it would take to get there, though? When you say
>> 'something like macvlan', you do mean we'd have to add something
>> completely new, right?
>>
>
> Macvlan with that can be hw-offloaded is there today. XDP support is
> not in place.
>
> The Mellanox subdev-work [1] (I just started to dig into the details)
> looks like this (i.e. slicing a physical device). Personally I really
> like this approach, but I need to dig into the details more.

Right; I'll go have a look at that :)

-Toke

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 22:49       ` Jakub Kicinski
  2019-04-16  7:45         ` Björn Töpel
@ 2019-04-16 13:55         ` Jesper Dangaard Brouer
  2019-04-16 16:53           ` Jonathan Lemon
  2019-04-16 21:28           ` Jakub Kicinski
  1 sibling, 2 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-16 13:55 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Björn Töpel, Björn Töpel, ilias.apalodimas,
	toke, magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	David Miller, Andy Gospodarek, netdev, bpf, Thomas Graf,
	Thomas Monjalon, brouer, Jonathan Lemon

On Mon, 15 Apr 2019 15:49:32 -0700
Jakub Kicinski <jakub.kicinski@netronome.com> wrote:

> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
> > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:  
> > > Hi,
> > > 
> > > As you probably can derive from the amount of time this is taking, I'm
> > > not really satisfied with the design of per-queue XDP program. (That,
> > > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > > in this mail!  
> 
> Jesper was advocating per-queue progs since very early days of XDP.
> If it was easy to implement cleanly we would've already gotten it ;)

(I cannot help to feel offended here...  IMHO that statement is BS,
that is not how upstream development work, and sure, I am to blame as
I've simply been to lazy or busy with other stuff to implement it.  It
is not that hard to send down a queue# together with the XDP attach
command.) 

I've been advocating for per-queue progs from day-1, since this is an
obvious performance advantage, given the programmer can specialize the
BPF/XDP-prog to the filtered traffic.  I hope/assume we are on the same
pages here, that per-queue progs is a performance optimization.

I guess the rest of the discussion in this thread is (1) if we can
convince each-other that someone will actually use this optimization,
and (2) if we can abstract this away from the user.


> > > Beware, it's kind of a long post, and it's all over the place.    
> > 
> > Cc'ing all the XDP-maintainers (and netdev).
> >   
> > > There are a number of ways of setting up flows in the kernel, e.g.
> > > 
> > > * Connecting/accepting a TCP socket (in-band)
> > > * Using tc-flower (out-of-band)
> > > * ethtool (out-of-band)
> > > * ...
> > > 
> > > The first acts on sockets, the second on netdevs. Then there's ethtool
> > > to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> > > to queues. Most users care about sockets and netdevices. Queues is
> > > more of an implementation detail of Rx or for QoS on the Tx side.    
> > 
> > Let me first acknowledge that the current Linux tools to administrator
> > HW filters is lacking (well sucks).  We know the hardware is capable,
> > as DPDK have an full API for this called rte_flow[1]. If nothing else
> > you/we can use the DPDK API to create a program to configure the
> > hardware, examples here[2]
> > 
> >  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> >  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> >   
> > > XDP is something that we can attach to a netdevice. Again, very
> > > natural from a user perspective. As for XDP sockets, the current
> > > mechanism is that we attach to an existing netdevice queue. Ideally
> > > what we'd like is to *remove* the queue concept. A better approach
> > > would be creating the socket and set it up -- but not binding it to a
> > > queue. Instead just binding it to a netdevice (or crazier just
> > > creating a socket without a netdevice).    
> 
> You can remove the concept of a queue from the AF_XDP ABI (well, extend
> it to not require the queue being explicitly specified..), but you can't
> avoid the user space knowing there is a queue.  Because if you do you
> can no longer track and configure that queue (things like IRQ
> moderation, descriptor count etc.)

Yes exactly.  Bjørn you mentioned leaky abstractions, and by removing
the concept of a queue# from the AF_XDP ABI, then you have basically
created a leaky abstraction, as the sysadm would need to tune/configure
the "hidden" abstracted queue# (IRQ moderation, desc count etc.).

> Currently the term "queue" refers mostly to the queues that stack uses.
> Which leaves e.g. the XDP TX queues in a strange grey zone (from
> ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice to
> have the HW queue ID somewhat detached from stack queue ID.  Or at least
> it'd be nice to introduce queue types?  I've been pondering this for a
> while, I don't see any silver bullet here..

Yes! - I also worry about the term "queue".  This is very interesting
to discuss.

I do find it very natural that your HW (e.g. Netronome) have several HW
RX-queues that feed/send to a single software NAPI RX-queue.  (I assume
this is how you HW already works, but software/Linux cannot know this
internal HW queue id).  How we expose and use this is interesting.

I do want to be-able to create new RX-queues, semi-dynamically
"setup"/load time.  But still a limited number of RX-queues, for
performance and memory reasons (TX-queue are cheaper).  Memory as we
prealloc memory RX-queue (and give it to HW).  Performance as with too
many queues, there is less chance to have a (RX) bulk of packets in
queue.

For example I would not create an RX-queue per TCP-flow.  But why do I
still want per-queue XDP-progs and HW-filters for this TCP-flow
use-case... let me explain:

  E.g. I want to implement an XDP TCP socket load-balancer (same host
delivery, between XDP and network stack).  And my goal is to avoid
touching packet payload on XDP RX-CPU.  First I configure ethtool
filter to redirect all TCP port 80 to a specific RX-queue (could also
be N-queues), then I don't need to parse TCP-port-80 in my per-queue
BPF-prog, and I have higher chance of bulk-RX.  Next I need HW to
provide some flow-identifier, e.g. RSS-hash, flow-id or internal
HW-queue-id, which I can use to redirect on (e.g. via CPUMAP to
N-CPUs).  This way I don't touch packet payload on RX-CPU (my bench
shows one RX-CPU can handle between 14-20Mpps).

 
> > Let me just remind everybody that the AF_XDP performance gains comes
> > from binding the resource, which allow for lock-free semantics, as
> > explained here[3].
> > 
> > [3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> > 
> >   
> > > The socket is an endpoint, where I'd like data to end up (or get sent
> > > from). If the kernel can attach the socket to a hardware queue,
> > > there's zerocopy if not, copy-mode. Dito for Tx.    
> > 
> > Well XDP programs per RXQ is just a building block to achieve this.
> > 
> > As Van Jacobson explain[4], sockets or applications "register" a
> > "transport signature", and gets back a "channel".   In our case, the
> > netdev-global XDP program is our way to register/program these transport
> > signatures and redirect (e.g. into the AF_XDP socket).
> > This requires some work in software to parse and match transport
> > signatures to sockets.  The XDP programs per RXQ is a way to get
> > hardware to perform this filtering for us.
> > 
> >  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> > 
> >   
> > > Does a user (control plane) want/need to care about queues? Just
> > > create a flow to a socket (out-of-band or inband) or to a netdevice
> > > (out-of-band).    
> > 
> > A userspace "control-plane" program, could hide the setup and use what
> > the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> > that the "listen" socket first register the transport signature (with
> > the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> > can register a 5-tuple (or create TC-HW rules) and load our
> > "transport-signature" XDP prog on the queue number we choose.  If not,
> > when our netdev-global XDP prog need a hash-table with 5-tuple and do
> > 5-tuple parsing.  
> 
> But we do want the ability to configure the queue, and get stats for
> that queue.. so we can't hide the queue completely, right?

Yes, that is yet another example that the queue id "leak".

 
> > Creating netdevices via HW filter into queues is an interesting idea.
> > DPDK have an example here[5], on how to per flow (via ethtool filter
> > setup even!) send packets to queues, that endup in SRIOV devices.
> > 
> >  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html  
> 
> I wish I had the courage to nack the ethtool redirect to VF Intel
> added :)
> 
> > > Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > > not, it would make *more* sense to attach the XDP program to the
> > > socket (e.g. if the endpoint would like to use kernel data structures
> > > via XDP).    
> > 
> > As demonstrated in [5] you can use (ethtool) hardware filters to
> > redirect packets into VFs (Virtual Functions).
> > 
> > I also want us to extend XDP to allow for redirect from a PF (Physical
> > Function) into a VF (Virtual Function).  First the netdev-global
> > XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> > info).  Next configure HW filter to queue# and load XDP prog on that
> > queue# that only "redirect" to a single VF.  Now if driver+HW supports
> > it, it can "eliminate" the per-queue XDP-prog and do everything in HW.   
> 
> That sounds slightly contrived.  If the program is not doing anything,
> why involve XDP at all?

If the HW doesn't support this then the XDP software will do the work.
If the HW supports this, then you can still list the XDP-prog via
bpftool, and see that you have a XDP prog that does this action (and
maybe expose a offloaded-to-HW bit if you like to expose this info).


>  As stated above we already have too many ways
> to do flow config and/or VF redirect.
> 
> > > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > > or similar *virtual* netdevices a better path, instead of introducing
> > > yet another abstraction?    
> 
> Yes, the question of use cases is extremely important.  It seems
> Mellanox is working on "spawning devlink ports" IOW slicing a device
> into subdevices.  Which is a great way to run bifurcated DPDK/netdev
> applications :/  If that gets merged I think we have to recalculate
> what purpose AF_XDP is going to serve, if any.
> 
> In my view we have different "levels" of slicing:

I do appreciate this overview of NIC slicing, as HW-filters +
per-queue-XDP can be seen as a way to slice up the NIC.

>  (1) full HW device;
>  (2) software device (mdev?);
>  (3) separate netdev;
>  (4) separate "RSS instance";
>  (5) dedicated application queues.
> 
> 1 - is SR-IOV VFs
> 2 - is software device slicing with mdev (Mellanox)
> 3 - is (I think) Intel's VSI debugfs... "thing"..
> 4 - is just ethtool RSS contexts (Solarflare)
> 5 - is currently AF-XDP (Intel)
> 
> (2) or lower is required to have raw register access allowing vfio/DPDK
> to run "natively".
> 
> (3) or lower allows for full reuse of all networking APIs, with very
> natural RSS configuration, TC/QoS configuration on TX etc.
> 
> (5) is sufficient for zero copy.
> 
> So back to the use case.. seems like AF_XDP is evolving into allowing
> "level 3" to pass all frames directly to the application?  With
> optional XDP filtering?  It's not a trick question - I'm just trying to
> place it somewhere on my mental map :)

 
> > XDP redirect a more generic abstraction that allow us to implement
> > macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
> > write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
> > Next I configure HW filters that match the MAC-addr into a queue# and
> > attach simpler XDP-prog to queue#, that redirect into macvlan device.
> >   
> > > Further, is queue/socket a good abstraction for all devices? Wifi?   
> 
> Right, queue is no abstraction whatsoever.  Queue is a low level
> primitive.

I agree, queue is a low level primitive.

This the basically interface that the NIC hardware gave us... it is
fairly limited as it can only express a queue id and a IRQ line that we
can try to utilize to scale the system.   Today, we have not really
tapped into the potential of using this... instead we simply RSS-hash
balance across all RX-queues and hope this makes the system scale...


> > > By just viewing sockets as an endpoint, we leave it up to the kernel to
> > > figure out the best way. "Here's an endpoint. Give me data **here**."
> > > 
> > > The OpenFlow protocol does however support the concept of queues per
> > > port, but do we want to introduce that into the kernel?  
> 
> Switch queues != host queues.  Switch/HW queues are for QoS, host queues
> are for RSS.  Those two concepts are similar yet different.  In Linux
> if you offload basic TX TC (mq)prio (the old work John has done for
> Intel) the actual number of HW queues becomes "channel count" x "num TC
> prios".  What would queue ID mean for AF_XDP in that setup, I wonder.

Thanks for explaining that. I must admit I never really understood the
mqprio concept and these "prios" (when reading the code and playing
with it).


> > > So, if per-queue XDP programs is only for AF_XDP, I think it's better
> > > to stick the program to the socket. For me per-queue is sort of a
> > > leaky abstraction...
> > >
> > > More thoughts. If we go the route of per-queue XDP programs. Would it
> > > be better to leave the setup to XDP -- i.e. the XDP program is
> > > controlling the per-queue programs (think tail-calls, but a map with
> > > per-q programs). Instead of the netlink layer. This is part of a
> > > bigger discussion, namely should XDP really implement the control
> > > plane?
> > >
> > > I really like that a software switch/router can be implemented
> > > effectively with XDP, but ideally I'd like it to be offloaded by
> > > hardware -- using the same control/configuration plane. Can we do it
> > > in hardware, do that. If not, emulate via XDP.    
> 
> There is already a number of proposals in the "device application
> slicing", it would be really great if we could make sure we don't
> repeat the mistakes of flow configuration APIs, and try to prevent
> having too many of them..
> 
> Which is very challenging unless we have strong use cases..
> 
> > That is actually the reason I want XDP per-queue, as it is a way to
> > offload the filtering to the hardware.  And if the per-queue XDP-prog
> > becomes simple enough, the hardware can eliminate and do everything in
> > hardware (hopefully).
> >   
> > > The control plane should IMO be outside of the XDP program.  
> 
> ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
> plane.  Do you mean application should not control the "context/
> channel/subdev" creation?  You're not saying "it's not the XDP program
> which should be making the classification", no?  XDP program
> controlling the classification was _the_ reason why we liked AF_XDP :)


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 17:58       ` Jonathan Lemon
@ 2019-04-16 14:48         ` Jesper Dangaard Brouer
  2019-04-17 20:17           ` Tom Herbert
  0 siblings, 1 reply; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-16 14:48 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Björn Töpel,  Björn Töpel, ilias.apalodimas,
	toke, magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	John Fastabend, David Miller, Andy Gospodarek, netdev, bpf,
	Thomas Graf, Thomas Monjalon, brouer

On Mon, 15 Apr 2019 10:58:07 -0700
"Jonathan Lemon" <jonathan.lemon@gmail.com> wrote:

> On 15 Apr 2019, at 9:32, Jesper Dangaard Brouer wrote:
> 
> > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel 
> > <bjorn.topel@intel.com> wrote:
> >  
> >> Hi,
> >>
> >> As you probably can derive from the amount of time this is taking, 
> >> I'm
> >> not really satisfied with the design of per-queue XDP program. (That,
> >> plus I'm a terribly slow hacker... ;-)) I'll try to expand my 
> >> thinking
> >> in this mail!
> >>
> >> Beware, it's kind of a long post, and it's all over the place.  
> >
> > Cc'ing all the XDP-maintainers (and netdev).
> >  
> >> There are a number of ways of setting up flows in the kernel, e.g.
> >>
> >> * Connecting/accepting a TCP socket (in-band)
> >> * Using tc-flower (out-of-band)
> >> * ethtool (out-of-band)
> >> * ...
> >>
> >> The first acts on sockets, the second on netdevs. Then there's 
> >> ethtool
> >> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can 
> >> steer
> >> to queues. Most users care about sockets and netdevices. Queues is
> >> more of an implementation detail of Rx or for QoS on the Tx side.  
> >
> > Let me first acknowledge that the current Linux tools to administrator
> > HW filters is lacking (well sucks).  We know the hardware is capable,
> > as DPDK have an full API for this called rte_flow[1]. If nothing else
> > you/we can use the DPDK API to create a program to configure the
> > hardware, examples here[2]
> >
> >  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> >  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> >  
> >> XDP is something that we can attach to a netdevice. Again, very
> >> natural from a user perspective. As for XDP sockets, the current
> >> mechanism is that we attach to an existing netdevice queue. Ideally
> >> what we'd like is to *remove* the queue concept. A better approach
> >> would be creating the socket and set it up -- but not binding it to a
> >> queue. Instead just binding it to a netdevice (or crazier just
> >> creating a socket without a netdevice).  
> >
> > Let me just remind everybody that the AF_XDP performance gains comes
> > from binding the resource, which allow for lock-free semantics, as
> > explained here[3].
> >
> > [3] 
> > https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> >
> >  
> >> The socket is an endpoint, where I'd like data to end up (or get sent
> >> from). If the kernel can attach the socket to a hardware queue,
> >> there's zerocopy if not, copy-mode. Dito for Tx.  
> >
> > Well XDP programs per RXQ is just a building block to achieve this.
> >
> > As Van Jacobson explain[4], sockets or applications "register" a
> > "transport signature", and gets back a "channel".   In our case, the
> > netdev-global XDP program is our way to register/program these 
> > transport
> > signatures and redirect (e.g. into the AF_XDP socket).
> > This requires some work in software to parse and match transport
> > signatures to sockets.  The XDP programs per RXQ is a way to get
> > hardware to perform this filtering for us.
> >
> >  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> >
> >  
> >> Does a user (control plane) want/need to care about queues? Just
> >> create a flow to a socket (out-of-band or inband) or to a netdevice
> >> (out-of-band).  
> >
> > A userspace "control-plane" program, could hide the setup and use what
> > the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> > that the "listen" socket first register the transport signature (with
> > the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> > can register a 5-tuple (or create TC-HW rules) and load our
> > "transport-signature" XDP prog on the queue number we choose.  If not,
> > when our netdev-global XDP prog need a hash-table with 5-tuple and do
> > 5-tuple parsing.
> >
> > Creating netdevices via HW filter into queues is an interesting idea.
> > DPDK have an example here[5], on how to per flow (via ethtool filter
> > setup even!) send packets to queues, that endup in SRIOV devices.
> >
> >  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
> >
> >  
> >> Do we envison any other uses for per-queue XDP other than AF_XDP? If
> >> not, it would make *more* sense to attach the XDP program to the
> >> socket (e.g. if the endpoint would like to use kernel data structures
> >> via XDP).  
> >
> > As demonstrated in [5] you can use (ethtool) hardware filters to
> > redirect packets into VFs (Virtual Functions).
> >
> > I also want us to extend XDP to allow for redirect from a PF (Physical
> > Function) into a VF (Virtual Function).  First the netdev-global
> > XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> > info).  Next configure HW filter to queue# and load XDP prog on that
> > queue# that only "redirect" to a single VF.  Now if driver+HW supports
> > it, it can "eliminate" the per-queue XDP-prog and do everything in HW.  
> 
> One thing I'd like to see is have RSS distribute incoming traffic
> across a set of queues.  The application would open a set of xsk's
> which are bound to those queues.

Yes. (Some) NIC hardware does support this RSS distribute incoming
traffic across a set of queues.  As you can see in [5] they have an
example of this:

 testpmd> flow isolate 0 true
 testpmd> flow create 0 ingress pattern eth / ipv4 / udp / vxlan vni is 42 / end \
          actions rss queues 0 1 2 3 end / end


> I'm not seeing how a transport signature would achieve this.  The
> current tooling seems to treat the queue as the basic building block,
> which seems generally appropriate.

After creating N-queue that your RSS-hash distribute over, I imagine
that you load your per-queue XDP program on each of these N-queues.  I
don't necessarily see a need for the kernel API to expose to userspace
a API/facility to load an XDP-prog on N-queue in-one-go (you can just
iterate over them).

 
> Whittling things down (receiving packets only for a specific flow)
> could be achieved by creating a queue which only contains those
> packets which atched via some form of classification (or perhaps
> steered to a VF device), aka [5] above.   Exposing multiple queues
> allows load distribution for those apps which care about it.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16 13:55         ` Jesper Dangaard Brouer
@ 2019-04-16 16:53           ` Jonathan Lemon
  2019-04-16 18:23             ` Björn Töpel
  2019-04-16 21:28           ` Jakub Kicinski
  1 sibling, 1 reply; 19+ messages in thread
From: Jonathan Lemon @ 2019-04-16 16:53 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jakub Kicinski, Björn Töpel, Björn Töpel,
	ilias.apalodimas, toke, magnus.karlsson, maciej.fijalkowski,
	Jason Wang, Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	David Miller, Andy Gospodarek, netdev, bpf, Thomas Graf,
	Thomas Monjalon

On 16 Apr 2019, at 6:55, Jesper Dangaard Brouer wrote:

> On Mon, 15 Apr 2019 15:49:32 -0700
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
>
>> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
>>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel 
>>> <bjorn.topel@intel.com> wrote:
>>>> Hi,
>>>>
>>>> As you probably can derive from the amount of time this is taking, 
>>>> I'm
>>>> not really satisfied with the design of per-queue XDP program. 
>>>> (That,
>>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my 
>>>> thinking
>>>> in this mail!
>>
>> Jesper was advocating per-queue progs since very early days of XDP.
>> If it was easy to implement cleanly we would've already gotten it ;)
>
> (I cannot help to feel offended here...  IMHO that statement is BS,
> that is not how upstream development work, and sure, I am to blame as
> I've simply been to lazy or busy with other stuff to implement it.  It
> is not that hard to send down a queue# together with the XDP attach
> command.)
>
> I've been advocating for per-queue progs from day-1, since this is an
> obvious performance advantage, given the programmer can specialize the
> BPF/XDP-prog to the filtered traffic.  I hope/assume we are on the 
> same
> pages here, that per-queue progs is a performance optimization.
>
> I guess the rest of the discussion in this thread is (1) if we can
> convince each-other that someone will actually use this optimization,
> and (2) if we can abstract this away from the user.
>
>
>>>> Beware, it's kind of a long post, and it's all over the place.
>>>
>>> Cc'ing all the XDP-maintainers (and netdev).
>>>
>>>> There are a number of ways of setting up flows in the kernel, e.g.
>>>>
>>>> * Connecting/accepting a TCP socket (in-band)
>>>> * Using tc-flower (out-of-band)
>>>> * ethtool (out-of-band)
>>>> * ...
>>>>
>>>> The first acts on sockets, the second on netdevs. Then there's 
>>>> ethtool
>>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can 
>>>> steer
>>>> to queues. Most users care about sockets and netdevices. Queues is
>>>> more of an implementation detail of Rx or for QoS on the Tx side.
>>>
>>> Let me first acknowledge that the current Linux tools to 
>>> administrator
>>> HW filters is lacking (well sucks).  We know the hardware is 
>>> capable,
>>> as DPDK have an full API for this called rte_flow[1]. If nothing 
>>> else
>>> you/we can use the DPDK API to create a program to configure the
>>> hardware, examples here[2]
>>>
>>>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
>>>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
>>>
>>>> XDP is something that we can attach to a netdevice. Again, very
>>>> natural from a user perspective. As for XDP sockets, the current
>>>> mechanism is that we attach to an existing netdevice queue. Ideally
>>>> what we'd like is to *remove* the queue concept. A better approach
>>>> would be creating the socket and set it up -- but not binding it to 
>>>> a
>>>> queue. Instead just binding it to a netdevice (or crazier just
>>>> creating a socket without a netdevice).
>>
>> You can remove the concept of a queue from the AF_XDP ABI (well, 
>> extend
>> it to not require the queue being explicitly specified..), but you 
>> can't
>> avoid the user space knowing there is a queue.  Because if you do you
>> can no longer track and configure that queue (things like IRQ
>> moderation, descriptor count etc.)
>
> Yes exactly.  Bjørn you mentioned leaky abstractions, and by removing
> the concept of a queue# from the AF_XDP ABI, then you have basically
> created a leaky abstraction, as the sysadm would need to 
> tune/configure
> the "hidden" abstracted queue# (IRQ moderation, desc count etc.).
>
>> Currently the term "queue" refers mostly to the queues that stack 
>> uses.
>> Which leaves e.g. the XDP TX queues in a strange grey zone (from
>> ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice 
>> to
>> have the HW queue ID somewhat detached from stack queue ID.  Or at 
>> least
>> it'd be nice to introduce queue types?  I've been pondering this for 
>> a
>> while, I don't see any silver bullet here..
>
> Yes! - I also worry about the term "queue".  This is very interesting
> to discuss.
>
> I do find it very natural that your HW (e.g. Netronome) have several 
> HW
> RX-queues that feed/send to a single software NAPI RX-queue.  (I 
> assume
> this is how you HW already works, but software/Linux cannot know this
> internal HW queue id).  How we expose and use this is interesting.
>
> I do want to be-able to create new RX-queues, semi-dynamically
> "setup"/load time.  But still a limited number of RX-queues, for
> performance and memory reasons (TX-queue are cheaper).  Memory as we
> prealloc memory RX-queue (and give it to HW).  Performance as with too
> many queues, there is less chance to have a (RX) bulk of packets in
> queue.

How would these be identified?  Suppose there's a set of existing RX
queues for the device which handle the normal system traffic - then I
add an AF_XDP socket which gets its own dedicated RX queue.  Does this
create a new queue id for the device?  Create a new namespace with its
own queue id?

The entire reason the user even cares about the queue id at all is
because it needs to use ethtool/netlink/tc for configuration, or the
net device's XDP program needs to differentiate between the queues
for specific treatment.


>
> For example I would not create an RX-queue per TCP-flow.  But why do I
> still want per-queue XDP-progs and HW-filters for this TCP-flow
> use-case... let me explain:
>
>   E.g. I want to implement an XDP TCP socket load-balancer (same host
> delivery, between XDP and network stack).  And my goal is to avoid
> touching packet payload on XDP RX-CPU.  First I configure ethtool
> filter to redirect all TCP port 80 to a specific RX-queue (could also
> be N-queues), then I don't need to parse TCP-port-80 in my per-queue
> BPF-prog, and I have higher chance of bulk-RX.  Next I need HW to
> provide some flow-identifier, e.g. RSS-hash, flow-id or internal
> HW-queue-id, which I can use to redirect on (e.g. via CPUMAP to
> N-CPUs).  This way I don't touch packet payload on RX-CPU (my bench
> shows one RX-CPU can handle between 14-20Mpps).
>
>
>>> Let me just remind everybody that the AF_XDP performance gains comes
>>> from binding the resource, which allow for lock-free semantics, as
>>> explained here[3].
>>>
>>> [3] 
>>> https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
>>>
>>>
>>>> The socket is an endpoint, where I'd like data to end up (or get 
>>>> sent
>>>> from). If the kernel can attach the socket to a hardware queue,
>>>> there's zerocopy if not, copy-mode. Dito for Tx.
>>>
>>> Well XDP programs per RXQ is just a building block to achieve this.
>>>
>>> As Van Jacobson explain[4], sockets or applications "register" a
>>> "transport signature", and gets back a "channel".   In our case, the
>>> netdev-global XDP program is our way to register/program these 
>>> transport
>>> signatures and redirect (e.g. into the AF_XDP socket).
>>> This requires some work in software to parse and match transport
>>> signatures to sockets.  The XDP programs per RXQ is a way to get
>>> hardware to perform this filtering for us.
>>>
>>>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
>>>
>>>
>>>> Does a user (control plane) want/need to care about queues? Just
>>>> create a flow to a socket (out-of-band or inband) or to a netdevice
>>>> (out-of-band).
>>>
>>> A userspace "control-plane" program, could hide the setup and use 
>>> what
>>> the system/hardware can provide of optimizations.  VJ[4] e.g. 
>>> suggest
>>> that the "listen" socket first register the transport signature 
>>> (with
>>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
>>> can register a 5-tuple (or create TC-HW rules) and load our
>>> "transport-signature" XDP prog on the queue number we choose.  If 
>>> not,
>>> when our netdev-global XDP prog need a hash-table with 5-tuple and 
>>> do
>>> 5-tuple parsing.
>>
>> But we do want the ability to configure the queue, and get stats for
>> that queue.. so we can't hide the queue completely, right?
>
> Yes, that is yet another example that the queue id "leak".
>
>
>>> Creating netdevices via HW filter into queues is an interesting 
>>> idea.
>>> DPDK have an example here[5], on how to per flow (via ethtool filter
>>> setup even!) send packets to queues, that endup in SRIOV devices.
>>>
>>>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
>>
>> I wish I had the courage to nack the ethtool redirect to VF Intel
>> added :)
>>
>>>> Do we envison any other uses for per-queue XDP other than AF_XDP? 
>>>> If
>>>> not, it would make *more* sense to attach the XDP program to the
>>>> socket (e.g. if the endpoint would like to use kernel data 
>>>> structures
>>>> via XDP).
>>>
>>> As demonstrated in [5] you can use (ethtool) hardware filters to
>>> redirect packets into VFs (Virtual Functions).
>>>
>>> I also want us to extend XDP to allow for redirect from a PF 
>>> (Physical
>>> Function) into a VF (Virtual Function).  First the netdev-global
>>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF + 
>>> VF
>>> info).  Next configure HW filter to queue# and load XDP prog on that
>>> queue# that only "redirect" to a single VF.  Now if driver+HW 
>>> supports
>>> it, it can "eliminate" the per-queue XDP-prog and do everything in 
>>> HW.
>>
>> That sounds slightly contrived.  If the program is not doing 
>> anything,
>> why involve XDP at all?
>
> If the HW doesn't support this then the XDP software will do the work.
> If the HW supports this, then you can still list the XDP-prog via
> bpftool, and see that you have a XDP prog that does this action (and
> maybe expose a offloaded-to-HW bit if you like to expose this info).
>
>
>>  As stated above we already have too many ways
>> to do flow config and/or VF redirect.
>>
>>>> If we'd like to slice a netdevice into multiple queues. Isn't 
>>>> macvlan
>>>> or similar *virtual* netdevices a better path, instead of 
>>>> introducing
>>>> yet another abstraction?
>>
>> Yes, the question of use cases is extremely important.  It seems
>> Mellanox is working on "spawning devlink ports" IOW slicing a device
>> into subdevices.  Which is a great way to run bifurcated DPDK/netdev
>> applications :/  If that gets merged I think we have to recalculate
>> what purpose AF_XDP is going to serve, if any.
>>
>> In my view we have different "levels" of slicing:
>
> I do appreciate this overview of NIC slicing, as HW-filters +
> per-queue-XDP can be seen as a way to slice up the NIC.
>
>>  (1) full HW device;
>>  (2) software device (mdev?);
>>  (3) separate netdev;
>>  (4) separate "RSS instance";
>>  (5) dedicated application queues.
>>
>> 1 - is SR-IOV VFs
>> 2 - is software device slicing with mdev (Mellanox)
>> 3 - is (I think) Intel's VSI debugfs... "thing"..
>> 4 - is just ethtool RSS contexts (Solarflare)
>> 5 - is currently AF-XDP (Intel)
>>
>> (2) or lower is required to have raw register access allowing 
>> vfio/DPDK
>> to run "natively".
>>
>> (3) or lower allows for full reuse of all networking APIs, with very
>> natural RSS configuration, TC/QoS configuration on TX etc.
>>
>> (5) is sufficient for zero copy.
>>
>> So back to the use case.. seems like AF_XDP is evolving into allowing
>> "level 3" to pass all frames directly to the application?  With
>> optional XDP filtering?  It's not a trick question - I'm just trying 
>> to
>> place it somewhere on my mental map :)
>
>
>>> XDP redirect a more generic abstraction that allow us to implement
>>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first 
>>> I
>>> write this as global-netdev XDP-prog, that does a lookup in a 
>>> BPF-map.
>>> Next I configure HW filters that match the MAC-addr into a queue# 
>>> and
>>> attach simpler XDP-prog to queue#, that redirect into macvlan 
>>> device.
>>>
>>>> Further, is queue/socket a good abstraction for all devices? Wifi?
>>
>> Right, queue is no abstraction whatsoever.  Queue is a low level
>> primitive.
>
> I agree, queue is a low level primitive.
>
> This the basically interface that the NIC hardware gave us... it is
> fairly limited as it can only express a queue id and a IRQ line that 
> we
> can try to utilize to scale the system.   Today, we have not really
> tapped into the potential of using this... instead we simply RSS-hash
> balance across all RX-queues and hope this makes the system scale...
>
>
>>>> By just viewing sockets as an endpoint, we leave it up to the 
>>>> kernel to
>>>> figure out the best way. "Here's an endpoint. Give me data 
>>>> **here**."
>>>>
>>>> The OpenFlow protocol does however support the concept of queues 
>>>> per
>>>> port, but do we want to introduce that into the kernel?
>>
>> Switch queues != host queues.  Switch/HW queues are for QoS, host 
>> queues
>> are for RSS.  Those two concepts are similar yet different.  In Linux
>> if you offload basic TX TC (mq)prio (the old work John has done for
>> Intel) the actual number of HW queues becomes "channel count" x "num 
>> TC
>> prios".  What would queue ID mean for AF_XDP in that setup, I wonder.
>
> Thanks for explaining that. I must admit I never really understood the
> mqprio concept and these "prios" (when reading the code and playing
> with it).
>
>
>>>> So, if per-queue XDP programs is only for AF_XDP, I think it's 
>>>> better
>>>> to stick the program to the socket. For me per-queue is sort of a
>>>> leaky abstraction...
>>>>
>>>> More thoughts. If we go the route of per-queue XDP programs. Would 
>>>> it
>>>> be better to leave the setup to XDP -- i.e. the XDP program is
>>>> controlling the per-queue programs (think tail-calls, but a map 
>>>> with
>>>> per-q programs). Instead of the netlink layer. This is part of a
>>>> bigger discussion, namely should XDP really implement the control
>>>> plane?
>>>>
>>>> I really like that a software switch/router can be implemented
>>>> effectively with XDP, but ideally I'd like it to be offloaded by
>>>> hardware -- using the same control/configuration plane. Can we do 
>>>> it
>>>> in hardware, do that. If not, emulate via XDP.
>>
>> There is already a number of proposals in the "device application
>> slicing", it would be really great if we could make sure we don't
>> repeat the mistakes of flow configuration APIs, and try to prevent
>> having too many of them..
>>
>> Which is very challenging unless we have strong use cases..
>>
>>> That is actually the reason I want XDP per-queue, as it is a way to
>>> offload the filtering to the hardware.  And if the per-queue 
>>> XDP-prog
>>> becomes simple enough, the hardware can eliminate and do everything 
>>> in
>>> hardware (hopefully).
>>>
>>>> The control plane should IMO be outside of the XDP program.
>>
>> ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
>> plane.  Do you mean application should not control the "context/
>> channel/subdev" creation?  You're not saying "it's not the XDP 
>> program
>> which should be making the classification", no?  XDP program
>> controlling the classification was _the_ reason why we liked AF_XDP 
>> :)
>
>
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16 16:53           ` Jonathan Lemon
@ 2019-04-16 18:23             ` Björn Töpel
  0 siblings, 0 replies; 19+ messages in thread
From: Björn Töpel @ 2019-04-16 18:23 UTC (permalink / raw)
  To: Jonathan Lemon
  Cc: Jesper Dangaard Brouer, Jakub Kicinski, Björn Töpel,
	Ilias Apalodimas, Toke Høiland-Jørgensen, Karlsson,
	Magnus, maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, David Miller, Andy Gospodarek,
	Netdev, bpf, Thomas Graf, Thomas Monjalon

On Tue, 16 Apr 2019 at 18:53, Jonathan Lemon <jonathan.lemon@gmail.com> wrote:
>
> On 16 Apr 2019, at 6:55, Jesper Dangaard Brouer wrote:
>
> > On Mon, 15 Apr 2019 15:49:32 -0700
> > Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> >
> >> On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:
> >>> On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel
> >>> <bjorn.topel@intel.com> wrote:
> >>>> Hi,
> >>>>
> >>>> As you probably can derive from the amount of time this is taking,
> >>>> I'm
> >>>> not really satisfied with the design of per-queue XDP program.
> >>>> (That,
> >>>> plus I'm a terribly slow hacker... ;-)) I'll try to expand my
> >>>> thinking
> >>>> in this mail!
> >>
> >> Jesper was advocating per-queue progs since very early days of XDP.
> >> If it was easy to implement cleanly we would've already gotten it ;)
> >
> > (I cannot help to feel offended here...  IMHO that statement is BS,
> > that is not how upstream development work, and sure, I am to blame as
> > I've simply been to lazy or busy with other stuff to implement it.  It
> > is not that hard to send down a queue# together with the XDP attach
> > command.)
> >
> > I've been advocating for per-queue progs from day-1, since this is an
> > obvious performance advantage, given the programmer can specialize the
> > BPF/XDP-prog to the filtered traffic.  I hope/assume we are on the
> > same
> > pages here, that per-queue progs is a performance optimization.
> >
> > I guess the rest of the discussion in this thread is (1) if we can
> > convince each-other that someone will actually use this optimization,
> > and (2) if we can abstract this away from the user.
> >
> >
> >>>> Beware, it's kind of a long post, and it's all over the place.
> >>>
> >>> Cc'ing all the XDP-maintainers (and netdev).
> >>>
> >>>> There are a number of ways of setting up flows in the kernel, e.g.
> >>>>
> >>>> * Connecting/accepting a TCP socket (in-band)
> >>>> * Using tc-flower (out-of-band)
> >>>> * ethtool (out-of-band)
> >>>> * ...
> >>>>
> >>>> The first acts on sockets, the second on netdevs. Then there's
> >>>> ethtool
> >>>> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can
> >>>> steer
> >>>> to queues. Most users care about sockets and netdevices. Queues is
> >>>> more of an implementation detail of Rx or for QoS on the Tx side.
> >>>
> >>> Let me first acknowledge that the current Linux tools to
> >>> administrator
> >>> HW filters is lacking (well sucks).  We know the hardware is
> >>> capable,
> >>> as DPDK have an full API for this called rte_flow[1]. If nothing
> >>> else
> >>> you/we can use the DPDK API to create a program to configure the
> >>> hardware, examples here[2]
> >>>
> >>>  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> >>>  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> >>>
> >>>> XDP is something that we can attach to a netdevice. Again, very
> >>>> natural from a user perspective. As for XDP sockets, the current
> >>>> mechanism is that we attach to an existing netdevice queue. Ideally
> >>>> what we'd like is to *remove* the queue concept. A better approach
> >>>> would be creating the socket and set it up -- but not binding it to
> >>>> a
> >>>> queue. Instead just binding it to a netdevice (or crazier just
> >>>> creating a socket without a netdevice).
> >>
> >> You can remove the concept of a queue from the AF_XDP ABI (well,
> >> extend
> >> it to not require the queue being explicitly specified..), but you
> >> can't
> >> avoid the user space knowing there is a queue.  Because if you do you
> >> can no longer track and configure that queue (things like IRQ
> >> moderation, descriptor count etc.)
> >
> > Yes exactly.  Bjørn you mentioned leaky abstractions, and by removing
> > the concept of a queue# from the AF_XDP ABI, then you have basically
> > created a leaky abstraction, as the sysadm would need to
> > tune/configure
> > the "hidden" abstracted queue# (IRQ moderation, desc count etc.).
> >
> >> Currently the term "queue" refers mostly to the queues that stack
> >> uses.
> >> Which leaves e.g. the XDP TX queues in a strange grey zone (from
> >> ethtool channel ABI perspective, RPS, XPS etc.)  So it would be nice
> >> to
> >> have the HW queue ID somewhat detached from stack queue ID.  Or at
> >> least
> >> it'd be nice to introduce queue types?  I've been pondering this for
> >> a
> >> while, I don't see any silver bullet here..
> >
> > Yes! - I also worry about the term "queue".  This is very interesting
> > to discuss.
> >
> > I do find it very natural that your HW (e.g. Netronome) have several
> > HW
> > RX-queues that feed/send to a single software NAPI RX-queue.  (I
> > assume
> > this is how you HW already works, but software/Linux cannot know this
> > internal HW queue id).  How we expose and use this is interesting.
> >
> > I do want to be-able to create new RX-queues, semi-dynamically
> > "setup"/load time.  But still a limited number of RX-queues, for
> > performance and memory reasons (TX-queue are cheaper).  Memory as we
> > prealloc memory RX-queue (and give it to HW).  Performance as with too
> > many queues, there is less chance to have a (RX) bulk of packets in
> > queue.
>
> How would these be identified?  Suppose there's a set of existing RX
> queues for the device which handle the normal system traffic - then I
> add an AF_XDP socket which gets its own dedicated RX queue.  Does this
> create a new queue id for the device?  Create a new namespace with its
> own queue id?
>
> The entire reason the user even cares about the queue id at all is
> because it needs to use ethtool/netlink/tc for configuration, or the
> net device's XDP program needs to differentiate between the queues
> for specific treatment.
>

Exactly!

I've been thinking along these lines as well -- I'd like to go the
torwards a "AF_XDP with dedicated queues" model (in addition to the
attach one). Then again, as Jesper and Jakub reminded me, XDP Tx is
yet another inaccessible (from a configuration standpoint) set of
queues. Maybe there is a need for proper "queues". Some are attached
to the kernel stack, some to XDP Tx and some to AF_XDP sockets.

That said, I like the good old netdevice and socket model for user
applications. "There's a bunch of pipes. Some have HW backing, but I
don't care much." :-P


Björn

> >
> > For example I would not create an RX-queue per TCP-flow.  But why do I
> > still want per-queue XDP-progs and HW-filters for this TCP-flow
> > use-case... let me explain:
> >
> >   E.g. I want to implement an XDP TCP socket load-balancer (same host
> > delivery, between XDP and network stack).  And my goal is to avoid
> > touching packet payload on XDP RX-CPU.  First I configure ethtool
> > filter to redirect all TCP port 80 to a specific RX-queue (could also
> > be N-queues), then I don't need to parse TCP-port-80 in my per-queue
> > BPF-prog, and I have higher chance of bulk-RX.  Next I need HW to
> > provide some flow-identifier, e.g. RSS-hash, flow-id or internal
> > HW-queue-id, which I can use to redirect on (e.g. via CPUMAP to
> > N-CPUs).  This way I don't touch packet payload on RX-CPU (my bench
> > shows one RX-CPU can handle between 14-20Mpps).
> >
> >
> >>> Let me just remind everybody that the AF_XDP performance gains comes
> >>> from binding the resource, which allow for lock-free semantics, as
> >>> explained here[3].
> >>>
> >>> [3]
> >>> https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> >>>
> >>>
> >>>> The socket is an endpoint, where I'd like data to end up (or get
> >>>> sent
> >>>> from). If the kernel can attach the socket to a hardware queue,
> >>>> there's zerocopy if not, copy-mode. Dito for Tx.
> >>>
> >>> Well XDP programs per RXQ is just a building block to achieve this.
> >>>
> >>> As Van Jacobson explain[4], sockets or applications "register" a
> >>> "transport signature", and gets back a "channel".   In our case, the
> >>> netdev-global XDP program is our way to register/program these
> >>> transport
> >>> signatures and redirect (e.g. into the AF_XDP socket).
> >>> This requires some work in software to parse and match transport
> >>> signatures to sockets.  The XDP programs per RXQ is a way to get
> >>> hardware to perform this filtering for us.
> >>>
> >>>  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> >>>
> >>>
> >>>> Does a user (control plane) want/need to care about queues? Just
> >>>> create a flow to a socket (out-of-band or inband) or to a netdevice
> >>>> (out-of-band).
> >>>
> >>> A userspace "control-plane" program, could hide the setup and use
> >>> what
> >>> the system/hardware can provide of optimizations.  VJ[4] e.g.
> >>> suggest
> >>> that the "listen" socket first register the transport signature
> >>> (with
> >>> the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> >>> can register a 5-tuple (or create TC-HW rules) and load our
> >>> "transport-signature" XDP prog on the queue number we choose.  If
> >>> not,
> >>> when our netdev-global XDP prog need a hash-table with 5-tuple and
> >>> do
> >>> 5-tuple parsing.
> >>
> >> But we do want the ability to configure the queue, and get stats for
> >> that queue.. so we can't hide the queue completely, right?
> >
> > Yes, that is yet another example that the queue id "leak".
> >
> >
> >>> Creating netdevices via HW filter into queues is an interesting
> >>> idea.
> >>> DPDK have an example here[5], on how to per flow (via ethtool filter
> >>> setup even!) send packets to queues, that endup in SRIOV devices.
> >>>
> >>>  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
> >>
> >> I wish I had the courage to nack the ethtool redirect to VF Intel
> >> added :)
> >>
> >>>> Do we envison any other uses for per-queue XDP other than AF_XDP?
> >>>> If
> >>>> not, it would make *more* sense to attach the XDP program to the
> >>>> socket (e.g. if the endpoint would like to use kernel data
> >>>> structures
> >>>> via XDP).
> >>>
> >>> As demonstrated in [5] you can use (ethtool) hardware filters to
> >>> redirect packets into VFs (Virtual Functions).
> >>>
> >>> I also want us to extend XDP to allow for redirect from a PF
> >>> (Physical
> >>> Function) into a VF (Virtual Function).  First the netdev-global
> >>> XDP-prog need to support this (maybe extend xdp_rxq_info with PF +
> >>> VF
> >>> info).  Next configure HW filter to queue# and load XDP prog on that
> >>> queue# that only "redirect" to a single VF.  Now if driver+HW
> >>> supports
> >>> it, it can "eliminate" the per-queue XDP-prog and do everything in
> >>> HW.
> >>
> >> That sounds slightly contrived.  If the program is not doing
> >> anything,
> >> why involve XDP at all?
> >
> > If the HW doesn't support this then the XDP software will do the work.
> > If the HW supports this, then you can still list the XDP-prog via
> > bpftool, and see that you have a XDP prog that does this action (and
> > maybe expose a offloaded-to-HW bit if you like to expose this info).
> >
> >
> >>  As stated above we already have too many ways
> >> to do flow config and/or VF redirect.
> >>
> >>>> If we'd like to slice a netdevice into multiple queues. Isn't
> >>>> macvlan
> >>>> or similar *virtual* netdevices a better path, instead of
> >>>> introducing
> >>>> yet another abstraction?
> >>
> >> Yes, the question of use cases is extremely important.  It seems
> >> Mellanox is working on "spawning devlink ports" IOW slicing a device
> >> into subdevices.  Which is a great way to run bifurcated DPDK/netdev
> >> applications :/  If that gets merged I think we have to recalculate
> >> what purpose AF_XDP is going to serve, if any.
> >>
> >> In my view we have different "levels" of slicing:
> >
> > I do appreciate this overview of NIC slicing, as HW-filters +
> > per-queue-XDP can be seen as a way to slice up the NIC.
> >
> >>  (1) full HW device;
> >>  (2) software device (mdev?);
> >>  (3) separate netdev;
> >>  (4) separate "RSS instance";
> >>  (5) dedicated application queues.
> >>
> >> 1 - is SR-IOV VFs
> >> 2 - is software device slicing with mdev (Mellanox)
> >> 3 - is (I think) Intel's VSI debugfs... "thing"..
> >> 4 - is just ethtool RSS contexts (Solarflare)
> >> 5 - is currently AF-XDP (Intel)
> >>
> >> (2) or lower is required to have raw register access allowing
> >> vfio/DPDK
> >> to run "natively".
> >>
> >> (3) or lower allows for full reuse of all networking APIs, with very
> >> natural RSS configuration, TC/QoS configuration on TX etc.
> >>
> >> (5) is sufficient for zero copy.
> >>
> >> So back to the use case.. seems like AF_XDP is evolving into allowing
> >> "level 3" to pass all frames directly to the application?  With
> >> optional XDP filtering?  It's not a trick question - I'm just trying
> >> to
> >> place it somewhere on my mental map :)
> >
> >
> >>> XDP redirect a more generic abstraction that allow us to implement
> >>> macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first
> >>> I
> >>> write this as global-netdev XDP-prog, that does a lookup in a
> >>> BPF-map.
> >>> Next I configure HW filters that match the MAC-addr into a queue#
> >>> and
> >>> attach simpler XDP-prog to queue#, that redirect into macvlan
> >>> device.
> >>>
> >>>> Further, is queue/socket a good abstraction for all devices? Wifi?
> >>
> >> Right, queue is no abstraction whatsoever.  Queue is a low level
> >> primitive.
> >
> > I agree, queue is a low level primitive.
> >
> > This the basically interface that the NIC hardware gave us... it is
> > fairly limited as it can only express a queue id and a IRQ line that
> > we
> > can try to utilize to scale the system.   Today, we have not really
> > tapped into the potential of using this... instead we simply RSS-hash
> > balance across all RX-queues and hope this makes the system scale...
> >
> >
> >>>> By just viewing sockets as an endpoint, we leave it up to the
> >>>> kernel to
> >>>> figure out the best way. "Here's an endpoint. Give me data
> >>>> **here**."
> >>>>
> >>>> The OpenFlow protocol does however support the concept of queues
> >>>> per
> >>>> port, but do we want to introduce that into the kernel?
> >>
> >> Switch queues != host queues.  Switch/HW queues are for QoS, host
> >> queues
> >> are for RSS.  Those two concepts are similar yet different.  In Linux
> >> if you offload basic TX TC (mq)prio (the old work John has done for
> >> Intel) the actual number of HW queues becomes "channel count" x "num
> >> TC
> >> prios".  What would queue ID mean for AF_XDP in that setup, I wonder.
> >
> > Thanks for explaining that. I must admit I never really understood the
> > mqprio concept and these "prios" (when reading the code and playing
> > with it).
> >
> >
> >>>> So, if per-queue XDP programs is only for AF_XDP, I think it's
> >>>> better
> >>>> to stick the program to the socket. For me per-queue is sort of a
> >>>> leaky abstraction...
> >>>>
> >>>> More thoughts. If we go the route of per-queue XDP programs. Would
> >>>> it
> >>>> be better to leave the setup to XDP -- i.e. the XDP program is
> >>>> controlling the per-queue programs (think tail-calls, but a map
> >>>> with
> >>>> per-q programs). Instead of the netlink layer. This is part of a
> >>>> bigger discussion, namely should XDP really implement the control
> >>>> plane?
> >>>>
> >>>> I really like that a software switch/router can be implemented
> >>>> effectively with XDP, but ideally I'd like it to be offloaded by
> >>>> hardware -- using the same control/configuration plane. Can we do
> >>>> it
> >>>> in hardware, do that. If not, emulate via XDP.
> >>
> >> There is already a number of proposals in the "device application
> >> slicing", it would be really great if we could make sure we don't
> >> repeat the mistakes of flow configuration APIs, and try to prevent
> >> having too many of them..
> >>
> >> Which is very challenging unless we have strong use cases..
> >>
> >>> That is actually the reason I want XDP per-queue, as it is a way to
> >>> offload the filtering to the hardware.  And if the per-queue
> >>> XDP-prog
> >>> becomes simple enough, the hardware can eliminate and do everything
> >>> in
> >>> hardware (hopefully).
> >>>
> >>>> The control plane should IMO be outside of the XDP program.
> >>
> >> ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
> >> plane.  Do you mean application should not control the "context/
> >> channel/subdev" creation?  You're not saying "it's not the XDP
> >> program
> >> which should be making the classification", no?  XDP program
> >> controlling the classification was _the_ reason why we liked AF_XDP
> >> :)
> >
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16  7:45         ` Björn Töpel
@ 2019-04-16 21:17           ` Jakub Kicinski
  0 siblings, 0 replies; 19+ messages in thread
From: Jakub Kicinski @ 2019-04-16 21:17 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Jesper Dangaard Brouer, Björn Töpel, Ilias Apalodimas,
	Toke Høiland-Jørgensen, Karlsson, Magnus,
	maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, David Miller, Andy Gospodarek,
	netdev, bpf, Thomas Graf, Thomas Monjalon, Jonathan Lemon

On Tue, 16 Apr 2019 09:45:24 +0200, Björn Töpel wrote:
> > > > If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> > > > or similar *virtual* netdevices a better path, instead of introducing
> > > > yet another abstraction?  
> >
> > Yes, the question of use cases is extremely important.  It seems
> > Mellanox is working on "spawning devlink ports" IOW slicing a device
> > into subdevices.  Which is a great way to run bifurcated DPDK/netdev
> > applications :/  If that gets merged I think we have to recalculate
> > what purpose AF_XDP is going to serve, if any.
> 
> I really like the subdevice-think, but let's have the drivers in the
> kernel. I don't see how the XDP view (including AF_XDP) changes with
> subdevices. My view on AF_XDP is that it's a socket that can
> receive/send data efficiently from/to the kernel. What subdevice
> *might* change is the requirement for a per-queue XDP program.

My worry is that the sockets are not expressive enough.  You can't have
a flower rule that forwards to a socket.  You can't have a flower rule
which forwards to an RSS context (AFAIK).  We have a model for doing
those things with port netdevs (A(incorrectly)KA representors).

> > > That is actually the reason I want XDP per-queue, as it is a way to
> > > offload the filtering to the hardware.  And if the per-queue XDP-prog
> > > becomes simple enough, the hardware can eliminate and do everything in
> > > hardware (hopefully).
> > >  
> > > > The control plane should IMO be outside of the XDP program.  
> >
> > ENOCOMPUTE :)  XDP program is the BPF byte code, it's never control
> > plance.  Do you mean application should not control the "context/
> > channel/subdev" creation?  
> 
> Yes, but I'm not sure. I'd like to hear more opinions.
> 
> Let me try to think out loud here. Say that per-queue XDP programs
> exist. The main XDP program receives all packets and makes the
> decision that a certain flow should end up in say queue X, and that
> the hardware supports offloading that. Should the knobs to program the
> hardware be in via BPF or by some other mechanism (perf ring to
> userland daemon)? Further, setting the XDP program per queue. Should
> that be done via XDP (the main XDP program has knowledge of all
> programs) or via say netlink (as XDP is today). One could argue that
> the per-queue setup should be a map (like tail-calls).

This is a philosophical discussion reminiscent of Saeed's control map
proposal.

I don't like the idea of purposefully shoehorning the networking
configuration into special maps.  It should probably be judged on
patch-by-patch basis, tho.

> > You're not saying "it's not the XDP program
> > which should be making the classification", no?  XDP program
> > controlling the classification was _the_ reason why we liked AF_XDP :)  
> 
> XDP program not doing classification would be weird. But if there's a
> scenario where *everything for a certain HW filter* end up in an
> AF_XDP queue, should we require an XDP program. I've been going back
> and forth here... :-)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16 13:55         ` Jesper Dangaard Brouer
  2019-04-16 16:53           ` Jonathan Lemon
@ 2019-04-16 21:28           ` Jakub Kicinski
  1 sibling, 0 replies; 19+ messages in thread
From: Jakub Kicinski @ 2019-04-16 21:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Björn Töpel, ilias.apalodimas,
	toke, magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	David Miller, Andy Gospodarek, netdev, bpf, Thomas Graf,
	Thomas Monjalon, Jonathan Lemon

On Tue, 16 Apr 2019 15:55:23 +0200, Jesper Dangaard Brouer wrote:
> On Mon, 15 Apr 2019 15:49:32 -0700
> Jakub Kicinski <jakub.kicinski@netronome.com> wrote:
> 
> > On Mon, 15 Apr 2019 18:32:58 +0200, Jesper Dangaard Brouer wrote:  
> > > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:    
> > > > Hi,
> > > > 
> > > > As you probably can derive from the amount of time this is taking, I'm
> > > > not really satisfied with the design of per-queue XDP program. (That,
> > > > plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> > > > in this mail!    
> > 
> > Jesper was advocating per-queue progs since very early days of XDP.
> > If it was easy to implement cleanly we would've already gotten it ;)  
> 
> (I cannot help to feel offended here...  IMHO that statement is BS,
> that is not how upstream development work, and sure, I am to blame as
> I've simply been to lazy or busy with other stuff to implement it.

Sincere apologies, definitely not what I was trying to say.

> It is not that hard to send down a queue# together with the XDP attach
> command.) 

That part is not hard, agreed.

> I've been advocating for per-queue progs from day-1, since this is an
> obvious performance advantage, given the programmer can specialize the
> BPF/XDP-prog to the filtered traffic.  I hope/assume we are on the same
> pages here, that per-queue progs is a performance optimization.
> 
> I guess the rest of the discussion in this thread is (1) if we can
> convince each-other that someone will actually use this optimization,
> and (2) if we can abstract this away from the user.

Yes, agreed.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
                         ` (5 preceding siblings ...)
  2019-04-16 10:41       ` Jason Wang
@ 2019-04-17 16:46       ` Björn Töpel
  6 siblings, 0 replies; 19+ messages in thread
From: Björn Töpel @ 2019-04-17 16:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Björn Töpel, Ilias Apalodimas,
	Toke Høiland-Jørgensen, Karlsson, Magnus,
	maciej.fijalkowski, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon,
	Jonathan Lemon

On Mon, 15 Apr 2019 at 18:33, Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
[...]
> >
> > Ok, please convince me! :-D
>
> I tried to above...
>

I think you (and Jakub) did. :-) Looks like a "queue" is a good
(necessary) abstraction, but I need to think more about how to e.g.
access "dedicated/isolated" AF_XDP queues or XDP Tx queues in a
convenient way.

I'll still aiming for making it dead-easy (like Jonathan describes) to
use AF_XDP sockets...

Thanks a lot for all the input/help!


Björn

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Per-queue XDP programs, thoughts
  2019-04-16 14:48         ` Jesper Dangaard Brouer
@ 2019-04-17 20:17           ` Tom Herbert
  0 siblings, 0 replies; 19+ messages in thread
From: Tom Herbert @ 2019-04-17 20:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Jonathan Lemon, Björn Töpel, Björn Töpel,
	Ilias Apalodimas, Toke Høiland-Jørgensen,
	magnus.karlsson, maciej.fijalkowski, Jason Wang,
	Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski,
	John Fastabend, David Miller, Andy Gospodarek,
	Linux Kernel Network Developers, bpf, Thomas Graf,
	Thomas Monjalon

On Tue, Apr 16, 2019 at 7:48 AM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Mon, 15 Apr 2019 10:58:07 -0700
> "Jonathan Lemon" <jonathan.lemon@gmail.com> wrote:
>
> > On 15 Apr 2019, at 9:32, Jesper Dangaard Brouer wrote:
> >
> > > On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel
> > > <bjorn.topel@intel.com> wrote:
> > >
> > >> Hi,
> > >>
> > >> As you probably can derive from the amount of time this is taking,
> > >> I'm
> > >> not really satisfied with the design of per-queue XDP program. (That,
> > >> plus I'm a terribly slow hacker... ;-)) I'll try to expand my
> > >> thinking
> > >> in this mail!
> > >>
> > >> Beware, it's kind of a long post, and it's all over the place.
> > >
> > > Cc'ing all the XDP-maintainers (and netdev).
> > >
> > >> There are a number of ways of setting up flows in the kernel, e.g.
> > >>
> > >> * Connecting/accepting a TCP socket (in-band)
> > >> * Using tc-flower (out-of-band)
> > >> * ethtool (out-of-band)
> > >> * ...
> > >>
> > >> The first acts on sockets, the second on netdevs. Then there's
> > >> ethtool
> > >> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can
> > >> steer
> > >> to queues. Most users care about sockets and netdevices. Queues is
> > >> more of an implementation detail of Rx or for QoS on the Tx side.
> > >
> > > Let me first acknowledge that the current Linux tools to administrator
> > > HW filters is lacking (well sucks).  We know the hardware is capable,
> > > as DPDK have an full API for this called rte_flow[1]. If nothing else
> > > you/we can use the DPDK API to create a program to configure the
> > > hardware, examples here[2]
> > >
> > >  [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
> > >  [2] https://doc.dpdk.org/guides/howto/rte_flow.html
> > >
> > >> XDP is something that we can attach to a netdevice. Again, very
> > >> natural from a user perspective. As for XDP sockets, the current
> > >> mechanism is that we attach to an existing netdevice queue. Ideally
> > >> what we'd like is to *remove* the queue concept. A better approach
> > >> would be creating the socket and set it up -- but not binding it to a
> > >> queue. Instead just binding it to a netdevice (or crazier just
> > >> creating a socket without a netdevice).
> > >
> > > Let me just remind everybody that the AF_XDP performance gains comes
> > > from binding the resource, which allow for lock-free semantics, as
> > > explained here[3].
> > >
> > > [3]
> > > https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from
> > >
> > >
> > >> The socket is an endpoint, where I'd like data to end up (or get sent
> > >> from). If the kernel can attach the socket to a hardware queue,
> > >> there's zerocopy if not, copy-mode. Dito for Tx.
> > >
> > > Well XDP programs per RXQ is just a building block to achieve this.
> > >
> > > As Van Jacobson explain[4], sockets or applications "register" a
> > > "transport signature", and gets back a "channel".   In our case, the
> > > netdev-global XDP program is our way to register/program these
> > > transport
> > > signatures and redirect (e.g. into the AF_XDP socket).
> > > This requires some work in software to parse and match transport
> > > signatures to sockets.  The XDP programs per RXQ is a way to get
> > > hardware to perform this filtering for us.
> > >
> > >  [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf
> > >
> > >
> > >> Does a user (control plane) want/need to care about queues? Just
> > >> create a flow to a socket (out-of-band or inband) or to a netdevice
> > >> (out-of-band).
> > >
> > > A userspace "control-plane" program, could hide the setup and use what
> > > the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
> > > that the "listen" socket first register the transport signature (with
> > > the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
> > > can register a 5-tuple (or create TC-HW rules) and load our
> > > "transport-signature" XDP prog on the queue number we choose.  If not,
> > > when our netdev-global XDP prog need a hash-table with 5-tuple and do
> > > 5-tuple parsing.
> > >
> > > Creating netdevices via HW filter into queues is an interesting idea.
> > > DPDK have an example here[5], on how to per flow (via ethtool filter
> > > setup even!) send packets to queues, that endup in SRIOV devices.
> > >
> > >  [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html
> > >
> > >
> > >> Do we envison any other uses for per-queue XDP other than AF_XDP? If
> > >> not, it would make *more* sense to attach the XDP program to the
> > >> socket (e.g. if the endpoint would like to use kernel data structures
> > >> via XDP).
> > >
> > > As demonstrated in [5] you can use (ethtool) hardware filters to
> > > redirect packets into VFs (Virtual Functions).
> > >
> > > I also want us to extend XDP to allow for redirect from a PF (Physical
> > > Function) into a VF (Virtual Function).  First the netdev-global
> > > XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
> > > info).  Next configure HW filter to queue# and load XDP prog on that
> > > queue# that only "redirect" to a single VF.  Now if driver+HW supports
> > > it, it can "eliminate" the per-queue XDP-prog and do everything in HW.
> >
> > One thing I'd like to see is have RSS distribute incoming traffic
> > across a set of queues.  The application would open a set of xsk's
> > which are bound to those queues.
>
> Yes. (Some) NIC hardware does support this RSS distribute incoming
> traffic across a set of queues.  As you can see in [5] they have an
> example of this:
>
>  testpmd> flow isolate 0 true
>  testpmd> flow create 0 ingress pattern eth / ipv4 / udp / vxlan vni is 42 / end \
>           actions rss queues 0 1 2 3 end / end
>
>
> > I'm not seeing how a transport signature would achieve this.  The
> > current tooling seems to treat the queue as the basic building block,
> > which seems generally appropriate.
>
> After creating N-queue that your RSS-hash distribute over, I imagine
> that you load your per-queue XDP program on each of these N-queues.  I
> don't necessarily see a need for the kernel API to expose to userspace
> a API/facility to load an XDP-prog on N-queue in-one-go (you can just
> iterate over them).

Accelerated RFS does this. The idea is that hardware packet steering
can be programmed to steer packets to a certain queue based in 5-tuple
(or probably a hash of 4-tuple). Right now the implementation ties
this to canonical RFS which endeavours to steer packets to an
appropriate queue where application process is running, but the
hardware interface should be amenable to arbitrarily programming flows
to steer to certain queues. There is a question of whether the
steering needed to be perfect (exact 5-tuple is matched) or imperfect
(hash indexed into a table is okay), that depends on the isolation
model needed.

Tom

>
>
> > Whittling things down (receiving packets only for a specific flow)
> > could be achieved by creating a queue which only contains those
> > packets which atched via some form of classification (or perhaps
> > steered to a VF device), aka [5] above.   Exposing multiple queues
> > allows load distribution for those apps which care about it.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-04-17 20:17 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190405131745.24727-1-bjorn.topel@gmail.com>
     [not found] ` <20190405131745.24727-2-bjorn.topel@gmail.com>
     [not found]   ` <64259723-f0d8-8ade-467e-ad865add4908@intel.com>
2019-04-15 16:32     ` Per-queue XDP programs, thoughts Jesper Dangaard Brouer
2019-04-15 17:08       ` Toke Høiland-Jørgensen
2019-04-15 17:58       ` Jonathan Lemon
2019-04-16 14:48         ` Jesper Dangaard Brouer
2019-04-17 20:17           ` Tom Herbert
2019-04-15 22:49       ` Jakub Kicinski
2019-04-16  7:45         ` Björn Töpel
2019-04-16 21:17           ` Jakub Kicinski
2019-04-16 13:55         ` Jesper Dangaard Brouer
2019-04-16 16:53           ` Jonathan Lemon
2019-04-16 18:23             ` Björn Töpel
2019-04-16 21:28           ` Jakub Kicinski
2019-04-16  7:44       ` Björn Töpel
2019-04-16  9:36         ` Toke Høiland-Jørgensen
2019-04-16 12:07           ` Björn Töpel
2019-04-16 13:25             ` Toke Høiland-Jørgensen
2019-04-16 10:15       ` Jason Wang
2019-04-16 10:41       ` Jason Wang
2019-04-17 16:46       ` Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).