Re: Per-queue XDP programs, thoughts

* Re: Per-queue XDP programs, thoughts
       [not found]   ` <64259723-f0d8-8ade-467e-ad865add4908@intel.com>
@ 2019-04-15 16:32     ` Jesper Dangaard Brouer
  2019-04-15 17:08       ` Toke Høiland-Jørgensen
                         ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Jesper Dangaard Brouer @ 2019-04-15 16:32 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Björn Töpel, ilias.apalodimas, toke, magnus.karlsson,
	maciej.fijalkowski, brouer, Jason Wang, Alexei Starovoitov,
	Daniel Borkmann, Jakub Kicinski, John Fastabend, David Miller,
	Andy Gospodarek, netdev, bpf, Thomas Graf, Thomas Monjalon

On Mon, 15 Apr 2019 13:59:03 +0200 Björn Töpel <bjorn.topel@intel.com> wrote:

> Hi,
> 
> As you probably can derive from the amount of time this is taking, I'm
> not really satisfied with the design of per-queue XDP program. (That,
> plus I'm a terribly slow hacker... ;-)) I'll try to expand my thinking
> in this mail!
> 
> Beware, it's kind of a long post, and it's all over the place.

Cc'ing all the XDP-maintainers (and netdev).

> There are a number of ways of setting up flows in the kernel, e.g.
> 
> * Connecting/accepting a TCP socket (in-band)
> * Using tc-flower (out-of-band)
> * ethtool (out-of-band)
> * ...
> 
> The first acts on sockets, the second on netdevs. Then there's ethtool
> to configure RSS, and the RSS-on-steriods rxhash/ntuple that can steer
> to queues. Most users care about sockets and netdevices. Queues is
> more of an implementation detail of Rx or for QoS on the Tx side.

Let me first acknowledge that the current Linux tools to administrator
HW filters is lacking (well sucks).  We know the hardware is capable,
as DPDK have an full API for this called rte_flow[1]. If nothing else
you/we can use the DPDK API to create a program to configure the
hardware, examples here[2]

 [1] https://doc.dpdk.org/guides/prog_guide/rte_flow.html
 [2] https://doc.dpdk.org/guides/howto/rte_flow.html

> XDP is something that we can attach to a netdevice. Again, very
> natural from a user perspective. As for XDP sockets, the current
> mechanism is that we attach to an existing netdevice queue. Ideally
> what we'd like is to *remove* the queue concept. A better approach
> would be creating the socket and set it up -- but not binding it to a
> queue. Instead just binding it to a netdevice (or crazier just
> creating a socket without a netdevice).

Let me just remind everybody that the AF_XDP performance gains comes
from binding the resource, which allow for lock-free semantics, as
explained here[3].

[3] https://github.com/xdp-project/xdp-tutorial/tree/master/advanced03-AF_XDP#where-does-af_xdp-performance-come-from

> The socket is an endpoint, where I'd like data to end up (or get sent
> from). If the kernel can attach the socket to a hardware queue,
> there's zerocopy if not, copy-mode. Dito for Tx.

Well XDP programs per RXQ is just a building block to achieve this.

As Van Jacobson explain[4], sockets or applications "register" a
"transport signature", and gets back a "channel".   In our case, the
netdev-global XDP program is our way to register/program these transport
signatures and redirect (e.g. into the AF_XDP socket).
This requires some work in software to parse and match transport
signatures to sockets.  The XDP programs per RXQ is a way to get
hardware to perform this filtering for us.

 [4] http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

> Does a user (control plane) want/need to care about queues? Just
> create a flow to a socket (out-of-band or inband) or to a netdevice
> (out-of-band).

A userspace "control-plane" program, could hide the setup and use what
the system/hardware can provide of optimizations.  VJ[4] e.g. suggest
that the "listen" socket first register the transport signature (with
the driver) on "accept()".   If the HW supports DPDK-rte_flow API we
can register a 5-tuple (or create TC-HW rules) and load our
"transport-signature" XDP prog on the queue number we choose.  If not,
when our netdev-global XDP prog need a hash-table with 5-tuple and do
5-tuple parsing.

Creating netdevices via HW filter into queues is an interesting idea.
DPDK have an example here[5], on how to per flow (via ethtool filter
setup even!) send packets to queues, that endup in SRIOV devices.

 [5] https://doc.dpdk.org/guides/howto/flow_bifurcation.html

> Do we envison any other uses for per-queue XDP other than AF_XDP? If
> not, it would make *more* sense to attach the XDP program to the
> socket (e.g. if the endpoint would like to use kernel data structures
> via XDP).

As demonstrated in [5] you can use (ethtool) hardware filters to
redirect packets into VFs (Virtual Functions).

I also want us to extend XDP to allow for redirect from a PF (Physical
Function) into a VF (Virtual Function).  First the netdev-global
XDP-prog need to support this (maybe extend xdp_rxq_info with PF + VF
info).  Next configure HW filter to queue# and load XDP prog on that
queue# that only "redirect" to a single VF.  Now if driver+HW supports
it, it can "eliminate" the per-queue XDP-prog and do everything in HW. 

> If we'd like to slice a netdevice into multiple queues. Isn't macvlan
> or similar *virtual* netdevices a better path, instead of introducing
> yet another abstraction?

XDP redirect a more generic abstraction that allow us to implement
macvlan.  Except macvlan driver is missing ndo_xdp_xmit. Again first I
write this as global-netdev XDP-prog, that does a lookup in a BPF-map.
Next I configure HW filters that match the MAC-addr into a queue# and
attach simpler XDP-prog to queue#, that redirect into macvlan device.

> Further, is queue/socket a good abstraction for all devices? Wifi? By
> just viewing sockets as an endpoint, we leave it up to the kernel to
> figure out the best way. "Here's an endpoint. Give me data **here**."
> 
> The OpenFlow protocol does however support the concept of queues per
> port, but do we want to introduce that into the kernel?
> 
> So, if per-queue XDP programs is only for AF_XDP, I think it's better
> to stick the program to the socket. For me per-queue is sort of a
> leaky abstraction...
>
> More thoughts. If we go the route of per-queue XDP programs. Would it
> be better to leave the setup to XDP -- i.e. the XDP program is
> controlling the per-queue programs (think tail-calls, but a map with
> per-q programs). Instead of the netlink layer. This is part of a
> bigger discussion, namely should XDP really implement the control
> plane?
> 
> I really like that a software switch/router can be implemented
> effectively with XDP, but ideally I'd like it to be offloaded by
> hardware -- using the same control/configuration plane. Can we do it
> in hardware, do that. If not, emulate via XDP.

That is actually the reason I want XDP per-queue, as it is a way to
offload the filtering to the hardware.  And if the per-queue XDP-prog
becomes simple enough, the hardware can eliminate and do everything in
hardware (hopefully).

> The control plane should IMO be outside of the XDP program.
> 
> Ok, please convince me! :-D

I tried to above...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

More use-cases for per-queue XDP-prog:

XDP for containers
------------------
XDP can redirect into veth, used for containers.  So, I want to
implement container protection/isolation in XDP layer.  E.g. I only
want my container to talk to 1 external src-IP to my container dst-IP
on port 80.  I can implement that check in netdev-global XDP BPF-code.
But I can also hardware "offload" this simple filter (via ethtool or
rte_flow) and simplify the per-queue XDP-prog.  Given the queue now
only receives traffic that match my desc, I have now protected/isolated
the traffic to my container.

DPDK using per-queue AF_XDP
---------------------------
AFAIK an AF_XDP PMD driver have been merged in DPDK (but I've not
looked at the code).

It would be very natural for DPDK to load per-queue XDP-progs for
interfacing with AF_XDP, as they already have rte_flow API (see
[1]+[2]) for configuring HW filters.  And loading per-queue XDP-progs
would also avoid disturbing other users of XDP on same machine (if we
choose the semantics defined in [6]).

[6] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_per_rxq01.org#proposal-rxq-prog-takes-precedence

^ permalink raw reply	[flat|nested] 19+ messages in thread