bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
@ 2019-10-18  4:07 Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver Toshiaki Makita
                   ` (16 more replies)
  0 siblings, 17 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
to XDP.

* Motivation

The purpose is to speed up flow based network features like TC flower and
nftables by making use of XDP.

I chose flow feature because my current interest is in OVS. OVS uses TC
flower to offload flow tables to hardware, so if TC can offload flows to
XDP, OVS also can be offloaded to XDP.

When TC flower filter is offloaded to XDP, the received packets are
handled by XDP first, and if their protocol or something is not
supported by the eBPF program, the program returns XDP_PASS and packets
are passed to upper layer TC.

The packet processing flow will be like this when this mechanism,
xdp_flow, is used with OVS.

 +-------------+
 | openvswitch |
 |    kmod     |
 +-------------+
        ^
        | if not match in filters (flow key or action not supported by TC)
 +-------------+
 |  TC flower  |
 +-------------+
        ^
        | if not match in flow tables (flow key or action not supported by XDP)
 +-------------+
 |  XDP prog   |
 +-------------+
        ^
        | incoming packets

Of course we can directly use TC flower without OVS to speed up TC.

This is useful especially when the device does not support HW-offload.
Such interfaces include virtual interfaces like veth.


* How to use

It only supports ingress flow block at this point.
Enable the feature via ethtool before binding a device to a flow block.

 $ ethtool -K eth0 flow-offload-xdp on

Then bind a device to a flow block using TC or nftables. An example
commands for TC would be like this.

 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower ...

Alternatively, when using OVS, adding qdisc and filters will be
automatically done by setting hw-offload.

 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress # or reboot
 $ ethtool -K eth0 flow-offload-xdp on
 $ systemctl start openvswitch

NOTE: I have not tested nftables offload. Theoretically it should work.

* Performance

I measured drop rate at veth interface with redirect action from physical
interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
(2.20 GHz).
                                                                 XDP_DROP
                    +------+                        +-------+    +-------+
 pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
                    +------+   (offloaded to XDP)   +-------+    +-------+

The setup for redirect is done by OVS like this.

 $ ovs-vsctl add-br ovsbr0
 $ ovs-vsctl add-port ovsbr0 eth0
 $ ovs-vsctl add-port ovsbr0 veth0
 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress
 $ tc qdisc del dev veth0 ingress
 $ ethtool -K eth0 flow-offload-xdp on
 $ ethtool -K veth0 flow-offload-xdp on
 $ systemctl start openvswitch

Tested single core/single flow with 3 kinds of configurations.
(spectre_v2 disabled)
- xdp_flow: hw-offload=true, flow-offload-xdp on
- TC:       hw-offload=true, flow-offload-xdp off (software TC)
- ovs kmod: hw-offload=false

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 5.2 Mpps  1.2 Mpps  1.1 Mpps

So xdp_flow drop rate is roughly 4x-5x faster than software TC or ovs kmod.

OTOH the time to add a flow increases with xdp_flow.

ping latency of first packet when veth1 does XDP_PASS instead of DROP:

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 22ms      6ms       0.6ms

xdp_flow does a lot of work to emulate TC behavior including UMH
transaction and multiple bpf map update from UMH which I think increases
the latency.


* Implementation

xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
bpfilter. The difference is that xdp_flow does not generate the eBPF
program dynamically but a prebuilt program is embedded in UMH. This is
mainly because flow insertion is considerably frequent. If we generate
and load an eBPF program on each insertion of a flow, the latency of the
first packet of ping in above test will incease, which I want to avoid.

                         +----------------------+
                         |    xdp_flow_umh      | load eBPF prog for XDP
                         | (eBPF prog embedded) | update maps for flow tables
                         +----------------------+
                                   ^ |
                           request | v eBPF prog id
 +-----------+  offload  +-----------------------+
 | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
 +-----------+           | (flow offload driver) |
                         +-----------------------+

- When ingress/clsact qdisc is created, i.e. a device is bound to a flow
  block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
  xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
  (the reason of attaching XDP from kmod is that rtnl_lock is held here).

- When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
  update maps for flow tables.


* Patches

- patch 1
 Basic framework for xdp_flow kmod and UMH.

- patch 2
 Add prebuilt eBPF program embedded in UMH.

- patch 3, 4, 5
 Attach the prog to XDP in kmod after using the prog id returned from
 UMH.

- patch 6, 7
 Add maps for flow tables and flow table manipulation logic in UMH.

- patch 8
 Implement flow lookup and basic actions in eBPF prog.

- patch 9
 Implement flow manipulation logic, serialize flow key and actions from
 TC flower and make requests to UMH in kmod.

- patch 10
 Add flow-offload-xdp netdev feature and register indr flow block to call
 xdp_flow kmod.

- patch 11, 12
 Add example actions, redirect and vlan_push.

- patch 13
 Add a testcase for xdp_flow.

- patch 14, 15
 These are unrelated patches. They just improve XDP program's
 performance. They are included to demonstrate to what extent xdp_flow
 performance can increase. Without them, drop rate goes down from 5.2Mpps
 to 4.2Mpps. The plan is to send these patches separately before
 drooping RFC tag.


* About OVS AF_XDP netdev

Recently OVS has added AF_XDP netdev type support. This also makes use
of XDP, but in some ways different from this patch set.

- AF_XDP work originally started in order to bring BPF's flexibility to
  OVS, which enables us to upgrade datapath without updating kernel.
  AF_XDP solution uses userland datapath so it achieved its goal.
  xdp_flow will not replace OVS datapath completely, but offload it
  partially just for speed up.

- OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
  as well as using another core for softirq.

- OVS AF_XDP needs packet copy when forwarding packets.

- xdp_flow can be used not only for OVS. It works for direct use of TC
  flower and nftables.


* About alternative userland (ovs-vswitchd etc.) implementation

Maybe a similar logic can be implemented in ovs-vswitchd offload
mechanism, instead of adding code to kernel. I just thought offloading
TC is more generic and allows wider usage with direct TC command.

For example, considering that OVS inserts a flow to kernel only when
flow miss happens in kernel, we can in advance add offloaded flows via
tc filter to avoid flow insertion latency for certain sensitive flows.
TC flower usage without using OVS is also possible.

Also as written above nftables can be offloaded to XDP with this
mechanism as well.

Another way to achieve this from userland is to add notifications in
flow_offload kernel code to inform userspace of flow addition and
deletion events, and listen them by a deamon which in turn loads eBPF
programs, attach them to XDP, and modify eBPF maps. Although this may
open up more use cases, I'm not thinking this is the best solution
because it requires emulation of kernel behavior as an offload engine
but flow related code is heavily changing which is difficult to follow
from out of tree.

* Note

This patch set is based on top of commit 5bc60de50dfe ("selftests: bpf:
Don't try to read files without read permission") on bpf-next, but need
to backport commit 98beb3edeb97 ("samples/bpf: Add a workaround for
asm_inline") from bpf tree to successfully build the module.

* Changes

RFC v2:
 - Use indr block instead of modifying TC core, feedback from Jakub
   Kicinski.
 - Rename tc-offload-xdp to flow-offload-xdp since this works not only
   for TC but also for nftables, as now I use indr flow block.
 - Factor out XDP program validation code in net/core and use it to
   attach a program to XDP from xdp_flow.
 - Use /dev/kmsg instead of syslog.

Any feedback is welcome.
Thanks!

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

Toshiaki Makita (15):
  xdp_flow: Add skeleton of XDP based flow offload driver
  xdp_flow: Add skeleton bpf program for XDP
  bpf: Add API to get program from id
  xdp: Export dev_check_xdp and dev_change_xdp
  xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  xdp_flow: Prepare flow tables in bpf
  xdp_flow: Add flow entry insertion/deletion logic in UMH
  xdp_flow: Add flow handling and basic actions in bpf prog
  xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  xdp_flow: Add netdev feature for enabling flow offload to XDP
  xdp_flow: Implement redirect action
  xdp_flow: Implement vlan_push action
  bpf, selftest: Add test for xdp_flow
  i40e: prefetch xdp->data before running XDP prog
  bpf, hashtab: Compare keys in long

 drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
 include/linux/bpf.h                          |    8 +
 include/linux/netdev_features.h              |    2 +
 include/linux/netdevice.h                    |    4 +
 kernel/bpf/hashtab.c                         |   27 +-
 kernel/bpf/syscall.c                         |   42 +-
 net/Kconfig                                  |    1 +
 net/Makefile                                 |    1 +
 net/core/dev.c                               |  113 ++-
 net/core/ethtool.c                           |    1 +
 net/xdp_flow/.gitignore                      |    1 +
 net/xdp_flow/Kconfig                         |   16 +
 net/xdp_flow/Makefile                        |  112 +++
 net/xdp_flow/msgfmt.h                        |  102 +++
 net/xdp_flow/umh_bpf.h                       |   34 +
 net/xdp_flow/xdp_flow.h                      |   28 +
 net/xdp_flow/xdp_flow_core.c                 |  180 +++++
 net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
 net/xdp_flow/xdp_flow_kern_mod.c             |  699 +++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c                  | 1043 ++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
 tools/testing/selftests/bpf/Makefile         |    1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh |  106 +++
 24 files changed, 2864 insertions(+), 30 deletions(-)
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/umh_bpf.h
 create mode 100644 net/xdp_flow/xdp_flow.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 02/15] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Add flow offload driver, xdp_flow_core.c, and skeleton of UMH handling
mechanism. The driver is not called from anywhere yet.

xdp_flow_setup_block() in xdp_flow_core.c is meant to be called when
a net device is bound to a flow block, e.g. ingress qdisc is added.
It loads xdp_flow kernel module and the kmod provides callbacks for
setup phase and flow insertion phase.

xdp_flow_setup() in the kmod will be called from xdp_flow_setup_block()
when ingress qdisc is added, and xdp_flow_setup_block_cb() will be
called when a tc flower filter is added.

The former will request the UMH to load the eBPF program and the latter
will request the UMH to populate maps for flow tables. In this patch
no actual processing is implemented and the following commits implement
them.

The overall mechanism of UMH handling is written referring to bpfilter.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/Kconfig                      |   1 +
 net/Makefile                     |   1 +
 net/xdp_flow/.gitignore          |   1 +
 net/xdp_flow/Kconfig             |  16 +++
 net/xdp_flow/Makefile            |  31 +++++
 net/xdp_flow/msgfmt.h            | 102 ++++++++++++++++
 net/xdp_flow/xdp_flow.h          |  23 ++++
 net/xdp_flow/xdp_flow_core.c     | 127 ++++++++++++++++++++
 net/xdp_flow/xdp_flow_kern_mod.c | 250 +++++++++++++++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c      | 116 ++++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S |   7 ++
 11 files changed, 675 insertions(+)
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/xdp_flow.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S

diff --git a/net/Kconfig b/net/Kconfig
index 3101bfcb..369ecd0 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -206,6 +206,7 @@ source "net/bridge/netfilter/Kconfig"
 endif
 
 source "net/bpfilter/Kconfig"
+source "net/xdp_flow/Kconfig"
 
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 449fc0b..b78d1ef 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -87,3 +87,4 @@ endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
+obj-$(CONFIG_XDP_FLOW)		+= xdp_flow/
diff --git a/net/xdp_flow/.gitignore b/net/xdp_flow/.gitignore
new file mode 100644
index 0000000..8cad817
--- /dev/null
+++ b/net/xdp_flow/.gitignore
@@ -0,0 +1 @@
+xdp_flow_umh
diff --git a/net/xdp_flow/Kconfig b/net/xdp_flow/Kconfig
new file mode 100644
index 0000000..a4d79fa
--- /dev/null
+++ b/net/xdp_flow/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+menuconfig XDP_FLOW
+	bool "XDP based flow offload engine (XDP_FLOW)"
+	depends on NET && BPF_SYSCALL && MEMFD_CREATE
+	help
+	  This builds experimental xdp_flow framework that is aiming to
+	  provide flow software offload functionality via XDP
+
+if XDP_FLOW
+config XDP_FLOW_UMH
+	tristate "xdp_flow kernel module with user mode helper"
+	depends on $(success,$(srctree)/scripts/cc-can-link.sh $(CC))
+	default m
+	help
+	  This builds xdp_flow kernel module with embedded user mode helper
+endif
diff --git a/net/xdp_flow/Makefile b/net/xdp_flow/Makefile
new file mode 100644
index 0000000..f6138c2
--- /dev/null
+++ b/net/xdp_flow/Makefile
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_XDP_FLOW) += xdp_flow_core.o
+
+ifeq ($(CONFIG_XDP_FLOW_UMH), y)
+# builtin xdp_flow_umh should be compiled with -static
+# since rootfs isn't mounted at the time of __init
+# function is called and do_execv won't find elf interpreter
+STATIC := -static
+endif
+
+quiet_cmd_cc_user = CC      $@
+      cmd_cc_user = $(CC) -Wall -Wmissing-prototypes -O2 -std=gnu89 \
+		    -I$(srctree)/tools/include/ \
+		    -c -o $@ $<
+
+quiet_cmd_ld_user = LD      $@
+      cmd_ld_user = $(CC) $(STATIC) -o $@ $^
+
+$(obj)/xdp_flow_umh.o: $(src)/xdp_flow_umh.c FORCE
+	$(call if_changed,cc_user)
+
+$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o
+	$(call if_changed,ld_user)
+
+clean-files := xdp_flow_umh
+
+$(obj)/xdp_flow_umh_blob.o: $(obj)/xdp_flow_umh
+
+obj-$(CONFIG_XDP_FLOW_UMH) += xdp_flow.o
+xdp_flow-objs += xdp_flow_kern_mod.o xdp_flow_umh_blob.o
diff --git a/net/xdp_flow/msgfmt.h b/net/xdp_flow/msgfmt.h
new file mode 100644
index 0000000..97d8490
--- /dev/null
+++ b/net/xdp_flow/msgfmt.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_XDP_FLOW_MSGFMT_H
+#define _NET_XDP_FLOW_MSGFMT_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/if_ether.h>
+#include <linux/in6.h>
+
+#define MAX_XDP_FLOW_ACTIONS 32
+
+enum xdp_flow_action_id {
+	/* ABORT if 0, i.e. uninitialized */
+	XDP_FLOW_ACTION_ACCEPT	= 1,
+	XDP_FLOW_ACTION_DROP,
+	XDP_FLOW_ACTION_REDIRECT,
+	XDP_FLOW_ACTION_VLAN_PUSH,
+	XDP_FLOW_ACTION_VLAN_POP,
+	XDP_FLOW_ACTION_VLAN_MANGLE,
+	XDP_FLOW_ACTION_MANGLE,
+	XDP_FLOW_ACTION_CSUM,
+	NR_XDP_FLOW_ACTION,
+};
+
+struct xdp_flow_action {
+	enum xdp_flow_action_id	id;
+	union {
+		int	ifindex;	/* REDIRECT */
+		struct {		/* VLAN */
+			__be16	proto;
+			__be16	tci;
+		} vlan;
+	};
+};
+
+struct xdp_flow_actions {
+	unsigned int num_actions;
+	struct xdp_flow_action actions[MAX_XDP_FLOW_ACTIONS];
+};
+
+struct xdp_flow_key {
+	struct {
+		__u8	dst[ETH_ALEN] __aligned(2);
+		__u8	src[ETH_ALEN] __aligned(2);
+		__be16	type;
+	} eth;
+	struct {
+		__be16	tpid;
+		__be16	tci;
+	} vlan;
+	struct {
+		__u8	proto;
+		__u8	ttl;
+		__u8	tos;
+		__u8	frag;
+	} ip;
+	union {
+		struct {
+			__be32	src;
+			__be32	dst;
+		} ipv4;
+		struct {
+			struct in6_addr	src;
+			struct in6_addr	dst;
+		} ipv6;
+	};
+	struct {
+		__be16	src;
+		__be16	dst;
+	} l4port;
+	struct {
+		__be16	flags;
+	} tcp;
+} __aligned(BITS_PER_LONG / 8);
+
+struct xdp_flow {
+	struct xdp_flow_key key;
+	struct xdp_flow_key mask;
+	struct xdp_flow_actions actions;
+	__u16 priority;
+};
+
+enum xdp_flow_cmd {
+	XDP_FLOW_CMD_NOOP		= 0,
+	XDP_FLOW_CMD_LOAD,
+	XDP_FLOW_CMD_UNLOAD,
+	XDP_FLOW_CMD_REPLACE,
+	XDP_FLOW_CMD_DELETE,
+};
+
+struct mbox_request {
+	int ifindex;
+	__u8 cmd;
+	struct xdp_flow flow;
+};
+
+struct mbox_reply {
+	int status;
+	__u32 id;
+};
+
+#endif
diff --git a/net/xdp_flow/xdp_flow.h b/net/xdp_flow/xdp_flow.h
new file mode 100644
index 0000000..656ceab
--- /dev/null
+++ b/net/xdp_flow/xdp_flow.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_XDP_FLOW_H
+#define _LINUX_XDP_FLOW_H
+
+#include <linux/netdevice.h>
+#include <linux/umh.h>
+#include <net/flow_offload.h>
+
+struct xdp_flow_umh_ops {
+	struct umh_info info;
+	/* serialize access to this object and UMH */
+	struct mutex lock;
+	flow_setup_cb_t *setup_cb;
+	int (*setup)(struct net_device *dev, bool do_bind,
+		     struct netlink_ext_ack *extack);
+	int (*start)(void);
+	bool stop;
+	struct module *module;
+};
+
+extern struct xdp_flow_umh_ops xdp_flow_ops;
+
+#endif
diff --git a/net/xdp_flow/xdp_flow_core.c b/net/xdp_flow/xdp_flow_core.c
new file mode 100644
index 0000000..8265aef
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_core.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kmod.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include "xdp_flow.h"
+
+struct xdp_flow_umh_ops xdp_flow_ops;
+EXPORT_SYMBOL_GPL(xdp_flow_ops);
+
+static LIST_HEAD(xdp_block_cb_list);
+
+static void xdp_flow_block_release(void *cb_priv)
+{
+	struct net_device *dev = cb_priv;
+	struct netlink_ext_ack extack;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	xdp_flow_ops.setup(dev, false, &extack);
+	module_put(xdp_flow_ops.module);
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+
+int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f)
+{
+	struct flow_block_cb *block_cb;
+	int err = 0;
+
+	/* TODO: Remove this limitation */
+	if (!net_eq(current->nsproxy->net_ns, &init_net))
+		return -EOPNOTSUPP;
+
+	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (!xdp_flow_ops.module) {
+		mutex_unlock(&xdp_flow_ops.lock);
+		if (f->command == FLOW_BLOCK_UNBIND)
+			return -ENOENT;
+		err = request_module("xdp_flow");
+		if (err)
+			return err;
+		mutex_lock(&xdp_flow_ops.lock);
+		if (!xdp_flow_ops.module) {
+			err = -ECHILD;
+			goto out;
+		}
+	}
+	if (xdp_flow_ops.stop) {
+		err = xdp_flow_ops.start();
+		if (err)
+			goto out;
+	}
+
+	f->driver_block_list = &xdp_block_cb_list;
+
+	switch (f->command) {
+	case FLOW_BLOCK_BIND:
+		if (flow_block_cb_is_busy(xdp_flow_ops.setup_cb, dev,
+					  &xdp_block_cb_list)) {
+			err = -EBUSY;
+			goto out;
+		}
+
+		if (!try_module_get(xdp_flow_ops.module)) {
+			err = -ECHILD;
+			goto out;
+		}
+
+		err = xdp_flow_ops.setup(dev, true, f->extack);
+		if (err) {
+			module_put(xdp_flow_ops.module);
+			goto out;
+		}
+
+		block_cb = flow_block_cb_alloc(xdp_flow_ops.setup_cb, dev, dev,
+					       xdp_flow_block_release);
+		if (IS_ERR(block_cb)) {
+			xdp_flow_ops.setup(dev, false, f->extack);
+			module_put(xdp_flow_ops.module);
+			err = PTR_ERR(block_cb);
+			goto out;
+		}
+
+		flow_block_cb_add(block_cb, f);
+		list_add_tail(&block_cb->driver_list, &xdp_block_cb_list);
+		break;
+	case FLOW_BLOCK_UNBIND:
+		block_cb = flow_block_cb_lookup(f->block, xdp_flow_ops.setup_cb,
+						dev);
+		if (!block_cb) {
+			err = -ENOENT;
+			goto out;
+		}
+
+		flow_block_cb_remove(block_cb, f);
+		list_del(&block_cb->driver_list);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+	}
+out:
+	mutex_unlock(&xdp_flow_ops.lock);
+
+	return err;
+}
+
+static void xdp_flow_umh_cleanup(struct umh_info *info)
+{
+	mutex_lock(&xdp_flow_ops.lock);
+	xdp_flow_ops.stop = true;
+	fput(info->pipe_to_umh);
+	fput(info->pipe_from_umh);
+	info->pid = 0;
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+
+static int __init xdp_flow_init(void)
+{
+	mutex_init(&xdp_flow_ops.lock);
+	xdp_flow_ops.stop = true;
+	xdp_flow_ops.info.cmdline = "xdp_flow_umh";
+	xdp_flow_ops.info.cleanup = &xdp_flow_umh_cleanup;
+
+	return 0;
+}
+device_initcall(xdp_flow_init);
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
new file mode 100644
index 0000000..14e06ee
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/umh.h>
+#include <linux/sched/signal.h>
+#include <linux/rtnetlink.h>
+#include "xdp_flow.h"
+#include "msgfmt.h"
+
+extern char xdp_flow_umh_start;
+extern char xdp_flow_umh_end;
+
+static void shutdown_umh(void)
+{
+	struct task_struct *tsk;
+
+	if (xdp_flow_ops.stop)
+		return;
+
+	tsk = get_pid_task(find_vpid(xdp_flow_ops.info.pid), PIDTYPE_PID);
+	if (tsk) {
+		send_sig(SIGKILL, tsk, 1);
+		put_task_struct(tsk);
+	}
+}
+
+static int transact_umh(struct mbox_request *req, u32 *id)
+{
+	struct mbox_reply reply;
+	int ret = -EFAULT;
+	loff_t pos;
+	ssize_t n;
+
+	if (!xdp_flow_ops.info.pid)
+		goto out;
+
+	n = __kernel_write(xdp_flow_ops.info.pipe_to_umh, req, sizeof(*req),
+			   &pos);
+	if (n != sizeof(*req)) {
+		pr_err("write fail %zd\n", n);
+		shutdown_umh();
+		goto out;
+	}
+
+	pos = 0;
+	n = kernel_read(xdp_flow_ops.info.pipe_from_umh, &reply,
+			sizeof(reply), &pos);
+	if (n != sizeof(reply)) {
+		pr_err("read fail %zd\n", n);
+		shutdown_umh();
+		goto out;
+	}
+
+	ret = reply.status;
+	if (id)
+		*id = reply.id;
+out:
+	return ret;
+}
+
+static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xdp_flow_destroy(struct net_device *dev, struct flow_cls_offload *f)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xdp_flow_setup_flower(struct net_device *dev,
+				 struct flow_cls_offload *f)
+{
+	switch (f->command) {
+	case FLOW_CLS_REPLACE:
+		return xdp_flow_replace(dev, f);
+	case FLOW_CLS_DESTROY:
+		return xdp_flow_destroy(dev, f);
+	case FLOW_CLS_STATS:
+	case FLOW_CLS_TMPLT_CREATE:
+	case FLOW_CLS_TMPLT_DESTROY:
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
+				   void *cb_priv)
+{
+	struct flow_cls_common_offload *common = type_data;
+	struct net_device *dev = cb_priv;
+	int err = 0;
+
+	if (common->chain_index) {
+		NL_SET_ERR_MSG_MOD(common->extack,
+				   "Supports only offload of chain 0");
+		return -EOPNOTSUPP;
+	}
+
+	if (type != TC_SETUP_CLSFLOWER)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (xdp_flow_ops.stop) {
+		err = xdp_flow_ops.start();
+		if (err)
+			goto out;
+	}
+
+	err = xdp_flow_setup_flower(dev, type_data);
+out:
+	mutex_unlock(&xdp_flow_ops.lock);
+	return err;
+}
+
+static int xdp_flow_setup_bind(struct net_device *dev,
+			       struct netlink_ext_ack *extack)
+{
+	struct mbox_request *req;
+	u32 id = 0;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_LOAD;
+	req->ifindex = dev->ifindex;
+
+	/* Load bpf in UMH and get prog id */
+	err = transact_umh(req, &id);
+
+	/* TODO: id will be used to attach bpf prog to XDP
+	 * As we have rtnl_lock, UMH cannot attach prog to XDP
+	 */
+
+	kfree(req);
+
+	return err;
+}
+
+static int xdp_flow_setup_unbind(struct net_device *dev,
+				 struct netlink_ext_ack *extack)
+{
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_UNLOAD;
+	req->ifindex = dev->ifindex;
+
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	return err;
+}
+
+static int xdp_flow_setup(struct net_device *dev, bool do_bind,
+			  struct netlink_ext_ack *extack)
+{
+	ASSERT_RTNL();
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return -EINVAL;
+
+	return do_bind ?
+		xdp_flow_setup_bind(dev, extack) :
+		xdp_flow_setup_unbind(dev, extack);
+}
+
+static int xdp_flow_test(void)
+{
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_NOOP;
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	return err;
+}
+
+static int start_umh(void)
+{
+	int err;
+
+	/* fork usermode process */
+	err = fork_usermode_blob(&xdp_flow_umh_start,
+				 &xdp_flow_umh_end - &xdp_flow_umh_start,
+				 &xdp_flow_ops.info);
+	if (err)
+		return err;
+
+	xdp_flow_ops.stop = false;
+	pr_info("Loaded xdp_flow_umh pid %d\n", xdp_flow_ops.info.pid);
+
+	/* health check that usermode process started correctly */
+	if (xdp_flow_test()) {
+		shutdown_umh();
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int __init load_umh(void)
+{
+	int err = 0;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (!xdp_flow_ops.stop) {
+		err = -EFAULT;
+		goto err;
+	}
+
+	err = start_umh();
+	if (err)
+		goto err;
+
+	xdp_flow_ops.setup_cb = &xdp_flow_setup_block_cb;
+	xdp_flow_ops.setup = &xdp_flow_setup;
+	xdp_flow_ops.start = &start_umh;
+	xdp_flow_ops.module = THIS_MODULE;
+err:
+	mutex_unlock(&xdp_flow_ops.lock);
+	return err;
+}
+
+static void __exit fini_umh(void)
+{
+	mutex_lock(&xdp_flow_ops.lock);
+	shutdown_umh();
+	xdp_flow_ops.module = NULL;
+	xdp_flow_ops.start = NULL;
+	xdp_flow_ops.setup = NULL;
+	xdp_flow_ops.setup_cb = NULL;
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+module_init(load_umh);
+module_exit(fini_umh);
+MODULE_LICENSE("GPL");
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
new file mode 100644
index 0000000..c642b5b
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <syslog.h>
+#include "msgfmt.h"
+
+FILE *kmsg;
+
+#define pr_log(fmt, prio, ...) fprintf(kmsg, "<%d>xdp_flow_umh: " fmt, \
+				       LOG_DAEMON | (prio), ##__VA_ARGS__)
+#ifdef DEBUG
+#define pr_debug(fmt, ...) pr_log(fmt, LOG_DEBUG, ##__VA_ARGS__)
+#else
+#define pr_debug(fmt, ...) do {} while (0)
+#endif
+#define pr_info(fmt, ...) pr_log(fmt, LOG_INFO, ##__VA_ARGS__)
+#define pr_warn(fmt, ...) pr_log(fmt, LOG_WARNING, ##__VA_ARGS__)
+#define pr_err(fmt, ...) pr_log(fmt, LOG_ERR, ##__VA_ARGS__)
+
+static int handle_load(const struct mbox_request *req, __u32 *prog_id)
+{
+	*prog_id = 0;
+
+	return 0;
+}
+
+static int handle_unload(const struct mbox_request *req)
+{
+	return 0;
+}
+
+static int handle_replace(struct mbox_request *req)
+{
+	return -EOPNOTSUPP;
+}
+
+static int handle_delete(const struct mbox_request *req)
+{
+	return -EOPNOTSUPP;
+}
+
+static void loop(void)
+{
+	struct mbox_request *req;
+
+	req = malloc(sizeof(struct mbox_request));
+	if (!req) {
+		pr_err("Memory allocation for mbox_request failed\n");
+		return;
+	}
+
+	while (1) {
+		struct mbox_reply reply;
+		int n;
+
+		n = read(0, req, sizeof(*req));
+		if (n < 0) {
+			pr_err("read for mbox_request failed: %s\n",
+			       strerror(errno));
+			break;
+		}
+		if (n != sizeof(*req)) {
+			pr_err("Invalid request size %d\n", n);
+			break;
+		}
+
+		switch (req->cmd) {
+		case XDP_FLOW_CMD_NOOP:
+			reply.status = 0;
+			break;
+		case XDP_FLOW_CMD_LOAD:
+			reply.status = handle_load(req, &reply.id);
+			break;
+		case XDP_FLOW_CMD_UNLOAD:
+			reply.status = handle_unload(req);
+			break;
+		case XDP_FLOW_CMD_REPLACE:
+			reply.status = handle_replace(req);
+			break;
+		case XDP_FLOW_CMD_DELETE:
+			reply.status = handle_delete(req);
+			break;
+		default:
+			pr_err("Invalid command %d\n", req->cmd);
+			reply.status = -EOPNOTSUPP;
+		}
+
+		n = write(1, &reply, sizeof(reply));
+		if (n < 0) {
+			pr_err("write for mbox_reply failed: %s\n",
+			       strerror(errno));
+			break;
+		}
+		if (n != sizeof(reply)) {
+			pr_err("reply written too short: %d\n", n);
+			break;
+		}
+	}
+
+	free(req);
+}
+
+int main(void)
+{
+	kmsg = fopen("/dev/kmsg", "a");
+	setvbuf(kmsg, NULL, _IONBF, 0);
+	pr_info("Started xdp_flow\n");
+	loop();
+	fclose(kmsg);
+
+	return 0;
+}
diff --git a/net/xdp_flow/xdp_flow_umh_blob.S b/net/xdp_flow/xdp_flow_umh_blob.S
new file mode 100644
index 0000000..6edcb0e
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_umh_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .rodata, "a"
+	.global xdp_flow_umh_start
+xdp_flow_umh_start:
+	.incbin "net/xdp_flow/xdp_flow_umh"
+	.global xdp_flow_umh_end
+xdp_flow_umh_end:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 02/15] xdp_flow: Add skeleton bpf program for XDP
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 03/15] bpf: Add API to get program from id Toshiaki Makita
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

The program is meant to be loaded when a device is bound to an ingress
flow block and should be attached to XDP on the device.
Typically it should be loaded when TC ingress or clsact qdisc is added
or a nftables offloaded chain is added.

The program is prebuilt and embedded in the UMH, instead of generated
dynamically. This is because TC filter is frequently changed when it is
used by OVS, and the latency of TC filter change will affect the latency
of datapath.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/Makefile                 |  87 +++++++++++-
 net/xdp_flow/xdp_flow_kern_bpf.c      |  12 ++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S |   7 +
 net/xdp_flow/xdp_flow_umh.c           | 243 +++++++++++++++++++++++++++++++++-
 4 files changed, 345 insertions(+), 4 deletions(-)
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S

diff --git a/net/xdp_flow/Makefile b/net/xdp_flow/Makefile
index f6138c2..057cc6a 100644
--- a/net/xdp_flow/Makefile
+++ b/net/xdp_flow/Makefile
@@ -2,25 +2,106 @@
 
 obj-$(CONFIG_XDP_FLOW) += xdp_flow_core.o
 
+XDP_FLOW_PATH ?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(XDP_FLOW_PATH)/../../tools
+
+# Libbpf dependencies
+LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
+
+LLC ?= llc
+CLANG ?= clang
+LLVM_OBJCOPY ?= llvm-objcopy
+BTF_PAHOLE ?= pahole
+
+ifdef CROSS_COMPILE
+CLANG_ARCH_ARGS = -target $(ARCH)
+endif
+
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
+BTF_LLVM_PROBE := $(shell echo "int main() { return 0; }" | \
+			  $(CLANG) -target bpf -O2 -g -c -x c - -o ./llvm_btf_verify.o; \
+			  readelf -S ./llvm_btf_verify.o | grep BTF; \
+			  /bin/rm -f ./llvm_btf_verify.o)
+
+ifneq ($(BTF_LLVM_PROBE),)
+	EXTRA_CFLAGS += -g
+else
+ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
+	EXTRA_CFLAGS += -g
+	LLC_FLAGS += -mattr=dwarfris
+	DWARF2BTF = y
+endif
+endif
+
+$(LIBBPF): FORCE
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+	$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(XDP_FLOW_PATH)/../../ O=
+
+# Verify LLVM compiler tools are available and bpf target is supported by llc
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+	@for TOOL in $^ ; do \
+		if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+			exit 1; \
+		else true; fi; \
+	done
+
+verify_target_bpf: verify_cmds
+	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+		exit 2; \
+	else true; fi
+
+$(src)/xdp_flow_kern_bpf.c: verify_target_bpf
+
+$(obj)/xdp_flow_kern_bpf.o: $(src)/xdp_flow_kern_bpf.c FORCE
+	@echo "  CLANG-bpf " $@
+	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+		-I$(srctree)/tools/lib/bpf/ \
+		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
+		-D__TARGET_ARCH_$(SRCARCH) -Wno-compare-distinct-pointer-types \
+		-Wno-gnu-variable-sized-type-not-at-end \
+		-Wno-address-of-packed-member -Wno-tautological-compare \
+		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
+		-I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
+ifeq ($(DWARF2BTF),y)
+	$(BTF_PAHOLE) -J $@
+endif
+
 ifeq ($(CONFIG_XDP_FLOW_UMH), y)
 # builtin xdp_flow_umh should be compiled with -static
 # since rootfs isn't mounted at the time of __init
 # function is called and do_execv won't find elf interpreter
 STATIC := -static
+STATICLDLIBS := -lz
 endif
 
+quiet_cmd_as_user = AS      $@
+      cmd_as_user = $(AS) -c -o $@ $<
+
 quiet_cmd_cc_user = CC      $@
       cmd_cc_user = $(CC) -Wall -Wmissing-prototypes -O2 -std=gnu89 \
-		    -I$(srctree)/tools/include/ \
+		    -I$(srctree)/tools/lib/ -I$(srctree)/tools/include/ \
 		    -c -o $@ $<
 
 quiet_cmd_ld_user = LD      $@
-      cmd_ld_user = $(CC) $(STATIC) -o $@ $^
+      cmd_ld_user = $(CC) $(STATIC) -o $@ $^ $(LIBBPF) -lelf $(STATICLDLIBS)
+
+$(obj)/xdp_flow_kern_bpf_blob.o: $(src)/xdp_flow_kern_bpf_blob.S \
+				 $(obj)/xdp_flow_kern_bpf.o
+	$(call if_changed,as_user)
 
 $(obj)/xdp_flow_umh.o: $(src)/xdp_flow_umh.c FORCE
 	$(call if_changed,cc_user)
 
-$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o
+$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o $(LIBBPF) \
+		     $(obj)/xdp_flow_kern_bpf_blob.o
 	$(call if_changed,ld_user)
 
 clean-files := xdp_flow_umh
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
new file mode 100644
index 0000000..74cdb1d
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <bpf_helpers.h>
+
+SEC("xdp_flow")
+int xdp_flow_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/net/xdp_flow/xdp_flow_kern_bpf_blob.S b/net/xdp_flow/xdp_flow_kern_bpf_blob.S
new file mode 100644
index 0000000..d180c1b
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_bpf_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .rodata, "a"
+	.global xdp_flow_bpf_start
+xdp_flow_bpf_start:
+	.incbin "net/xdp_flow/xdp_flow_kern_bpf.o"
+	.global xdp_flow_bpf_end
+xdp_flow_bpf_end:
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index c642b5b..85c5c7b 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -6,8 +6,18 @@
 #include <stdlib.h>
 #include <unistd.h>
 #include <syslog.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <sys/mman.h>
+#include <sys/types.h>
+#include <sys/resource.h>
+#include <linux/hashtable.h>
+#include <linux/err.h>
 #include "msgfmt.h"
 
+extern char xdp_flow_bpf_start;
+extern char xdp_flow_bpf_end;
+int progfile_fd;
 FILE *kmsg;
 
 #define pr_log(fmt, prio, ...) fprintf(kmsg, "<%d>xdp_flow_umh: " fmt, \
@@ -21,15 +31,241 @@
 #define pr_warn(fmt, ...) pr_log(fmt, LOG_WARNING, ##__VA_ARGS__)
 #define pr_err(fmt, ...) pr_log(fmt, LOG_ERR, ##__VA_ARGS__)
 
+#define ERRBUF_SIZE 64
+
+/* This key represents a net device */
+struct netdev_info_key {
+	int ifindex;
+};
+
+struct netdev_info {
+	struct netdev_info_key key;
+	struct hlist_node node;
+	struct bpf_object *obj;
+};
+
+DEFINE_HASHTABLE(netdev_info_table, 16);
+
+static int libbpf_err(int err, char *errbuf)
+{
+	libbpf_strerror(err, errbuf, ERRBUF_SIZE);
+
+	if (-err < __LIBBPF_ERRNO__START)
+		return err;
+
+	return -EINVAL;
+}
+
+static int setup(void)
+{
+	size_t size = &xdp_flow_bpf_end - &xdp_flow_bpf_start;
+	struct rlimit r = { RLIM_INFINITY, RLIM_INFINITY };
+	ssize_t len;
+	int err;
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		err = -errno;
+		pr_err("setrlimit MEMLOCK failed: %s\n", strerror(errno));
+		return err;
+	}
+
+	progfile_fd = memfd_create("xdp_flow_kern_bpf.o", 0);
+	if (progfile_fd < 0) {
+		err = -errno;
+		pr_err("memfd_create failed: %s\n", strerror(errno));
+		return err;
+	}
+
+	len = write(progfile_fd, &xdp_flow_bpf_start, size);
+	if (len < 0) {
+		err = -errno;
+		pr_err("Failed to write bpf prog: %s\n", strerror(errno));
+		goto err;
+	}
+
+	if (len < size) {
+		pr_err("bpf prog written too short: expected %ld, actual %ld\n",
+		       size, len);
+		err = -EIO;
+		goto err;
+	}
+
+	return 0;
+err:
+	close(progfile_fd);
+
+	return err;
+}
+
+static int load_bpf(int ifindex, struct bpf_object **objp)
+{
+	struct bpf_object_open_attr attr = {};
+	char path[256], errbuf[ERRBUF_SIZE];
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	int prog_fd, err;
+	ssize_t len;
+
+	len = snprintf(path, 256, "/proc/self/fd/%d", progfile_fd);
+	if (len < 0) {
+		err = -errno;
+		pr_err("Failed to setup prog fd path string: %s\n",
+		       strerror(errno));
+		return err;
+	}
+
+	attr.file = path;
+	attr.prog_type = BPF_PROG_TYPE_XDP;
+	obj = bpf_object__open_xattr(&attr);
+	if (IS_ERR_OR_NULL(obj)) {
+		if (IS_ERR(obj)) {
+			err = libbpf_err((int)PTR_ERR(obj), errbuf);
+		} else {
+			err = -ENOENT;
+			strerror_r(-err, errbuf, sizeof(errbuf));
+		}
+		pr_err("Cannot open bpf prog: %s\n", errbuf);
+		return err;
+	}
+
+	bpf_object__for_each_program(prog, obj)
+		bpf_program__set_type(prog, attr.prog_type);
+
+	err = bpf_object__load(obj);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to load bpf prog: %s\n", errbuf);
+		goto err;
+	}
+
+	prog = bpf_object__find_program_by_title(obj, "xdp_flow");
+	if (!prog) {
+		pr_err("Cannot find xdp_flow program\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		err = libbpf_err(prog_fd, errbuf);
+		pr_err("Invalid program fd: %s\n", errbuf);
+		goto err;
+	}
+
+	*objp = obj;
+
+	return prog_fd;
+err:
+	bpf_object__close(obj);
+	return err;
+}
+
+static int get_netdev_info_keyval(const struct netdev_info_key *key)
+{
+	return key->ifindex;
+}
+
+static struct netdev_info *find_netdev_info(const struct netdev_info_key *key)
+{
+	int keyval = get_netdev_info_keyval(key);
+	struct netdev_info *netdev_info;
+
+	hash_for_each_possible(netdev_info_table, netdev_info, node, keyval) {
+		if (netdev_info->key.ifindex == key->ifindex)
+			return netdev_info;
+	}
+
+	return NULL;
+}
+
+static int get_netdev_info_key(const struct mbox_request *req,
+			       struct netdev_info_key *key)
+{
+	key->ifindex = req->ifindex;
+
+	return 0;
+}
+
+static struct netdev_info *get_netdev_info(const struct mbox_request *req)
+{
+	struct netdev_info *netdev_info;
+	struct netdev_info_key key;
+	int err;
+
+	err = get_netdev_info_key(req, &key);
+	if (err)
+		return ERR_PTR(err);
+
+	netdev_info = find_netdev_info(&key);
+	if (!netdev_info) {
+		pr_err("BUG: netdev_info for if %d not found.\n",
+		       key.ifindex);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return netdev_info;
+}
+
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
-	*prog_id = 0;
+	struct netdev_info *netdev_info;
+	struct bpf_prog_info info = {};
+	struct netdev_info_key key;
+	__u32 len = sizeof(info);
+	int err, prog_fd;
+
+	err = get_netdev_info_key(req, &key);
+	if (err)
+		return err;
+
+	netdev_info = find_netdev_info(&key);
+	if (netdev_info)
+		return 0;
+
+	netdev_info = malloc(sizeof(*netdev_info));
+	if (!netdev_info) {
+		pr_err("malloc for netdev_info failed.\n");
+		return -ENOMEM;
+	}
+	netdev_info->key.ifindex = key.ifindex;
+
+	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
+	if (prog_fd < 0) {
+		err = prog_fd;
+		goto err_netdev_info;
+	}
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
+	if (err)
+		goto err_obj;
+
+	*prog_id = info.id;
+	hash_add(netdev_info_table, &netdev_info->node,
+		 get_netdev_info_keyval(&netdev_info->key));
+	pr_debug("XDP program for if %d was loaded\n", req->ifindex);
 
 	return 0;
+err_obj:
+	bpf_object__close(netdev_info->obj);
+err_netdev_info:
+	free(netdev_info);
+
+	return err;
 }
 
 static int handle_unload(const struct mbox_request *req)
 {
+	struct netdev_info *netdev_info;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	hash_del(&netdev_info->node);
+	bpf_object__close(netdev_info->obj);
+	free(netdev_info);
+	pr_debug("XDP program for if %d was closed\n", req->ifindex);
+
 	return 0;
 }
 
@@ -109,7 +345,12 @@ int main(void)
 	kmsg = fopen("/dev/kmsg", "a");
 	setvbuf(kmsg, NULL, _IONBF, 0);
 	pr_info("Started xdp_flow\n");
+	if (setup()) {
+		fclose(kmsg);
+		return -1;
+	}
 	loop();
+	close(progfile_fd);
 	fclose(kmsg);
 
 	return 0;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 03/15] bpf: Add API to get program from id
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 02/15] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 04/15] xdp: Export dev_check_xdp and dev_change_xdp Toshiaki Makita
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Factor out the logic in bpf_prog_get_fd_by_id() and add
bpf_prog_get_by_id()/bpf_prog_get_type_dev_by_id().
Also export bpf_prog_get_type_dev_by_id(), which will be used by the
following commit to get bpf prog from its id.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/bpf.h  |  8 ++++++++
 kernel/bpf/syscall.c | 42 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 282e28b..78fe7ef 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -636,6 +636,8 @@ int bpf_prog_array_copy(struct bpf_prog_array *old_array,
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 				       bool attach_drv);
+struct bpf_prog *bpf_prog_get_type_dev_by_id(u32 id, enum bpf_prog_type type,
+					     bool attach_drv);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -760,6 +762,12 @@ static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline struct bpf_prog *
+bpf_prog_get_type_dev_by_id(u32 id, enum bpf_prog_type type, bool attach_drv)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog,
 							  int i)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 82eabd4..2dd6cfc 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2139,6 +2139,39 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
 	return err;
 }
 
+static struct bpf_prog *bpf_prog_get_by_id(u32 id)
+{
+	struct bpf_prog *prog;
+
+	spin_lock_bh(&prog_idr_lock);
+	prog = idr_find(&prog_idr, id);
+	if (prog)
+		prog = bpf_prog_inc_not_zero(prog);
+	else
+		prog = ERR_PTR(-ENOENT);
+	spin_unlock_bh(&prog_idr_lock);
+
+	return prog;
+}
+
+struct bpf_prog *bpf_prog_get_type_dev_by_id(u32 id, enum bpf_prog_type type,
+					     bool attach_drv)
+{
+	struct bpf_prog *prog;
+
+	prog = bpf_prog_get_by_id(id);
+	if (IS_ERR(prog))
+		return prog;
+
+	if (!bpf_prog_get_ok(prog, &type, attach_drv)) {
+		bpf_prog_put(prog);
+		return ERR_PTR(-EINVAL);
+	}
+
+	return prog;
+}
+EXPORT_SYMBOL_GPL(bpf_prog_get_type_dev_by_id);
+
 #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
 
 static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
@@ -2153,14 +2186,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	spin_lock_bh(&prog_idr_lock);
-	prog = idr_find(&prog_idr, id);
-	if (prog)
-		prog = bpf_prog_inc_not_zero(prog);
-	else
-		prog = ERR_PTR(-ENOENT);
-	spin_unlock_bh(&prog_idr_lock);
-
+	prog = bpf_prog_get_by_id(id);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 04/15] xdp: Export dev_check_xdp and dev_change_xdp
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (2 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 03/15] bpf: Add API to get program from id Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 05/15] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Factor out the check and change logic from dev_change_xdp_fd(),
and export them for the following commit.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/netdevice.h |   4 ++
 net/core/dev.c            | 111 +++++++++++++++++++++++++++++++++++++---------
 2 files changed, 95 insertions(+), 20 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3207e0b..c338a73 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3707,6 +3707,10 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+int dev_check_xdp(struct net_device *dev, struct netlink_ext_ack *extack,
+		  bool do_install, u32 *prog_id_p, u32 flags);
+int dev_change_xdp(struct net_device *dev, struct netlink_ext_ack *extack,
+		   struct bpf_prog *prod, u32 flags);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		      int fd, u32 flags);
 u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
diff --git a/net/core/dev.c b/net/core/dev.c
index 8bc3dce..9965675 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8317,23 +8317,24 @@ static void dev_xdp_uninstall(struct net_device *dev)
 }
 
 /**
- *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ *	dev_check_xdp - check if xdp prog can be [un]installed
  *	@dev: device
  *	@extack: netlink extended ack
- *	@fd: new program fd or negative value to clear
+ *	@install: flag to install or uninstall
+ *	@prog_id_p: pointer to a storage for program id
  *	@flags: xdp-related flags
  *
- *	Set or clear a bpf program for a device
+ *	Check if xdp prog can be [un]installed
+ *	If a program is already loaded, store the prog id to prog_id_p
  */
-int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
-		      int fd, u32 flags)
+int dev_check_xdp(struct net_device *dev, struct netlink_ext_ack *extack,
+		  bool install, u32 *prog_id_p, u32 flags)
 {
 	const struct net_device_ops *ops = dev->netdev_ops;
 	enum bpf_netdev_command query;
-	struct bpf_prog *prog = NULL;
 	bpf_op_t bpf_op, bpf_chk;
 	bool offload;
-	int err;
+	u32 prog_id;
 
 	ASSERT_RTNL();
 
@@ -8350,28 +8351,64 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	if (bpf_op == bpf_chk)
 		bpf_chk = generic_xdp_install;
 
-	if (fd >= 0) {
-		u32 prog_id;
-
+	if (install) {
 		if (!offload && __dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG)) {
 			NL_SET_ERR_MSG(extack, "native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
 
 		prog_id = __dev_xdp_query(dev, bpf_op, query);
+		if (prog_id_p)
+			*prog_id_p = prog_id;
 		if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && prog_id) {
 			NL_SET_ERR_MSG(extack, "XDP program already attached");
 			return -EBUSY;
 		}
+	} else {
+		prog_id = __dev_xdp_query(dev, bpf_op, query);
+		if (prog_id_p)
+			*prog_id_p = prog_id;
+		if (!prog_id)
+			return -ENOENT;
+	}
 
-		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
-					     bpf_op == ops->ndo_bpf);
-		if (IS_ERR(prog))
-			return PTR_ERR(prog);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(dev_check_xdp);
+
+/**
+ *	dev_change_xdp - set or clear a bpf program for a device rx path
+ *	@dev: device
+ *	@extack: netlink extended ack
+ *	@prog: bpf progam
+ *	@flags: xdp-related flags
+ *
+ *	Set or clear a bpf program for a device.
+ *	Caller must call dev_check_xdp before calling this function to
+ *	check if xdp prog can be [un]installed.
+ */
+int dev_change_xdp(struct net_device *dev, struct netlink_ext_ack *extack,
+		   struct bpf_prog *prog, u32 flags)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+	enum bpf_netdev_command query;
+	bpf_op_t bpf_op;
+	bool offload;
+
+	ASSERT_RTNL();
+
+	offload = flags & XDP_FLAGS_HW_MODE;
+	query = offload ? XDP_QUERY_PROG_HW : XDP_QUERY_PROG;
+
+	bpf_op = ops->ndo_bpf;
+	if (!bpf_op || (flags & XDP_FLAGS_SKB_MODE))
+		bpf_op = generic_xdp_install;
+
+	if (prog) {
+		u32 prog_id = __dev_xdp_query(dev, bpf_op, query);
 
 		if (!offload && bpf_prog_is_dev_bound(prog->aux)) {
 			NL_SET_ERR_MSG(extack, "using device-bound program without HW_MODE flag is not supported");
-			bpf_prog_put(prog);
 			return -EINVAL;
 		}
 
@@ -8379,13 +8416,47 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 			bpf_prog_put(prog);
 			return 0;
 		}
-	} else {
-		if (!__dev_xdp_query(dev, bpf_op, query))
-			return 0;
 	}
 
-	err = dev_xdp_install(dev, bpf_op, extack, flags, prog);
-	if (err < 0 && prog)
+	return dev_xdp_install(dev, bpf_op, extack, flags, prog);
+}
+EXPORT_SYMBOL_GPL(dev_change_xdp);
+
+/**
+ *	dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ *	@dev: device
+ *	@extack: netlink extended ack
+ *	@fd: new program fd or negative value to clear
+ *	@flags: xdp-related flags
+ *
+ *	Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+		      int fd, u32 flags)
+{
+	struct bpf_prog *prog = NULL;
+	bool install = fd >= 0;
+	int err;
+
+	err = dev_check_xdp(dev, extack, install, NULL, flags);
+	if (err) {
+		if (!install && err == -ENOENT)
+			err = 0;
+		return err;
+	}
+
+	if (install) {
+		bool attach_drv;
+
+		attach_drv = dev->netdev_ops->ndo_bpf &&
+			     !(flags & XDP_FLAGS_SKB_MODE);
+		prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP, attach_drv);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	err = dev_change_xdp(dev, extack, prog, flags);
+	if (err && prog)
 		bpf_prog_put(prog);
 
 	return err;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 05/15] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (3 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 04/15] xdp: Export dev_check_xdp and dev_change_xdp Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 06/15] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

As UMH runs under RTNL, it cannot attach XDP from userspace. Thus the
kernel, xdp_flow module, installs the XDP program.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_mod.c | 109 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 103 insertions(+), 6 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 14e06ee..2c80590 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -3,10 +3,27 @@
 #include <linux/module.h>
 #include <linux/umh.h>
 #include <linux/sched/signal.h>
+#include <linux/rhashtable.h>
 #include <linux/rtnetlink.h>
+#include <linux/filter.h>
 #include "xdp_flow.h"
 #include "msgfmt.h"
 
+struct xdp_flow_prog {
+	struct rhash_head ht_node;
+	struct net_device *dev;
+	struct bpf_prog *prog;
+};
+
+static const struct rhashtable_params progs_params = {
+	.key_len = sizeof(struct net_devce *),
+	.key_offset = offsetof(struct xdp_flow_prog, dev),
+	.head_offset = offsetof(struct xdp_flow_prog, ht_node),
+	.automatic_shrinking = true,
+};
+
+static struct rhashtable progs;
+
 extern char xdp_flow_umh_start;
 extern char xdp_flow_umh_end;
 
@@ -116,10 +133,17 @@ static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
 static int xdp_flow_setup_bind(struct net_device *dev,
 			       struct netlink_ext_ack *extack)
 {
+	u32 flags = XDP_FLAGS_DRV_MODE | XDP_FLAGS_UPDATE_IF_NOEXIST;
+	struct xdp_flow_prog *prog_node;
 	struct mbox_request *req;
+	struct bpf_prog *prog;
 	u32 id = 0;
 	int err;
 
+	err = dev_check_xdp(dev, extack, true, NULL, flags);
+	if (err)
+		return err;
+
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (!req)
 		return -ENOMEM;
@@ -129,21 +153,83 @@ static int xdp_flow_setup_bind(struct net_device *dev,
 
 	/* Load bpf in UMH and get prog id */
 	err = transact_umh(req, &id);
+	if (err)
+		goto out;
+
+	prog = bpf_prog_get_type_dev_by_id(id, BPF_PROG_TYPE_XDP, true);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto err_umh;
+	}
 
-	/* TODO: id will be used to attach bpf prog to XDP
-	 * As we have rtnl_lock, UMH cannot attach prog to XDP
-	 */
+	err = dev_change_xdp(dev, extack, prog, flags);
+	if (err)
+		goto err_prog;
 
+	prog_node = kzalloc(sizeof(*prog_node), GFP_KERNEL);
+	if (!prog_node) {
+		err = -ENOMEM;
+		goto err_xdp;
+	}
+
+	prog_node->dev = dev;
+	prog_node->prog = prog;
+	err = rhashtable_insert_fast(&progs, &prog_node->ht_node, progs_params);
+	if (err)
+		goto err_pnode;
+
+	prog = bpf_prog_inc(prog);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto err_rht;
+	}
+out:
 	kfree(req);
 
 	return err;
+err_rht:
+	rhashtable_remove_fast(&progs, &prog_node->ht_node, progs_params);
+err_pnode:
+	kfree(prog_node);
+err_xdp:
+	dev_change_xdp(dev, extack, NULL, flags);
+err_prog:
+	bpf_prog_put(prog);
+err_umh:
+	req->cmd = XDP_FLOW_CMD_UNLOAD;
+	transact_umh(req, NULL);
+
+	goto out;
 }
 
 static int xdp_flow_setup_unbind(struct net_device *dev,
 				 struct netlink_ext_ack *extack)
 {
+	struct xdp_flow_prog *prog_node;
+	u32 flags = XDP_FLAGS_DRV_MODE;
 	struct mbox_request *req;
-	int err;
+	int err, ret = 0;
+	u32 prog_id = 0;
+
+	prog_node = rhashtable_lookup_fast(&progs, &dev, progs_params);
+	if (!prog_node) {
+		pr_warn_once("%s: xdp_flow unbind was requested before bind\n",
+			     dev->name);
+		return -ENOENT;
+	}
+
+	err = dev_check_xdp(dev, extack, false, &prog_id, flags);
+	if (!err && prog_id == prog_node->prog->aux->id) {
+		err = dev_change_xdp(dev, extack, NULL, flags);
+		if (err) {
+			pr_warn("Failed to uninstall XDP prog: %d\n", err);
+			ret = err;
+		}
+	}
+
+	bpf_prog_put(prog_node->prog);
+	rhashtable_remove_fast(&progs, &prog_node->ht_node, progs_params);
+	kfree(prog_node);
 
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (!req)
@@ -153,10 +239,12 @@ static int xdp_flow_setup_unbind(struct net_device *dev,
 	req->ifindex = dev->ifindex;
 
 	err = transact_umh(req, NULL);
+	if (err)
+		ret = err;
 
 	kfree(req);
 
-	return err;
+	return ret;
 }
 
 static int xdp_flow_setup(struct net_device *dev, bool do_bind,
@@ -214,7 +302,11 @@ static int start_umh(void)
 
 static int __init load_umh(void)
 {
-	int err = 0;
+	int err;
+
+	err = rhashtable_init(&progs, &progs_params);
+	if (err)
+		return err;
 
 	mutex_lock(&xdp_flow_ops.lock);
 	if (!xdp_flow_ops.stop) {
@@ -230,8 +322,12 @@ static int __init load_umh(void)
 	xdp_flow_ops.setup = &xdp_flow_setup;
 	xdp_flow_ops.start = &start_umh;
 	xdp_flow_ops.module = THIS_MODULE;
+
+	mutex_unlock(&xdp_flow_ops.lock);
+	return 0;
 err:
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&progs);
 	return err;
 }
 
@@ -244,6 +340,7 @@ static void __exit fini_umh(void)
 	xdp_flow_ops.setup = NULL;
 	xdp_flow_ops.setup_cb = NULL;
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&progs);
 }
 module_init(load_umh);
 module_exit(fini_umh);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 06/15] xdp_flow: Prepare flow tables in bpf
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (4 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 05/15] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 07/15] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Add maps for flow tables in bpf. TC flower has hash tables for each flow
mask ordered by priority. To do the same thing, prepare a
hashmap-in-arraymap. As bpf does not provide ordered list, we emulate it
by an array. Each array entry has one-byte next index field to implement
a list. Also prepare a one-element array to point to the head index of
the list.

Because of the limitation of bpf maps, the outer array is implemented
using two array maps. "flow_masks" is the array to emulate the list and
its entries have the priority and mask of each flow table. For each
priority/mask, the same index entry of another map "flow_tables", which
is the hashmap-in-arraymap, points to the actual flow table.

The flow insertion logic in UMH and lookup logic in BPF will be
implemented in the following commits.

NOTE: This list emulation by array may be able to be realized by adding
ordered-list type map. In that case we also need map iteration API for
bpf progs.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           | 18 +++++++++++
 net/xdp_flow/xdp_flow_kern_bpf.c | 22 +++++++++++++
 net/xdp_flow/xdp_flow_umh.c      | 70 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 108 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp_flow/umh_bpf.h

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
new file mode 100644
index 0000000..b4fe0c6
--- /dev/null
+++ b/net/xdp_flow/umh_bpf.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_XDP_FLOW_UMH_BPF_H
+#define _NET_XDP_FLOW_UMH_BPF_H
+
+#include "msgfmt.h"
+
+#define MAX_FLOWS 1024
+#define MAX_FLOW_MASKS 255
+#define FLOW_MASKS_TAIL 255
+
+struct xdp_flow_mask_entry {
+	struct xdp_flow_key mask;
+	__u16 priority;
+	short count;
+	int next;
+};
+
+#endif
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 74cdb1d..c101156 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -2,6 +2,28 @@
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
 #include <bpf_helpers.h>
+#include "umh_bpf.h"
+
+struct bpf_map_def SEC("maps") flow_masks_head = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") flow_masks = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct xdp_flow_mask_entry),
+	.max_entries = MAX_FLOW_MASKS,
+};
+
+struct bpf_map_def SEC("maps") flow_tables = {
+	.type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_FLOW_MASKS,
+};
 
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 85c5c7b..515c2fd 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -13,7 +13,7 @@
 #include <sys/resource.h>
 #include <linux/hashtable.h>
 #include <linux/err.h>
-#include "msgfmt.h"
+#include "umh_bpf.h"
 
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
@@ -99,11 +99,13 @@ static int setup(void)
 
 static int load_bpf(int ifindex, struct bpf_object **objp)
 {
+	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
+	struct bpf_map *flow_tables, *flow_masks_head;
+	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
 	struct bpf_program *prog;
 	struct bpf_object *obj;
-	int prog_fd, err;
 	ssize_t len;
 
 	len = snprintf(path, 256, "/proc/self/fd/%d", progfile_fd);
@@ -131,6 +133,48 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				      sizeof(struct xdp_flow_key),
+				      sizeof(struct xdp_flow_actions),
+				      MAX_FLOWS, 0);
+	if (flow_meta_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables meta failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	flow_tables_fd = bpf_create_map_in_map(BPF_MAP_TYPE_ARRAY_OF_MAPS,
+					       "flow_tables", sizeof(__u32),
+					       flow_meta_fd, MAX_FLOW_MASKS, 0);
+	if (flow_tables_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables failed: %s\n",
+		       strerror(errno));
+		close(flow_meta_fd);
+		goto err;
+	}
+
+	close(flow_meta_fd);
+
+	flow_tables = bpf_object__find_map_by_name(obj, "flow_tables");
+	if (!flow_tables) {
+		pr_err("Cannot find flow_tables\n");
+		err = -ENOENT;
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	err = bpf_map__reuse_fd(flow_tables, flow_tables_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse flow_tables fd: %s\n", errbuf);
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	close(flow_tables_fd);
+
 	err = bpf_object__load(obj);
 	if (err) {
 		err = libbpf_err(err, errbuf);
@@ -138,6 +182,28 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 		goto err;
 	}
 
+	flow_masks_head = bpf_object__find_map_by_name(obj, "flow_masks_head");
+	if (!flow_masks_head) {
+		pr_err("Cannot find flow_masks_head map\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	flow_masks_head_fd = bpf_map__fd(flow_masks_head);
+	if (flow_masks_head_fd < 0) {
+		err = libbpf_err(flow_masks_head_fd, errbuf);
+		pr_err("Invalid flow_masks_head fd: %s\n", errbuf);
+		goto err;
+	}
+
+	if (bpf_map_update_elem(flow_masks_head_fd, &zero, &flow_masks_tail,
+				0)) {
+		err = -errno;
+		pr_err("Failed to initialize flow_masks_head: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	prog = bpf_object__find_program_by_title(obj, "xdp_flow");
 	if (!prog) {
 		pr_err("Cannot find xdp_flow program\n");
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 07/15] xdp_flow: Add flow entry insertion/deletion logic in UMH
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (5 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 06/15] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 08/15] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

This logic will be used when xdp_flow kmod requests flow
insertion/deleteion.

On insertion, find a free entry and populate it, then update next index
pointer of its previous entry. On deletion, set the next index pointer
of the prev entry to the next index of the entry to be deleted.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h      |  15 ++
 net/xdp_flow/xdp_flow_umh.c | 470 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 483 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index b4fe0c6..4e4633f 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -15,4 +15,19 @@ struct xdp_flow_mask_entry {
 	int next;
 };
 
+static inline bool flow_equal(const struct xdp_flow_key *key1,
+			      const struct xdp_flow_key *key2)
+{
+	long *lkey1 = (long *)key1;
+	long *lkey2 = (long *)key2;
+	int i;
+
+	for (i = 0; i < sizeof(*key1); i += sizeof(long)) {
+		if (*lkey1++ != *lkey2++)
+			return false;
+	}
+
+	return true;
+}
+
 #endif
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 515c2fd..0588a36 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -20,6 +20,8 @@
 int progfile_fd;
 FILE *kmsg;
 
+#define zalloc(size) calloc(1, (size))
+
 #define pr_log(fmt, prio, ...) fprintf(kmsg, "<%d>xdp_flow_umh: " fmt, \
 				       LOG_DAEMON | (prio), ##__VA_ARGS__)
 #ifdef DEBUG
@@ -42,6 +44,8 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int free_slot_top;
+	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
@@ -272,6 +276,57 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
+{
+	int i;
+
+	for (i = 0; i < MAX_FLOW_MASKS; i++)
+		netdev_info->free_slots[MAX_FLOW_MASKS - 1 - i] = i;
+	netdev_info->free_slot_top = MAX_FLOW_MASKS - 1;
+}
+
+static int get_flow_masks_free_slot(const struct netdev_info *netdev_info)
+{
+	if (netdev_info->free_slot_top < 0)
+		return -ENOBUFS;
+
+	return netdev_info->free_slots[netdev_info->free_slot_top];
+}
+
+static int add_flow_masks_free_slot(struct netdev_info *netdev_info, int slot)
+{
+	if (unlikely(netdev_info->free_slot_top >= MAX_FLOW_MASKS - 1)) {
+		pr_warn("BUG: free_slot overflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return -EOVERFLOW;
+	}
+
+	netdev_info->free_slots[++netdev_info->free_slot_top] = slot;
+
+	return 0;
+}
+
+static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
+					int slot)
+{
+	int top_slot;
+
+	if (unlikely(netdev_info->free_slot_top < 0)) {
+		pr_warn("BUG: free_slot underflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return;
+	}
+
+	top_slot = netdev_info->free_slots[netdev_info->free_slot_top];
+	if (unlikely(top_slot != slot)) {
+		pr_warn("BUG: inconsistent free_slot top: top_slot=%d, slot=%d\n",
+			top_slot, slot);
+		return;
+	}
+
+	netdev_info->free_slot_top--;
+}
+
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
 	struct netdev_info *netdev_info;
@@ -295,6 +350,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	init_flow_masks_free_slot(netdev_info);
+
 	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
@@ -335,14 +392,423 @@ static int handle_unload(const struct mbox_request *req)
 	return 0;
 }
 
+static int get_table_fd(const struct netdev_info *netdev_info,
+			const char *table_name)
+{
+	char errbuf[ERRBUF_SIZE];
+	struct bpf_map *map;
+	int map_fd;
+	int err;
+
+	map = bpf_object__find_map_by_name(netdev_info->obj, table_name);
+	if (!map) {
+		pr_err("BUG: %s map not found.\n", table_name);
+		return -ENOENT;
+	}
+
+	map_fd = bpf_map__fd(map);
+	if (map_fd < 0) {
+		err = libbpf_err(map_fd, errbuf);
+		pr_err("Invalid map fd: %s\n", errbuf);
+		return err;
+	}
+
+	return map_fd;
+}
+
+static int get_flow_masks_head_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks_head");
+}
+
+static int get_flow_masks_head(int head_fd, int *head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_lookup_elem(head_fd, &zero, head)) {
+		err = -errno;
+		pr_err("Cannot get flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int update_flow_masks_head(int head_fd, int head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_update_elem(head_fd, &zero, &head, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int get_flow_masks_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks");
+}
+
+static int get_flow_tables_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_tables");
+}
+
+static int __flow_table_insert_elem(int flow_table_fd,
+				    const struct xdp_flow *flow)
+{
+	int err = 0;
+
+	if (bpf_map_update_elem(flow_table_fd, &flow->key, &flow->actions, 0)) {
+		err = -errno;
+		pr_err("Cannot insert flow entry: %s\n",
+		       strerror(errno));
+	}
+
+	return err;
+}
+
+static void __flow_table_delete_elem(int flow_table_fd,
+				     const struct xdp_flow *flow)
+{
+	bpf_map_delete_elem(flow_table_fd, &flow->key);
+}
+
+static int flow_table_insert_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, free_slot, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		flow_table_fd = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_entry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL)
+			break;
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto err;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask)) {
+			__u32 id;
+
+			if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+				err = -errno;
+				pr_err("Cannot lookup flow_tables: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			flow_table_fd = bpf_map_get_fd_by_id(id);
+			if (flow_table_fd < 0) {
+				err = -errno;
+				pr_err("Cannot get flow_table fd by id: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			err = __flow_table_insert_elem(flow_table_fd, flow);
+			if (err)
+				goto out;
+
+			entry->count++;
+			if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+				err = -errno;
+				pr_err("Cannot update flow_masks count: %s\n",
+				       strerror(errno));
+				__flow_table_delete_elem(flow_table_fd, flow);
+				goto out;
+			}
+
+			goto out;
+		}
+
+		if (entry->priority > flow->priority)
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS && idx != FLOW_MASKS_TAIL)) {
+		err = -EINVAL;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	/* Flow mask was not found. Create a new one */
+
+	free_slot = get_flow_masks_free_slot(netdev_info);
+	if (free_slot < 0) {
+		err = free_slot;
+		goto err;
+	}
+
+	entry->mask = flow->mask;
+	entry->priority = flow->priority;
+	entry->count = 1;
+	entry->next = idx;
+	if (bpf_map_update_elem(masks_fd, &free_slot, entry, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks: %s\n", strerror(errno));
+		goto err;
+	}
+
+	flow_table_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				       sizeof(struct xdp_flow_key),
+				       sizeof(struct xdp_flow_actions),
+				       MAX_FLOWS, 0);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_table failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	err = __flow_table_insert_elem(flow_table_fd, flow);
+	if (err)
+		goto out;
+
+	if (bpf_map_update_elem(flow_tables_fd, &free_slot, &flow_table_fd, 0)) {
+		err = -errno;
+		pr_err("Failed to insert flow_table into flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, free_slot);
+		if (err)
+			goto err_flow_table;
+	} else {
+		pentry->next = free_slot;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto err_flow_table;
+		}
+	}
+	delete_flow_masks_free_slot(netdev_info, free_slot);
+out:
+	close(flow_table_fd);
+err:
+	free(pentry);
+err_entry:
+	free(entry);
+
+	return err;
+
+err_flow_table:
+	bpf_map_delete_elem(flow_tables_fd, &free_slot);
+
+	goto out;
+}
+
+static int flow_table_delete_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+	__u32 id;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		err = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_pentry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+
+		if (entry->priority > flow->priority) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask))
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS)) {
+		err = -ENOENT;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+		err = -errno;
+		pr_err("Cannot lookup flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	flow_table_fd = bpf_map_get_fd_by_id(id);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("Cannot get flow_table fd by id: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	__flow_table_delete_elem(flow_table_fd, flow);
+	close(flow_table_fd);
+
+	if (--entry->count > 0) {
+		if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks count: %s\n",
+			       strerror(errno));
+		}
+
+		goto out;
+	}
+
+	if (unlikely(entry->count < 0)) {
+		pr_warn("flow_masks has negative count: %d\n",
+			entry->count);
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, entry->next);
+		if (err)
+			goto out;
+	} else {
+		pentry->next = entry->next;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+	}
+
+	bpf_map_delete_elem(flow_tables_fd, &idx);
+	err = add_flow_masks_free_slot(netdev_info, idx);
+	if (err)
+		pr_err("Cannot add flow_masks free slot: %s\n", strerror(-err));
+out:
+	free(pentry);
+err_pentry:
+	free(entry);
+
+	return err;
+}
+
 static int handle_replace(struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_insert_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static int handle_delete(const struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_delete_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static void loop(void)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 08/15] xdp_flow: Add flow handling and basic actions in bpf prog
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (6 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 07/15] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 09/15] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

BPF prog for XDP parses the packet and extracts the flow key. Then find
an entry from flow tables.
Only "accept" and "drop" actions are implemented at this point.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 297 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 296 insertions(+), 1 deletion(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index c101156..f4a6346 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -1,9 +1,27 @@
 // SPDX-License-Identifier: GPL-2.0
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/ipv6.h>
+#include <net/dsfield.h>
 #include <bpf_helpers.h>
 #include "umh_bpf.h"
 
+/* Used when the action only modifies the packet */
+#define _XDP_CONTINUE -1
+
+struct bpf_map_def SEC("maps") debug_stats = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -25,10 +43,287 @@ struct bpf_map_def SEC("maps") flow_tables = {
 	.max_entries = MAX_FLOW_MASKS,
 };
 
+static inline void account_debug(int idx)
+{
+	long *cnt;
+
+	cnt = bpf_map_lookup_elem(&debug_stats, &idx);
+	if (cnt)
+		*cnt += 1;
+}
+
+static inline void account_action(int act)
+{
+	account_debug(act + 1);
+}
+
+static inline int action_accept(void)
+{
+	account_action(XDP_FLOW_ACTION_ACCEPT);
+	return XDP_PASS;
+}
+
+static inline int action_drop(void)
+{
+	account_action(XDP_FLOW_ACTION_DROP);
+	return XDP_DROP;
+}
+
+static inline int action_redirect(struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_REDIRECT);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_push(struct xdp_md *ctx,
+				   struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_pop(struct xdp_md *ctx,
+				  struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_POP);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_mangle(struct xdp_md *ctx,
+				     struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_mangle(struct xdp_md *ctx,
+				struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_csum(struct xdp_md *ctx,
+			      struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_CSUM);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline void __ether_addr_copy(u8 *dst, const u8 *src)
+{
+	u16 *a = (u16 *)dst;
+	const u16 *b = (const u16 *)src;
+
+	a[0] = b[0];
+	a[1] = b[1];
+	a[2] = b[2];
+}
+
+static inline int parse_ipv4(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct iphdr *iph = data + *nh_off;
+
+	if (iph + 1 > data_end)
+		return -1;
+
+	key->ipv4.src = iph->saddr;
+	key->ipv4.dst = iph->daddr;
+	key->ip.ttl = iph->ttl;
+	key->ip.tos = iph->tos;
+	*nh_off += iph->ihl * 4;
+
+	return iph->protocol;
+}
+
+static inline int parse_ipv6(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct ipv6hdr *ip6h = data + *nh_off;
+
+	if (ip6h + 1 > data_end)
+		return -1;
+
+	key->ipv6.src = ip6h->saddr;
+	key->ipv6.dst = ip6h->daddr;
+	key->ip.ttl = ip6h->hop_limit;
+	key->ip.tos = ipv6_get_dsfield(ip6h);
+	*nh_off += sizeof(*ip6h);
+
+	if (ip6h->nexthdr == NEXTHDR_HOP ||
+	    ip6h->nexthdr == NEXTHDR_ROUTING ||
+	    ip6h->nexthdr == NEXTHDR_FRAGMENT ||
+	    ip6h->nexthdr == NEXTHDR_AUTH ||
+	    ip6h->nexthdr == NEXTHDR_NONE ||
+	    ip6h->nexthdr == NEXTHDR_DEST)
+		return 0;
+
+	return ip6h->nexthdr;
+}
+
+#define for_each_flow_mask(entry, head, idx, cnt) \
+	for (entry = bpf_map_lookup_elem(&flow_masks, (head)), \
+	     idx = *(head), cnt = 0; \
+	     entry != NULL && cnt < MAX_FLOW_MASKS; \
+	     idx = entry->next, \
+	     entry = bpf_map_lookup_elem(&flow_masks, &idx), cnt++)
+
+static inline void flow_mask(struct xdp_flow_key *mkey,
+			     const struct xdp_flow_key *key,
+			     const struct xdp_flow_key *mask)
+{
+	long *lmkey = (long *)mkey;
+	long *lmask = (long *)mask;
+	long *lkey = (long *)key;
+	int i;
+
+	for (i = 0; i < sizeof(*mkey); i += sizeof(long))
+		*lmkey++ = *lkey++ & *lmask++;
+}
+
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
 {
-	return XDP_PASS;
+	void *data_end = (void *)(long)ctx->data_end;
+	struct xdp_flow_actions *actions = NULL;
+	void *data = (void *)(long)ctx->data;
+	int cnt, idx, action_idx, zero = 0;
+	struct xdp_flow_mask_entry *entry;
+	struct ethhdr *eth = data;
+	struct xdp_flow_key key;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	int ipproto;
+	u64 nh_off;
+	int *head;
+
+	account_debug(0);
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	__builtin_memset(&key, 0, sizeof(key));
+	h_proto = eth->h_proto;
+	__ether_addr_copy(key.eth.dst, eth->h_dest);
+	__ether_addr_copy(key.eth.src, eth->h_source);
+
+	if (eth_type_vlan(h_proto)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(*vhdr);
+		if (data + nh_off > data_end)
+			return XDP_DROP;
+		key.vlan.tpid = h_proto;
+		key.vlan.tci = vhdr->h_vlan_TCI;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	key.eth.type = h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		ipproto = parse_ipv4(data, &nh_off, data_end, &key);
+	else if (h_proto == htons(ETH_P_IPV6))
+		ipproto = parse_ipv6(data, &nh_off, data_end, &key);
+	else
+		ipproto = 0;
+	if (ipproto < 0)
+		return XDP_DROP;
+	key.ip.proto = ipproto;
+
+	if (ipproto == IPPROTO_TCP) {
+		struct tcphdr *th = data + nh_off;
+
+		if (th + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = th->source;
+		key.l4port.dst = th->dest;
+		key.tcp.flags = (*(__be16 *)&tcp_flag_word(th) & htons(0x0FFF));
+	} else if (ipproto == IPPROTO_UDP) {
+		struct udphdr *uh = data + nh_off;
+
+		if (uh + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = uh->source;
+		key.l4port.dst = uh->dest;
+	}
+
+	head = bpf_map_lookup_elem(&flow_masks_head, &zero);
+	if (!head)
+		return XDP_PASS;
+
+	for_each_flow_mask(entry, head, idx, cnt) {
+		struct xdp_flow_key mkey;
+		void *flow_table;
+
+		flow_table = bpf_map_lookup_elem(&flow_tables, &idx);
+		if (!flow_table)
+			return XDP_ABORTED;
+
+		flow_mask(&mkey, &key, &entry->mask);
+		actions = bpf_map_lookup_elem(flow_table, &mkey);
+		if (actions)
+			break;
+	}
+
+	if (!actions)
+		return XDP_PASS;
+
+	for (action_idx = 0;
+	     action_idx < actions->num_actions &&
+	     action_idx < MAX_XDP_FLOW_ACTIONS;
+	     action_idx++) {
+		struct xdp_flow_action *action;
+		int act;
+
+		action = &actions->actions[action_idx];
+
+		switch (action->id) {
+		case XDP_FLOW_ACTION_ACCEPT:
+			return action_accept();
+		case XDP_FLOW_ACTION_DROP:
+			return action_drop();
+		case XDP_FLOW_ACTION_REDIRECT:
+			return action_redirect(action);
+		case XDP_FLOW_ACTION_VLAN_PUSH:
+			act = action_vlan_push(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_POP:
+			act = action_vlan_pop(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_MANGLE:
+			act = action_vlan_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_MANGLE:
+			act = action_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_CSUM:
+			act = action_csum(ctx, action);
+			break;
+		default:
+			return XDP_ABORTED;
+		}
+		if (act != _XDP_CONTINUE)
+			return act;
+	}
+
+	return XDP_ABORTED;
 }
 
 char _license[] SEC("license") = "GPL";
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 09/15] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (7 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 08/15] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 10/15] xdp_flow: Add netdev feature for enabling flow offload to XDP Toshiaki Makita
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

As struct flow_rule has descrete storages for flow_dissector and
key/mask containers, we need to serialize them in some way to pass them
to UMH.

Convert flow_rule into flow key form used in xdp_flow bpf prog and
pass it.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_mod.c | 331 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 329 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 2c80590..e70a86a 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -3,8 +3,10 @@
 #include <linux/module.h>
 #include <linux/umh.h>
 #include <linux/sched/signal.h>
+#include <linux/etherdevice.h>
 #include <linux/rhashtable.h>
 #include <linux/rtnetlink.h>
+#include <linux/if_vlan.h>
 #include <linux/filter.h>
 #include "xdp_flow.h"
 #include "msgfmt.h"
@@ -24,9 +26,261 @@ struct xdp_flow_prog {
 
 static struct rhashtable progs;
 
+struct xdp_flow_rule {
+	struct rhash_head ht_node;
+	unsigned long cookie;
+	struct xdp_flow_key key;
+	struct xdp_flow_key mask;
+};
+
+static const struct rhashtable_params rules_params = {
+	.key_len = sizeof(unsigned long),
+	.key_offset = offsetof(struct xdp_flow_rule, cookie),
+	.head_offset = offsetof(struct xdp_flow_rule, ht_node),
+	.automatic_shrinking = true,
+};
+
+static struct rhashtable rules;
+
 extern char xdp_flow_umh_start;
 extern char xdp_flow_umh_end;
 
+static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
+				  struct flow_action *flow_action,
+				  struct netlink_ext_ack *extack)
+{
+	const struct flow_action_entry *act;
+	int i;
+
+	if (!flow_action_has_entries(flow_action))
+		return 0;
+
+	if (flow_action->num_entries > MAX_XDP_FLOW_ACTIONS)
+		return -ENOBUFS;
+
+	flow_action_for_each(i, act, flow_action) {
+		struct xdp_flow_action *action = &actions->actions[i];
+
+		switch (act->id) {
+		case FLOW_ACTION_ACCEPT:
+			action->id = XDP_FLOW_ACTION_ACCEPT;
+			break;
+		case FLOW_ACTION_DROP:
+			action->id = XDP_FLOW_ACTION_DROP;
+			break;
+		case FLOW_ACTION_REDIRECT:
+		case FLOW_ACTION_VLAN_PUSH:
+		case FLOW_ACTION_VLAN_POP:
+		case FLOW_ACTION_VLAN_MANGLE:
+		case FLOW_ACTION_MANGLE:
+		case FLOW_ACTION_CSUM:
+			/* TODO: implement these */
+			/* fall through */
+		default:
+			NL_SET_ERR_MSG_MOD(extack, "Unsupported action");
+			return -EOPNOTSUPP;
+		}
+	}
+	actions->num_actions = flow_action->num_entries;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ports(struct xdp_flow_key *key,
+				struct xdp_flow_key *mask,
+				struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ports match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_PORTS))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP && ip_proto != IPPROTO_UDP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "Only UDP and TCP keys are supported");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ports(rule, &match);
+
+	key->l4port.src = match.key->src;
+	mask->l4port.src = match.mask->src;
+	key->l4port.dst = match.key->dst;
+	mask->l4port.dst = match.mask->dst;
+
+	return 0;
+}
+
+static int xdp_flow_parse_tcp(struct xdp_flow_key *key,
+			      struct xdp_flow_key *mask,
+			      struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_tcp match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_TCP))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "TCP keys supported only for TCP");
+		return -EINVAL;
+	}
+
+	flow_rule_match_tcp(rule, &match);
+
+	key->tcp.flags = match.key->flags;
+	mask->tcp.flags = match.mask->flags;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ip(struct xdp_flow_key *key,
+			     struct xdp_flow_key *mask,
+			     struct flow_cls_offload *f, __be16 n_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ip match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_IP))
+		return 0;
+
+	if (n_proto != htons(ETH_P_IP) && n_proto != htons(ETH_P_IPV6)) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "IP keys supported only for IPv4/6");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ip(rule, &match);
+
+	key->ip.ttl = match.key->ttl;
+	mask->ip.ttl = match.mask->ttl;
+	key->ip.tos = match.key->tos;
+	mask->ip.tos = match.mask->tos;
+
+	return 0;
+}
+
+static int xdp_flow_parse(struct xdp_flow_key *key, struct xdp_flow_key *mask,
+			  struct xdp_flow_actions *actions,
+			  struct flow_cls_offload *f)
+{
+	struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_dissector *dissector = rule->match.dissector;
+	__be16 n_proto = 0, n_proto_mask = 0;
+	u16 addr_type = 0;
+	u8 ip_proto = 0;
+	int err;
+
+	if (dissector->used_keys &
+	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
+	      BIT(FLOW_DISSECTOR_KEY_TCP) |
+	      BIT(FLOW_DISSECTOR_KEY_IP) |
+	      BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported key");
+		return -EOPNOTSUPP;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_CONTROL)) {
+		struct flow_match_control match;
+
+		flow_rule_match_control(rule, &match);
+		addr_type = match.key->addr_type;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_match_basic match;
+
+		flow_rule_match_basic(rule, &match);
+
+		n_proto = match.key->n_proto;
+		n_proto_mask = match.mask->n_proto;
+		if (n_proto == htons(ETH_P_ALL)) {
+			n_proto = 0;
+			n_proto_mask = 0;
+		}
+
+		key->eth.type = n_proto;
+		mask->eth.type = n_proto_mask;
+
+		if (match.mask->ip_proto) {
+			ip_proto = match.key->ip_proto;
+			key->ip.proto = ip_proto;
+			mask->ip.proto = match.mask->ip_proto;
+		}
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_match_eth_addrs match;
+
+		flow_rule_match_eth_addrs(rule, &match);
+
+		ether_addr_copy(key->eth.dst, match.key->dst);
+		ether_addr_copy(mask->eth.dst, match.mask->dst);
+		ether_addr_copy(key->eth.src, match.key->src);
+		ether_addr_copy(mask->eth.src, match.mask->src);
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_VLAN)) {
+		struct flow_match_vlan match;
+
+		flow_rule_match_vlan(rule, &match);
+
+		key->vlan.tpid = match.key->vlan_tpid;
+		mask->vlan.tpid = match.mask->vlan_tpid;
+		key->vlan.tci = htons(match.key->vlan_id |
+				      (match.key->vlan_priority <<
+				       VLAN_PRIO_SHIFT));
+		mask->vlan.tci = htons(match.mask->vlan_id |
+				       (match.mask->vlan_priority <<
+					VLAN_PRIO_SHIFT));
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		struct flow_match_ipv4_addrs match;
+
+		flow_rule_match_ipv4_addrs(rule, &match);
+
+		key->ipv4.src = match.key->src;
+		mask->ipv4.src = match.mask->src;
+		key->ipv4.dst = match.key->dst;
+		mask->ipv4.dst = match.mask->dst;
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		struct flow_match_ipv6_addrs match;
+
+		flow_rule_match_ipv6_addrs(rule, &match);
+
+		key->ipv6.src = match.key->src;
+		mask->ipv6.src = match.mask->src;
+		key->ipv6.dst = match.key->dst;
+		mask->ipv6.dst = match.mask->dst;
+	}
+
+	err = xdp_flow_parse_ports(key, mask, f, ip_proto);
+	if (err)
+		return err;
+	err = xdp_flow_parse_tcp(key, mask, f, ip_proto);
+	if (err)
+		return err;
+
+	err = xdp_flow_parse_ip(key, mask, f, n_proto);
+	if (err)
+		return err;
+
+	// TODO: encapsulation related tasks
+
+	return xdp_flow_parse_actions(actions, &rule->action,
+					   f->common.extack);
+}
+
 static void shutdown_umh(void)
 {
 	struct task_struct *tsk;
@@ -77,12 +331,78 @@ static int transact_umh(struct mbox_request *req, u32 *id)
 
 static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct xdp_flow_rule *rule;
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+	if (!rule) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	req->flow.priority = f->common.prio >> 16;
+	err = xdp_flow_parse(&req->flow.key, &req->flow.mask,
+			     &req->flow.actions, f);
+	if (err)
+		goto err_rule;
+
+	rule->cookie = f->cookie;
+	rule->key = req->flow.key;
+	rule->mask = req->flow.mask;
+	err = rhashtable_insert_fast(&rules, &rule->ht_node, rules_params);
+	if (err)
+		goto err_rule;
+
+	req->cmd = XDP_FLOW_CMD_REPLACE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+	if (err)
+		goto err_rht;
+out:
+	kfree(req);
+
+	return err;
+err_rht:
+	rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+err_rule:
+	kfree(rule);
+	goto out;
 }
 
 static int xdp_flow_destroy(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct xdp_flow_rule *rule;
+	struct mbox_request *req;
+	int err;
+
+	rule = rhashtable_lookup_fast(&rules, &f->cookie, rules_params);
+	if (!rule)
+		return 0;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->flow.priority = f->common.prio >> 16;
+	req->flow.key = rule->key;
+	req->flow.mask = rule->mask;
+	req->cmd = XDP_FLOW_CMD_DELETE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	if (!err) {
+		rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+		kfree(rule);
+	}
+
+	return err;
 }
 
 static int xdp_flow_setup_flower(struct net_device *dev,
@@ -308,6 +628,10 @@ static int __init load_umh(void)
 	if (err)
 		return err;
 
+	err = rhashtable_init(&rules, &rules_params);
+	if (err)
+		goto err_progs;
+
 	mutex_lock(&xdp_flow_ops.lock);
 	if (!xdp_flow_ops.stop) {
 		err = -EFAULT;
@@ -327,6 +651,8 @@ static int __init load_umh(void)
 	return 0;
 err:
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
+err_progs:
 	rhashtable_destroy(&progs);
 	return err;
 }
@@ -340,6 +666,7 @@ static void __exit fini_umh(void)
 	xdp_flow_ops.setup = NULL;
 	xdp_flow_ops.setup_cb = NULL;
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
 	rhashtable_destroy(&progs);
 }
 module_init(load_umh);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 10/15] xdp_flow: Add netdev feature for enabling flow offload to XDP
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (8 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 09/15] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 11/15] xdp_flow: Implement redirect action Toshiaki Makita
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

The usage would be like this:

 $ ethtool -K eth0 flow-offload-xdp on
 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower ...

Then the filters offloaded to XDP are marked as "in_hw".

xdp_flow is using the indirect block mechanism to handle the newly added
feature.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/netdev_features.h  |  2 ++
 net/core/dev.c                   |  2 ++
 net/core/ethtool.c               |  1 +
 net/xdp_flow/xdp_flow.h          |  5 ++++
 net/xdp_flow/xdp_flow_core.c     | 55 +++++++++++++++++++++++++++++++++++++++-
 net/xdp_flow/xdp_flow_kern_mod.c |  6 +++++
 6 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 4b19c54..1063511 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -80,6 +80,7 @@ enum {
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
+	NETIF_F_XDP_FLOW_BIT,		/* Offload flow to XDP */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -150,6 +151,7 @@ enum {
 #define NETIF_F_GSO_UDP_L4	__NETIF_F(GSO_UDP_L4)
 #define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 #define NETIF_F_HW_TLS_RX	__NETIF_F(HW_TLS_RX)
+#define NETIF_F_XDP_FLOW	__NETIF_F(XDP_FLOW)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index 9965675..62e0469 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9035,6 +9035,8 @@ int register_netdevice(struct net_device *dev)
 	 * software offloads (GSO and GRO).
 	 */
 	dev->hw_features |= NETIF_F_SOFT_FEATURES;
+	if (IS_ENABLED(CONFIG_XDP_FLOW) && dev->netdev_ops->ndo_bpf)
+		dev->hw_features |= NETIF_F_XDP_FLOW;
 	dev->features |= NETIF_F_SOFT_FEATURES;
 
 	if (dev->netdev_ops->ndo_udp_tunnel_add) {
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index c763106..200aa96 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info)
 	[NETIF_F_HW_TLS_RECORD_BIT] =	"tls-hw-record",
 	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
+	[NETIF_F_XDP_FLOW_BIT] =	 "flow-offload-xdp",
 };
 
 static const char
diff --git a/net/xdp_flow/xdp_flow.h b/net/xdp_flow/xdp_flow.h
index 656ceab..58f8a229 100644
--- a/net/xdp_flow/xdp_flow.h
+++ b/net/xdp_flow/xdp_flow.h
@@ -20,4 +20,9 @@ struct xdp_flow_umh_ops {
 
 extern struct xdp_flow_umh_ops xdp_flow_ops;
 
+static inline bool xdp_flow_enabled(const struct net_device *dev)
+{
+	return dev->features & NETIF_F_XDP_FLOW;
+}
+
 #endif
diff --git a/net/xdp_flow/xdp_flow_core.c b/net/xdp_flow/xdp_flow_core.c
index 8265aef..f402427 100644
--- a/net/xdp_flow/xdp_flow_core.c
+++ b/net/xdp_flow/xdp_flow_core.c
@@ -20,7 +20,8 @@ static void xdp_flow_block_release(void *cb_priv)
 	mutex_unlock(&xdp_flow_ops.lock);
 }
 
-int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f)
+static int xdp_flow_setup_block(struct net_device *dev,
+				struct flow_block_offload *f)
 {
 	struct flow_block_cb *block_cb;
 	int err = 0;
@@ -32,6 +33,9 @@ int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f)
 	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
 		return -EOPNOTSUPP;
 
+	if (f->command == FLOW_BLOCK_BIND && !xdp_flow_enabled(dev))
+		return -EOPNOTSUPP;
+
 	mutex_lock(&xdp_flow_ops.lock);
 	if (!xdp_flow_ops.module) {
 		mutex_unlock(&xdp_flow_ops.lock);
@@ -105,6 +109,50 @@ int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f)
 	return err;
 }
 
+static int xdp_flow_indr_setup_cb(struct net_device *dev, void *cb_priv,
+				  enum tc_setup_type type, void *type_data)
+{
+	switch (type) {
+	case TC_SETUP_BLOCK:
+		return xdp_flow_setup_block(dev, type_data);
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int xdp_flow_netdevice_event(struct notifier_block *nb,
+				    unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	int err;
+
+	if (!dev->netdev_ops->ndo_bpf)
+		return NOTIFY_DONE;
+
+	switch (event) {
+	case NETDEV_REGISTER:
+		err = __flow_indr_block_cb_register(dev, NULL,
+						    xdp_flow_indr_setup_cb,
+						    dev);
+		if (err) {
+			netdev_err(dev,
+				   "Failed to register indirect block setup callback: %d\n",
+				   err);
+		}
+		break;
+	case NETDEV_UNREGISTER:
+		__flow_indr_block_cb_unregister(dev, xdp_flow_indr_setup_cb,
+						dev);
+		break;
+	}
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block xdp_flow_notifier_block __read_mostly = {
+	.notifier_call = xdp_flow_netdevice_event,
+};
+
 static void xdp_flow_umh_cleanup(struct umh_info *info)
 {
 	mutex_lock(&xdp_flow_ops.lock);
@@ -117,6 +165,11 @@ static void xdp_flow_umh_cleanup(struct umh_info *info)
 
 static int __init xdp_flow_init(void)
 {
+	int err = register_netdevice_notifier(&xdp_flow_notifier_block);
+
+	if (err)
+		return err;
+
 	mutex_init(&xdp_flow_ops.lock);
 	xdp_flow_ops.stop = true;
 	xdp_flow_ops.info.cmdline = "xdp_flow_umh";
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index e70a86a..ce8a75b 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -335,6 +335,12 @@ static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
 	struct mbox_request *req;
 	int err;
 
+	if (!xdp_flow_enabled(dev)) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "flow-offload-xdp is disabled on net device");
+		return -EOPNOTSUPP;
+	}
+
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (!req)
 		return -ENOMEM;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 11/15] xdp_flow: Implement redirect action
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (9 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 10/15] xdp_flow: Add netdev feature for enabling flow offload to XDP Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 12/15] xdp_flow: Implement vlan_push action Toshiaki Makita
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Add a devmap for XDP_REDIRECT and use it for redirect action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           |   1 +
 net/xdp_flow/xdp_flow_kern_bpf.c |  14 +++-
 net/xdp_flow/xdp_flow_kern_mod.c |  14 ++++
 net/xdp_flow/xdp_flow_umh.c      | 164 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 186 insertions(+), 7 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index 4e4633f..a279d0a1 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -4,6 +4,7 @@
 
 #include "msgfmt.h"
 
+#define MAX_PORTS 65536
 #define MAX_FLOWS 1024
 #define MAX_FLOW_MASKS 255
 #define FLOW_MASKS_TAIL 255
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index f4a6346..381d67e 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -22,6 +22,13 @@ struct bpf_map_def SEC("maps") debug_stats = {
 	.max_entries = 256,
 };
 
+struct bpf_map_def SEC("maps") output_map = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = MAX_PORTS,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -71,10 +78,13 @@ static inline int action_drop(void)
 
 static inline int action_redirect(struct xdp_flow_action *action)
 {
+	int tx_port;
+
 	account_action(XDP_FLOW_ACTION_REDIRECT);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	tx_port = action->ifindex;
+
+	return bpf_redirect_map(&output_map, tx_port, 0);
 }
 
 static inline int action_vlan_push(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index ce8a75b..2581b81 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -69,6 +69,20 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->id = XDP_FLOW_ACTION_DROP;
 			break;
 		case FLOW_ACTION_REDIRECT:
+			if (!act->dev->netdev_ops->ndo_xdp_xmit) {
+				NL_SET_ERR_MSG_MOD(extack,
+						   "Redirect target interface does not support XDP_TX");
+				return -EOPNOTSUPP;
+			}
+			if (!rhashtable_lookup_fast(&progs, &act->dev,
+						    progs_params)) {
+				NL_SET_ERR_MSG_MOD(extack,
+						   "Need xdp_flow setup on redirect target interface in advance");
+				return -EINVAL;
+			}
+			action->id = XDP_FLOW_ACTION_REDIRECT;
+			action->ifindex = act->dev->ifindex;
+			break;
 		case FLOW_ACTION_VLAN_PUSH:
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 0588a36..54a7f10 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -18,6 +18,7 @@
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
 int progfile_fd;
+int output_map_fd;
 FILE *kmsg;
 
 #define zalloc(size) calloc(1, (size))
@@ -44,12 +45,22 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int devmap_idx;
 	int free_slot_top;
 	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
 
+struct devmap_idx_node {
+	int devmap_idx;
+	struct hlist_node node;
+};
+
+DEFINE_HASHTABLE(devmap_idx_table, 16);
+
+int max_devmap_idx;
+
 static int libbpf_err(int err, char *errbuf)
 {
 	libbpf_strerror(err, errbuf, ERRBUF_SIZE);
@@ -94,6 +105,15 @@ static int setup(void)
 		goto err;
 	}
 
+	output_map_fd = bpf_create_map(BPF_MAP_TYPE_DEVMAP, sizeof(int),
+				       sizeof(int), MAX_PORTS, 0);
+	if (output_map_fd < 0) {
+		err = -errno;
+		pr_err("map creation for output_map failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	return 0;
 err:
 	close(progfile_fd);
@@ -101,10 +121,23 @@ static int setup(void)
 	return err;
 }
 
-static int load_bpf(int ifindex, struct bpf_object **objp)
+static void delete_output_map_elem(int idx)
+{
+	char errbuf[ERRBUF_SIZE];
+	int err;
+
+	err = bpf_map_delete_elem(output_map_fd, &idx);
+	if (err) {
+		libbpf_err(err, errbuf);
+		pr_warn("Failed to delete idx %d from output_map: %s\n",
+			idx, errbuf);
+	}
+}
+
+static int load_bpf(int ifindex, int devmap_idx, struct bpf_object **objp)
 {
 	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
-	struct bpf_map *flow_tables, *flow_masks_head;
+	struct bpf_map *output_map, *flow_tables, *flow_masks_head;
 	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
@@ -137,6 +170,27 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	output_map = bpf_object__find_map_by_name(obj, "output_map");
+	if (!output_map) {
+		pr_err("Cannot find output_map\n");
+		err = -ENOENT;
+		goto err_obj;
+	}
+
+	err = bpf_map__reuse_fd(output_map, output_map_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse output_map fd: %s\n", errbuf);
+		goto err_obj;
+	}
+
+	if (bpf_map_update_elem(output_map_fd, &devmap_idx, &ifindex, 0)) {
+		err = -errno;
+		pr_err("Failed to insert idx %d if %d into output_map: %s\n",
+		       devmap_idx, ifindex, strerror(errno));
+		goto err_obj;
+	}
+
 	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
 				      sizeof(struct xdp_flow_key),
 				      sizeof(struct xdp_flow_actions),
@@ -226,6 +280,8 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 
 	return prog_fd;
 err:
+	delete_output_map_elem(devmap_idx);
+err_obj:
 	bpf_object__close(obj);
 	return err;
 }
@@ -276,6 +332,56 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static struct devmap_idx_node *find_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node;
+
+	hash_for_each_possible(devmap_idx_table, node, node, devmap_idx) {
+		if (node->devmap_idx == devmap_idx)
+			return node;
+	}
+
+	return NULL;
+}
+
+static int get_new_devmap_idx(void)
+{
+	int offset;
+
+	for (offset = 0; offset < MAX_PORTS; offset++) {
+		int devmap_idx = max_devmap_idx++;
+
+		if (max_devmap_idx >= MAX_PORTS)
+			max_devmap_idx -= MAX_PORTS;
+
+		if (!find_devmap_idx(devmap_idx)) {
+			struct devmap_idx_node *node;
+
+			node = malloc(sizeof(*node));
+			if (!node) {
+				pr_err("malloc for devmap_idx failed\n");
+				return -ENOMEM;
+			}
+			node->devmap_idx = devmap_idx;
+			hash_add(devmap_idx_table, &node->node, devmap_idx);
+
+			return devmap_idx;
+		}
+	}
+
+	return -ENOSPC;
+}
+
+static void delete_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node = find_devmap_idx(devmap_idx);
+
+	if (node) {
+		hash_del(&node->node);
+		free(node);
+	}
+}
+
 static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
 {
 	int i;
@@ -329,11 +435,11 @@ static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
 
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
+	int err, prog_fd, devmap_idx = -1;
 	struct netdev_info *netdev_info;
 	struct bpf_prog_info info = {};
 	struct netdev_info_key key;
 	__u32 len = sizeof(info);
-	int err, prog_fd;
 
 	err = get_netdev_info_key(req, &key);
 	if (err)
@@ -350,12 +456,19 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	devmap_idx = get_new_devmap_idx();
+	if (devmap_idx < 0) {
+		err = devmap_idx;
+		goto err_netdev_info;
+	}
+	netdev_info->devmap_idx = devmap_idx;
+
 	init_flow_masks_free_slot(netdev_info);
 
-	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
+	prog_fd = load_bpf(req->ifindex, devmap_idx, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
-		goto err_netdev_info;
+		goto err_devmap_idx;
 	}
 
 	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
@@ -370,6 +483,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	return 0;
 err_obj:
 	bpf_object__close(netdev_info->obj);
+err_devmap_idx:
+	delete_devmap_idx(devmap_idx);
 err_netdev_info:
 	free(netdev_info);
 
@@ -386,12 +501,45 @@ static int handle_unload(const struct mbox_request *req)
 
 	hash_del(&netdev_info->node);
 	bpf_object__close(netdev_info->obj);
+	delete_output_map_elem(netdev_info->devmap_idx);
+	delete_devmap_idx(netdev_info->devmap_idx);
 	free(netdev_info);
 	pr_debug("XDP program for if %d was closed\n", req->ifindex);
 
 	return 0;
 }
 
+static int convert_ifindex_to_devmap_idx(struct mbox_request *req)
+{
+	int i;
+
+	for (i = 0; i < req->flow.actions.num_actions; i++) {
+		struct xdp_flow_action *action = &req->flow.actions.actions[i];
+
+		if (action->id == XDP_FLOW_ACTION_REDIRECT) {
+			struct netdev_info *netdev_info;
+			struct netdev_info_key key;
+			int err;
+
+			err = get_netdev_info_key(req, &key);
+			if (err)
+				return err;
+			key.ifindex = action->ifindex;
+
+			netdev_info = find_netdev_info(&key);
+			if (!netdev_info) {
+				pr_err("BUG: Interface %d is not ready for redirect target.\n",
+				       key.ifindex);
+				return -EINVAL;
+			}
+
+			action->ifindex = netdev_info->devmap_idx;
+		}
+	}
+
+	return 0;
+}
+
 static int get_table_fd(const struct netdev_info *netdev_info,
 			const char *table_name)
 {
@@ -788,6 +936,11 @@ static int handle_replace(struct mbox_request *req)
 	if (IS_ERR(netdev_info))
 		return PTR_ERR(netdev_info);
 
+	/* TODO: Use XDP_TX for redirect action when possible */
+	err = convert_ifindex_to_devmap_idx(req);
+	if (err)
+		return err;
+
 	err = flow_table_insert_elem(netdev_info, &req->flow);
 	if (err)
 		return err;
@@ -883,6 +1036,7 @@ int main(void)
 	}
 	loop();
 	close(progfile_fd);
+	close(output_map_fd);
 	fclose(kmsg);
 
 	return 0;
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 12/15] xdp_flow: Implement vlan_push action
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (10 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 11/15] xdp_flow: Implement redirect action Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 13/15] bpf, selftest: Add test for xdp_flow Toshiaki Makita
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

This is another example action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 23 +++++++++++++++++++++--
 net/xdp_flow/xdp_flow_kern_mod.c |  5 +++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 381d67e..7930349 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -90,10 +90,29 @@ static inline int action_redirect(struct xdp_flow_action *action)
 static inline int action_vlan_push(struct xdp_md *ctx,
 				   struct xdp_flow_action *action)
 {
+	struct vlan_ethhdr *vehdr;
+	void *data, *data_end;
+	__be16 proto, tci;
+
 	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	proto = action->vlan.proto;
+	tci = action->vlan.tci;
+
+	if (bpf_xdp_adjust_head(ctx, -VLAN_HLEN))
+		return XDP_DROP;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = (void *)(long)ctx->data;
+	if (data + VLAN_ETH_HLEN > data_end)
+		return XDP_DROP;
+
+	__builtin_memmove(data, data + VLAN_HLEN, ETH_ALEN * 2);
+	vehdr = data;
+	vehdr->h_vlan_proto = proto;
+	vehdr->h_vlan_TCI = tci;
+
+	return _XDP_CONTINUE;
 }
 
 static inline int action_vlan_pop(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 2581b81..7ce1733 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -84,6 +84,11 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->ifindex = act->dev->ifindex;
 			break;
 		case FLOW_ACTION_VLAN_PUSH:
+			action->id = XDP_FLOW_ACTION_VLAN_PUSH;
+			action->vlan.tci = act->vlan.vid |
+					   (act->vlan.prio << VLAN_PRIO_SHIFT);
+			action->vlan.proto = act->vlan.proto;
+			break;
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
 		case FLOW_ACTION_MANGLE:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 13/15] bpf, selftest: Add test for xdp_flow
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (11 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 12/15] xdp_flow: Implement vlan_push action Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 14/15] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Check if TC flow offloading to XDP works.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 tools/testing/selftests/bpf/Makefile         |   1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh | 106 +++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 00d05c5..3db9819 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -55,6 +55,7 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
+	test_xdp_flow.sh \
 	test_offload.py \
 	test_sock_addr.sh \
 	test_tunnel.sh \
diff --git a/tools/testing/selftests/bpf/test_xdp_flow.sh b/tools/testing/selftests/bpf/test_xdp_flow.sh
new file mode 100755
index 0000000..6937454
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_flow.sh
@@ -0,0 +1,106 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Create 2 namespaces with 2 veth peers, and
+# forward packets in-between using xdp_flow
+#
+# NS1(veth11)        NS2(veth22)
+#      |                  |
+#      |                  |
+#   (veth1)            (veth2)
+#      ^                  ^
+#      |     xdp_flow     |
+#      --------------------
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TESTNAME=xdp_flow
+
+_cleanup()
+{
+	set +e
+	ip link del veth1 2> /dev/null
+	ip link del veth2 2> /dev/null
+	ip netns del ns1 2> /dev/null
+	ip netns del ns2 2> /dev/null
+}
+
+cleanup_skip()
+{
+	echo "selftests: $TESTNAME [SKIP]"
+	_cleanup
+
+	exit $ksft_skip
+}
+
+cleanup()
+{
+	if [ "$?" = 0 ]; then
+		echo "selftests: $TESTNAME [PASS]"
+	else
+		echo "selftests: $TESTNAME [FAILED]"
+	fi
+	_cleanup
+}
+
+if [ $(id -u) -ne 0 ]; then
+	echo "selftests: $TESTNAME [SKIP] Need root privileges"
+	exit $ksft_skip
+fi
+
+if ! ip link set dev lo xdp off > /dev/null 2>&1; then
+	echo "selftests: $TESTNAME [SKIP] Could not run test without the ip xdp support"
+	exit $ksft_skip
+fi
+
+set -e
+
+trap cleanup_skip EXIT
+
+ip netns add ns1
+ip netns add ns2
+
+ip link add veth1 type veth peer name veth11 netns ns1
+ip link add veth2 type veth peer name veth22 netns ns2
+
+ip link set veth1 up
+ip link set veth2 up
+
+ip -n ns1 addr add 10.1.1.11/24 dev veth11
+ip -n ns2 addr add 10.1.1.22/24 dev veth22
+
+ip -n ns1 link set dev veth11 up
+ip -n ns2 link set dev veth22 up
+
+ip -n ns1 link set dev veth11 xdp obj xdp_dummy.o sec xdp_dummy
+ip -n ns2 link set dev veth22 xdp obj xdp_dummy.o sec xdp_dummy
+
+ethtool -K veth1 flow-offload-xdp on
+ethtool -K veth2 flow-offload-xdp on
+
+trap cleanup EXIT
+
+# Adding clsact or ingress will trigger loading bpf prog in UMH
+tc qdisc add dev veth1 clsact
+tc qdisc add dev veth2 clsact
+
+# Adding filter will have UMH populate flow table map
+tc filter add dev veth1 ingress protocol ip flower \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol ip flower \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+# flows should be offloaded when 'flow-offload-xdp' is enabled on veth
+tc filter show dev veth1 ingress | grep -q not_in_hw && false
+tc filter show dev veth2 ingress | grep -q not_in_hw && false
+
+# ARP is not supported so add filters after in_hw check
+tc filter add dev veth1 ingress protocol arp flower \
+	arp_tip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol arp flower \
+	arp_sip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+ip netns exec ns1 ping -c 1 -W 1 10.1.1.22
+
+exit 0
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 14/15] i40e: prefetch xdp->data before running XDP prog
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (12 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 13/15] bpf, selftest: Add test for xdp_flow Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 15/15] bpf, hashtab: Compare keys in long Toshiaki Makita
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

XDP progs are likely to read/write xdp->data.
This improves the performance of xdp_flow.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e3f29dc..a85a4ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2207,6 +2207,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
+	prefetchw(xdp->data);
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* [RFC PATCH v2 bpf-next 15/15] bpf, hashtab: Compare keys in long
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (13 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 14/15] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
@ 2019-10-18  4:07 ` Toshiaki Makita
  2019-10-18 15:22 ` [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP John Fastabend
  2019-10-21 11:23 ` Björn Töpel
  16 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-18  4:07 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

memcmp() is generally slow. Compare keys in long if possible.
This improves xdp_flow performance.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 kernel/bpf/hashtab.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a6..8b5ffd4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -417,6 +417,29 @@ static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, u32
 	return &__select_bucket(htab, hash)->head;
 }
 
+/* key1 must be aligned to sizeof long */
+static bool key_equal(void *key1, void *key2, u32 size)
+{
+	/* Check for key1 */
+	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct htab_elem, key),
+				 sizeof(long)));
+
+	if (IS_ALIGNED((unsigned long)key2 | (unsigned long)size,
+		       sizeof(long))) {
+		unsigned long *lkey1, *lkey2;
+
+		for (lkey1 = key1, lkey2 = key2; size > 0;
+		     lkey1++, lkey2++, size -= sizeof(long)) {
+			if (*lkey1 != *lkey2)
+				return false;
+		}
+
+		return true;
+	}
+
+	return !memcmp(key1, key2, size);
+}
+
 /* this lookup function can only be called with bucket lock taken */
 static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
 					 void *key, u32 key_size)
@@ -425,7 +448,7 @@ static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash
 	struct htab_elem *l;
 
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	return NULL;
@@ -444,7 +467,7 @@ static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
 
 again:
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1))))
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (14 preceding siblings ...)
  2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 15/15] bpf, hashtab: Compare keys in long Toshiaki Makita
@ 2019-10-18 15:22 ` John Fastabend
  2019-10-21  7:31   ` Toshiaki Makita
  2019-10-21 11:23 ` Björn Töpel
  16 siblings, 1 reply; 58+ messages in thread
From: John Fastabend @ 2019-10-18 15:22 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: Toshiaki Makita, netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita wrote:
> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> to XDP.
> 

I've only read the cover letter so far but...

> * Motivation
> 
> The purpose is to speed up flow based network features like TC flower and
> nftables by making use of XDP.
> 
> I chose flow feature because my current interest is in OVS. OVS uses TC
> flower to offload flow tables to hardware, so if TC can offload flows to
> XDP, OVS also can be offloaded to XDP.

This adds a non-trivial amount of code and complexity so I'm
critical of the usefulness of being able to offload TC flower to
XDP when userspace can simply load an XDP program.

Why does OVS use tc flower at all if XDP is about 5x faster using
your measurements below? Rather than spend energy adding code to
a use case that as far as I can tell is narrowly focused on offload
support can we enumerate what is missing on XDP side that blocks
OVS from using it directly? Additionally for hardware that can
do XDP/BPF offload you will get the hardware offload for free.

Yes I know XDP is bytecode and you can't "offload" bytecode into
a flow based interface likely backed by a tcam but IMO that doesn't
mean we should leak complexity into the kernel network stack to
fix this. Use the tc-flower for offload only (it has support for
this) if you must and use the best (in terms of Mpps) software
interface for your software bits. And if you want auto-magic
offload support build hardware with BPF offload support.

In addition by using XDP natively any extra latency overhead from
bouncing calls through multiple layers would be removed.

> 
> When TC flower filter is offloaded to XDP, the received packets are
> handled by XDP first, and if their protocol or something is not
> supported by the eBPF program, the program returns XDP_PASS and packets
> are passed to upper layer TC.
> 
> The packet processing flow will be like this when this mechanism,
> xdp_flow, is used with OVS.

Same as obove just cross out the 'TC flower' box and add support
for your missing features to 'XDP prog' box. Now you have less
code to maintain and less bugs and aren't pushing packets through
multiple hops in a call chain.

> 
>  +-------------+
>  | openvswitch |
>  |    kmod     |
>  +-------------+
>         ^
>         | if not match in filters (flow key or action not supported by TC)
>  +-------------+
>  |  TC flower  |
>  +-------------+
>         ^
>         | if not match in flow tables (flow key or action not supported by XDP)
>  +-------------+
>  |  XDP prog   |
>  +-------------+
>         ^
>         | incoming packets
> 
> Of course we can directly use TC flower without OVS to speed up TC.

huh? TC flower is part of TC so not sure what 'speed up TC' means. I
guess this means using tc flower offload to xdp prog would speed up
general tc flower usage as well?

But again if we are concerned about Mpps metrics just write the XDP
program directly.

> 
> This is useful especially when the device does not support HW-offload.
> Such interfaces include virtual interfaces like veth.

I disagree, use XDP directly.

> 
> 
> * How to use

[...]

> * Performance

[...]

> Tested single core/single flow with 3 kinds of configurations.
> (spectre_v2 disabled)
> - xdp_flow: hw-offload=true, flow-offload-xdp on
> - TC:       hw-offload=true, flow-offload-xdp off (software TC)
> - ovs kmod: hw-offload=false
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  5.2 Mpps  1.2 Mpps  1.1 Mpps
> 
> So xdp_flow drop rate is roughly 4x-5x faster than software TC or ovs kmod.

+1 yep the main point of using XDP ;)

> 
> OTOH the time to add a flow increases with xdp_flow.
> 
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  22ms      6ms       0.6ms
> 
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.

And this is IMO sinks why we would adopt this. A native XDP solution would
as far as I can tell not suffer this latency. So in short, we add lots
of code that needs to be maintained, in my opinion it adds complexity,
and finally I can't see what XDP is missing today (with the code we
already have upstream!) to block doing an implementation without any
changes.

> 
> 
> * Implementation
> 
 
[...]

> 
> * About OVS AF_XDP netdev

[...]
 
> * About alternative userland (ovs-vswitchd etc.) implementation
> 
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
> 
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.

I argue to cut tc filter out entirely and then I think non of this
is needed.

> 
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.

Or same argument use XDP directly.

> 
> Another way to achieve this from userland is to add notifications in
> flow_offload kernel code to inform userspace of flow addition and
> deletion events, and listen them by a deamon which in turn loads eBPF
> programs, attach them to XDP, and modify eBPF maps. Although this may
> open up more use cases, I'm not thinking this is the best solution
> because it requires emulation of kernel behavior as an offload engine
> but flow related code is heavily changing which is difficult to follow
> from out of tree.

So if everything was already in XDP why would we need these
notifications? I think a way to poll on a map from user space would
be a great idea e.g. everytime my XDP program adds a flow to my
hash map wake up my userspace agent with some ctx on what was added or
deleted so I can do some control plane logic.

[...]

Lots of code churn...

>  24 files changed, 2864 insertions(+), 30 deletions(-)

Thanks,
John

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-18 15:22 ` [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP John Fastabend
@ 2019-10-21  7:31   ` Toshiaki Makita
  2019-10-22 16:54     ` John Fastabend
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-21  7:31 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/10/19 0:22, John Fastabend wrote:
> Toshiaki Makita wrote:
>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>> to XDP.
>>
> 
> I've only read the cover letter so far but...

Thank you for reading this long cover letter.

> 
>> * Motivation
>>
>> The purpose is to speed up flow based network features like TC flower and
>> nftables by making use of XDP.
>>
>> I chose flow feature because my current interest is in OVS. OVS uses TC
>> flower to offload flow tables to hardware, so if TC can offload flows to
>> XDP, OVS also can be offloaded to XDP.
> 
> This adds a non-trivial amount of code and complexity so I'm
> critical of the usefulness of being able to offload TC flower to
> XDP when userspace can simply load an XDP program.
> 
> Why does OVS use tc flower at all if XDP is about 5x faster using
> your measurements below? Rather than spend energy adding code to
> a use case that as far as I can tell is narrowly focused on offload
> support can we enumerate what is missing on XDP side that blocks
> OVS from using it directly?

I think nothing is missing for direct XDP use, as long as XDP datapath
only partially supports OVS flow parser/actions like xdp_flow.
The point is to avoid duplicate effort when someone wants to use XDP
through TC flower or nftables transparently.

> Additionally for hardware that can
> do XDP/BPF offload you will get the hardware offload for free.

This is not necessary as OVS already uses TC flower to offload flows.

> Yes I know XDP is bytecode and you can't "offload" bytecode into
> a flow based interface likely backed by a tcam but IMO that doesn't
> mean we should leak complexity into the kernel network stack to
> fix this. Use the tc-flower for offload only (it has support for
> this) if you must and use the best (in terms of Mpps) software
> interface for your software bits. And if you want auto-magic
> offload support build hardware with BPF offload support.
> 
> In addition by using XDP natively any extra latency overhead from
> bouncing calls through multiple layers would be removed.

To some extent yes, but not completely. Flow insertion from userspace
triggered by datapath upcall is necessary regardless of whether we use
TC or not.

>> When TC flower filter is offloaded to XDP, the received packets are
>> handled by XDP first, and if their protocol or something is not
>> supported by the eBPF program, the program returns XDP_PASS and packets
>> are passed to upper layer TC.
>>
>> The packet processing flow will be like this when this mechanism,
>> xdp_flow, is used with OVS.
> 
> Same as obove just cross out the 'TC flower' box and add support
> for your missing features to 'XDP prog' box. Now you have less
> code to maintain and less bugs and aren't pushing packets through
> multiple hops in a call chain.

If we cross out TC then we would need similar code in OVS userspace.
In total I don't think it would be less code to maintain.

> 
>>
>>   +-------------+
>>   | openvswitch |
>>   |    kmod     |
>>   +-------------+
>>          ^
>>          | if not match in filters (flow key or action not supported by TC)
>>   +-------------+
>>   |  TC flower  |
>>   +-------------+
>>          ^
>>          | if not match in flow tables (flow key or action not supported by XDP)
>>   +-------------+
>>   |  XDP prog   |
>>   +-------------+
>>          ^
>>          | incoming packets
>>
>> Of course we can directly use TC flower without OVS to speed up TC.
> 
> huh? TC flower is part of TC so not sure what 'speed up TC' means. I
> guess this means using tc flower offload to xdp prog would speed up
> general tc flower usage as well?

Yes.

> 
> But again if we are concerned about Mpps metrics just write the XDP
> program directly.

I guess you mean any Linux users who want TC-like flow handling should develop
their own XDP programs? (sorry if I misunderstand you.)
I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
good to have any program each user wants, but not every sysadmin can write low
level good performance programs like us. For typical use-cases like flow handling
easy use of XDP through existing kernel interface (here TC) is useful IMO.

> 
...
>> * About alternative userland (ovs-vswitchd etc.) implementation
>>
>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>> mechanism, instead of adding code to kernel. I just thought offloading
>> TC is more generic and allows wider usage with direct TC command.
>>
>> For example, considering that OVS inserts a flow to kernel only when
>> flow miss happens in kernel, we can in advance add offloaded flows via
>> tc filter to avoid flow insertion latency for certain sensitive flows.
>> TC flower usage without using OVS is also possible.
> 
> I argue to cut tc filter out entirely and then I think non of this
> is needed.

Not correct. Even with native XDP use, multiple map lookup/modification
from userspace is necessary for flow miss handling, which will lead to
some latency.

And there are other use-cases for direct TC use, like packet drop or
redirection for certain flows.

> 
>>
>> Also as written above nftables can be offloaded to XDP with this
>> mechanism as well.
> 
> Or same argument use XDP directly.

I'm thinking it's useful for sysadmins to be able to use XDP through
existing kernel interfaces.

> 
>>
>> Another way to achieve this from userland is to add notifications in
>> flow_offload kernel code to inform userspace of flow addition and
>> deletion events, and listen them by a deamon which in turn loads eBPF
>> programs, attach them to XDP, and modify eBPF maps. Although this may
>> open up more use cases, I'm not thinking this is the best solution
>> because it requires emulation of kernel behavior as an offload engine
>> but flow related code is heavily changing which is difficult to follow
>> from out of tree.
> 
> So if everything was already in XDP why would we need these
> notifications? I think a way to poll on a map from user space would
> be a great idea e.g. everytime my XDP program adds a flow to my
> hash map wake up my userspace agent with some ctx on what was added or
> deleted so I can do some control plane logic.

I was talking about TC emulation above, so map notification is not related
to this problem, although it may be a nice feature.

> 
> [...]
> 
> Lots of code churn...

Note that most of it is TC offload driver implementation. So it should add
little complexity to network/XDP/TC core.

> 
>>   24 files changed, 2864 insertions(+), 30 deletions(-)
> 
> Thanks,
> John
> 

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (15 preceding siblings ...)
  2019-10-18 15:22 ` [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP John Fastabend
@ 2019-10-21 11:23 ` Björn Töpel
  2019-10-21 11:47   ` Toshiaki Makita
  16 siblings, 1 reply; 58+ messages in thread
From: Björn Töpel @ 2019-10-21 11:23 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar, Netdev, bpf, William Tu,
	Stanislav Fomichev, Karlsson, Magnus

On Sat, 19 Oct 2019 at 00:31, Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
[...]
>
> * About OVS AF_XDP netdev
>
> Recently OVS has added AF_XDP netdev type support. This also makes use
> of XDP, but in some ways different from this patch set.
>
> - AF_XDP work originally started in order to bring BPF's flexibility to
>   OVS, which enables us to upgrade datapath without updating kernel.
>   AF_XDP solution uses userland datapath so it achieved its goal.
>   xdp_flow will not replace OVS datapath completely, but offload it
>   partially just for speed up.
>
> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
>   as well as using another core for softirq.
>

Disclaimer; I haven't studied the OVS AF_XDP code, so this is about
AF_XDP in general.

One of the nice things about AF_XDP is that it *doesn't* force a user
to busy-poll (burn CPUs) like a regular userland pull-mode driver.
Yes, you can do that if you're extremely latency sensitive, but for
most users (and I think some OVS deployments might fit into this
category) not pinning cores/interrupts and using poll() syscalls (need
wakeup patch [1]) is the way to go. The scenario you're describing
with ksoftirqd spinning on one core, and user application on another
is not something I'd recommend, rather run your packet processing
application on one core together with the softirq processing.

Björn
[1] https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/#t



> - OVS AF_XDP needs packet copy when forwarding packets.
>
> - xdp_flow can be used not only for OVS. It works for direct use of TC
>   flower and nftables.
>
>
> * About alternative userland (ovs-vswitchd etc.) implementation
>
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
>
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
>
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.
>
> Another way to achieve this from userland is to add notifications in
> flow_offload kernel code to inform userspace of flow addition and
> deletion events, and listen them by a deamon which in turn loads eBPF
> programs, attach them to XDP, and modify eBPF maps. Although this may
> open up more use cases, I'm not thinking this is the best solution
> because it requires emulation of kernel behavior as an offload engine
> but flow related code is heavily changing which is difficult to follow
> from out of tree.
>
> * Note
>
> This patch set is based on top of commit 5bc60de50dfe ("selftests: bpf:
> Don't try to read files without read permission") on bpf-next, but need
> to backport commit 98beb3edeb97 ("samples/bpf: Add a workaround for
> asm_inline") from bpf tree to successfully build the module.
>
> * Changes
>
> RFC v2:
>  - Use indr block instead of modifying TC core, feedback from Jakub
>    Kicinski.
>  - Rename tc-offload-xdp to flow-offload-xdp since this works not only
>    for TC but also for nftables, as now I use indr flow block.
>  - Factor out XDP program validation code in net/core and use it to
>    attach a program to XDP from xdp_flow.
>  - Use /dev/kmsg instead of syslog.
>
> Any feedback is welcome.
> Thanks!
>
> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>
> Toshiaki Makita (15):
>   xdp_flow: Add skeleton of XDP based flow offload driver
>   xdp_flow: Add skeleton bpf program for XDP
>   bpf: Add API to get program from id
>   xdp: Export dev_check_xdp and dev_change_xdp
>   xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
>   xdp_flow: Prepare flow tables in bpf
>   xdp_flow: Add flow entry insertion/deletion logic in UMH
>   xdp_flow: Add flow handling and basic actions in bpf prog
>   xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
>   xdp_flow: Add netdev feature for enabling flow offload to XDP
>   xdp_flow: Implement redirect action
>   xdp_flow: Implement vlan_push action
>   bpf, selftest: Add test for xdp_flow
>   i40e: prefetch xdp->data before running XDP prog
>   bpf, hashtab: Compare keys in long
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>  include/linux/bpf.h                          |    8 +
>  include/linux/netdev_features.h              |    2 +
>  include/linux/netdevice.h                    |    4 +
>  kernel/bpf/hashtab.c                         |   27 +-
>  kernel/bpf/syscall.c                         |   42 +-
>  net/Kconfig                                  |    1 +
>  net/Makefile                                 |    1 +
>  net/core/dev.c                               |  113 ++-
>  net/core/ethtool.c                           |    1 +
>  net/xdp_flow/.gitignore                      |    1 +
>  net/xdp_flow/Kconfig                         |   16 +
>  net/xdp_flow/Makefile                        |  112 +++
>  net/xdp_flow/msgfmt.h                        |  102 +++
>  net/xdp_flow/umh_bpf.h                       |   34 +
>  net/xdp_flow/xdp_flow.h                      |   28 +
>  net/xdp_flow/xdp_flow_core.c                 |  180 +++++
>  net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
>  net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
>  net/xdp_flow/xdp_flow_kern_mod.c             |  699 +++++++++++++++++
>  net/xdp_flow/xdp_flow_umh.c                  | 1043 ++++++++++++++++++++++++++
>  net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
>  tools/testing/selftests/bpf/Makefile         |    1 +
>  tools/testing/selftests/bpf/test_xdp_flow.sh |  106 +++
>  24 files changed, 2864 insertions(+), 30 deletions(-)
>  create mode 100644 net/xdp_flow/.gitignore
>  create mode 100644 net/xdp_flow/Kconfig
>  create mode 100644 net/xdp_flow/Makefile
>  create mode 100644 net/xdp_flow/msgfmt.h
>  create mode 100644 net/xdp_flow/umh_bpf.h
>  create mode 100644 net/xdp_flow/xdp_flow.h
>  create mode 100644 net/xdp_flow/xdp_flow_core.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
>  create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-21 11:23 ` Björn Töpel
@ 2019-10-21 11:47   ` Toshiaki Makita
  0 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-21 11:47 UTC (permalink / raw)
  To: Björn Töpel, William Tu
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar, Netdev, bpf,
	Stanislav Fomichev, Karlsson, Magnus

On 2019/10/21 20:23, Björn Töpel wrote:
> On Sat, 19 Oct 2019 at 00:31, Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
> [...]
>>
>> * About OVS AF_XDP netdev
>>
>> Recently OVS has added AF_XDP netdev type support. This also makes use
>> of XDP, but in some ways different from this patch set.
>>
>> - AF_XDP work originally started in order to bring BPF's flexibility to
>>    OVS, which enables us to upgrade datapath without updating kernel.
>>    AF_XDP solution uses userland datapath so it achieved its goal.
>>    xdp_flow will not replace OVS datapath completely, but offload it
>>    partially just for speed up.
>>
>> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU
>>    as well as using another core for softirq.
>>
> 
> Disclaimer; I haven't studied the OVS AF_XDP code, so this is about
> AF_XDP in general.
> 
> One of the nice things about AF_XDP is that it *doesn't* force a user
> to busy-poll (burn CPUs) like a regular userland pull-mode driver.
> Yes, you can do that if you're extremely latency sensitive, but for
> most users (and I think some OVS deployments might fit into this
> category) not pinning cores/interrupts and using poll() syscalls (need
> wakeup patch [1]) is the way to go. The scenario you're describing
> with ksoftirqd spinning on one core, and user application on another
> is not something I'd recommend, rather run your packet processing
> application on one core together with the softirq processing.

Thank you for the information.
I want to evaluate AF_XDP solution more appropriately.

William, please correct me if I'm saying something wrong here.
Or guide me if more appropriate configuration to achieve best performance is possible.

> 
> Björn
> [1] https://lore.kernel.org/bpf/1565767643-4908-1-git-send-email-magnus.karlsson@intel.com/#t

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-21  7:31   ` Toshiaki Makita
@ 2019-10-22 16:54     ` John Fastabend
  2019-10-22 17:45       ` Toke Høiland-Jørgensen
                         ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: John Fastabend @ 2019-10-22 16:54 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita wrote:
> On 2019/10/19 0:22, John Fastabend wrote:
> > Toshiaki Makita wrote:
> >> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> >> to XDP.
> >>
> > 
> > I've only read the cover letter so far but...
> 
> Thank you for reading this long cover letter.
> 
> > 
> >> * Motivation
> >>
> >> The purpose is to speed up flow based network features like TC flower and
> >> nftables by making use of XDP.
> >>
> >> I chose flow feature because my current interest is in OVS. OVS uses TC
> >> flower to offload flow tables to hardware, so if TC can offload flows to
> >> XDP, OVS also can be offloaded to XDP.
> > 
> > This adds a non-trivial amount of code and complexity so I'm
> > critical of the usefulness of being able to offload TC flower to
> > XDP when userspace can simply load an XDP program.
> > 
> > Why does OVS use tc flower at all if XDP is about 5x faster using
> > your measurements below? Rather than spend energy adding code to
> > a use case that as far as I can tell is narrowly focused on offload
> > support can we enumerate what is missing on XDP side that blocks
> > OVS from using it directly?
> 
> I think nothing is missing for direct XDP use, as long as XDP datapath
> only partially supports OVS flow parser/actions like xdp_flow.
> The point is to avoid duplicate effort when someone wants to use XDP
> through TC flower or nftables transparently.

I don't know who this "someone" is that wants to use XDP through TC
flower or nftables transparently. TC at least is not known for a
great uapi. It seems to me that it would be a relatively small project
to write a uapi that ran on top of a canned XDP program to add
flow rules. This could match tc cli if you wanted but why not take
the opportunity to write a UAPI that does flow management well.

Are there users of tc_flower in deployment somewhere? I don't have
lots of visibility but my impression is these were mainly OVS
and other switch offload interface.

OVS should be sufficiently important that developers can right a
native solution in my opinion. Avoiding needless layers of abstraction
in the process. The nice thing about XDP here is you can write
an XDP program and set of maps that matches OVS perfectly so no
need to try and fit things in places.

> 
> > Additionally for hardware that can
> > do XDP/BPF offload you will get the hardware offload for free.
> 
> This is not necessary as OVS already uses TC flower to offload flows.

But... if you have BPF offload hardware that doesn't support TCAM
style flow based offloads direct XDP offload would seem easier IMO.
Which is better? Flow table based offloads vs BPF offloads likely
depends on your use case in my experience.

> 
> > Yes I know XDP is bytecode and you can't "offload" bytecode into
> > a flow based interface likely backed by a tcam but IMO that doesn't
> > mean we should leak complexity into the kernel network stack to
> > fix this. Use the tc-flower for offload only (it has support for
> > this) if you must and use the best (in terms of Mpps) software
> > interface for your software bits. And if you want auto-magic
> > offload support build hardware with BPF offload support.
> > 
> > In addition by using XDP natively any extra latency overhead from
> > bouncing calls through multiple layers would be removed.
> 
> To some extent yes, but not completely. Flow insertion from userspace
> triggered by datapath upcall is necessary regardless of whether we use
> TC or not.

Right but these are latency involved with OVS architecture not
kernel implementation artifacts. Actually what would be an interesting
metric would be to see latency of a native xdp implementation.

I don't think we should add another implementation to the kernel
that is worse than what we have.


 xdp_flow  TC        ovs kmod
 --------  --------  --------
 22ms      6ms       0.6ms

TC is already order of magnitude off it seems :(

If ovs_kmod is .6ms why am I going to use something that is 6ms or
22ms. I expect a native xdp implementation using a hash map to be
inline with ovs kmod if not better. So if we have already have
an implementation in kernel that is 5x faster and better at flow
insertion another implementation that doesn't meet that threshold
should probably not go in kernel.

Additionally, for the OVS use case I would argue the XDP native
solution is straight forward to implement. Although I will defer
to OVS datapath experts here but above you noted nothing is
missing on the feature side?

> 
> >> When TC flower filter is offloaded to XDP, the received packets are
> >> handled by XDP first, and if their protocol or something is not
> >> supported by the eBPF program, the program returns XDP_PASS and packets
> >> are passed to upper layer TC.
> >>
> >> The packet processing flow will be like this when this mechanism,
> >> xdp_flow, is used with OVS.
> > 
> > Same as obove just cross out the 'TC flower' box and add support
> > for your missing features to 'XDP prog' box. Now you have less
> > code to maintain and less bugs and aren't pushing packets through
> > multiple hops in a call chain.
> 
> If we cross out TC then we would need similar code in OVS userspace.
> In total I don't think it would be less code to maintain.

Yes but I think minimizing kernel code and complexity is more important
than minimizing code in a specific userspace application/use-case.
Just think about the cost of a bug in kernel vs user space side. In
user space you have ability to fix and release your own code in kernel
side you will have to fix upstream, manage backports, get distributions
involved, etc.

I have no problem adding code if its a good use case but in this case
I'm still not seeing it.

> 
> > 
> >>
> >>   +-------------+
> >>   | openvswitch |
> >>   |    kmod     |
> >>   +-------------+
> >>          ^
> >>          | if not match in filters (flow key or action not supported by TC)
> >>   +-------------+
> >>   |  TC flower  |
> >>   +-------------+
> >>          ^
> >>          | if not match in flow tables (flow key or action not supported by XDP)
> >>   +-------------+
> >>   |  XDP prog   |
> >>   +-------------+
> >>          ^
> >>          | incoming packets
> >>
> >> Of course we can directly use TC flower without OVS to speed up TC.
> > 
> > huh? TC flower is part of TC so not sure what 'speed up TC' means. I
> > guess this means using tc flower offload to xdp prog would speed up
> > general tc flower usage as well?
> 
> Yes.
> 
> > 
> > But again if we are concerned about Mpps metrics just write the XDP
> > program directly.
> 
> I guess you mean any Linux users who want TC-like flow handling should develop
> their own XDP programs? (sorry if I misunderstand you.)
> I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
> good to have any program each user wants, but not every sysadmin can write low
> level good performance programs like us. For typical use-cases like flow handling
> easy use of XDP through existing kernel interface (here TC) is useful IMO.

For OVS the initial use case I suggest write a XDP program tailored and
optimized for OVS. Optimize it for this specific use case.

If you want a general flow based XDP program write one, convince someone
to deploy and build a user space application to manage it. No sysadmin
has to touch this. Toke and others at RedHat appear to have this exact
use case in mind.

> 
> > 
> ...
> >> * About alternative userland (ovs-vswitchd etc.) implementation
> >>
> >> Maybe a similar logic can be implemented in ovs-vswitchd offload
> >> mechanism, instead of adding code to kernel. I just thought offloading
> >> TC is more generic and allows wider usage with direct TC command.
> >>
> >> For example, considering that OVS inserts a flow to kernel only when
> >> flow miss happens in kernel, we can in advance add offloaded flows via
> >> tc filter to avoid flow insertion latency for certain sensitive flows.
> >> TC flower usage without using OVS is also possible.
> > 
> > I argue to cut tc filter out entirely and then I think non of this
> > is needed.
> 
> Not correct. Even with native XDP use, multiple map lookup/modification
> from userspace is necessary for flow miss handling, which will lead to
> some latency.

I have not done got the data but I suspect the latency will be much
closer to the ovs kmod .6ms than the TC or xdp_flow latency.

> 
> And there are other use-cases for direct TC use, like packet drop or
> redirection for certain flows.

But these can be implemented in XDP correct?

> 
> > 
> >>
> >> Also as written above nftables can be offloaded to XDP with this
> >> mechanism as well.
> > 
> > Or same argument use XDP directly.
> 
> I'm thinking it's useful for sysadmins to be able to use XDP through
> existing kernel interfaces.

I agree its perhaps friendly to do so but for OVS not necessary and
if sysadmins want a generic XDP flow interface someone can write one.
Tell admins using your new tool gives 5x mpps improvement and orders
of magnitude latency reduction and I suspect converting them over
should be easy. Or if needed write an application in userspace that
converts tc commands to native XDP map commands.

I think for sysadmins in general (not OVS) use case I would work
with Jesper and Toke. They seem to be working on this specific problem.

> 
> > 
> >>
> >> Another way to achieve this from userland is to add notifications in
> >> flow_offload kernel code to inform userspace of flow addition and
> >> deletion events, and listen them by a deamon which in turn loads eBPF
> >> programs, attach them to XDP, and modify eBPF maps. Although this may
> >> open up more use cases, I'm not thinking this is the best solution
> >> because it requires emulation of kernel behavior as an offload engine
> >> but flow related code is heavily changing which is difficult to follow
> >> from out of tree.
> > 
> > So if everything was already in XDP why would we need these
> > notifications? I think a way to poll on a map from user space would
> > be a great idea e.g. everytime my XDP program adds a flow to my
> > hash map wake up my userspace agent with some ctx on what was added or
> > deleted so I can do some control plane logic.
> 
> I was talking about TC emulation above, so map notification is not related
> to this problem, although it may be a nice feature.

OK

> 
> > 
> > [...]
> > 
> > Lots of code churn...
> 
> Note that most of it is TC offload driver implementation. So it should add
> little complexity to network/XDP/TC core.

Maybe but I would still like the TC offload driver implementation to
be as straight forward as possible.

> 
> > 
> >>   24 files changed, 2864 insertions(+), 30 deletions(-)
> > 
> > Thanks,
> > John
> > 
> 
> Toshiaki Makita



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-22 16:54     ` John Fastabend
@ 2019-10-22 17:45       ` Toke Høiland-Jørgensen
  2019-10-24  4:27         ` John Fastabend
  2019-10-27 13:13         ` Toshiaki Makita
  2019-10-23 14:11       ` Jamal Hadi Salim
  2019-10-27 13:06       ` Toshiaki Makita
  2 siblings, 2 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-22 17:45 UTC (permalink / raw)
  To: John Fastabend, Toshiaki Makita, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

John Fastabend <john.fastabend@gmail.com> writes:

> I think for sysadmins in general (not OVS) use case I would work
> with Jesper and Toke. They seem to be working on this specific
> problem.

We're definitely thinking about how we can make "XDP magically speeds up
my network stack" a reality, if that's what you mean. Not that we have
arrived at anything specific yet...

And yeah, I'd also be happy to discuss what it would take to make a
native XDP implementation of the OVS datapath; including what (if
anything) is missing from the current XDP feature set to make this
feasible. I must admit that I'm not quite clear on why that wasn't the
approach picked for the first attempt to speed up OVS using XDP...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-22 16:54     ` John Fastabend
  2019-10-22 17:45       ` Toke Høiland-Jørgensen
@ 2019-10-23 14:11       ` Jamal Hadi Salim
  2019-10-24  4:38         ` John Fastabend
  2019-10-27 13:27         ` Toshiaki Makita
  2019-10-27 13:06       ` Toshiaki Makita
  2 siblings, 2 replies; 58+ messages in thread
From: Jamal Hadi Salim @ 2019-10-23 14:11 UTC (permalink / raw)
  To: John Fastabend, Toshiaki Makita, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev


Sorry - didnt read every detail of this thread so i may
be missing something.

On 2019-10-22 12:54 p.m., John Fastabend wrote:
> Toshiaki Makita wrote:
>> On 2019/10/19 0:22, John Fastabend wrote:
>>> Toshiaki Makita wrote:
>>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>>>> to XDP.
>>>>

> 
> I don't know who this "someone" is that wants to use XDP through TC
> flower or nftables transparently. TC at least is not known for a
> great uapi. 


The uapi is netlink. You may be talking about lack of a friendly
application library that abstracts out concepts?

> It seems to me that it would be a relatively small project
> to write a uapi that ran on top of a canned XDP program to add
> flow rules. This could match tc cli if you wanted but why not take
> the opportunity to write a UAPI that does flow management well.
> 

Disagreement:
Unfortunately legacy utilities and apps cant just be magically wished
away. There's a lot of value in transparently making them work with
new infrastructure. My usual exaggerated pitch: 1000 books have been
written on this stuff, 100K people have RH certificates which entitle
them to be "experts"; dinasour kernels exist in data centres and
(/giggle) "enteprise". You cant just ignore all that.

Summary: there is value in what Toshiaki is doing.

I am disappointed that given a flexible canvas like XDP, we are still
going after something like flower... if someone was using u32 as the
abstraction it will justify it a lot more in my mind.
Tying it to OVS as well is not doing it justice.

Agreement:
Having said that I dont think that flower/OVS should be the interface
that XDP should be aware of. Neither do i agree that kernel "real
estate" should belong to Oneway(TM) of doing things (we are still stuck
with netfilter planting the columbus flag on all networking hooks).
Let 1000 flowers bloom.
So: couldnt Toshiaki's requirement be met with writting a user space
daemon that trampolines flower to "XDP format" flow transforms? That way
in the future someone could add a u32->XDP format flow definition and we
are not doomed to forever just use flower.

>> To some extent yes, but not completely. Flow insertion from userspace
>> triggered by datapath upcall is necessary regardless of whether we use
>> TC or not.
> 
> Right but these are latency involved with OVS architecture not
> kernel implementation artifacts. Actually what would be an interesting
> metric would be to see latency of a native xdp implementation.
> 
> I don't think we should add another implementation to the kernel
> that is worse than what we have.
> 
> 
>   xdp_flow  TC        ovs kmod
>   --------  --------  --------
>   22ms      6ms       0.6ms
> 
> TC is already order of magnitude off it seems :(
> 
 >
> If ovs_kmod is .6ms why am I going to use something that is 6ms or
> 22ms. 

I am speculating having not read Toshiaki's code.
The obvious case for the layering is for policy management.
As you go upwards hw->xdp->tc->userspace->remote control
your policies get richer and the resolved policies pushed down
are more resolved. I am guessing the numbers we see above are
for that first packet which is used as a control packet.
An automonous system like this is of course susceptible to
attacks.

The workaround would be to preload the rules, but even then
you will need to deal with resource constraints. Comparison
would be like hierarchies of cache to RAM: L1/2/3 before RAM.
To illustrate: Very limited fastest L1 (aka NIC offload),
Limited faster L2 (XDP algorithms), L3 being tc and RAM being
the user space resolution.

>I expect a native xdp implementation using a hash map to be
> inline with ovs kmod if not better.

Hashes are good for datapath use cases but not when you consider
a holistic access where you have to worry about control aspect.

cheers,
jamal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-22 17:45       ` Toke Høiland-Jørgensen
@ 2019-10-24  4:27         ` John Fastabend
  2019-10-24 10:13           ` Toke Høiland-Jørgensen
  2019-10-27 13:13         ` Toshiaki Makita
  1 sibling, 1 reply; 58+ messages in thread
From: John Fastabend @ 2019-10-24  4:27 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > I think for sysadmins in general (not OVS) use case I would work
> > with Jesper and Toke. They seem to be working on this specific
> > problem.
> 
> We're definitely thinking about how we can make "XDP magically speeds up
> my network stack" a reality, if that's what you mean. Not that we have
> arrived at anything specific yet...

There seemed to be two thoughts in the cover letter one how to make
OVS flow tc path faster via XDP. And the other how to make other
users of tc flower software stack faster.

For the OVS case seems to me that OVS should create its own XDP datapath
if its 5x faster than the tc flower datapath. Although missing from the
data was comparing against ovs kmod so that comparison would also be
interesting. This way OVS could customize things and create only what
they need.

But the other case for a transparent tc flower XDP a set of user tools
could let users start using XDP for this use case without having to
write their own BPF code. Anyways I had the impression that might be
something you and Jesper are thinking about, general usability for
users that are not necessarily writing their own network.

> 
> And yeah, I'd also be happy to discuss what it would take to make a
> native XDP implementation of the OVS datapath; including what (if
> anything) is missing from the current XDP feature set to make this
> feasible. I must admit that I'm not quite clear on why that wasn't the
> approach picked for the first attempt to speed up OVS using XDP...
> 
> -Toke
> 



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-23 14:11       ` Jamal Hadi Salim
@ 2019-10-24  4:38         ` John Fastabend
  2019-10-24 17:05           ` Jamal Hadi Salim
  2019-10-27 13:27         ` Toshiaki Makita
  1 sibling, 1 reply; 58+ messages in thread
From: John Fastabend @ 2019-10-24  4:38 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, Toshiaki Makita,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Jamal Hadi Salim wrote:
> 
> Sorry - didnt read every detail of this thread so i may
> be missing something.
> 
> On 2019-10-22 12:54 p.m., John Fastabend wrote:
> > Toshiaki Makita wrote:
> >> On 2019/10/19 0:22, John Fastabend wrote:
> >>> Toshiaki Makita wrote:
> >>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
> >>>> to XDP.
> >>>>
> 
> > 
> > I don't know who this "someone" is that wants to use XDP through TC
> > flower or nftables transparently. TC at least is not known for a
> > great uapi. 
> 
> 
> The uapi is netlink. You may be talking about lack of a friendly
> application library that abstracts out concepts?

Correct, sorry was not entirely precise. I've written tooling on top of
the netlink API to do what is needed and it worked out just fine.

I think it would be interesting (in this context of flower vs XDP vs
u32, etc.) to build a flow API that abstracts tc vs XDP away and leverages
the correct lower level mechanics as needed. Easier said than done
of course.

> 
> > It seems to me that it would be a relatively small project
> > to write a uapi that ran on top of a canned XDP program to add
> > flow rules. This could match tc cli if you wanted but why not take
> > the opportunity to write a UAPI that does flow management well.
> > 
> 
> Disagreement:
> Unfortunately legacy utilities and apps cant just be magically wished
> away. There's a lot of value in transparently making them work with
> new infrastructure. My usual exaggerated pitch: 1000 books have been
> written on this stuff, 100K people have RH certificates which entitle
> them to be "experts"; dinasour kernels exist in data centres and
> (/giggle) "enteprise". You cant just ignore all that.

But flower itself is not so old.

> 
> Summary: there is value in what Toshiaki is doing.
> 
> I am disappointed that given a flexible canvas like XDP, we are still
> going after something like flower... if someone was using u32 as the
> abstraction it will justify it a lot more in my mind.
> Tying it to OVS as well is not doing it justice.

William Tu worked on doing OVS natively in XDP at one point and
could provide more input on the pain points. But seems easier to just
modify OVS vs adding kernel shim code to take tc to xdp IMO.

> 
> Agreement:
> Having said that I dont think that flower/OVS should be the interface
> that XDP should be aware of. Neither do i agree that kernel "real
> estate" should belong to Oneway(TM) of doing things (we are still stuck
> with netfilter planting the columbus flag on all networking hooks).
> Let 1000 flowers bloom.
> So: couldnt Toshiaki's requirement be met with writting a user space
> daemon that trampolines flower to "XDP format" flow transforms? That way
> in the future someone could add a u32->XDP format flow definition and we
> are not doomed to forever just use flower.

A user space daemon I agree would work.

> 
> >> To some extent yes, but not completely. Flow insertion from userspace
> >> triggered by datapath upcall is necessary regardless of whether we use
> >> TC or not.
> > 
> > Right but these are latency involved with OVS architecture not
> > kernel implementation artifacts. Actually what would be an interesting
> > metric would be to see latency of a native xdp implementation.
> > 
> > I don't think we should add another implementation to the kernel
> > that is worse than what we have.
> > 
> > 
> >   xdp_flow  TC        ovs kmod
> >   --------  --------  --------
> >   22ms      6ms       0.6ms
> > 
> > TC is already order of magnitude off it seems :(
> > 
>  >
> > If ovs_kmod is .6ms why am I going to use something that is 6ms or
> > 22ms. 
> 
> I am speculating having not read Toshiaki's code.
> The obvious case for the layering is for policy management.
> As you go upwards hw->xdp->tc->userspace->remote control
> your policies get richer and the resolved policies pushed down
> are more resolved. I am guessing the numbers we see above are
> for that first packet which is used as a control packet.
> An automonous system like this is of course susceptible to
> attacks.

Agree but still first packets happen and introducing latency spikes
when we have a better solution around should be avoided.

> 
> The workaround would be to preload the rules, but even then
> you will need to deal with resource constraints. Comparison
> would be like hierarchies of cache to RAM: L1/2/3 before RAM.
> To illustrate: Very limited fastest L1 (aka NIC offload),
> Limited faster L2 (XDP algorithms), L3 being tc and RAM being
> the user space resolution.

Of course.

> 
> >I expect a native xdp implementation using a hash map to be
> > inline with ovs kmod if not better.
> 
> Hashes are good for datapath use cases but not when you consider
> a holistic access where you have to worry about control aspect.

Whats the "right" data structure? We can build it in XDP if
its useful/generic. tc flower doesn't implement the saem data
structures as ovs kmod as far as I know.

Thanks!

> 
> cheers,
> jamal



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-24  4:27         ` John Fastabend
@ 2019-10-24 10:13           ` Toke Høiland-Jørgensen
  2019-10-27 13:19             ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-24 10:13 UTC (permalink / raw)
  To: John Fastabend, John Fastabend, Toshiaki Makita, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > I think for sysadmins in general (not OVS) use case I would work
>> > with Jesper and Toke. They seem to be working on this specific
>> > problem.
>> 
>> We're definitely thinking about how we can make "XDP magically speeds up
>> my network stack" a reality, if that's what you mean. Not that we have
>> arrived at anything specific yet...
>
> There seemed to be two thoughts in the cover letter one how to make
> OVS flow tc path faster via XDP. And the other how to make other users
> of tc flower software stack faster.
>
> For the OVS case seems to me that OVS should create its own XDP
> datapath if its 5x faster than the tc flower datapath. Although
> missing from the data was comparing against ovs kmod so that
> comparison would also be interesting. This way OVS could customize
> things and create only what they need.
>
> But the other case for a transparent tc flower XDP a set of user tools
> could let users start using XDP for this use case without having to
> write their own BPF code. Anyways I had the impression that might be
> something you and Jesper are thinking about, general usability for
> users that are not necessarily writing their own network.

Yeah, you are right that it's something we're thinking about. I'm not
sure we'll actually have the bandwidth to implement a complete solution
ourselves, but we are very much interested in helping others do this,
including smoothing out any rough edges (or adding missing features) in
the core XDP feature set that is needed to achieve this :)

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-24  4:38         ` John Fastabend
@ 2019-10-24 17:05           ` Jamal Hadi Salim
  0 siblings, 0 replies; 58+ messages in thread
From: Jamal Hadi Salim @ 2019-10-24 17:05 UTC (permalink / raw)
  To: John Fastabend, Toshiaki Makita, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019-10-24 12:38 a.m., John Fastabend wrote:
> Jamal Hadi Salim wrote:
>>

[..]
> Correct, sorry was not entirely precise. I've written tooling on top of
> the netlink API to do what is needed and it worked out just fine.
> 
> I think it would be interesting (in this context of flower vs XDP vs
> u32, etc.) to build a flow API that abstracts tc vs XDP away and leverages
> the correct lower level mechanics as needed. Easier said than done
> of course.


So, IMO, the choice is usability vs performance vs expressability.
Pick 2 of 3.
Some context..

Usability:
Flower is intended for humans, so usability is higher priority.
Somewhere along that journey we lost track of reality - now all the
freaking drivers are exposing very highly perfomant interfaces
abstracted as flower. I was worried this is where this XDP interface
was heading when i saw this.

Expressability:
Flower: You want to add another tuple, sure change kernel
code, user code, driver code. Wait 3 years before the rest
of the world catches up.
u32: none of the above. Infact i can express flower using
u32.

performance:
I think flower does well on egress when the flow cache
is already collected; on ingress those memcmps are
not cheap.
u32: you can organize your tables to make it performant
for your traffic patterns.

Back to your comment:
XDP should make choices that prioritize expressability and performance.

u32 will be a good choice because of its use of hierachies of
tables for expression (and tables being close relatives of ebpf maps).
The embedded parse/match in u32 could use some refinements. Maybe in
modern machines we should work on 64 bit words instead of 32, etc.
Note: it doesnt have to be u32  _as long as the two requirements
are met_.
A human friendly "wrapper" API (if you want your 15 tuples by all means)
can be made on top. For machines give them the power to do more.

The third requirement i would have is to allow for other ways of
doing these classification/actions; sort of what tc does - allowing
many different implementations for different classifiers to coexist.
It may u64 today but for some other use case you may need a different
classifier (and yes OVS can move theirs down there too).

> But flower itself is not so old.

It is out in the wild already.

>>
>> Summary: there is value in what Toshiaki is doing.
>>
>> I am disappointed that given a flexible canvas like XDP, we are still
>> going after something like flower... if someone was using u32 as the
>> abstraction it will justify it a lot more in my mind.
>> Tying it to OVS as well is not doing it justice.
> 
> William Tu worked on doing OVS natively in XDP at one point and
> could provide more input on the pain points. But seems easier to just
> modify OVS vs adding kernel shim code to take tc to xdp IMO.
> 

Will be good to hear Williams pain points (there may be a paper out
there).

I dont think any of this should be done to cater for OVS. We need
a low level interface that is both expressive and performant.
OVS can ride on top of it. Human friendly interfaces can be
written on top.

Note also ebpf maps can be shared between tc and XDP.

> Agree but still first packets happen and introducing latency spikes
> when we have a better solution around should be avoided.
> 

Certainly susceptible to attacks (re: old route cache)

But:
If you want to allow for people for choice - then we cant put
obstacles for people who want to do silly things. Just dont
force everyone else to use your shit.

>> Hashes are good for datapath use cases but not when you consider
>> a holistic access where you have to worry about control aspect.
> 
> Whats the "right" data structure?

 From a datapath perspective, hash tables are fine. You can shard
them by having hierarchies, give them more buckets, use some clever
traffic specific keying algorithm etc.
 From a control path perspective, there are challenges. If i want
to (for example) dump based on a partial key filter - that interface
becomes a linked list (i.e i iterate the whole hash table matching
things). A trie would be better in that case.
In my world, when you have hundreds of thousands or millions of
flow entries that you need to retrieve for whatever reasons
every few seconds - this is a big deal.

> We can build it in XDP if
> its useful/generic. tc flower doesn't implement the saem data
> structures as ovs kmod as far as I know.

Generic is key. Freedom is key. OVS is not that. If someone wants
to do a performant 2 tuple hardcoded classifier, let it be.
Let 1000 flowers (garden variety, not tc variety) bloom.

cheers,
jamal

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-22 16:54     ` John Fastabend
  2019-10-22 17:45       ` Toke Høiland-Jørgensen
  2019-10-23 14:11       ` Jamal Hadi Salim
@ 2019-10-27 13:06       ` Toshiaki Makita
  2 siblings, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-27 13:06 UTC (permalink / raw)
  To: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 19/10/23 (水) 1:54:57, John Fastabend wrote:
> Toshiaki Makita wrote:
>> On 2019/10/19 0:22, John Fastabend wrote:
>>> Toshiaki Makita wrote:
>>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>>>> to XDP.
>>>>
>>>
>>> I've only read the cover letter so far but...
>>
>> Thank you for reading this long cover letter.
>>
>>>
>>>> * Motivation
>>>>
>>>> The purpose is to speed up flow based network features like TC flower and
>>>> nftables by making use of XDP.
>>>>
>>>> I chose flow feature because my current interest is in OVS. OVS uses TC
>>>> flower to offload flow tables to hardware, so if TC can offload flows to
>>>> XDP, OVS also can be offloaded to XDP.
>>>
>>> This adds a non-trivial amount of code and complexity so I'm
>>> critical of the usefulness of being able to offload TC flower to
>>> XDP when userspace can simply load an XDP program.
>>>
>>> Why does OVS use tc flower at all if XDP is about 5x faster using
>>> your measurements below? Rather than spend energy adding code to
>>> a use case that as far as I can tell is narrowly focused on offload
>>> support can we enumerate what is missing on XDP side that blocks
>>> OVS from using it directly?
>>
>> I think nothing is missing for direct XDP use, as long as XDP datapath
>> only partially supports OVS flow parser/actions like xdp_flow.
>> The point is to avoid duplicate effort when someone wants to use XDP
>> through TC flower or nftables transparently.
> 
> I don't know who this "someone" is that wants to use XDP through TC
> flower or nftables transparently. TC at least is not known for a
> great uapi. It seems to me that it would be a relatively small project
> to write a uapi that ran on top of a canned XDP program to add
> flow rules. This could match tc cli if you wanted but why not take
> the opportunity to write a UAPI that does flow management well.

Not sure why TC is not a great uapi (or cli you mean?).
I'm not thinking it's a good idea to add another API when there is an 
API to do the same thing.

> Are there users of tc_flower in deployment somewhere?

Yes at least I know a few examples in my company...
To me flower is easier to use compared to u32 as it can automatically 
dissect flows, so I usually use it when some filter or redirection is 
necessary on devices level.

> I don't have
> lots of visibility but my impression is these were mainly OVS
> and other switch offload interface.
> 
> OVS should be sufficiently important that developers can right a
> native solution in my opinion. Avoiding needless layers of abstraction
> in the process. The nice thing about XDP here is you can write
> an XDP program and set of maps that matches OVS perfectly so no
> need to try and fit things in places.

I think the current XDP program for TC matches OVS as well, so I don't 
see such merit. TC flower and ovs kmod is similar in handling flows 
(both use mask list and look up hash table per mask).

> 
>>
>>> Additionally for hardware that can
>>> do XDP/BPF offload you will get the hardware offload for free.
>>
>> This is not necessary as OVS already uses TC flower to offload flows.
> 
> But... if you have BPF offload hardware that doesn't support TCAM
> style flow based offloads direct XDP offload would seem easier IMO.
> Which is better? Flow table based offloads vs BPF offloads likely
> depends on your use case in my experience.

I thought having single control path for offload is simple and 
intuitive. If XDP is allowed to be offloaded, OVS needs to have more 
knobs to enable offload of XDP programs.
I don't have ideas about difficulty in offload drivers' implementation 
though.

> 
>>
>>> Yes I know XDP is bytecode and you can't "offload" bytecode into
>>> a flow based interface likely backed by a tcam but IMO that doesn't
>>> mean we should leak complexity into the kernel network stack to
>>> fix this. Use the tc-flower for offload only (it has support for
>>> this) if you must and use the best (in terms of Mpps) software
>>> interface for your software bits. And if you want auto-magic
>>> offload support build hardware with BPF offload support.
>>>
>>> In addition by using XDP natively any extra latency overhead from
>>> bouncing calls through multiple layers would be removed.
>>
>> To some extent yes, but not completely. Flow insertion from userspace
>> triggered by datapath upcall is necessary regardless of whether we use
>> TC or not.
> 
> Right but these are latency involved with OVS architecture not
> kernel implementation artifacts. Actually what would be an interesting
> metric would be to see latency of a native xdp implementation.
> 
> I don't think we should add another implementation to the kernel
> that is worse than what we have.
> 
> 
>   xdp_flow  TC        ovs kmod
>   --------  --------  --------
>   22ms      6ms       0.6ms
> 
> TC is already order of magnitude off it seems :(

Yes but I'm unsure why it takes so much time. I need to investigate that...

> 
> If ovs_kmod is .6ms why am I going to use something that is 6ms or
> 22ms. I expect a native xdp implementation using a hash map to be
> inline with ovs kmod if not better. So if we have already have
> an implementation in kernel that is 5x faster and better at flow
> insertion another implementation that doesn't meet that threshold
> should probably not go in kernel.

Note that simple hash table (micro flows) implementation is hard.
It requires support for all key parameters including conntrack 
parameters, which does not look feasible now.
I only support mega flows using multiple hash tables each of which has 
its mask. This allows for support of partial set of keys which is 
feasible with XDP.

With this implementation I needed to emulate a list of masks using 
arrays, which I think is the main source of the latency, as it requires 
several syscalls to modify and lookup the BPF maps (I think it can be 
improved later though). This is the same even if we cross out the TC.

Anyway if the flow is really sensitive to latency, users should add the 
flow entry in advance. Going through userspace itself has non-negligible 
latency.

> 
> Additionally, for the OVS use case I would argue the XDP native
> solution is straight forward to implement. Although I will defer
> to OVS datapath experts here but above you noted nothing is
> missing on the feature side?

Yes as long as it is partial flow key/action support.
For full datapath support it's far from feasible, as William experienced...

> 
>>
>>>> When TC flower filter is offloaded to XDP, the received packets are
>>>> handled by XDP first, and if their protocol or something is not
>>>> supported by the eBPF program, the program returns XDP_PASS and packets
>>>> are passed to upper layer TC.
>>>>
>>>> The packet processing flow will be like this when this mechanism,
>>>> xdp_flow, is used with OVS.
>>>
>>> Same as obove just cross out the 'TC flower' box and add support
>>> for your missing features to 'XDP prog' box. Now you have less
>>> code to maintain and less bugs and aren't pushing packets through
>>> multiple hops in a call chain.
>>
>> If we cross out TC then we would need similar code in OVS userspace.
>> In total I don't think it would be less code to maintain.
> 
> Yes but I think minimizing kernel code and complexity is more important
> than minimizing code in a specific userspace application/use-case.
> Just think about the cost of a bug in kernel vs user space side. In
> user space you have ability to fix and release your own code in kernel
> side you will have to fix upstream, manage backports, get distributions
> involved, etc.
> 
> I have no problem adding code if its a good use case but in this case
> I'm still not seeing it.

I can understand the kernel code is more important, but if we determine 
not to have the code in kernel, we probably need each code in each 
userland tools? OVS has XDP, TC (iproute2 or some new userland tool) has 
XDP, nft has XDP, and they will be the same functionality, which is 
duplicate effort.

> 
>>
>>>
>>>>
>>>>    +-------------+
>>>>    | openvswitch |
>>>>    |    kmod     |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in filters (flow key or action not supported by TC)
>>>>    +-------------+
>>>>    |  TC flower  |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in flow tables (flow key or action not supported by XDP)
>>>>    +-------------+
>>>>    |  XDP prog   |
>>>>    +-------------+
>>>>           ^
>>>>           | incoming packets
>>>>
>>>> Of course we can directly use TC flower without OVS to speed up TC.
>>>
>>> huh? TC flower is part of TC so not sure what 'speed up TC' means. I
>>> guess this means using tc flower offload to xdp prog would speed up
>>> general tc flower usage as well?
>>
>> Yes.
>>
>>>
>>> But again if we are concerned about Mpps metrics just write the XDP
>>> program directly.
>>
>> I guess you mean any Linux users who want TC-like flow handling should develop
>> their own XDP programs? (sorry if I misunderstand you.)
>> I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
>> good to have any program each user wants, but not every sysadmin can write low
>> level good performance programs like us. For typical use-cases like flow handling
>> easy use of XDP through existing kernel interface (here TC) is useful IMO.
> 
> For OVS the initial use case I suggest write a XDP program tailored and
> optimized for OVS. Optimize it for this specific use case.

I think current code is already optimized for OVS as well as TC.

> 
> If you want a general flow based XDP program write one, convince someone
> to deploy and build a user space application to manage it. No sysadmin
> has to touch this. Toke and others at RedHat appear to have this exact
> use case in mind.
> 
>>
>>>
>> ...
>>>> * About alternative userland (ovs-vswitchd etc.) implementation
>>>>
>>>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>>>> mechanism, instead of adding code to kernel. I just thought offloading
>>>> TC is more generic and allows wider usage with direct TC command.
>>>>
>>>> For example, considering that OVS inserts a flow to kernel only when
>>>> flow miss happens in kernel, we can in advance add offloaded flows via
>>>> tc filter to avoid flow insertion latency for certain sensitive flows.
>>>> TC flower usage without using OVS is also possible.
>>>
>>> I argue to cut tc filter out entirely and then I think non of this
>>> is needed.
>>
>> Not correct. Even with native XDP use, multiple map lookup/modification
>> from userspace is necessary for flow miss handling, which will lead to
>> some latency.
> 
> I have not done got the data but I suspect the latency will be much
> closer to the ovs kmod .6ms than the TC or xdp_flow latency.

It should not as I explained above, although I should confirm it.

> 
>>
>> And there are other use-cases for direct TC use, like packet drop or
>> redirection for certain flows.
> 
> But these can be implemented in XDP correct?

So want to avoid duplicate work...

> 
>>
>>>
>>>>
>>>> Also as written above nftables can be offloaded to XDP with this
>>>> mechanism as well.
>>>
>>> Or same argument use XDP directly.
>>
>> I'm thinking it's useful for sysadmins to be able to use XDP through
>> existing kernel interfaces.
> 
> I agree its perhaps friendly to do so but for OVS not necessary and
> if sysadmins want a generic XDP flow interface someone can write one.
> Tell admins using your new tool gives 5x mpps improvement and orders
> of magnitude latency reduction and I suspect converting them over
> should be easy. Or if needed write an application in userspace that
> converts tc commands to native XDP map commands.
> 
> I think for sysadmins in general (not OVS) use case I would work
> with Jesper and Toke. They seem to be working on this specific problem.

I have talked about this in Netdev 0x13 and IIRC Jesper was somewhat 
positive on this idea...
Jesper, any thoughts?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-22 17:45       ` Toke Høiland-Jørgensen
  2019-10-24  4:27         ` John Fastabend
@ 2019-10-27 13:13         ` Toshiaki Makita
  2019-10-27 15:24           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-27 13:13 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
>> I think for sysadmins in general (not OVS) use case I would work
>> with Jesper and Toke. They seem to be working on this specific
>> problem.
> 
> We're definitely thinking about how we can make "XDP magically speeds up
> my network stack" a reality, if that's what you mean. Not that we have
> arrived at anything specific yet...
> 
> And yeah, I'd also be happy to discuss what it would take to make a
> native XDP implementation of the OVS datapath; including what (if
> anything) is missing from the current XDP feature set to make this
> feasible. I must admit that I'm not quite clear on why that wasn't the
> approach picked for the first attempt to speed up OVS using XDP...

Here's some history from William Tu et al.
https://linuxplumbersconf.org/event/2/contributions/107/

Although his aim was not to speed up OVS but to add kernel-independent 
datapath, his experience shows full OVS support by eBPF is very difficult.
Later I discussed this xdp_flow approach with William and we find value 
performance-wise in such a way of partial offloading to XDP. TC is one 
of ways to achieve the partial offloading.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-24 10:13           ` Toke Høiland-Jørgensen
@ 2019-10-27 13:19             ` Toshiaki Makita
  2019-10-27 15:21               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-27 13:19 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 19/10/24 (木) 19:13:09, Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
>> Toke Høiland-Jørgensen wrote:
>>> John Fastabend <john.fastabend@gmail.com> writes:
>>>
>>>> I think for sysadmins in general (not OVS) use case I would work
>>>> with Jesper and Toke. They seem to be working on this specific
>>>> problem.
>>>
>>> We're definitely thinking about how we can make "XDP magically speeds up
>>> my network stack" a reality, if that's what you mean. Not that we have
>>> arrived at anything specific yet...
>>
>> There seemed to be two thoughts in the cover letter one how to make
>> OVS flow tc path faster via XDP. And the other how to make other users
>> of tc flower software stack faster.
>>
>> For the OVS case seems to me that OVS should create its own XDP
>> datapath if its 5x faster than the tc flower datapath. Although
>> missing from the data was comparing against ovs kmod so that

In the cover letter there is

  xdp_flow  TC        ovs kmod
  --------  --------  --------
  5.2 Mpps  1.2 Mpps  1.1 Mpps

Or are you talking about something different?

>> comparison would also be interesting. This way OVS could customize
>> things and create only what they need.
>>
>> But the other case for a transparent tc flower XDP a set of user tools
>> could let users start using XDP for this use case without having to
>> write their own BPF code. Anyways I had the impression that might be
>> something you and Jesper are thinking about, general usability for
>> users that are not necessarily writing their own network.
> 
> Yeah, you are right that it's something we're thinking about. I'm not
> sure we'll actually have the bandwidth to implement a complete solution
> ourselves, but we are very much interested in helping others do this,
> including smoothing out any rough edges (or adding missing features) in
> the core XDP feature set that is needed to achieve this :)

I'm very interested in general usability solutions.
I'd appreciate if you could join the discussion.

Here the basic idea of my approach is to reuse HW-offload infrastructure 
in kernel.
Typical networking features in kernel have offload mechanism (TC flower, 
nftables, bridge, routing, and so on).
In general these are what users want to accelerate, so easy XDP use also 
should support these features IMO. With this idea, reusing existing 
HW-offload mechanism is a natural way to me. OVS uses TC to offload 
flows, then use TC for XDP as well...
Of course as John suggested there are other ways to do that. Probably we 
should compare them more thoroughly to discuss it more?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-23 14:11       ` Jamal Hadi Salim
  2019-10-24  4:38         ` John Fastabend
@ 2019-10-27 13:27         ` Toshiaki Makita
  1 sibling, 0 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-27 13:27 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 19/10/23 (水) 23:11:25, Jamal Hadi Salim wrote:
> 
> Sorry - didnt read every detail of this thread so i may
> be missing something.
> 
> On 2019-10-22 12:54 p.m., John Fastabend wrote:
>> Toshiaki Makita wrote:
>>> On 2019/10/19 0:22, John Fastabend wrote:
>>>> Toshiaki Makita wrote:
>>>>> This is a PoC for an idea to offload flow, i.e. TC flower and 
>>>>> nftables,
>>>>> to XDP.
>>>>>
> 
>>
>> I don't know who this "someone" is that wants to use XDP through TC
>> flower or nftables transparently. TC at least is not known for a
>> great uapi. 
> 
> 
> The uapi is netlink. You may be talking about lack of a friendly
> application library that abstracts out concepts?
> 
>> It seems to me that it would be a relatively small project
>> to write a uapi that ran on top of a canned XDP program to add
>> flow rules. This could match tc cli if you wanted but why not take
>> the opportunity to write a UAPI that does flow management well.
>>
> 
> Disagreement:
> Unfortunately legacy utilities and apps cant just be magically wished
> away. There's a lot of value in transparently making them work with
> new infrastructure. My usual exaggerated pitch: 1000 books have been
> written on this stuff, 100K people have RH certificates which entitle
> them to be "experts"; dinasour kernels exist in data centres and
> (/giggle) "enteprise". You cant just ignore all that.
> 
> Summary: there is value in what Toshiaki is doing.
> 
> I am disappointed that given a flexible canvas like XDP, we are still
> going after something like flower... if someone was using u32 as the
> abstraction it will justify it a lot more in my mind.
> Tying it to OVS as well is not doing it justice.

Flexibility is good for the time when very complicated or unusual flow 
handling is needed. OTOH, good flexibility often sacrifices good 
usability IMO. Configuration tends to be difficult.

What I want to do here is to make XDP easy for sysadmins for typical 
use-cases. u32 is good for flexibility, but to me flower is easier to 
use. Using flower fits better in my intention.

> 
> Agreement:
> Having said that I dont think that flower/OVS should be the interface
> that XDP should be aware of. Neither do i agree that kernel "real
> estate" should belong to Oneway(TM) of doing things (we are still stuck
> with netfilter planting the columbus flag on all networking hooks).
> Let 1000 flowers bloom.
> So: couldnt Toshiaki's requirement be met with writting a user space
> daemon that trampolines flower to "XDP format" flow transforms? That way
> in the future someone could add a u32->XDP format flow definition and we
> are not doomed to forever just use flower.

Userspace daemon is possible. Do you mean adding notification points in 
TC and listen filter modification events from userspace? If so, I'm not 
so positive about this, as it seems difficult to emulate TC/kernel 
behavior from userspace.
Note that I think u32 offload can be added in the future even with 
current xdp_flow implementation.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 13:19             ` Toshiaki Makita
@ 2019-10-27 15:21               ` Toke Høiland-Jørgensen
  2019-10-28  3:16                 ` David Ahern
  2019-10-31  0:18                 ` Toshiaki Makita
  0 siblings, 2 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-27 15:21 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

>> Yeah, you are right that it's something we're thinking about. I'm not
>> sure we'll actually have the bandwidth to implement a complete solution
>> ourselves, but we are very much interested in helping others do this,
>> including smoothing out any rough edges (or adding missing features) in
>> the core XDP feature set that is needed to achieve this :)
>
> I'm very interested in general usability solutions.
> I'd appreciate if you could join the discussion.
>
> Here the basic idea of my approach is to reuse HW-offload infrastructure 
> in kernel.
> Typical networking features in kernel have offload mechanism (TC flower, 
> nftables, bridge, routing, and so on).
> In general these are what users want to accelerate, so easy XDP use also 
> should support these features IMO. With this idea, reusing existing 
> HW-offload mechanism is a natural way to me. OVS uses TC to offload 
> flows, then use TC for XDP as well...

I agree that XDP should be able to accelerate existing kernel
functionality. However, this does not necessarily mean that the kernel
has to generate an XDP program and install it, like your patch does.
Rather, what we should be doing is exposing the functionality through
helpers so XDP can hook into the data structures already present in the
kernel and make decisions based on what is contained there. We already
have that for routing; L2 bridging, and some kind of connection
tracking, are obvious contenders for similar additions.

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 13:13         ` Toshiaki Makita
@ 2019-10-27 15:24           ` Toke Høiland-Jørgensen
  2019-10-27 19:17             ` David Miller
  2019-11-12 17:38             ` William Tu
  0 siblings, 2 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-27 15:24 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>>> I think for sysadmins in general (not OVS) use case I would work
>>> with Jesper and Toke. They seem to be working on this specific
>>> problem.
>> 
>> We're definitely thinking about how we can make "XDP magically speeds up
>> my network stack" a reality, if that's what you mean. Not that we have
>> arrived at anything specific yet...
>> 
>> And yeah, I'd also be happy to discuss what it would take to make a
>> native XDP implementation of the OVS datapath; including what (if
>> anything) is missing from the current XDP feature set to make this
>> feasible. I must admit that I'm not quite clear on why that wasn't the
>> approach picked for the first attempt to speed up OVS using XDP...
>
> Here's some history from William Tu et al.
> https://linuxplumbersconf.org/event/2/contributions/107/
>
> Although his aim was not to speed up OVS but to add kernel-independent 
> datapath, his experience shows full OVS support by eBPF is very
> difficult.

Yeah, I remember seeing that presentation; it still isn't clear to me
what exactly the issue was with implementing the OVS datapath in eBPF.
As far as I can tell from glancing through the paper, only lists program
size and lack of loops as limitations; both of which have been lifted
now.

The results in the paper also shows somewhat disappointing performance
for the eBPF implementation, but that is not too surprising given that
it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
that this was also one of the things puzzling to me back when this was
presented...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 15:24           ` Toke Høiland-Jørgensen
@ 2019-10-27 19:17             ` David Miller
  2019-10-31  0:32               ` Toshiaki Makita
  2019-11-12 17:38             ` William Tu
  1 sibling, 1 reply; 58+ messages in thread
From: David Miller @ 2019-10-27 19:17 UTC (permalink / raw)
  To: toke
  Cc: toshiaki.makita1, john.fastabend, ast, daniel, kafai,
	songliubraving, yhs, jakub.kicinski, hawk, jhs, xiyou.wangcong,
	jiri, pablo, kadlec, fw, pshelar, netdev, bpf, u9012063, sdf

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Sun, 27 Oct 2019 16:24:24 +0100

> The results in the paper also shows somewhat disappointing performance
> for the eBPF implementation, but that is not too surprising given that
> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> that this was also one of the things puzzling to me back when this was
> presented...

Also, no attempt was made to dyanamically optimize the data structures
and code generated in response to features actually used.

That's the big error.

The full OVS key is huge, OVS is really quite a monster.

But people don't use the entire key, nor do they use the totality of
the data paths.

So just doing a 1-to-1 translation of the OVS datapath into BPF makes
absolutely no sense whatsoever and it is guaranteed to have worse
performance.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 15:21               ` Toke Høiland-Jørgensen
@ 2019-10-28  3:16                 ` David Ahern
  2019-10-28  8:36                   ` Toke Høiland-Jørgensen
  2019-10-31  0:18                 ` Toshiaki Makita
  1 sibling, 1 reply; 58+ messages in thread
From: David Ahern @ 2019-10-28  3:16 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Toshiaki Makita,
	John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:
> Rather, what we should be doing is exposing the functionality through
> helpers so XDP can hook into the data structures already present in the
> kernel and make decisions based on what is contained there. We already
> have that for routing; L2 bridging, and some kind of connection
> tracking, are obvious contenders for similar additions.

The way OVS is coded and expected to flow (ovs_vport_receive ->
ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
not see any way to refactor it to expose a hook to XDP. But, if the use
case is not doing anything big with OVS (e.g., just ACLs and forwarding)
that is easy to replicate in XDP - but then that means duplicate data
and code.

Linux bridge on the other hand seems fairly straightforward to refactor.
One helper is needed to convert ingress <port,mac,vlan> to an L2 device
(and needs to consider stacked devices) and then a second one to access
the fdb for that device.

Either way, bypassing the bridge has mixed results: latency improves but
throughput takes a hit (no GRO).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-28  3:16                 ` David Ahern
@ 2019-10-28  8:36                   ` Toke Høiland-Jørgensen
  2019-10-28 10:08                     ` Jesper Dangaard Brouer
  2019-10-28 19:05                     ` David Ahern
  0 siblings, 2 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-28  8:36 UTC (permalink / raw)
  To: David Ahern, Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

David Ahern <dsahern@gmail.com> writes:

> On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:
>> Rather, what we should be doing is exposing the functionality through
>> helpers so XDP can hook into the data structures already present in the
>> kernel and make decisions based on what is contained there. We already
>> have that for routing; L2 bridging, and some kind of connection
>> tracking, are obvious contenders for similar additions.
>
> The way OVS is coded and expected to flow (ovs_vport_receive ->
> ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
> not see any way to refactor it to expose a hook to XDP. But, if the use
> case is not doing anything big with OVS (e.g., just ACLs and forwarding)
> that is easy to replicate in XDP - but then that means duplicate data
> and code.

Yeah, I didn't mean that part for OVS, that was a general comment for
reusing kernel functionality.

> Linux bridge on the other hand seems fairly straightforward to
> refactor. One helper is needed to convert ingress <port,mac,vlan> to
> an L2 device (and needs to consider stacked devices) and then a second
> one to access the fdb for that device.

Why not just a single lookup like what you did for routing? Not too
familiar with the routing code...

> Either way, bypassing the bridge has mixed results: latency improves
> but throughput takes a hit (no GRO).

Well, for some traffic mixes XDP should be able to keep up without GRO.
And longer term, we probably want to support GRO with XDP anyway
(I believe Jesper has plans for supporting bigger XDP frames)...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-28  8:36                   ` Toke Høiland-Jørgensen
@ 2019-10-28 10:08                     ` Jesper Dangaard Brouer
  2019-10-28 19:07                       ` David Ahern
  2019-10-28 19:05                     ` David Ahern
  1 sibling, 1 reply; 58+ messages in thread
From: Jesper Dangaard Brouer @ 2019-10-28 10:08 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David Ahern, Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar, netdev, bpf,
	William Tu, Stanislav Fomichev

On Mon, 28 Oct 2019 09:36:12 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> David Ahern <dsahern@gmail.com> writes:
> 
> > On 10/27/19 9:21 AM, Toke Høiland-Jørgensen wrote:  
> >> Rather, what we should be doing is exposing the functionality through
> >> helpers so XDP can hook into the data structures already present in the
> >> kernel and make decisions based on what is contained there. We already
> >> have that for routing; L2 bridging, and some kind of connection
> >> tracking, are obvious contenders for similar additions.  
> >
> > The way OVS is coded and expected to flow (ovs_vport_receive ->
> > ovs_dp_process_packet -> ovs_execute_actions -> do_execute_actions) I do
> > not see any way to refactor it to expose a hook to XDP. But, if the use
> > case is not doing anything big with OVS (e.g., just ACLs and forwarding)
> > that is easy to replicate in XDP - but then that means duplicate data
> > and code.  
> 
> Yeah, I didn't mean that part for OVS, that was a general comment for
> reusing kernel functionality.
> 
> > Linux bridge on the other hand seems fairly straightforward to
> > refactor. One helper is needed to convert ingress <port,mac,vlan> to
> > an L2 device (and needs to consider stacked devices) and then a second
> > one to access the fdb for that device.  
> 
> Why not just a single lookup like what you did for routing? Not too
> familiar with the routing code...

I'm also very interested in hearing more about how we can create an XDP
bridge lookup BPF-helper...


> > Either way, bypassing the bridge has mixed results: latency improves
> > but throughput takes a hit (no GRO).  
> 
> Well, for some traffic mixes XDP should be able to keep up without GRO.
> And longer term, we probably want to support GRO with XDP anyway

Do you have any numbers to back up your expected throughput decrease,
due to lack of GRO?  Or is it a theory?

GRO mainly gains performance due to the bulking effect.  XDP redirect
also have bulking.  For bridging, I would claim that XDP redirect
bulking works better, because it does bulking based on egress
net_device. (Even for intermixed packets per NAPI budget).  You might
worry that XDP will do a bridge-lookup per frame, but as the likely fit
in the CPU I-cache, then this will have very little effect.


> (I believe Jesper has plans for supporting bigger XDP frames)...

Yes [1], but it's orthogonal and mostly that to support HW features,
like TSO, jumbo-frames, packet header split.

 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-28  8:36                   ` Toke Høiland-Jørgensen
  2019-10-28 10:08                     ` Jesper Dangaard Brouer
@ 2019-10-28 19:05                     ` David Ahern
  1 sibling, 0 replies; 58+ messages in thread
From: David Ahern @ 2019-10-28 19:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Toshiaki Makita,
	John Fastabend, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 10/28/19 2:36 AM, Toke Høiland-Jørgensen wrote:
> 
>> Linux bridge on the other hand seems fairly straightforward to
>> refactor. One helper is needed to convert ingress <port,mac,vlan> to
>> an L2 device (and needs to consider stacked devices) and then a second
>> one to access the fdb for that device.
> 
> Why not just a single lookup like what you did for routing? Not too
> familiar with the routing code...

The current code for routing only works for forwarding across ports
without vlans or other upper level devices. That is a very limited use
case and needs to be extended for VLANs and bonds (I have a POC for both).

The API is setup for the extra layers:

struct bpf_fib_lookup {
    ...
    /* input: L3 device index for lookup
     * output: device index from FIB lookup
     */
    __u32   ifindex;
   ...

For bridging, certainly step 1 is the same - define a bpf_fdb_lookup
struct and helper that takes on L2 device index and returns a
<port,vlan> pair.

However, this thread is about bridging with VMs / containers. A viable
solution for this use case MUST handle both vlans and bonds.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-28 10:08                     ` Jesper Dangaard Brouer
@ 2019-10-28 19:07                       ` David Ahern
  0 siblings, 0 replies; 58+ messages in thread
From: David Ahern @ 2019-10-28 19:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen
  Cc: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar, netdev, bpf,
	William Tu, Stanislav Fomichev

On 10/28/19 4:08 AM, Jesper Dangaard Brouer wrote:
>>> Either way, bypassing the bridge has mixed results: latency improves
>>> but throughput takes a hit (no GRO).  
>>
>> Well, for some traffic mixes XDP should be able to keep up without GRO.
>> And longer term, we probably want to support GRO with XDP anyway
> 
> Do you have any numbers to back up your expected throughput decrease,
> due to lack of GRO?  Or is it a theory?
> 

of course. I'll start a new thread about this rather than go too far
down this tangent relative to the current patches.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 15:21               ` Toke Høiland-Jørgensen
  2019-10-28  3:16                 ` David Ahern
@ 2019-10-31  0:18                 ` Toshiaki Makita
  2019-10-31 12:12                   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-31  0:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>> Yeah, you are right that it's something we're thinking about. I'm not
>>> sure we'll actually have the bandwidth to implement a complete solution
>>> ourselves, but we are very much interested in helping others do this,
>>> including smoothing out any rough edges (or adding missing features) in
>>> the core XDP feature set that is needed to achieve this :)
>>
>> I'm very interested in general usability solutions.
>> I'd appreciate if you could join the discussion.
>>
>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>> in kernel.
>> Typical networking features in kernel have offload mechanism (TC flower,
>> nftables, bridge, routing, and so on).
>> In general these are what users want to accelerate, so easy XDP use also
>> should support these features IMO. With this idea, reusing existing
>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>> flows, then use TC for XDP as well...
> 
> I agree that XDP should be able to accelerate existing kernel
> functionality. However, this does not necessarily mean that the kernel
> has to generate an XDP program and install it, like your patch does.
> Rather, what we should be doing is exposing the functionality through
> helpers so XDP can hook into the data structures already present in the
> kernel and make decisions based on what is contained there. We already
> have that for routing; L2 bridging, and some kind of connection
> tracking, are obvious contenders for similar additions.

Thanks, adding helpers itself should be good, but how does this let users
start using XDP without having them write their own BPF code?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 19:17             ` David Miller
@ 2019-10-31  0:32               ` Toshiaki Makita
  2019-11-12 17:50                 ` William Tu
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-10-31  0:32 UTC (permalink / raw)
  To: David Miller, toke
  Cc: john.fastabend, ast, daniel, kafai, songliubraving, yhs,
	jakub.kicinski, hawk, jhs, xiyou.wangcong, jiri, pablo, kadlec,
	fw, pshelar, netdev, bpf, u9012063, sdf

On 2019/10/28 4:17, David Miller wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Date: Sun, 27 Oct 2019 16:24:24 +0100
> 
>> The results in the paper also shows somewhat disappointing performance
>> for the eBPF implementation, but that is not too surprising given that
>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>> that this was also one of the things puzzling to me back when this was
>> presented...
> 
> Also, no attempt was made to dyanamically optimize the data structures
> and code generated in response to features actually used.
> 
> That's the big error.
> 
> The full OVS key is huge, OVS is really quite a monster.
> 
> But people don't use the entire key, nor do they use the totality of
> the data paths.
> 
> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> absolutely no sense whatsoever and it is guaranteed to have worse
> performance.

Agree that 1-to-1 translation would result in worse performance.
What I'm doing now is just supporting subset of keys, only very basic ones.
This does not accelerate all usages so dynamic program generation certainly
has value. What is difficult is that basically flow insertion is triggered
by datapath packet reception and it causes latency spike. Going through
bpf verifier on each new flow packet reception on datapath does not look
feasible so we need to come up with something to avoid this.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-31  0:18                 ` Toshiaki Makita
@ 2019-10-31 12:12                   ` Toke Høiland-Jørgensen
  2019-11-11  7:32                     ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-10-31 12:12 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>> ourselves, but we are very much interested in helping others do this,
>>>> including smoothing out any rough edges (or adding missing features) in
>>>> the core XDP feature set that is needed to achieve this :)
>>>
>>> I'm very interested in general usability solutions.
>>> I'd appreciate if you could join the discussion.
>>>
>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>> in kernel.
>>> Typical networking features in kernel have offload mechanism (TC flower,
>>> nftables, bridge, routing, and so on).
>>> In general these are what users want to accelerate, so easy XDP use also
>>> should support these features IMO. With this idea, reusing existing
>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>> flows, then use TC for XDP as well...
>> 
>> I agree that XDP should be able to accelerate existing kernel
>> functionality. However, this does not necessarily mean that the kernel
>> has to generate an XDP program and install it, like your patch does.
>> Rather, what we should be doing is exposing the functionality through
>> helpers so XDP can hook into the data structures already present in the
>> kernel and make decisions based on what is contained there. We already
>> have that for routing; L2 bridging, and some kind of connection
>> tracking, are obvious contenders for similar additions.
>
> Thanks, adding helpers itself should be good, but how does this let users
> start using XDP without having them write their own BPF code?

It wouldn't in itself. But it would make it possible to write XDP
programs that could provide the same functionality; people would then
need to run those programs to actually opt-in to this.

For some cases this would be a simple "on/off switch", e.g.,
"xdp-route-accel --load <dev>", which would install an XDP program that
uses the regular kernel routing table (and the same with bridging). We
are planning to collect such utilities in the xdp-tools repo - I am
currently working on a simple packet filter:
https://github.com/xdp-project/xdp-tools/tree/xdp-filter

For more advanced use cases (such as OVS), the application packages will
need to integrate and load their own XDP support. We should encourage
that, and help smooth out any rough edges (such as missing features)
needed for this to happen.

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-31 12:12                   ` Toke Høiland-Jørgensen
@ 2019-11-11  7:32                     ` Toshiaki Makita
  2019-11-12 16:53                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-11  7:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Hi Toke,

Sorry for the delay.

On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>> ourselves, but we are very much interested in helping others do this,
>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>> the core XDP feature set that is needed to achieve this :)
>>>>
>>>> I'm very interested in general usability solutions.
>>>> I'd appreciate if you could join the discussion.
>>>>
>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>> in kernel.
>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>> nftables, bridge, routing, and so on).
>>>> In general these are what users want to accelerate, so easy XDP use also
>>>> should support these features IMO. With this idea, reusing existing
>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>> flows, then use TC for XDP as well...
>>>
>>> I agree that XDP should be able to accelerate existing kernel
>>> functionality. However, this does not necessarily mean that the kernel
>>> has to generate an XDP program and install it, like your patch does.
>>> Rather, what we should be doing is exposing the functionality through
>>> helpers so XDP can hook into the data structures already present in the
>>> kernel and make decisions based on what is contained there. We already
>>> have that for routing; L2 bridging, and some kind of connection
>>> tracking, are obvious contenders for similar additions.
>>
>> Thanks, adding helpers itself should be good, but how does this let users
>> start using XDP without having them write their own BPF code?
> 
> It wouldn't in itself. But it would make it possible to write XDP
> programs that could provide the same functionality; people would then
> need to run those programs to actually opt-in to this.
> 
> For some cases this would be a simple "on/off switch", e.g.,
> "xdp-route-accel --load <dev>", which would install an XDP program that
> uses the regular kernel routing table (and the same with bridging). We
> are planning to collect such utilities in the xdp-tools repo - I am
> currently working on a simple packet filter:
> https://github.com/xdp-project/xdp-tools/tree/xdp-filter

Let me confirm how this tool adds filter rules.
Is this adding another commandline tool for firewall?

If so, that is different from my goal.
Introducing another commandline tool will require people to learn more.

My proposal is to reuse kernel interface to minimize such need for learning.

Toshiaki Makita

> For more advanced use cases (such as OVS), the application packages will
> need to integrate and load their own XDP support. We should encourage
> that, and help smooth out any rough edges (such as missing features)
> needed for this to happen.
> 
> -Toke
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-11  7:32                     ` Toshiaki Makita
@ 2019-11-12 16:53                       ` Toke Høiland-Jørgensen
  2019-11-14 10:11                         ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-12 16:53 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>
> Hi Toke,
>
> Sorry for the delay.
>
> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>
>>>>> I'm very interested in general usability solutions.
>>>>> I'd appreciate if you could join the discussion.
>>>>>
>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>> in kernel.
>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>> nftables, bridge, routing, and so on).
>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>> should support these features IMO. With this idea, reusing existing
>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>> flows, then use TC for XDP as well...
>>>>
>>>> I agree that XDP should be able to accelerate existing kernel
>>>> functionality. However, this does not necessarily mean that the kernel
>>>> has to generate an XDP program and install it, like your patch does.
>>>> Rather, what we should be doing is exposing the functionality through
>>>> helpers so XDP can hook into the data structures already present in the
>>>> kernel and make decisions based on what is contained there. We already
>>>> have that for routing; L2 bridging, and some kind of connection
>>>> tracking, are obvious contenders for similar additions.
>>>
>>> Thanks, adding helpers itself should be good, but how does this let users
>>> start using XDP without having them write their own BPF code?
>> 
>> It wouldn't in itself. But it would make it possible to write XDP
>> programs that could provide the same functionality; people would then
>> need to run those programs to actually opt-in to this.
>> 
>> For some cases this would be a simple "on/off switch", e.g.,
>> "xdp-route-accel --load <dev>", which would install an XDP program that
>> uses the regular kernel routing table (and the same with bridging). We
>> are planning to collect such utilities in the xdp-tools repo - I am
>> currently working on a simple packet filter:
>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>
> Let me confirm how this tool adds filter rules.
> Is this adding another commandline tool for firewall?
>
> If so, that is different from my goal.
> Introducing another commandline tool will require people to learn
> more.
>
> My proposal is to reuse kernel interface to minimize such need for
> learning.

I wasn't proposing that this particular tool should be a replacement for
the kernel packet filter; it's deliberately fairly limited in
functionality. My point was that we could create other such tools for
specific use cases which could be more or less drop-in (similar to how
nftables has a command line tool that is compatible with the iptables
syntax).

I'm all for exposing more of the existing kernel capabilities to XDP.
However, I think it's the wrong approach to do this by reimplementing
the functionality in eBPF program and replicating the state in maps;
instead, it's better to refactor the existing kernel functionality to it
can be called directly from an eBPF helper function. And then ship a
tool as part of xdp-tools that installs an XDP program to make use of
these helpers to accelerate the functionality.

Take your example of TC rules: You were proposing a flow like this:

Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
program

Whereas what I mean is that we could do this instead:

Userspace TC rule -> kernel rule table

and separately

XDP program -> bpf helper -> lookup in kernel rule table


-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-27 15:24           ` Toke Høiland-Jørgensen
  2019-10-27 19:17             ` David Miller
@ 2019-11-12 17:38             ` William Tu
  1 sibling, 0 replies; 58+ messages in thread
From: William Tu @ 2019-11-12 17:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar,
	Linux Kernel Network Developers, bpf, Stanislav Fomichev

On Sun, Oct 27, 2019 at 8:24 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>
> > On 19/10/23 (水) 2:45:05, Toke Høiland-Jørgensen wrote:
> >> John Fastabend <john.fastabend@gmail.com> writes:
> >>
> >>> I think for sysadmins in general (not OVS) use case I would work
> >>> with Jesper and Toke. They seem to be working on this specific
> >>> problem.
> >>
> >> We're definitely thinking about how we can make "XDP magically speeds up
> >> my network stack" a reality, if that's what you mean. Not that we have
> >> arrived at anything specific yet...
> >>
> >> And yeah, I'd also be happy to discuss what it would take to make a
> >> native XDP implementation of the OVS datapath; including what (if
> >> anything) is missing from the current XDP feature set to make this
> >> feasible. I must admit that I'm not quite clear on why that wasn't the
> >> approach picked for the first attempt to speed up OVS using XDP...
> >
> > Here's some history from William Tu et al.
> > https://linuxplumbersconf.org/event/2/contributions/107/
> >
> > Although his aim was not to speed up OVS but to add kernel-independent
> > datapath, his experience shows full OVS support by eBPF is very
> > difficult.
>
> Yeah, I remember seeing that presentation; it still isn't clear to me
> what exactly the issue was with implementing the OVS datapath in eBPF.
> As far as I can tell from glancing through the paper, only lists program
> size and lack of loops as limitations; both of which have been lifted
> now.
>
Sorry it's not very clear in the presentation and paper.
Some of the limitations are resolved today, let me list my experiences.

This is from OVS's feature requirements:
What's missing in eBPF
- limited stack size (resolved now)
- limited program size (resolved now)
- dynamic loop support for OVS actions applied to packet
  (now bounded loop is supported)
- no connection tracking/alg support (people suggest to look cilium)
- no packet fragment/defragment support
- no wildcard table/map type support
I think it would be good to restart the project again using
existing eBPF features.

What's missing in XDP
- clone a packet: this is very basic feature for a switch to
  broadcast/multicast. I understand it's hard to implement.
  A workaround is to XDP_PASS and let tc do the clone. But slow.

Because of no packet cloning support, I didn't try implementing
OVS datapath in XDP.

> The results in the paper also shows somewhat disappointing performance
> for the eBPF implementation, but that is not too surprising given that
> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> that this was also one of the things puzzling to me back when this was
> presented...

Right, the point of that project is not performance improvement.
But sort of to see how existing eBPF feature can be used to implement
all features needed by OVS datapath.

Regards,
William

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-10-31  0:32               ` Toshiaki Makita
@ 2019-11-12 17:50                 ` William Tu
  2019-11-14 10:06                   ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: William Tu @ 2019-11-12 17:50 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: David Miller, Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, pravin shelar,
	Linux Kernel Network Developers, bpf, Stanislav Fomichev

On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2019/10/28 4:17, David Miller wrote:
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> > Date: Sun, 27 Oct 2019 16:24:24 +0100
> >
> >> The results in the paper also shows somewhat disappointing performance
> >> for the eBPF implementation, but that is not too surprising given that
> >> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> >> that this was also one of the things puzzling to me back when this was
> >> presented...
> >
> > Also, no attempt was made to dyanamically optimize the data structures
> > and code generated in response to features actually used.
> >
> > That's the big error.
> >
> > The full OVS key is huge, OVS is really quite a monster.
> >
> > But people don't use the entire key, nor do they use the totality of
> > the data paths.
> >
> > So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> > absolutely no sense whatsoever and it is guaranteed to have worse
> > performance.

1-to-1 translation has nothing to do with performance.

eBPF/XDP is faster only when you can by-pass/shortcut some code.
If the number of features required are the same, then an eBPF
implementation should be less than or equal to a kernel module's
performance. "less than" because eBPF usually has some limitations
so you have to redesign the data structure.

It's possible that after redesigning your data structure to eBPF,
it becomes faster. But there is no such case in my experience.

Regards,
William

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-12 17:50                 ` William Tu
@ 2019-11-14 10:06                   ` Toshiaki Makita
  2019-11-14 17:09                     ` William Tu
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-14 10:06 UTC (permalink / raw)
  To: William Tu
  Cc: David Miller, Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, pravin shelar,
	Linux Kernel Network Developers, bpf, Stanislav Fomichev

On 2019/11/13 2:50, William Tu wrote:
> On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2019/10/28 4:17, David Miller wrote:
>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>> Date: Sun, 27 Oct 2019 16:24:24 +0100
>>>
>>>> The results in the paper also shows somewhat disappointing performance
>>>> for the eBPF implementation, but that is not too surprising given that
>>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>>>> that this was also one of the things puzzling to me back when this was
>>>> presented...
>>>
>>> Also, no attempt was made to dyanamically optimize the data structures
>>> and code generated in response to features actually used.
>>>
>>> That's the big error.
>>>
>>> The full OVS key is huge, OVS is really quite a monster.
>>>
>>> But people don't use the entire key, nor do they use the totality of
>>> the data paths.
>>>
>>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
>>> absolutely no sense whatsoever and it is guaranteed to have worse
>>> performance.
> 
> 1-to-1 translation has nothing to do with performance.

I think at least key size matters.
One big part of hot spots in xdp_flow bpf program is hash table lookup.
Especially hash calculation by jhash and key comparison are heavy.
The computational cost heavily depends on key size.

If umh can determine some keys won't be used in some way (not sure if it's
practical though), umh can load an XDP program which uses less sized
key. Also it can remove unnecessary key parser routines.
If it's possible, the performance will increase.

Toshiaki Makita

> 
> eBPF/XDP is faster only when you can by-pass/shortcut some code.
> If the number of features required are the same, then an eBPF
> implementation should be less than or equal to a kernel module's
> performance. "less than" because eBPF usually has some limitations
> so you have to redesign the data structure.
> 
> It's possible that after redesigning your data structure to eBPF,
> it becomes faster. But there is no such case in my experience.
> 
> Regards,
> William
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-12 16:53                       ` Toke Høiland-Jørgensen
@ 2019-11-14 10:11                         ` Toshiaki Makita
  2019-11-14 12:41                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-14 10:11 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>
>> Hi Toke,
>>
>> Sorry for the delay.
>>
>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>
>>>>>> I'm very interested in general usability solutions.
>>>>>> I'd appreciate if you could join the discussion.
>>>>>>
>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>> in kernel.
>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>> nftables, bridge, routing, and so on).
>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>> flows, then use TC for XDP as well...
>>>>>
>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>> has to generate an XDP program and install it, like your patch does.
>>>>> Rather, what we should be doing is exposing the functionality through
>>>>> helpers so XDP can hook into the data structures already present in the
>>>>> kernel and make decisions based on what is contained there. We already
>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>> tracking, are obvious contenders for similar additions.
>>>>
>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>> start using XDP without having them write their own BPF code?
>>>
>>> It wouldn't in itself. But it would make it possible to write XDP
>>> programs that could provide the same functionality; people would then
>>> need to run those programs to actually opt-in to this.
>>>
>>> For some cases this would be a simple "on/off switch", e.g.,
>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>> uses the regular kernel routing table (and the same with bridging). We
>>> are planning to collect such utilities in the xdp-tools repo - I am
>>> currently working on a simple packet filter:
>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>
>> Let me confirm how this tool adds filter rules.
>> Is this adding another commandline tool for firewall?
>>
>> If so, that is different from my goal.
>> Introducing another commandline tool will require people to learn
>> more.
>>
>> My proposal is to reuse kernel interface to minimize such need for
>> learning.
> 
> I wasn't proposing that this particular tool should be a replacement for
> the kernel packet filter; it's deliberately fairly limited in
> functionality. My point was that we could create other such tools for
> specific use cases which could be more or less drop-in (similar to how
> nftables has a command line tool that is compatible with the iptables
> syntax).
> 
> I'm all for exposing more of the existing kernel capabilities to XDP.
> However, I think it's the wrong approach to do this by reimplementing
> the functionality in eBPF program and replicating the state in maps;
> instead, it's better to refactor the existing kernel functionality to it
> can be called directly from an eBPF helper function. And then ship a
> tool as part of xdp-tools that installs an XDP program to make use of
> these helpers to accelerate the functionality.
> 
> Take your example of TC rules: You were proposing a flow like this:
> 
> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
> program
> 
> Whereas what I mean is that we could do this instead:
> 
> Userspace TC rule -> kernel rule table
> 
> and separately
> 
> XDP program -> bpf helper -> lookup in kernel rule table

Thanks, now I see what you mean.
You expect an XDP program like this, right?

int xdp_tc(struct xdp_md *ctx)
{
	int act = bpf_xdp_tc_filter(ctx);
	return act;
}

But doesn't this way lose a chance to reduce/minimize the program
to only use necessary features for this device?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-14 10:11                         ` Toshiaki Makita
@ 2019-11-14 12:41                           ` Toke Høiland-Jørgensen
  2019-11-18  6:41                             ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-14 12:41 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>> Hi Toke,
>>>
>>> Sorry for the delay.
>>>
>>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>>
>>>>>>> I'm very interested in general usability solutions.
>>>>>>> I'd appreciate if you could join the discussion.
>>>>>>>
>>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>>> in kernel.
>>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>>> nftables, bridge, routing, and so on).
>>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>>> flows, then use TC for XDP as well...
>>>>>>
>>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>>> has to generate an XDP program and install it, like your patch does.
>>>>>> Rather, what we should be doing is exposing the functionality through
>>>>>> helpers so XDP can hook into the data structures already present in the
>>>>>> kernel and make decisions based on what is contained there. We already
>>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>>> tracking, are obvious contenders for similar additions.
>>>>>
>>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>>> start using XDP without having them write their own BPF code?
>>>>
>>>> It wouldn't in itself. But it would make it possible to write XDP
>>>> programs that could provide the same functionality; people would then
>>>> need to run those programs to actually opt-in to this.
>>>>
>>>> For some cases this would be a simple "on/off switch", e.g.,
>>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>>> uses the regular kernel routing table (and the same with bridging). We
>>>> are planning to collect such utilities in the xdp-tools repo - I am
>>>> currently working on a simple packet filter:
>>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>>
>>> Let me confirm how this tool adds filter rules.
>>> Is this adding another commandline tool for firewall?
>>>
>>> If so, that is different from my goal.
>>> Introducing another commandline tool will require people to learn
>>> more.
>>>
>>> My proposal is to reuse kernel interface to minimize such need for
>>> learning.
>> 
>> I wasn't proposing that this particular tool should be a replacement for
>> the kernel packet filter; it's deliberately fairly limited in
>> functionality. My point was that we could create other such tools for
>> specific use cases which could be more or less drop-in (similar to how
>> nftables has a command line tool that is compatible with the iptables
>> syntax).
>> 
>> I'm all for exposing more of the existing kernel capabilities to XDP.
>> However, I think it's the wrong approach to do this by reimplementing
>> the functionality in eBPF program and replicating the state in maps;
>> instead, it's better to refactor the existing kernel functionality to it
>> can be called directly from an eBPF helper function. And then ship a
>> tool as part of xdp-tools that installs an XDP program to make use of
>> these helpers to accelerate the functionality.
>> 
>> Take your example of TC rules: You were proposing a flow like this:
>> 
>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>> program
>> 
>> Whereas what I mean is that we could do this instead:
>> 
>> Userspace TC rule -> kernel rule table
>> 
>> and separately
>> 
>> XDP program -> bpf helper -> lookup in kernel rule table
>
> Thanks, now I see what you mean.
> You expect an XDP program like this, right?
>
> int xdp_tc(struct xdp_md *ctx)
> {
> 	int act = bpf_xdp_tc_filter(ctx);
> 	return act;
> }

Yes, basically, except that the XDP program would need to parse the
packet first, and bpf_xdp_tc_filter() would take a parameter struct with
the parsed values. See the usage of bpf_fib_lookup() in
bpf/samples/xdp_fwd_kern.c

> But doesn't this way lose a chance to reduce/minimize the program to
> only use necessary features for this device?

Not necessarily. Since the BPF program does the packet parsing and fills
in the TC filter lookup data structure, it can limit what features are
used that way (e.g., if I only want to do IPv6, I just parse the v6
header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
helper could also have a flag argument to disable some of the lookup
features.

It would probably require a bit of refactoring in the kernel data
structures so they can be used without being tied to an skb. David Ahern
did something similar for the fib. For the routing table case, that
resulted in a significant speedup: About 2.5x-3x the performance when
using it via XDP (depending on the number of routes in the table).

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-14 10:06                   ` Toshiaki Makita
@ 2019-11-14 17:09                     ` William Tu
  2019-11-15 13:16                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: William Tu @ 2019-11-14 17:09 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: David Miller, Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, pravin shelar,
	Linux Kernel Network Developers, bpf, Stanislav Fomichev

On Thu, Nov 14, 2019 at 2:06 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> On 2019/11/13 2:50, William Tu wrote:
> > On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
> > <toshiaki.makita1@gmail.com> wrote:
> >>
> >> On 2019/10/28 4:17, David Miller wrote:
> >>> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> Date: Sun, 27 Oct 2019 16:24:24 +0100
> >>>
> >>>> The results in the paper also shows somewhat disappointing performance
> >>>> for the eBPF implementation, but that is not too surprising given that
> >>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
> >>>> that this was also one of the things puzzling to me back when this was
> >>>> presented...
> >>>
> >>> Also, no attempt was made to dyanamically optimize the data structures
> >>> and code generated in response to features actually used.
> >>>
> >>> That's the big error.
> >>>
> >>> The full OVS key is huge, OVS is really quite a monster.
> >>>
> >>> But people don't use the entire key, nor do they use the totality of
> >>> the data paths.
> >>>
> >>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
> >>> absolutely no sense whatsoever and it is guaranteed to have worse
> >>> performance.
> >
> > 1-to-1 translation has nothing to do with performance.
>
> I think at least key size matters.
> One big part of hot spots in xdp_flow bpf program is hash table lookup.
> Especially hash calculation by jhash and key comparison are heavy.
> The computational cost heavily depends on key size.
>
> If umh can determine some keys won't be used in some way (not sure if it's
> practical though), umh can load an XDP program which uses less sized
> key. Also it can remove unnecessary key parser routines.
> If it's possible, the performance will increase.
>
Yes, that's a good point.
In other meeting people also gave me this suggestions.

Basically it's "on-demand flow key parsing using eBPF"
The key parsing consists of multiple eBPF programs, and
based on the existing rules, load the program and parse minimum
necessary fields required by existing rules. This will definitely
have better performance.

I didn't try it at all because most of our use cases use overlay
tunnel and connection tracking.  There is little chance of rules
using only L2 or L3 fields. Another way is to do flow key compression
something like miniflow in OVS userspace datapath.

Regards,
William

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-14 17:09                     ` William Tu
@ 2019-11-15 13:16                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-15 13:16 UTC (permalink / raw)
  To: William Tu, Toshiaki Makita
  Cc: David Miller, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Jesper Dangaard Brouer, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Florian Westphal, pravin shelar, Linux Kernel Network Developers,
	bpf, Stanislav Fomichev

William Tu <u9012063@gmail.com> writes:

> On Thu, Nov 14, 2019 at 2:06 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> On 2019/11/13 2:50, William Tu wrote:
>> > On Wed, Oct 30, 2019 at 5:32 PM Toshiaki Makita
>> > <toshiaki.makita1@gmail.com> wrote:
>> >>
>> >> On 2019/10/28 4:17, David Miller wrote:
>> >>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>> Date: Sun, 27 Oct 2019 16:24:24 +0100
>> >>>
>> >>>> The results in the paper also shows somewhat disappointing performance
>> >>>> for the eBPF implementation, but that is not too surprising given that
>> >>>> it's implemented as a TC eBPF hook, not an XDP program. I seem to recall
>> >>>> that this was also one of the things puzzling to me back when this was
>> >>>> presented...
>> >>>
>> >>> Also, no attempt was made to dyanamically optimize the data structures
>> >>> and code generated in response to features actually used.
>> >>>
>> >>> That's the big error.
>> >>>
>> >>> The full OVS key is huge, OVS is really quite a monster.
>> >>>
>> >>> But people don't use the entire key, nor do they use the totality of
>> >>> the data paths.
>> >>>
>> >>> So just doing a 1-to-1 translation of the OVS datapath into BPF makes
>> >>> absolutely no sense whatsoever and it is guaranteed to have worse
>> >>> performance.
>> >
>> > 1-to-1 translation has nothing to do with performance.
>>
>> I think at least key size matters.
>> One big part of hot spots in xdp_flow bpf program is hash table lookup.
>> Especially hash calculation by jhash and key comparison are heavy.
>> The computational cost heavily depends on key size.
>>
>> If umh can determine some keys won't be used in some way (not sure if it's
>> practical though), umh can load an XDP program which uses less sized
>> key. Also it can remove unnecessary key parser routines.
>> If it's possible, the performance will increase.
>>
> Yes, that's a good point.
> In other meeting people also gave me this suggestions.
>
> Basically it's "on-demand flow key parsing using eBPF"
> The key parsing consists of multiple eBPF programs, and
> based on the existing rules, load the program and parse minimum
> necessary fields required by existing rules. This will definitely
> have better performance.

See the xdp-filter program[0] for a simple example of how to do this
with pre-compiled BPF programs. Basically, what it does is generate
different versions of the same program with different subsets of
functionality included (through ifdefs). The feature set of each program
is saved as a feature bitmap, and the loader will dynamically select
which program to load based on which features the user enables at
runtime.

The nice thing about this is that it doesn't require dynamic program
generation, and everything can be compiled ahead of time. The drawback
is that you'll end up with a combinatorial explosion of program variants
if you want full granularity in your feature selection.

-Toke

[0] https://github.com/xdp-project/xdp-tools/tree/master/xdp-filter


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-14 12:41                           ` Toke Høiland-Jørgensen
@ 2019-11-18  6:41                             ` Toshiaki Makita
  2019-11-18 10:20                               ` Toke Høiland-Jørgensen
  2019-11-18 10:28                               ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-18  6:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/11/14 21:41, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/11/13 1:53, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>> Hi Toke,
>>>>
>>>> Sorry for the delay.
>>>>
>>>> On 2019/10/31 21:12, Toke Høiland-Jørgensen wrote:
>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>
>>>>>> On 2019/10/28 0:21, Toke Høiland-Jørgensen wrote:
>>>>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>>>>>> Yeah, you are right that it's something we're thinking about. I'm not
>>>>>>>>> sure we'll actually have the bandwidth to implement a complete solution
>>>>>>>>> ourselves, but we are very much interested in helping others do this,
>>>>>>>>> including smoothing out any rough edges (or adding missing features) in
>>>>>>>>> the core XDP feature set that is needed to achieve this :)
>>>>>>>>
>>>>>>>> I'm very interested in general usability solutions.
>>>>>>>> I'd appreciate if you could join the discussion.
>>>>>>>>
>>>>>>>> Here the basic idea of my approach is to reuse HW-offload infrastructure
>>>>>>>> in kernel.
>>>>>>>> Typical networking features in kernel have offload mechanism (TC flower,
>>>>>>>> nftables, bridge, routing, and so on).
>>>>>>>> In general these are what users want to accelerate, so easy XDP use also
>>>>>>>> should support these features IMO. With this idea, reusing existing
>>>>>>>> HW-offload mechanism is a natural way to me. OVS uses TC to offload
>>>>>>>> flows, then use TC for XDP as well...
>>>>>>>
>>>>>>> I agree that XDP should be able to accelerate existing kernel
>>>>>>> functionality. However, this does not necessarily mean that the kernel
>>>>>>> has to generate an XDP program and install it, like your patch does.
>>>>>>> Rather, what we should be doing is exposing the functionality through
>>>>>>> helpers so XDP can hook into the data structures already present in the
>>>>>>> kernel and make decisions based on what is contained there. We already
>>>>>>> have that for routing; L2 bridging, and some kind of connection
>>>>>>> tracking, are obvious contenders for similar additions.
>>>>>>
>>>>>> Thanks, adding helpers itself should be good, but how does this let users
>>>>>> start using XDP without having them write their own BPF code?
>>>>>
>>>>> It wouldn't in itself. But it would make it possible to write XDP
>>>>> programs that could provide the same functionality; people would then
>>>>> need to run those programs to actually opt-in to this.
>>>>>
>>>>> For some cases this would be a simple "on/off switch", e.g.,
>>>>> "xdp-route-accel --load <dev>", which would install an XDP program that
>>>>> uses the regular kernel routing table (and the same with bridging). We
>>>>> are planning to collect such utilities in the xdp-tools repo - I am
>>>>> currently working on a simple packet filter:
>>>>> https://github.com/xdp-project/xdp-tools/tree/xdp-filter
>>>>
>>>> Let me confirm how this tool adds filter rules.
>>>> Is this adding another commandline tool for firewall?
>>>>
>>>> If so, that is different from my goal.
>>>> Introducing another commandline tool will require people to learn
>>>> more.
>>>>
>>>> My proposal is to reuse kernel interface to minimize such need for
>>>> learning.
>>>
>>> I wasn't proposing that this particular tool should be a replacement for
>>> the kernel packet filter; it's deliberately fairly limited in
>>> functionality. My point was that we could create other such tools for
>>> specific use cases which could be more or less drop-in (similar to how
>>> nftables has a command line tool that is compatible with the iptables
>>> syntax).
>>>
>>> I'm all for exposing more of the existing kernel capabilities to XDP.
>>> However, I think it's the wrong approach to do this by reimplementing
>>> the functionality in eBPF program and replicating the state in maps;
>>> instead, it's better to refactor the existing kernel functionality to it
>>> can be called directly from an eBPF helper function. And then ship a
>>> tool as part of xdp-tools that installs an XDP program to make use of
>>> these helpers to accelerate the functionality.
>>>
>>> Take your example of TC rules: You were proposing a flow like this:
>>>
>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>> program
>>>
>>> Whereas what I mean is that we could do this instead:
>>>
>>> Userspace TC rule -> kernel rule table
>>>
>>> and separately
>>>
>>> XDP program -> bpf helper -> lookup in kernel rule table
>>
>> Thanks, now I see what you mean.
>> You expect an XDP program like this, right?
>>
>> int xdp_tc(struct xdp_md *ctx)
>> {
>> 	int act = bpf_xdp_tc_filter(ctx);
>> 	return act;
>> }
> 
> Yes, basically, except that the XDP program would need to parse the
> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
> the parsed values. See the usage of bpf_fib_lookup() in
> bpf/samples/xdp_fwd_kern.c
> 
>> But doesn't this way lose a chance to reduce/minimize the program to
>> only use necessary features for this device?
> 
> Not necessarily. Since the BPF program does the packet parsing and fills
> in the TC filter lookup data structure, it can limit what features are
> used that way (e.g., if I only want to do IPv6, I just parse the v6
> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
> helper could also have a flag argument to disable some of the lookup
> features.

It's unclear to me how to configure that.
Use options when attaching the program? Something like
$ xdp_tc attach eth0 --only-with ipv6
But can users always determine their necessary features in advance?
Frequent manual reconfiguration when TC rules frequently changes does not sound nice.
Or, add hook to kernel to listen any TC filter event on some daemon and automatically
reload the attached program?

Another concern is key size. If we use the TC core then TC will use its hash table with
fixed key size. So we cannot decrease the size of hash table key in this way?

> 
> It would probably require a bit of refactoring in the kernel data
> structures so they can be used without being tied to an skb. David Ahern
> did something similar for the fib. For the routing table case, that
> resulted in a significant speedup: About 2.5x-3x the performance when
> using it via XDP (depending on the number of routes in the table).

I'm curious about how much the helper function can improve the performance compared to
XDP programs which emulates kernel feature without using such helpers.
2.5x-3x sounds a bit slow as XDP to me, but it can be routing specific problem.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-18  6:41                             ` Toshiaki Makita
@ 2019-11-18 10:20                               ` Toke Høiland-Jørgensen
  2019-11-22  5:42                                 ` Toshiaki Makita
  2019-11-18 10:28                               ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-18 10:20 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

[... trimming the context a bit ...]

>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>
>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>> program
>>>>
>>>> Whereas what I mean is that we could do this instead:
>>>>
>>>> Userspace TC rule -> kernel rule table
>>>>
>>>> and separately
>>>>
>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>
>>> Thanks, now I see what you mean.
>>> You expect an XDP program like this, right?
>>>
>>> int xdp_tc(struct xdp_md *ctx)
>>> {
>>> 	int act = bpf_xdp_tc_filter(ctx);
>>> 	return act;
>>> }
>> 
>> Yes, basically, except that the XDP program would need to parse the
>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>> the parsed values. See the usage of bpf_fib_lookup() in
>> bpf/samples/xdp_fwd_kern.c
>> 
>>> But doesn't this way lose a chance to reduce/minimize the program to
>>> only use necessary features for this device?
>> 
>> Not necessarily. Since the BPF program does the packet parsing and fills
>> in the TC filter lookup data structure, it can limit what features are
>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>> helper could also have a flag argument to disable some of the lookup
>> features.
>
> It's unclear to me how to configure that.
> Use options when attaching the program? Something like
> $ xdp_tc attach eth0 --only-with ipv6
> But can users always determine their necessary features in advance?

That's what I'm doing with xdp-filter now. But the answer to your second
question is likely to be 'probably not', so it would be good to not have
to do this :)

> Frequent manual reconfiguration when TC rules frequently changes does
> not sound nice. Or, add hook to kernel to listen any TC filter event
> on some daemon and automatically reload the attached program?

Doesn't have to be a kernel hook; we could enhance the userspace tooling
to do it. Say we integrate it into 'tc':

- Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
- When adding new rules, add the following logic:
  - Check if XDP acceleration is enabled
  - If it is, check whether the rule being added fits into the current
    'feature set' loaded on that interface.
    - If the rule needs more features, reload the XDP program to one
      with the needed additional features.
    - Or, alternatively, just warn the user and let them manually
      replace it?

> Another concern is key size. If we use the TC core then TC will use
> its hash table with fixed key size. So we cannot decrease the size of
> hash table key in this way?

Here I must admit that I'm not too familiar with the tc internals.
Wouldn't it be possible to refactor the code to either dynamically size
the hash tables, or to split them up into parts based on whatever
'feature set' is required? That might even speed up rule evaluation
without XDP acceleration?

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-18  6:41                             ` Toshiaki Makita
  2019-11-18 10:20                               ` Toke Høiland-Jørgensen
@ 2019-11-18 10:28                               ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-18 10:28 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Forgot to answer this part...

>> It would probably require a bit of refactoring in the kernel data
>> structures so they can be used without being tied to an skb. David Ahern
>> did something similar for the fib. For the routing table case, that
>> resulted in a significant speedup: About 2.5x-3x the performance when
>> using it via XDP (depending on the number of routes in the table).
>
> I'm curious about how much the helper function can improve the
> performance compared to XDP programs which emulates kernel feature
> without using such helpers. 2.5x-3x sounds a bit slow as XDP to me,
> but it can be routing specific problem.

That's specific to routing; the numbers we got were roughly consistent
with the routing table lookup performance reported here:
https://vincent.bernat.ch/en/blog/2017-ipv4-route-lookup-linux

I.e., a fib lookup takes something on the order of 30-50 ns, which
eats up quite a bit of the time budget for forwarding...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-18 10:20                               ` Toke Høiland-Jørgensen
@ 2019-11-22  5:42                                 ` Toshiaki Makita
  2019-11-22 11:54                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-22  5:42 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
> [... trimming the context a bit ...]
> 
>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>
>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>> program
>>>>>
>>>>> Whereas what I mean is that we could do this instead:
>>>>>
>>>>> Userspace TC rule -> kernel rule table
>>>>>
>>>>> and separately
>>>>>
>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>
>>>> Thanks, now I see what you mean.
>>>> You expect an XDP program like this, right?
>>>>
>>>> int xdp_tc(struct xdp_md *ctx)
>>>> {
>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>> 	return act;
>>>> }
>>>
>>> Yes, basically, except that the XDP program would need to parse the
>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>> the parsed values. See the usage of bpf_fib_lookup() in
>>> bpf/samples/xdp_fwd_kern.c
>>>
>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>> only use necessary features for this device?
>>>
>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>> in the TC filter lookup data structure, it can limit what features are
>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>> helper could also have a flag argument to disable some of the lookup
>>> features.
>>
>> It's unclear to me how to configure that.
>> Use options when attaching the program? Something like
>> $ xdp_tc attach eth0 --only-with ipv6
>> But can users always determine their necessary features in advance?
> 
> That's what I'm doing with xdp-filter now. But the answer to your second
> question is likely to be 'probably not', so it would be good to not have
> to do this :)
> 
>> Frequent manual reconfiguration when TC rules frequently changes does
>> not sound nice. Or, add hook to kernel to listen any TC filter event
>> on some daemon and automatically reload the attached program?
> 
> Doesn't have to be a kernel hook; we could enhance the userspace tooling
> to do it. Say we integrate it into 'tc':
> 
> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
> - When adding new rules, add the following logic:
>    - Check if XDP acceleration is enabled
>    - If it is, check whether the rule being added fits into the current
>      'feature set' loaded on that interface.
>      - If the rule needs more features, reload the XDP program to one
>        with the needed additional features.
>      - Or, alternatively, just warn the user and let them manually
>        replace it?

Ok, but there are other userspace tools to configure tc in wild.
python and golang have their own netlink library project.
OVS embeds TC netlink handling code in itself. There may be more tools like this.
I think at least we should have rtnl notification about TC and monitor it
from daemon, if we want to reload the program from userspace tools.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-22  5:42                                 ` Toshiaki Makita
@ 2019-11-22 11:54                                   ` Toke Høiland-Jørgensen
  2019-11-25 10:18                                     ` Toshiaki Makita
  0 siblings, 1 reply; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-22 11:54 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>> [... trimming the context a bit ...]
>> 
>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>
>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>> program
>>>>>>
>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>
>>>>>> Userspace TC rule -> kernel rule table
>>>>>>
>>>>>> and separately
>>>>>>
>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>
>>>>> Thanks, now I see what you mean.
>>>>> You expect an XDP program like this, right?
>>>>>
>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>> {
>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>> 	return act;
>>>>> }
>>>>
>>>> Yes, basically, except that the XDP program would need to parse the
>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>> bpf/samples/xdp_fwd_kern.c
>>>>
>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>> only use necessary features for this device?
>>>>
>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>> in the TC filter lookup data structure, it can limit what features are
>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>> helper could also have a flag argument to disable some of the lookup
>>>> features.
>>>
>>> It's unclear to me how to configure that.
>>> Use options when attaching the program? Something like
>>> $ xdp_tc attach eth0 --only-with ipv6
>>> But can users always determine their necessary features in advance?
>> 
>> That's what I'm doing with xdp-filter now. But the answer to your second
>> question is likely to be 'probably not', so it would be good to not have
>> to do this :)
>> 
>>> Frequent manual reconfiguration when TC rules frequently changes does
>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>> on some daemon and automatically reload the attached program?
>> 
>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>> to do it. Say we integrate it into 'tc':
>> 
>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>> - When adding new rules, add the following logic:
>>    - Check if XDP acceleration is enabled
>>    - If it is, check whether the rule being added fits into the current
>>      'feature set' loaded on that interface.
>>      - If the rule needs more features, reload the XDP program to one
>>        with the needed additional features.
>>      - Or, alternatively, just warn the user and let them manually
>>        replace it?
>
> Ok, but there are other userspace tools to configure tc in wild.
> python and golang have their own netlink library project.
> OVS embeds TC netlink handling code in itself. There may be more tools like this.
> I think at least we should have rtnl notification about TC and monitor it
> from daemon, if we want to reload the program from userspace tools.

A daemon would be one way to do this in cases where it needs to be
completely dynamic. My guess is that there are lots of environments
where that is not required, and where a user/administrator could
realistically specify ahead of time which feature set they want to
enable XDP acceleration for. So in my mind the way to go about this is
to implement the latter first, then add dynamic reconfiguration of it on
top when (or if) it turns out to be necessary...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-22 11:54                                   ` Toke Høiland-Jørgensen
@ 2019-11-25 10:18                                     ` Toshiaki Makita
  2019-11-25 13:03                                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 58+ messages in thread
From: Toshiaki Makita @ 2019-11-25 10:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Florian Westphal,
	Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

On 2019/11/22 20:54, Toke Høiland-Jørgensen wrote:
> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
> 
>> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>
>>> [... trimming the context a bit ...]
>>>
>>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>>
>>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>>> program
>>>>>>>
>>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>>
>>>>>>> Userspace TC rule -> kernel rule table
>>>>>>>
>>>>>>> and separately
>>>>>>>
>>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>>
>>>>>> Thanks, now I see what you mean.
>>>>>> You expect an XDP program like this, right?
>>>>>>
>>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>>> {
>>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>>> 	return act;
>>>>>> }
>>>>>
>>>>> Yes, basically, except that the XDP program would need to parse the
>>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>>> bpf/samples/xdp_fwd_kern.c
>>>>>
>>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>>> only use necessary features for this device?
>>>>>
>>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>>> in the TC filter lookup data structure, it can limit what features are
>>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>>> helper could also have a flag argument to disable some of the lookup
>>>>> features.
>>>>
>>>> It's unclear to me how to configure that.
>>>> Use options when attaching the program? Something like
>>>> $ xdp_tc attach eth0 --only-with ipv6
>>>> But can users always determine their necessary features in advance?
>>>
>>> That's what I'm doing with xdp-filter now. But the answer to your second
>>> question is likely to be 'probably not', so it would be good to not have
>>> to do this :)
>>>
>>>> Frequent manual reconfiguration when TC rules frequently changes does
>>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>>> on some daemon and automatically reload the attached program?
>>>
>>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>>> to do it. Say we integrate it into 'tc':
>>>
>>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>>> - When adding new rules, add the following logic:
>>>     - Check if XDP acceleration is enabled
>>>     - If it is, check whether the rule being added fits into the current
>>>       'feature set' loaded on that interface.
>>>       - If the rule needs more features, reload the XDP program to one
>>>         with the needed additional features.
>>>       - Or, alternatively, just warn the user and let them manually
>>>         replace it?
>>
>> Ok, but there are other userspace tools to configure tc in wild.
>> python and golang have their own netlink library project.
>> OVS embeds TC netlink handling code in itself. There may be more tools like this.
>> I think at least we should have rtnl notification about TC and monitor it
>> from daemon, if we want to reload the program from userspace tools.
> 
> A daemon would be one way to do this in cases where it needs to be
> completely dynamic. My guess is that there are lots of environments
> where that is not required, and where a user/administrator could
> realistically specify ahead of time which feature set they want to
> enable XDP acceleration for. So in my mind the way to go about this is
> to implement the latter first, then add dynamic reconfiguration of it on
> top when (or if) it turns out to be necessary...

Hmm, but I think there is big difference between a daemon and a cli tool.
Shouldn't we determine the design considering future usage?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
  2019-11-25 10:18                                     ` Toshiaki Makita
@ 2019-11-25 13:03                                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 58+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-11-25 13:03 UTC (permalink / raw)
  To: Toshiaki Makita, John Fastabend, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Song Liu, Yonghong Song,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Florian Westphal, Pravin B Shelar
  Cc: netdev, bpf, William Tu, Stanislav Fomichev

Toshiaki Makita <toshiaki.makita1@gmail.com> writes:

> On 2019/11/22 20:54, Toke Høiland-Jørgensen wrote:
>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>> 
>>> On 2019/11/18 19:20, Toke Høiland-Jørgensen wrote:
>>>> Toshiaki Makita <toshiaki.makita1@gmail.com> writes:
>>>>
>>>> [... trimming the context a bit ...]
>>>>
>>>>>>>> Take your example of TC rules: You were proposing a flow like this:
>>>>>>>>
>>>>>>>> Userspace TC rule -> kernel rule table -> eBPF map -> generated XDP
>>>>>>>> program
>>>>>>>>
>>>>>>>> Whereas what I mean is that we could do this instead:
>>>>>>>>
>>>>>>>> Userspace TC rule -> kernel rule table
>>>>>>>>
>>>>>>>> and separately
>>>>>>>>
>>>>>>>> XDP program -> bpf helper -> lookup in kernel rule table
>>>>>>>
>>>>>>> Thanks, now I see what you mean.
>>>>>>> You expect an XDP program like this, right?
>>>>>>>
>>>>>>> int xdp_tc(struct xdp_md *ctx)
>>>>>>> {
>>>>>>> 	int act = bpf_xdp_tc_filter(ctx);
>>>>>>> 	return act;
>>>>>>> }
>>>>>>
>>>>>> Yes, basically, except that the XDP program would need to parse the
>>>>>> packet first, and bpf_xdp_tc_filter() would take a parameter struct with
>>>>>> the parsed values. See the usage of bpf_fib_lookup() in
>>>>>> bpf/samples/xdp_fwd_kern.c
>>>>>>
>>>>>>> But doesn't this way lose a chance to reduce/minimize the program to
>>>>>>> only use necessary features for this device?
>>>>>>
>>>>>> Not necessarily. Since the BPF program does the packet parsing and fills
>>>>>> in the TC filter lookup data structure, it can limit what features are
>>>>>> used that way (e.g., if I only want to do IPv6, I just parse the v6
>>>>>> header, ignore TCP/UDP, and drop everything that's not IPv6). The lookup
>>>>>> helper could also have a flag argument to disable some of the lookup
>>>>>> features.
>>>>>
>>>>> It's unclear to me how to configure that.
>>>>> Use options when attaching the program? Something like
>>>>> $ xdp_tc attach eth0 --only-with ipv6
>>>>> But can users always determine their necessary features in advance?
>>>>
>>>> That's what I'm doing with xdp-filter now. But the answer to your second
>>>> question is likely to be 'probably not', so it would be good to not have
>>>> to do this :)
>>>>
>>>>> Frequent manual reconfiguration when TC rules frequently changes does
>>>>> not sound nice. Or, add hook to kernel to listen any TC filter event
>>>>> on some daemon and automatically reload the attached program?
>>>>
>>>> Doesn't have to be a kernel hook; we could enhance the userspace tooling
>>>> to do it. Say we integrate it into 'tc':
>>>>
>>>> - Add a new command 'tc xdp_accel enable <iface> --features [ipv6,etc]'
>>>> - When adding new rules, add the following logic:
>>>>     - Check if XDP acceleration is enabled
>>>>     - If it is, check whether the rule being added fits into the current
>>>>       'feature set' loaded on that interface.
>>>>       - If the rule needs more features, reload the XDP program to one
>>>>         with the needed additional features.
>>>>       - Or, alternatively, just warn the user and let them manually
>>>>         replace it?
>>>
>>> Ok, but there are other userspace tools to configure tc in wild.
>>> python and golang have their own netlink library project.
>>> OVS embeds TC netlink handling code in itself. There may be more tools like this.
>>> I think at least we should have rtnl notification about TC and monitor it
>>> from daemon, if we want to reload the program from userspace tools.
>> 
>> A daemon would be one way to do this in cases where it needs to be
>> completely dynamic. My guess is that there are lots of environments
>> where that is not required, and where a user/administrator could
>> realistically specify ahead of time which feature set they want to
>> enable XDP acceleration for. So in my mind the way to go about this is
>> to implement the latter first, then add dynamic reconfiguration of it on
>> top when (or if) it turns out to be necessary...
>
> Hmm, but I think there is big difference between a daemon and a cli tool.
> Shouldn't we determine the design considering future usage?

Sure, we should make sure the design doesn't exclude either option. But
we also shouldn't end up in a "the perfect is the enemy of the good"
type of situation. And the kernel-side changes are likely to be somewhat
independent of what the userspace management ends up looking like...

-Toke


^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2019-11-25 13:03 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-18  4:07 [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 02/15] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 03/15] bpf: Add API to get program from id Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 04/15] xdp: Export dev_check_xdp and dev_change_xdp Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 05/15] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 06/15] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 07/15] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 08/15] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 09/15] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 10/15] xdp_flow: Add netdev feature for enabling flow offload to XDP Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 11/15] xdp_flow: Implement redirect action Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 12/15] xdp_flow: Implement vlan_push action Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 13/15] bpf, selftest: Add test for xdp_flow Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 14/15] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 15/15] bpf, hashtab: Compare keys in long Toshiaki Makita
2019-10-18 15:22 ` [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP John Fastabend
2019-10-21  7:31   ` Toshiaki Makita
2019-10-22 16:54     ` John Fastabend
2019-10-22 17:45       ` Toke Høiland-Jørgensen
2019-10-24  4:27         ` John Fastabend
2019-10-24 10:13           ` Toke Høiland-Jørgensen
2019-10-27 13:19             ` Toshiaki Makita
2019-10-27 15:21               ` Toke Høiland-Jørgensen
2019-10-28  3:16                 ` David Ahern
2019-10-28  8:36                   ` Toke Høiland-Jørgensen
2019-10-28 10:08                     ` Jesper Dangaard Brouer
2019-10-28 19:07                       ` David Ahern
2019-10-28 19:05                     ` David Ahern
2019-10-31  0:18                 ` Toshiaki Makita
2019-10-31 12:12                   ` Toke Høiland-Jørgensen
2019-11-11  7:32                     ` Toshiaki Makita
2019-11-12 16:53                       ` Toke Høiland-Jørgensen
2019-11-14 10:11                         ` Toshiaki Makita
2019-11-14 12:41                           ` Toke Høiland-Jørgensen
2019-11-18  6:41                             ` Toshiaki Makita
2019-11-18 10:20                               ` Toke Høiland-Jørgensen
2019-11-22  5:42                                 ` Toshiaki Makita
2019-11-22 11:54                                   ` Toke Høiland-Jørgensen
2019-11-25 10:18                                     ` Toshiaki Makita
2019-11-25 13:03                                       ` Toke Høiland-Jørgensen
2019-11-18 10:28                               ` Toke Høiland-Jørgensen
2019-10-27 13:13         ` Toshiaki Makita
2019-10-27 15:24           ` Toke Høiland-Jørgensen
2019-10-27 19:17             ` David Miller
2019-10-31  0:32               ` Toshiaki Makita
2019-11-12 17:50                 ` William Tu
2019-11-14 10:06                   ` Toshiaki Makita
2019-11-14 17:09                     ` William Tu
2019-11-15 13:16                       ` Toke Høiland-Jørgensen
2019-11-12 17:38             ` William Tu
2019-10-23 14:11       ` Jamal Hadi Salim
2019-10-24  4:38         ` John Fastabend
2019-10-24 17:05           ` Jamal Hadi Salim
2019-10-27 13:27         ` Toshiaki Makita
2019-10-27 13:06       ` Toshiaki Makita
2019-10-21 11:23 ` Björn Töpel
2019-10-21 11:47   ` Toshiaki Makita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).