BPF Archive on lore.kernel.org
 help / color / Atom feed
* [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
@ 2019-08-13 12:05 Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 01/14] xdp_flow: Add skeleton of XDP based TC offload driver Toshiaki Makita
                   ` (16 more replies)
  0 siblings, 17 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

This is a rough PoC for an idea to offload TC flower to XDP.


* Motivation

The purpose is to speed up software TC flower by using XDP.

I chose TC flower because my current interest is in OVS. OVS uses TC to
offload flow tables to hardware, so if TC can offload flows to XDP, OVS
also can be offloaded to XDP.

When TC flower filter is offloaded to XDP, the received packets are
handled by XDP first, and if their protocol or something is not
supported by the eBPF program, the program returns XDP_PASS and packets
are passed to upper layer TC.

The packet processing flow will be like this when this mechanism,
xdp_flow, is used with OVS.

 +-------------+
 | openvswitch |
 |    kmod     |
 +-------------+
        ^
        | if not match in filters (flow key or action not supported by TC)
 +-------------+
 |  TC flower  |
 +-------------+
        ^
        | if not match in flow tables (flow key or action not supported by XDP)
 +-------------+
 |  XDP prog   |
 +-------------+
        ^
        | incoming packets

Of course we can directly use TC flower without OVS to speed up TC.

This is useful especially when the device does not support HW-offload.
Such interfaces include virtual interfaces like veth.


* How to use

It only supports ingress (clsact) flower filter at this point.
Enable the feature via ethtool before adding ingress/clsact qdisc.

 $ ethtool -K eth0 tc-offload-xdp on

Then add qdisc/filters as normal.

 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...

Alternatively, when using OVS, adding qdisc and filters will be
automatically done by setting hw-offload.

 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress # or reboot
 $ ethtool -K eth0 tc-offload-xdp on
 $ systemctl start openvswitch


* Performance

I measured drop rate at veth interface with redirect action from physical
interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
(2.20 GHz).
                                                                 XDP_DROP
                    +------+                        +-------+    +-------+
 pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
                    +------+   (offloaded to XDP)   +-------+    +-------+

The setup for redirect is done by OVS like this.

 $ ovs-vsctl add-br ovsbr0
 $ ovs-vsctl add-port ovsbr0 eth0
 $ ovs-vsctl add-port ovsbr0 veth0
 $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
 $ systemctl stop openvswitch
 $ tc qdisc del dev eth0 ingress
 $ tc qdisc del dev veth0 ingress
 $ ethtool -K eth0 tc-offload-xdp on
 $ ethtool -K veth0 tc-offload-xdp on
 $ systemctl start openvswitch

Tested single core/single flow with 3 configurations.
- xdp_flow: hw-offload=true, tc-offload-xdp on
- TC:       hw-offload=true, tc-offload-xdp off (software TC)
- ovs kmod: hw-offload=false

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 4.0 Mpps  1.1 Mpps  1.1 Mpps

So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.

OTOH the time to add a flow increases with xdp_flow.

ping latency of first packet when veth1 does XDP_PASS instead of DROP:

 xdp_flow  TC        ovs kmod
 --------  --------  --------
 25ms      12ms      0.6ms

xdp_flow does a lot of work to emulate TC behavior including UMH
transaction and multiple bpf map update from UMH which I think increases
the latency.


* Implementation

xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
bpfilter. The difference is that xdp_flow does not generate the eBPF
program dynamically but a prebuilt program is embedded in UMH. This is
mainly because flow insertion is considerably frequent. If we generate
and load an eBPF program on each insertion of a flow, the latency of the
first packet of ping in above test will incease, which I want to avoid.

                         +----------------------+
                         |    xdp_flow_umh      | load eBPF prog for XDP
                         | (eBPF prog embedded) | update maps for flow tables
                         +----------------------+
                                   ^ |
                           request | v eBPF prog id
 +-----------+  offload  +-----------------------+
 | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
 +-----------+           | (flow offload driver) |
                         +-----------------------+

- When ingress/clsact qdisc is created, i.e. a device is bound to a flow
  block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
  xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
  (the reason of attaching XDP from kmod is that rtnl_lock is held here).

- When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
  update maps for flow tables.


* Patches

- patch 1
 Basic framework for xdp_flow kmod and UMH.

- patch 2
 Add prebuilt eBPF program embedded in UMH.

- patch 3, 4
 Attach the prog to XDP in kmod after using the prog id returned from
 UMH.

- patch 5, 6
 Add maps for flow tables and flow table manipulation logic in UMH.

- patch 7
 Implement flow lookup and basic actions in eBPF prog.

- patch 8
 Implement flow manipulation logic, serialize flow key and actions from
 TC flower and make requests to UMH in kmod.

- patch 9
 Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
 TC flower offload code.

- patch 10, 11
 Add example actions, redirect and vlan_push.

- patch 12
 Add testcase for xdp_flow.

- patch 13, 14
 These are unrelated patches. They just improves XDP program's
 performance. They are included to demonstrate to what extent xdp_flow
 performance can increase. Without them, drop rate goes down from 4Mpps
 to 3Mpps.


* About OVS AF_XDP netdev

Recently OVS has added AF_XDP netdev type support. This also makes use
of XDP, but in some ways different from this patch set.

- AF_XDP work originally started in order to bring BPF's flexibility to
  OVS, which enables us to upgrade datapath without updating kernel.
  AF_XDP solution uses userland datapath so it achieved its goal.
  xdp_flow will not replace OVS datapath completely, but offload it
  partially just for speed up.

- OVS AF_XDP requires PMD for the best performance so consumes 100% CPU.

- OVS AF_XDP needs packet copy when forwarding packets.

- xdp_flow can be used not only for OVS. It works for direct use of TC
  flower. nftables also can be offloaded by the same mechanism in the
  future.


* About alternative userland (ovs-vswitchd etc.) implementation

Maybe a similar logic can be implemented in ovs-vswitchd offload
mechanism, instead of adding code to kernel. I just thought offloading
TC is more generic and allows wider usage with direct TC command.

For example, considering that OVS inserts a flow to kernel only when
flow miss happens in kernel, we can in advance add offloaded flows via
tc filter to avoid flow insertion latency for certain sensitive flows.
TC flower usage without using OVS is also possible.

Also as written above nftables can be offloaded to XDP with this
mechanism as well.


* Note

This patch set is based on top of commit a664a834579a ("tools: bpftool:
fix reading from /proc/config.gz").

Any feedback is welcome.
Thanks!

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>

Toshiaki Makita (14):
  xdp_flow: Add skeleton of XDP based TC offload driver
  xdp_flow: Add skeleton bpf program for XDP
  bpf: Add API to get program from id
  xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  xdp_flow: Prepare flow tables in bpf
  xdp_flow: Add flow entry insertion/deletion logic in UMH
  xdp_flow: Add flow handling and basic actions in bpf prog
  xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  xdp_flow: Add netdev feature for enabling TC flower offload to XDP
  xdp_flow: Implement redirect action
  xdp_flow: Implement vlan_push action
  bpf, selftest: Add test for xdp_flow
  i40e: prefetch xdp->data before running XDP prog
  bpf, hashtab: Compare keys in long

 drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
 include/linux/bpf.h                          |    6 +
 include/linux/netdev_features.h              |    2 +
 include/linux/netdevice.h                    |    4 +
 include/net/flow_offload_xdp.h               |   33 +
 include/net/pkt_cls.h                        |    5 +
 include/net/sch_generic.h                    |    1 +
 kernel/bpf/hashtab.c                         |   27 +-
 kernel/bpf/syscall.c                         |   26 +-
 net/Kconfig                                  |    1 +
 net/Makefile                                 |    1 +
 net/core/dev.c                               |   13 +-
 net/core/ethtool.c                           |    1 +
 net/sched/cls_api.c                          |   67 +-
 net/xdp_flow/.gitignore                      |    1 +
 net/xdp_flow/Kconfig                         |   16 +
 net/xdp_flow/Makefile                        |  112 +++
 net/xdp_flow/msgfmt.h                        |  102 +++
 net/xdp_flow/umh_bpf.h                       |   34 +
 net/xdp_flow/xdp_flow_core.c                 |  126 ++++
 net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
 net/xdp_flow/xdp_flow_kern_mod.c             |  645 ++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c                  | 1034 ++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
 tools/testing/selftests/bpf/Makefile         |    1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh |  103 +++
 27 files changed, 2716 insertions(+), 18 deletions(-)
 create mode 100644 include/net/flow_offload_xdp.h
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/umh_bpf.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 01/14] xdp_flow: Add skeleton of XDP based TC offload driver
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 02/14] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

Add TC offload driver, xdp_flow_core.c, and skeleton of UMH handling
mechanism. The driver is not called from anywhere yet.

xdp_flow_setup_block() in xdp_flow_core.c is meant to be called when
ingress qdisc is added. It loads xdp_flow kernel module and the kmod
provides some callbacks for setup phase and flow insertion phase.

xdp_flow_setup() in the kmod will be called from xdp_flow_setup_block()
when ingress qdisc is added, and xdp_flow_setup_block_cb() will be
called when a tc flower filter is added.

The former will request the UMH to load the eBPF program and the latter
will request the UMH to populate maps for flow tables. In this patch
no actual processing is implemented and the following commits implement
them.

The overall mechanism of UMH handling is written referring to bpfilter.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/net/flow_offload_xdp.h   |  33 ++++++
 net/Kconfig                      |   1 +
 net/Makefile                     |   1 +
 net/xdp_flow/.gitignore          |   1 +
 net/xdp_flow/Kconfig             |  16 +++
 net/xdp_flow/Makefile            |  31 +++++
 net/xdp_flow/msgfmt.h            | 102 ++++++++++++++++
 net/xdp_flow/xdp_flow_core.c     | 126 ++++++++++++++++++++
 net/xdp_flow/xdp_flow_kern_mod.c | 250 +++++++++++++++++++++++++++++++++++++++
 net/xdp_flow/xdp_flow_umh.c      | 109 +++++++++++++++++
 net/xdp_flow/xdp_flow_umh_blob.S |   7 ++
 11 files changed, 677 insertions(+)
 create mode 100644 include/net/flow_offload_xdp.h
 create mode 100644 net/xdp_flow/.gitignore
 create mode 100644 net/xdp_flow/Kconfig
 create mode 100644 net/xdp_flow/Makefile
 create mode 100644 net/xdp_flow/msgfmt.h
 create mode 100644 net/xdp_flow/xdp_flow_core.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
 create mode 100644 net/xdp_flow/xdp_flow_umh.c
 create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S

diff --git a/include/net/flow_offload_xdp.h b/include/net/flow_offload_xdp.h
new file mode 100644
index 0000000..d04a73d
--- /dev/null
+++ b/include/net/flow_offload_xdp.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_FLOW_OFFLOAD_XDP_H
+#define _LINUX_FLOW_OFFLOAD_XDP_H
+
+#include <linux/netdevice.h>
+#include <linux/umh.h>
+#include <net/flow_offload.h>
+
+struct xdp_flow_umh_ops {
+	struct umh_info info;
+	/* serialize access to this object and UMH */
+	struct mutex lock;
+	flow_setup_cb_t *setup_cb;
+	int (*setup)(struct net_device *dev, bool do_bind,
+		     struct netlink_ext_ack *extack);
+	int (*start)(void);
+	bool stop;
+	struct module *module;
+};
+
+extern struct xdp_flow_umh_ops xdp_flow_ops;
+
+#ifdef CONFIG_XDP_FLOW
+int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f);
+#else
+static inline int xdp_flow_setup_block(struct net_device *dev,
+				       struct flow_block_offload *f)
+{
+	return -EOPNOTSUPP;
+}
+#endif
+
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index 57f51a2..08d36444 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -206,6 +206,7 @@ source "net/bridge/netfilter/Kconfig"
 endif
 
 source "net/bpfilter/Kconfig"
+source "net/xdp_flow/Kconfig"
 
 source "net/dccp/Kconfig"
 source "net/sctp/Kconfig"
diff --git a/net/Makefile b/net/Makefile
index 449fc0b..b78d1ef 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -87,3 +87,4 @@ endif
 obj-$(CONFIG_QRTR)		+= qrtr/
 obj-$(CONFIG_NET_NCSI)		+= ncsi/
 obj-$(CONFIG_XDP_SOCKETS)	+= xdp/
+obj-$(CONFIG_XDP_FLOW)		+= xdp_flow/
diff --git a/net/xdp_flow/.gitignore b/net/xdp_flow/.gitignore
new file mode 100644
index 0000000..8cad817
--- /dev/null
+++ b/net/xdp_flow/.gitignore
@@ -0,0 +1 @@
+xdp_flow_umh
diff --git a/net/xdp_flow/Kconfig b/net/xdp_flow/Kconfig
new file mode 100644
index 0000000..82e7bf3
--- /dev/null
+++ b/net/xdp_flow/Kconfig
@@ -0,0 +1,16 @@
+# SPDX-License-Identifier: GPL-2.0-only
+menuconfig XDP_FLOW
+	bool "XDP based flow offload engine (XDP_FLOW)"
+	depends on NET && BPF_SYSCALL && NET_CLS_FLOWER && MEMFD_CREATE
+	help
+	  This builds experimental xdp_flow framework that is aiming to
+	  provide flow software offload functionality via XDP
+
+if XDP_FLOW
+config XDP_FLOW_UMH
+	tristate "xdp_flow kernel module with user mode helper"
+	depends on $(success,$(srctree)/scripts/cc-can-link.sh $(CC))
+	default m
+	help
+	  This builds xdp_flow kernel module with embedded user mode helper
+endif
diff --git a/net/xdp_flow/Makefile b/net/xdp_flow/Makefile
new file mode 100644
index 0000000..f6138c2
--- /dev/null
+++ b/net/xdp_flow/Makefile
@@ -0,0 +1,31 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_XDP_FLOW) += xdp_flow_core.o
+
+ifeq ($(CONFIG_XDP_FLOW_UMH), y)
+# builtin xdp_flow_umh should be compiled with -static
+# since rootfs isn't mounted at the time of __init
+# function is called and do_execv won't find elf interpreter
+STATIC := -static
+endif
+
+quiet_cmd_cc_user = CC      $@
+      cmd_cc_user = $(CC) -Wall -Wmissing-prototypes -O2 -std=gnu89 \
+		    -I$(srctree)/tools/include/ \
+		    -c -o $@ $<
+
+quiet_cmd_ld_user = LD      $@
+      cmd_ld_user = $(CC) $(STATIC) -o $@ $^
+
+$(obj)/xdp_flow_umh.o: $(src)/xdp_flow_umh.c FORCE
+	$(call if_changed,cc_user)
+
+$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o
+	$(call if_changed,ld_user)
+
+clean-files := xdp_flow_umh
+
+$(obj)/xdp_flow_umh_blob.o: $(obj)/xdp_flow_umh
+
+obj-$(CONFIG_XDP_FLOW_UMH) += xdp_flow.o
+xdp_flow-objs += xdp_flow_kern_mod.o xdp_flow_umh_blob.o
diff --git a/net/xdp_flow/msgfmt.h b/net/xdp_flow/msgfmt.h
new file mode 100644
index 0000000..97d8490
--- /dev/null
+++ b/net/xdp_flow/msgfmt.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_XDP_FLOW_MSGFMT_H
+#define _NET_XDP_FLOW_MSGFMT_H
+
+#include <linux/types.h>
+#include <linux/compiler.h>
+#include <linux/if_ether.h>
+#include <linux/in6.h>
+
+#define MAX_XDP_FLOW_ACTIONS 32
+
+enum xdp_flow_action_id {
+	/* ABORT if 0, i.e. uninitialized */
+	XDP_FLOW_ACTION_ACCEPT	= 1,
+	XDP_FLOW_ACTION_DROP,
+	XDP_FLOW_ACTION_REDIRECT,
+	XDP_FLOW_ACTION_VLAN_PUSH,
+	XDP_FLOW_ACTION_VLAN_POP,
+	XDP_FLOW_ACTION_VLAN_MANGLE,
+	XDP_FLOW_ACTION_MANGLE,
+	XDP_FLOW_ACTION_CSUM,
+	NR_XDP_FLOW_ACTION,
+};
+
+struct xdp_flow_action {
+	enum xdp_flow_action_id	id;
+	union {
+		int	ifindex;	/* REDIRECT */
+		struct {		/* VLAN */
+			__be16	proto;
+			__be16	tci;
+		} vlan;
+	};
+};
+
+struct xdp_flow_actions {
+	unsigned int num_actions;
+	struct xdp_flow_action actions[MAX_XDP_FLOW_ACTIONS];
+};
+
+struct xdp_flow_key {
+	struct {
+		__u8	dst[ETH_ALEN] __aligned(2);
+		__u8	src[ETH_ALEN] __aligned(2);
+		__be16	type;
+	} eth;
+	struct {
+		__be16	tpid;
+		__be16	tci;
+	} vlan;
+	struct {
+		__u8	proto;
+		__u8	ttl;
+		__u8	tos;
+		__u8	frag;
+	} ip;
+	union {
+		struct {
+			__be32	src;
+			__be32	dst;
+		} ipv4;
+		struct {
+			struct in6_addr	src;
+			struct in6_addr	dst;
+		} ipv6;
+	};
+	struct {
+		__be16	src;
+		__be16	dst;
+	} l4port;
+	struct {
+		__be16	flags;
+	} tcp;
+} __aligned(BITS_PER_LONG / 8);
+
+struct xdp_flow {
+	struct xdp_flow_key key;
+	struct xdp_flow_key mask;
+	struct xdp_flow_actions actions;
+	__u16 priority;
+};
+
+enum xdp_flow_cmd {
+	XDP_FLOW_CMD_NOOP		= 0,
+	XDP_FLOW_CMD_LOAD,
+	XDP_FLOW_CMD_UNLOAD,
+	XDP_FLOW_CMD_REPLACE,
+	XDP_FLOW_CMD_DELETE,
+};
+
+struct mbox_request {
+	int ifindex;
+	__u8 cmd;
+	struct xdp_flow flow;
+};
+
+struct mbox_reply {
+	int status;
+	__u32 id;
+};
+
+#endif
diff --git a/net/xdp_flow/xdp_flow_core.c b/net/xdp_flow/xdp_flow_core.c
new file mode 100644
index 0000000..ab84863
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_core.c
@@ -0,0 +1,126 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <net/flow_offload_xdp.h>
+#include <linux/kmod.h>
+#include <linux/module.h>
+#include <net/pkt_cls.h>
+#include <linux/netdevice.h>
+
+struct xdp_flow_umh_ops xdp_flow_ops;
+EXPORT_SYMBOL_GPL(xdp_flow_ops);
+
+static LIST_HEAD(xdp_block_cb_list);
+
+static void xdp_flow_block_release(void *cb_priv)
+{
+	struct net_device *dev = cb_priv;
+	struct netlink_ext_ack extack;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	xdp_flow_ops.setup(dev, false, &extack);
+	module_put(xdp_flow_ops.module);
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+
+int xdp_flow_setup_block(struct net_device *dev, struct flow_block_offload *f)
+{
+	struct flow_block_cb *block_cb;
+	int err = 0;
+
+	/* TODO: Remove this limitation */
+	if (!net_eq(current->nsproxy->net_ns, &init_net))
+		return -EOPNOTSUPP;
+
+	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (!xdp_flow_ops.module) {
+		mutex_unlock(&xdp_flow_ops.lock);
+		err = request_module("xdp_flow");
+		if (err)
+			return err;
+		mutex_lock(&xdp_flow_ops.lock);
+		if (!xdp_flow_ops.module) {
+			err = -ECHILD;
+			goto out;
+		}
+	}
+	if (xdp_flow_ops.stop) {
+		err = xdp_flow_ops.start();
+		if (err)
+			goto out;
+	}
+
+	f->driver_block_list = &xdp_block_cb_list;
+
+	switch (f->command) {
+	case FLOW_BLOCK_BIND:
+		if (flow_block_cb_is_busy(xdp_flow_ops.setup_cb, dev,
+					  &xdp_block_cb_list)) {
+			err = -EBUSY;
+			goto out;
+		}
+
+		if (!try_module_get(xdp_flow_ops.module)) {
+			err = -ECHILD;
+			goto out;
+		}
+
+		err = xdp_flow_ops.setup(dev, true, f->extack);
+		if (err) {
+			module_put(xdp_flow_ops.module);
+			goto out;
+		}
+
+		block_cb = flow_block_cb_alloc(xdp_flow_ops.setup_cb, dev, dev,
+					       xdp_flow_block_release);
+		if (IS_ERR(block_cb)) {
+			xdp_flow_ops.setup(dev, false, f->extack);
+			module_put(xdp_flow_ops.module);
+			err = PTR_ERR(block_cb);
+			goto out;
+		}
+
+		flow_block_cb_add(block_cb, f);
+		list_add_tail(&block_cb->driver_list, &xdp_block_cb_list);
+		break;
+	case FLOW_BLOCK_UNBIND:
+		block_cb = flow_block_cb_lookup(f->block, xdp_flow_ops.setup_cb,
+						dev);
+		if (!block_cb) {
+			err = -ENOENT;
+			goto out;
+		}
+
+		flow_block_cb_remove(block_cb, f);
+		list_del(&block_cb->driver_list);
+		break;
+	default:
+		err = -EOPNOTSUPP;
+	}
+out:
+	mutex_unlock(&xdp_flow_ops.lock);
+
+	return err;
+}
+
+static void xdp_flow_umh_cleanup(struct umh_info *info)
+{
+	mutex_lock(&xdp_flow_ops.lock);
+	xdp_flow_ops.stop = true;
+	fput(info->pipe_to_umh);
+	fput(info->pipe_from_umh);
+	info->pid = 0;
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+
+static int __init xdp_flow_init(void)
+{
+	mutex_init(&xdp_flow_ops.lock);
+	xdp_flow_ops.stop = true;
+	xdp_flow_ops.info.cmdline = "xdp_flow_umh";
+	xdp_flow_ops.info.cleanup = &xdp_flow_umh_cleanup;
+
+	return 0;
+}
+device_initcall(xdp_flow_init);
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
new file mode 100644
index 0000000..823ab65
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -0,0 +1,250 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/module.h>
+#include <linux/umh.h>
+#include <linux/sched/signal.h>
+#include <net/pkt_cls.h>
+#include <net/flow_offload_xdp.h>
+#include "msgfmt.h"
+
+extern char xdp_flow_umh_start;
+extern char xdp_flow_umh_end;
+
+static void shutdown_umh(void)
+{
+	struct task_struct *tsk;
+
+	if (xdp_flow_ops.stop)
+		return;
+
+	tsk = get_pid_task(find_vpid(xdp_flow_ops.info.pid), PIDTYPE_PID);
+	if (tsk) {
+		send_sig(SIGKILL, tsk, 1);
+		put_task_struct(tsk);
+	}
+}
+
+static int transact_umh(struct mbox_request *req, u32 *id)
+{
+	struct mbox_reply reply;
+	int ret = -EFAULT;
+	loff_t pos;
+	ssize_t n;
+
+	if (!xdp_flow_ops.info.pid)
+		goto out;
+
+	n = __kernel_write(xdp_flow_ops.info.pipe_to_umh, req, sizeof(*req),
+			   &pos);
+	if (n != sizeof(*req)) {
+		pr_err("write fail %zd\n", n);
+		shutdown_umh();
+		goto out;
+	}
+
+	pos = 0;
+	n = kernel_read(xdp_flow_ops.info.pipe_from_umh, &reply,
+			sizeof(reply), &pos);
+	if (n != sizeof(reply)) {
+		pr_err("read fail %zd\n", n);
+		shutdown_umh();
+		goto out;
+	}
+
+	ret = reply.status;
+	if (id)
+		*id = reply.id;
+out:
+	return ret;
+}
+
+static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
+{
+	return -EOPNOTSUPP;
+}
+
+int xdp_flow_destroy(struct net_device *dev, struct flow_cls_offload *f)
+{
+	return -EOPNOTSUPP;
+}
+
+static int xdp_flow_setup_flower(struct net_device *dev,
+				 struct flow_cls_offload *f)
+{
+	switch (f->command) {
+	case FLOW_CLS_REPLACE:
+		return xdp_flow_replace(dev, f);
+	case FLOW_CLS_DESTROY:
+		return xdp_flow_destroy(dev, f);
+	case FLOW_CLS_STATS:
+	case FLOW_CLS_TMPLT_CREATE:
+	case FLOW_CLS_TMPLT_DESTROY:
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
+				   void *cb_priv)
+{
+	struct flow_cls_common_offload *common = type_data;
+	struct net_device *dev = cb_priv;
+	int err = 0;
+
+	if (common->chain_index) {
+		NL_SET_ERR_MSG(common->extack,
+			       "xdp_flow supports only offload of chain 0");
+		return -EOPNOTSUPP;
+	}
+
+	if (type != TC_SETUP_CLSFLOWER)
+		return -EOPNOTSUPP;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (xdp_flow_ops.stop) {
+		err = xdp_flow_ops.start();
+		if (err)
+			goto out;
+	}
+
+	err = xdp_flow_setup_flower(dev, type_data);
+out:
+	mutex_unlock(&xdp_flow_ops.lock);
+	return err;
+}
+
+static int xdp_flow_setup_bind(struct net_device *dev,
+			       struct netlink_ext_ack *extack)
+{
+	struct mbox_request *req;
+	u32 id = 0;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_LOAD;
+	req->ifindex = dev->ifindex;
+
+	/* Load bpf in UMH and get prog id */
+	err = transact_umh(req, &id);
+
+	/* TODO: id will be used to attach bpf prog to XDP
+	 * As we have rtnl_lock, UMH cannot attach prog to XDP
+	 */
+
+	kfree(req);
+
+	return err;
+}
+
+static int xdp_flow_setup_unbind(struct net_device *dev,
+				 struct netlink_ext_ack *extack)
+{
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_UNLOAD;
+	req->ifindex = dev->ifindex;
+
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	return err;
+}
+
+static int xdp_flow_setup(struct net_device *dev, bool do_bind,
+			  struct netlink_ext_ack *extack)
+{
+	ASSERT_RTNL();
+
+	if (!net_eq(dev_net(dev), &init_net))
+		return -EINVAL;
+
+	return do_bind ?
+		xdp_flow_setup_bind(dev, extack) :
+		xdp_flow_setup_unbind(dev, extack);
+}
+
+static int xdp_flow_test(void)
+{
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->cmd = XDP_FLOW_CMD_NOOP;
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	return err;
+}
+
+static int start_umh(void)
+{
+	int err;
+
+	/* fork usermode process */
+	err = fork_usermode_blob(&xdp_flow_umh_start,
+				 &xdp_flow_umh_end - &xdp_flow_umh_start,
+				 &xdp_flow_ops.info);
+	if (err)
+		return err;
+
+	xdp_flow_ops.stop = false;
+	pr_info("Loaded xdp_flow_umh pid %d\n", xdp_flow_ops.info.pid);
+
+	/* health check that usermode process started correctly */
+	if (xdp_flow_test()) {
+		shutdown_umh();
+		return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int __init load_umh(void)
+{
+	int err = 0;
+
+	mutex_lock(&xdp_flow_ops.lock);
+	if (!xdp_flow_ops.stop) {
+		err = -EFAULT;
+		goto err;
+	}
+
+	err = start_umh();
+	if (err)
+		goto err;
+
+	xdp_flow_ops.setup_cb = &xdp_flow_setup_block_cb;
+	xdp_flow_ops.setup = &xdp_flow_setup;
+	xdp_flow_ops.start = &start_umh;
+	xdp_flow_ops.module = THIS_MODULE;
+err:
+	mutex_unlock(&xdp_flow_ops.lock);
+	return err;
+}
+
+static void __exit fini_umh(void)
+{
+	mutex_lock(&xdp_flow_ops.lock);
+	shutdown_umh();
+	xdp_flow_ops.module = NULL;
+	xdp_flow_ops.start = NULL;
+	xdp_flow_ops.setup = NULL;
+	xdp_flow_ops.setup_cb = NULL;
+	mutex_unlock(&xdp_flow_ops.lock);
+}
+module_init(load_umh);
+module_exit(fini_umh);
+MODULE_LICENSE("GPL");
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
new file mode 100644
index 0000000..6729bdf
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -0,0 +1,109 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include <errno.h>
+#include <string.h>
+#include <stdlib.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <syslog.h>
+#include <sys/types.h>
+#include "msgfmt.h"
+
+/* FIXME: syslog is used for easy debugging. As writing /dev/log can be stuck
+ * due to reader side, should use another log mechanism like kmsg.
+ */
+#define pr_debug(fmt, ...) syslog(LOG_DAEMON | LOG_DEBUG, fmt, ##__VA_ARGS__)
+#define pr_info(fmt, ...) syslog(LOG_DAEMON | LOG_INFO, fmt, ##__VA_ARGS__)
+#define pr_warn(fmt, ...) syslog(LOG_DAEMON | LOG_WARNING, fmt, ##__VA_ARGS__)
+#define pr_err(fmt, ...) syslog(LOG_DAEMON | LOG_ERR, fmt, ##__VA_ARGS__)
+
+static int handle_load(const struct mbox_request *req, __u32 *prog_id)
+{
+	*prog_id = 0;
+
+	return 0;
+}
+
+static int handle_unload(const struct mbox_request *req)
+{
+	return 0;
+}
+
+static int handle_replace(struct mbox_request *req)
+{
+	return -EOPNOTSUPP;
+}
+
+static int handle_delete(const struct mbox_request *req)
+{
+	return -EOPNOTSUPP;
+}
+
+static void loop(void)
+{
+	struct mbox_request *req;
+
+	req = malloc(sizeof(struct mbox_request));
+	if (!req) {
+		pr_err("Memory allocation for mbox_request failed\n");
+		return;
+	}
+
+	while (1) {
+		struct mbox_reply reply;
+		int n;
+
+		n = read(0, req, sizeof(*req));
+		if (n < 0) {
+			pr_err("read for mbox_request failed: %s\n",
+			       strerror(errno));
+			break;
+		}
+		if (n != sizeof(*req)) {
+			pr_err("Invalid request size %d\n", n);
+			break;
+		}
+
+		switch (req->cmd) {
+		case XDP_FLOW_CMD_NOOP:
+			reply.status = 0;
+			break;
+		case XDP_FLOW_CMD_LOAD:
+			reply.status = handle_load(req, &reply.id);
+			break;
+		case XDP_FLOW_CMD_UNLOAD:
+			reply.status = handle_unload(req);
+			break;
+		case XDP_FLOW_CMD_REPLACE:
+			reply.status = handle_replace(req);
+			break;
+		case XDP_FLOW_CMD_DELETE:
+			reply.status = handle_delete(req);
+			break;
+		default:
+			pr_err("Invalid command %d\n", req->cmd);
+			reply.status = -EOPNOTSUPP;
+		}
+
+		n = write(1, &reply, sizeof(reply));
+		if (n < 0) {
+			pr_err("write for mbox_reply failed: %s\n",
+			       strerror(errno));
+			break;
+		}
+		if (n != sizeof(reply)) {
+			pr_err("reply written too short: %d\n", n);
+			break;
+		}
+	}
+
+	free(req);
+}
+
+int main(void)
+{
+	pr_info("Started xdp_flow\n");
+	loop();
+
+	return 0;
+}
diff --git a/net/xdp_flow/xdp_flow_umh_blob.S b/net/xdp_flow/xdp_flow_umh_blob.S
new file mode 100644
index 0000000..6edcb0e
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_umh_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .rodata, "a"
+	.global xdp_flow_umh_start
+xdp_flow_umh_start:
+	.incbin "net/xdp_flow/xdp_flow_umh"
+	.global xdp_flow_umh_end
+xdp_flow_umh_end:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 02/14] xdp_flow: Add skeleton bpf program for XDP
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 01/14] xdp_flow: Add skeleton of XDP based TC offload driver Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 03/14] bpf: Add API to get program from id Toshiaki Makita
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

The program is meant to be loaded when a device is bound to an ingress
TC block and should be attached to XDP on the device.
Typically it should be loaded when TC ingress or clsact qdisc is added.

The program is prebuilt and embedded in the UMH, instead of generated
dynamically. This is because TC filter is frequently changed when it is
used by OVS, and the latency of TC filter change will affect the latency
of datapath.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/Makefile                 |  87 +++++++++++-
 net/xdp_flow/xdp_flow_kern_bpf.c      |  12 ++
 net/xdp_flow/xdp_flow_kern_bpf_blob.S |   7 +
 net/xdp_flow/xdp_flow_umh.c           | 241 +++++++++++++++++++++++++++++++++-
 4 files changed, 343 insertions(+), 4 deletions(-)
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
 create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S

diff --git a/net/xdp_flow/Makefile b/net/xdp_flow/Makefile
index f6138c2..b3a0416 100644
--- a/net/xdp_flow/Makefile
+++ b/net/xdp_flow/Makefile
@@ -2,25 +2,106 @@
 
 obj-$(CONFIG_XDP_FLOW) += xdp_flow_core.o
 
+XDP_FLOW_PATH ?= $(abspath $(srctree)/$(src))
+TOOLS_PATH := $(XDP_FLOW_PATH)/../../tools
+
+# Libbpf dependencies
+LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
+
+LLC ?= llc
+CLANG ?= clang
+LLVM_OBJCOPY ?= llvm-objcopy
+BTF_PAHOLE ?= pahole
+
+ifdef CROSS_COMPILE
+CLANG_ARCH_ARGS = -target $(ARCH)
+endif
+
+BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
+BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
+BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 'usage.*llvm')
+BTF_LLVM_PROBE := $(shell echo "int main() { return 0; }" | \
+			  $(CLANG) -target bpf -O2 -g -c -x c - -o ./llvm_btf_verify.o; \
+			  readelf -S ./llvm_btf_verify.o | grep BTF; \
+			  /bin/rm -f ./llvm_btf_verify.o)
+
+ifneq ($(BTF_LLVM_PROBE),)
+	EXTRA_CFLAGS += -g
+else
+ifneq ($(and $(BTF_LLC_PROBE),$(BTF_PAHOLE_PROBE),$(BTF_OBJCOPY_PROBE)),)
+	EXTRA_CFLAGS += -g
+	LLC_FLAGS += -mattr=dwarfris
+	DWARF2BTF = y
+endif
+endif
+
+$(LIBBPF): FORCE
+# Fix up variables inherited from Kbuild that tools/ build system won't like
+	$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(XDP_FLOW_PATH)/../../ O=
+
+# Verify LLVM compiler tools are available and bpf target is supported by llc
+.PHONY: verify_cmds verify_target_bpf $(CLANG) $(LLC)
+
+verify_cmds: $(CLANG) $(LLC)
+	@for TOOL in $^ ; do \
+		if ! (which -- "$${TOOL}" > /dev/null 2>&1); then \
+			echo "*** ERROR: Cannot find LLVM tool $${TOOL}" ;\
+			exit 1; \
+		else true; fi; \
+	done
+
+verify_target_bpf: verify_cmds
+	@if ! (${LLC} -march=bpf -mattr=help > /dev/null 2>&1); then \
+		echo "*** ERROR: LLVM (${LLC}) does not support 'bpf' target" ;\
+		echo "   NOTICE: LLVM version >= 3.7.1 required" ;\
+		exit 2; \
+	else true; fi
+
+$(src)/xdp_flow_kern_bpf.c: verify_target_bpf
+
+$(obj)/xdp_flow_kern_bpf.o: $(src)/xdp_flow_kern_bpf.c FORCE
+	@echo "  CLANG-bpf " $@
+	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
+		-I$(srctree)/tools/testing/selftests/bpf/ \
+		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
+		-D__TARGET_ARCH_$(SRCARCH) -Wno-compare-distinct-pointer-types \
+		-Wno-gnu-variable-sized-type-not-at-end \
+		-Wno-address-of-packed-member -Wno-tautological-compare \
+		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
+		-I$(srctree)/samples/bpf/ -include asm_goto_workaround.h \
+		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
+ifeq ($(DWARF2BTF),y)
+	$(BTF_PAHOLE) -J $@
+endif
+
 ifeq ($(CONFIG_XDP_FLOW_UMH), y)
 # builtin xdp_flow_umh should be compiled with -static
 # since rootfs isn't mounted at the time of __init
 # function is called and do_execv won't find elf interpreter
 STATIC := -static
+STATICLDLIBS := -lz
 endif
 
+quiet_cmd_as_user = AS      $@
+      cmd_as_user = $(AS) -c -o $@ $<
+
 quiet_cmd_cc_user = CC      $@
       cmd_cc_user = $(CC) -Wall -Wmissing-prototypes -O2 -std=gnu89 \
-		    -I$(srctree)/tools/include/ \
+		    -I$(srctree)/tools/lib/ -I$(srctree)/tools/include/ \
 		    -c -o $@ $<
 
 quiet_cmd_ld_user = LD      $@
-      cmd_ld_user = $(CC) $(STATIC) -o $@ $^
+      cmd_ld_user = $(CC) $(STATIC) -o $@ $^ $(LIBBPF) -lelf $(STATICLDLIBS)
+
+$(obj)/xdp_flow_kern_bpf_blob.o: $(src)/xdp_flow_kern_bpf_blob.S \
+				 $(obj)/xdp_flow_kern_bpf.o
+	$(call if_changed,as_user)
 
 $(obj)/xdp_flow_umh.o: $(src)/xdp_flow_umh.c FORCE
 	$(call if_changed,cc_user)
 
-$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o
+$(obj)/xdp_flow_umh: $(obj)/xdp_flow_umh.o $(LIBBPF) \
+		     $(obj)/xdp_flow_kern_bpf_blob.o
 	$(call if_changed,ld_user)
 
 clean-files := xdp_flow_umh
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
new file mode 100644
index 0000000..74cdb1d
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <bpf_helpers.h>
+
+SEC("xdp_flow")
+int xdp_flow_prog(struct xdp_md *ctx)
+{
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/net/xdp_flow/xdp_flow_kern_bpf_blob.S b/net/xdp_flow/xdp_flow_kern_bpf_blob.S
new file mode 100644
index 0000000..d180c1b
--- /dev/null
+++ b/net/xdp_flow/xdp_flow_kern_bpf_blob.S
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+	.section .rodata, "a"
+	.global xdp_flow_bpf_start
+xdp_flow_bpf_start:
+	.incbin "net/xdp_flow/xdp_flow_kern_bpf.o"
+	.global xdp_flow_bpf_end
+xdp_flow_bpf_end:
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 6729bdf..734db00 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -6,9 +6,19 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include <syslog.h>
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include <sys/mman.h>
 #include <sys/types.h>
+#include <sys/resource.h>
+#include <linux/hashtable.h>
+#include <linux/err.h>
 #include "msgfmt.h"
 
+extern char xdp_flow_bpf_start;
+extern char xdp_flow_bpf_end;
+int progfile_fd;
+
 /* FIXME: syslog is used for easy debugging. As writing /dev/log can be stuck
  * due to reader side, should use another log mechanism like kmsg.
  */
@@ -17,15 +27,241 @@
 #define pr_warn(fmt, ...) syslog(LOG_DAEMON | LOG_WARNING, fmt, ##__VA_ARGS__)
 #define pr_err(fmt, ...) syslog(LOG_DAEMON | LOG_ERR, fmt, ##__VA_ARGS__)
 
+#define ERRBUF_SIZE 64
+
+/* This key represents a net device */
+struct netdev_info_key {
+	int ifindex;
+};
+
+struct netdev_info {
+	struct netdev_info_key key;
+	struct hlist_node node;
+	struct bpf_object *obj;
+};
+
+DEFINE_HASHTABLE(netdev_info_table, 16);
+
+static int libbpf_err(int err, char *errbuf)
+{
+	libbpf_strerror(err, errbuf, ERRBUF_SIZE);
+
+	if (-err < __LIBBPF_ERRNO__START)
+		return err;
+
+	return -EINVAL;
+}
+
+static int setup(void)
+{
+	size_t size = &xdp_flow_bpf_end - &xdp_flow_bpf_start;
+	struct rlimit r = { RLIM_INFINITY, RLIM_INFINITY };
+	ssize_t len;
+	int err;
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		err = -errno;
+		pr_err("setrlimit MEMLOCK failed: %s\n", strerror(errno));
+		return err;
+	}
+
+	progfile_fd = memfd_create("xdp_flow_kern_bpf.o", 0);
+	if (progfile_fd < 0) {
+		err = -errno;
+		pr_err("memfd_create failed: %s\n", strerror(errno));
+		return err;
+	}
+
+	len = write(progfile_fd, &xdp_flow_bpf_start, size);
+	if (len < 0) {
+		err = -errno;
+		pr_err("Failed to write bpf prog: %s\n", strerror(errno));
+		goto err;
+	}
+
+	if (len < size) {
+		pr_err("bpf prog written too short: expected %ld, actual %ld\n",
+		       size, len);
+		err = -EIO;
+		goto err;
+	}
+
+	return 0;
+err:
+	close(progfile_fd);
+
+	return err;
+}
+
+static int load_bpf(int ifindex, struct bpf_object **objp)
+{
+	struct bpf_object_open_attr attr = {};
+	char path[256], errbuf[ERRBUF_SIZE];
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	int prog_fd, err;
+	ssize_t len;
+
+	len = snprintf(path, 256, "/proc/self/fd/%d", progfile_fd);
+	if (len < 0) {
+		err = -errno;
+		pr_err("Failed to setup prog fd path string: %s\n",
+		       strerror(errno));
+		return err;
+	}
+
+	attr.file = path;
+	attr.prog_type = BPF_PROG_TYPE_XDP;
+	obj = bpf_object__open_xattr(&attr);
+	if (IS_ERR_OR_NULL(obj)) {
+		if (IS_ERR(obj)) {
+			err = libbpf_err((int)PTR_ERR(obj), errbuf);
+		} else {
+			err = -ENOENT;
+			strerror_r(-err, errbuf, sizeof(errbuf));
+		}
+		pr_err("Cannot open bpf prog: %s\n", errbuf);
+		return err;
+	}
+
+	bpf_object__for_each_program(prog, obj)
+		bpf_program__set_type(prog, attr.prog_type);
+
+	err = bpf_object__load(obj);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to load bpf prog: %s\n", errbuf);
+		goto err;
+	}
+
+	prog = bpf_object__find_program_by_title(obj, "xdp_flow");
+	if (!prog) {
+		pr_err("Cannot find xdp_flow program\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0) {
+		err = libbpf_err(prog_fd, errbuf);
+		pr_err("Invalid program fd: %s\n", errbuf);
+		goto err;
+	}
+
+	*objp = obj;
+
+	return prog_fd;
+err:
+	bpf_object__close(obj);
+	return err;
+}
+
+static int get_netdev_info_keyval(const struct netdev_info_key *key)
+{
+	return key->ifindex;
+}
+
+static struct netdev_info *find_netdev_info(const struct netdev_info_key *key)
+{
+	int keyval = get_netdev_info_keyval(key);
+	struct netdev_info *netdev_info;
+
+	hash_for_each_possible(netdev_info_table, netdev_info, node, keyval) {
+		if (netdev_info->key.ifindex == key->ifindex)
+			return netdev_info;
+	}
+
+	return NULL;
+}
+
+static int get_netdev_info_key(const struct mbox_request *req,
+			       struct netdev_info_key *key)
+{
+	key->ifindex = req->ifindex;
+
+	return 0;
+}
+
+static struct netdev_info *get_netdev_info(const struct mbox_request *req)
+{
+	struct netdev_info *netdev_info;
+	struct netdev_info_key key;
+	int err;
+
+	err = get_netdev_info_key(req, &key);
+	if (err)
+		return ERR_PTR(err);
+
+	netdev_info = find_netdev_info(&key);
+	if (!netdev_info) {
+		pr_err("BUG: netdev_info for if %d not found.\n",
+		       key.ifindex);
+		return ERR_PTR(-ENOENT);
+	}
+
+	return netdev_info;
+}
+
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
-	*prog_id = 0;
+	struct netdev_info *netdev_info;
+	struct bpf_prog_info info = {};
+	struct netdev_info_key key;
+	__u32 len = sizeof(info);
+	int err, prog_fd;
+
+	err = get_netdev_info_key(req, &key);
+	if (err)
+		return err;
+
+	netdev_info = find_netdev_info(&key);
+	if (netdev_info)
+		return 0;
+
+	netdev_info = malloc(sizeof(*netdev_info));
+	if (!netdev_info) {
+		pr_err("malloc for netdev_info failed.\n");
+		return -ENOMEM;
+	}
+	netdev_info->key.ifindex = key.ifindex;
+
+	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
+	if (prog_fd < 0) {
+		err = prog_fd;
+		goto err_netdev_info;
+	}
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
+	if (err)
+		goto err_obj;
+
+	*prog_id = info.id;
+	hash_add(netdev_info_table, &netdev_info->node,
+		 get_netdev_info_keyval(&netdev_info->key));
+	pr_debug("XDP program for if %d was loaded\n", req->ifindex);
 
 	return 0;
+err_obj:
+	bpf_object__close(netdev_info->obj);
+err_netdev_info:
+	free(netdev_info);
+
+	return err;
 }
 
 static int handle_unload(const struct mbox_request *req)
 {
+	struct netdev_info *netdev_info;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	hash_del(&netdev_info->node);
+	bpf_object__close(netdev_info->obj);
+	free(netdev_info);
+	pr_debug("XDP program for if %d was closed\n", req->ifindex);
+
 	return 0;
 }
 
@@ -103,7 +339,10 @@ static void loop(void)
 int main(void)
 {
 	pr_info("Started xdp_flow\n");
+	if (setup())
+		return -1;
 	loop();
+	close(progfile_fd);
 
 	return 0;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 03/14] bpf: Add API to get program from id
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 01/14] xdp_flow: Add skeleton of XDP based TC offload driver Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 02/14] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 04/14] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

Factor out the logic in bpf_prog_get_fd_by_id() and add
bpf_prog_get_by_id(). Also export bpf_prog_get_ok().
They are used by the next commit to get bpf prog from its id.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/bpf.h  |  6 ++++++
 kernel/bpf/syscall.c | 26 ++++++++++++++++++--------
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f9a5061..d8ad865 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -633,6 +633,7 @@ int bpf_prog_array_copy(struct bpf_prog_array *old_array,
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type_dev(u32 ufd, enum bpf_prog_type type,
 				       bool attach_drv);
+struct bpf_prog *bpf_prog_get_by_id(u32 id);
 struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
 struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
@@ -755,6 +756,11 @@ static inline struct bpf_prog *bpf_prog_get_type_dev(u32 ufd,
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static inline struct bpf_prog *bpf_prog_get_by_id(u32 id)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog,
 							  int i)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5d141f1..cb5ecc4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1495,6 +1495,7 @@ bool bpf_prog_get_ok(struct bpf_prog *prog,
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(bpf_prog_get_ok);
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *attach_type,
 				       bool attach_drv)
@@ -2122,6 +2123,22 @@ static int bpf_obj_get_next_id(const union bpf_attr *attr,
 	return err;
 }
 
+struct bpf_prog *bpf_prog_get_by_id(u32 id)
+{
+	struct bpf_prog *prog;
+
+	spin_lock_bh(&prog_idr_lock);
+	prog = idr_find(&prog_idr, id);
+	if (prog)
+		prog = bpf_prog_inc_not_zero(prog);
+	else
+		prog = ERR_PTR(-ENOENT);
+	spin_unlock_bh(&prog_idr_lock);
+
+	return prog;
+}
+EXPORT_SYMBOL_GPL(bpf_prog_get_by_id);
+
 #define BPF_PROG_GET_FD_BY_ID_LAST_FIELD prog_id
 
 static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
@@ -2136,14 +2153,7 @@ static int bpf_prog_get_fd_by_id(const union bpf_attr *attr)
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-	spin_lock_bh(&prog_idr_lock);
-	prog = idr_find(&prog_idr, id);
-	if (prog)
-		prog = bpf_prog_inc_not_zero(prog);
-	else
-		prog = ERR_PTR(-ENOENT);
-	spin_unlock_bh(&prog_idr_lock);
-
+	prog = bpf_prog_get_by_id(id);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
 
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 04/14] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (2 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 03/14] bpf: Add API to get program from id Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 05/14] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

As UMH runs under RTNL, it cannot attach XDP from userspace. Thus the
kernel, xdp_flow module, installs the XDP program.

NOTE: As an RFC, XDP-related logic is emulating dev_change_xdp_fd().
I'm thinking I should factor out the logic from dev_change_xdp_fd() and
export it instead.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/netdevice.h        |  4 +++
 net/core/dev.c                   | 11 ++++---
 net/xdp_flow/xdp_flow_kern_mod.c | 63 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8829295..c99e022 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3678,6 +3678,10 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 				    struct netdev_queue *txq, int *ret);
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
+		    struct netlink_ext_ack *extack, u32 flags,
+		    struct bpf_prog *prog);
+int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		      int fd, u32 flags);
 u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
diff --git a/net/core/dev.c b/net/core/dev.c
index fc676b2..a45d2e4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5145,7 +5145,7 @@ static void __netif_receive_skb_list(struct list_head *head)
 		memalloc_noreclaim_restore(noreclaim_flag);
 }
 
-static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
+int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
 	struct bpf_prog *new = xdp->prog;
@@ -5177,6 +5177,7 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(generic_xdp_install);
 
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
@@ -8001,10 +8002,11 @@ u32 __dev_xdp_query(struct net_device *dev, bpf_op_t bpf_op,
 
 	return xdp.prog_id;
 }
+EXPORT_SYMBOL_GPL(__dev_xdp_query);
 
-static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
-			   struct netlink_ext_ack *extack, u32 flags,
-			   struct bpf_prog *prog)
+int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
+		    struct netlink_ext_ack *extack, u32 flags,
+		    struct bpf_prog *prog)
 {
 	struct netdev_bpf xdp;
 
@@ -8019,6 +8021,7 @@ static int dev_xdp_install(struct net_device *dev, bpf_op_t bpf_op,
 
 	return bpf_op(dev, &xdp);
 }
+EXPORT_SYMBOL_GPL(dev_xdp_install);
 
 static void dev_xdp_uninstall(struct net_device *dev)
 {
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 823ab65..9cf527d 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -116,10 +116,26 @@ static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
 static int xdp_flow_setup_bind(struct net_device *dev,
 			       struct netlink_ext_ack *extack)
 {
+	enum bpf_prog_type attach_type = BPF_PROG_TYPE_XDP;
 	struct mbox_request *req;
+	bpf_op_t bpf_op, bpf_chk;
+	struct bpf_prog *prog;
 	u32 id = 0;
 	int err;
 
+	bpf_op = bpf_chk = dev->netdev_ops->ndo_bpf;
+	if (!bpf_op)
+		bpf_op = generic_xdp_install;
+	else
+		bpf_chk = generic_xdp_install;
+
+	/* TODO: These checks should be unified with net core */
+	if (__dev_xdp_query(dev, bpf_chk, XDP_QUERY_PROG))
+		return -EEXIST;
+
+	if (__dev_xdp_query(dev, bpf_op, XDP_QUERY_PROG))
+		return -EBUSY;
+
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (!req)
 		return -ENOMEM;
@@ -129,21 +145,56 @@ static int xdp_flow_setup_bind(struct net_device *dev,
 
 	/* Load bpf in UMH and get prog id */
 	err = transact_umh(req, &id);
+	if (err)
+		goto out;
+
+	prog = bpf_prog_get_by_id(id);
+	if (IS_ERR(prog)) {
+		err = PTR_ERR(prog);
+		goto err_umh;
+	}
+
+	if (!bpf_prog_get_ok(prog, &attach_type, false)) {
+		err = -EINVAL;
+		goto err_prog;
+	}
 
-	/* TODO: id will be used to attach bpf prog to XDP
-	 * As we have rtnl_lock, UMH cannot attach prog to XDP
-	 */
+	/* As we have rtnl_lock, install XDP in kernel */
+	err = dev_xdp_install(dev, bpf_op, extack, 0, prog);
+	if (err)
+		goto err_prog;
 
+	/* TODO: Should get prog once more and save it for later check */
+out:
 	kfree(req);
 
 	return err;
+err_prog:
+	bpf_prog_put(prog);
+err_umh:
+	req->cmd = XDP_FLOW_CMD_UNLOAD;
+	transact_umh(req, NULL);
+
+	goto out;
 }
 
 static int xdp_flow_setup_unbind(struct net_device *dev,
 				 struct netlink_ext_ack *extack)
 {
 	struct mbox_request *req;
-	int err;
+	int err, ret = 0;
+	bpf_op_t bpf_op;
+
+	bpf_op = dev->netdev_ops->ndo_bpf;
+	if (!bpf_op)
+		bpf_op = generic_xdp_install;
+
+	/* TODO: Should check if prog is not changed */
+	err = dev_xdp_install(dev, bpf_op, extack, 0, NULL);
+	if (err) {
+		pr_warn("Failed to uninstall XDP prog: %d\n", err);
+		ret = err;
+	}
 
 	req = kzalloc(sizeof(*req), GFP_KERNEL);
 	if (!req)
@@ -153,10 +204,12 @@ static int xdp_flow_setup_unbind(struct net_device *dev,
 	req->ifindex = dev->ifindex;
 
 	err = transact_umh(req, NULL);
+	if (err)
+		ret = err;
 
 	kfree(req);
 
-	return err;
+	return ret;
 }
 
 static int xdp_flow_setup(struct net_device *dev, bool do_bind,
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 05/14] xdp_flow: Prepare flow tables in bpf
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (3 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 04/14] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

Add maps for flow tables in bpf. TC flower has hash tables for each flow
mask ordered by priority. To do the same thing, prepare
hashmap-in-arraymap. As bpf does not provide ordered list, we emulate it
by an array. Each array entry has one-byte next index field to implement
a list. Also prepare a one-element array to point to the head index of
the list.

Because of the limitation of bpf maps, the outer array is implemented
using two array maps. "flow_masks" is the array to emulate the list and
its entries have the priority and mask of each flow table. For each
priority/mask, the same index entry of another map "flow_tables", which
is the hashmap-in-arraymap, points to the actual flow table.

The flow insertion logic in UMH and lookup logic in BPF will be
implemented in the following commits.

NOTE: This list emulation by array may be able to be realized by adding
ordered-list type map. In that case we also need map iteration API for
bpf progs.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           | 18 +++++++++++
 net/xdp_flow/xdp_flow_kern_bpf.c | 22 +++++++++++++
 net/xdp_flow/xdp_flow_umh.c      | 70 ++++++++++++++++++++++++++++++++++++++--
 3 files changed, 108 insertions(+), 2 deletions(-)
 create mode 100644 net/xdp_flow/umh_bpf.h

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
new file mode 100644
index 0000000..b4fe0c6
--- /dev/null
+++ b/net/xdp_flow/umh_bpf.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _NET_XDP_FLOW_UMH_BPF_H
+#define _NET_XDP_FLOW_UMH_BPF_H
+
+#include "msgfmt.h"
+
+#define MAX_FLOWS 1024
+#define MAX_FLOW_MASKS 255
+#define FLOW_MASKS_TAIL 255
+
+struct xdp_flow_mask_entry {
+	struct xdp_flow_key mask;
+	__u16 priority;
+	short count;
+	int next;
+};
+
+#endif
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 74cdb1d..c101156 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -2,6 +2,28 @@
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
 #include <bpf_helpers.h>
+#include "umh_bpf.h"
+
+struct bpf_map_def SEC("maps") flow_masks_head = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(int),
+	.max_entries = 1,
+};
+
+struct bpf_map_def SEC("maps") flow_masks = {
+	.type = BPF_MAP_TYPE_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(struct xdp_flow_mask_entry),
+	.max_entries = MAX_FLOW_MASKS,
+};
+
+struct bpf_map_def SEC("maps") flow_tables = {
+	.type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(u32),
+	.max_entries = MAX_FLOW_MASKS,
+};
 
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 734db00..e35666a 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -13,7 +13,7 @@
 #include <sys/resource.h>
 #include <linux/hashtable.h>
 #include <linux/err.h>
-#include "msgfmt.h"
+#include "umh_bpf.h"
 
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
@@ -95,11 +95,13 @@ static int setup(void)
 
 static int load_bpf(int ifindex, struct bpf_object **objp)
 {
+	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
+	struct bpf_map *flow_tables, *flow_masks_head;
+	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
 	struct bpf_program *prog;
 	struct bpf_object *obj;
-	int prog_fd, err;
 	ssize_t len;
 
 	len = snprintf(path, 256, "/proc/self/fd/%d", progfile_fd);
@@ -127,6 +129,48 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				      sizeof(struct xdp_flow_key),
+				      sizeof(struct xdp_flow_actions),
+				      MAX_FLOWS, 0);
+	if (flow_meta_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables meta failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	flow_tables_fd = bpf_create_map_in_map(BPF_MAP_TYPE_ARRAY_OF_MAPS,
+					       "flow_tables", sizeof(__u32),
+					       flow_meta_fd, MAX_FLOW_MASKS, 0);
+	if (flow_tables_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_tables failed: %s\n",
+		       strerror(errno));
+		close(flow_meta_fd);
+		goto err;
+	}
+
+	close(flow_meta_fd);
+
+	flow_tables = bpf_object__find_map_by_name(obj, "flow_tables");
+	if (!flow_tables) {
+		pr_err("Cannot find flow_tables\n");
+		err = -ENOENT;
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	err = bpf_map__reuse_fd(flow_tables, flow_tables_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse flow_tables fd: %s\n", errbuf);
+		close(flow_tables_fd);
+		goto err;
+	}
+
+	close(flow_tables_fd);
+
 	err = bpf_object__load(obj);
 	if (err) {
 		err = libbpf_err(err, errbuf);
@@ -134,6 +178,28 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 		goto err;
 	}
 
+	flow_masks_head = bpf_object__find_map_by_name(obj, "flow_masks_head");
+	if (!flow_masks_head) {
+		pr_err("Cannot find flow_masks_head map\n");
+		err = -ENOENT;
+		goto err;
+	}
+
+	flow_masks_head_fd = bpf_map__fd(flow_masks_head);
+	if (flow_masks_head_fd < 0) {
+		err = libbpf_err(flow_masks_head_fd, errbuf);
+		pr_err("Invalid flow_masks_head fd: %s\n", errbuf);
+		goto err;
+	}
+
+	if (bpf_map_update_elem(flow_masks_head_fd, &zero, &flow_masks_tail,
+				0)) {
+		err = -errno;
+		pr_err("Failed to initialize flow_masks_head: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	prog = bpf_object__find_program_by_title(obj, "xdp_flow");
 	if (!prog) {
 		pr_err("Cannot find xdp_flow program\n");
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (4 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 05/14] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 07/14] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

This logic will be used when xdp_flow kmod requests flow
insertion/deleteion.

On insertion, find a free entry and populate it, then update next index
pointer of its previous entry. On deletion, set the next index pointer
of the prev entry to the next index of the entry to be deleted.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h      |  15 ++
 net/xdp_flow/xdp_flow_umh.c | 470 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 483 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index b4fe0c6..4e4633f 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -15,4 +15,19 @@ struct xdp_flow_mask_entry {
 	int next;
 };
 
+static inline bool flow_equal(const struct xdp_flow_key *key1,
+			      const struct xdp_flow_key *key2)
+{
+	long *lkey1 = (long *)key1;
+	long *lkey2 = (long *)key2;
+	int i;
+
+	for (i = 0; i < sizeof(*key1); i += sizeof(long)) {
+		if (*lkey1++ != *lkey2++)
+			return false;
+	}
+
+	return true;
+}
+
 #endif
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index e35666a..9a4769b 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -19,6 +19,8 @@
 extern char xdp_flow_bpf_end;
 int progfile_fd;
 
+#define zalloc(size) calloc(1, (size))
+
 /* FIXME: syslog is used for easy debugging. As writing /dev/log can be stuck
  * due to reader side, should use another log mechanism like kmsg.
  */
@@ -38,6 +40,8 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int free_slot_top;
+	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
@@ -268,6 +272,57 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
+{
+	int i;
+
+	for (i = 0; i < MAX_FLOW_MASKS; i++)
+		netdev_info->free_slots[MAX_FLOW_MASKS - 1 - i] = i;
+	netdev_info->free_slot_top = MAX_FLOW_MASKS - 1;
+}
+
+static int get_flow_masks_free_slot(const struct netdev_info *netdev_info)
+{
+	if (netdev_info->free_slot_top < 0)
+		return -ENOBUFS;
+
+	return netdev_info->free_slots[netdev_info->free_slot_top];
+}
+
+static int add_flow_masks_free_slot(struct netdev_info *netdev_info, int slot)
+{
+	if (unlikely(netdev_info->free_slot_top >= MAX_FLOW_MASKS - 1)) {
+		pr_warn("BUG: free_slot overflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return -EOVERFLOW;
+	}
+
+	netdev_info->free_slots[++netdev_info->free_slot_top] = slot;
+
+	return 0;
+}
+
+static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
+					int slot)
+{
+	int top_slot;
+
+	if (unlikely(netdev_info->free_slot_top < 0)) {
+		pr_warn("BUG: free_slot underflow: top=%d, slot=%d\n",
+			netdev_info->free_slot_top, slot);
+		return;
+	}
+
+	top_slot = netdev_info->free_slots[netdev_info->free_slot_top];
+	if (unlikely(top_slot != slot)) {
+		pr_warn("BUG: inconsistent free_slot top: top_slot=%d, slot=%d\n",
+			top_slot, slot);
+		return;
+	}
+
+	netdev_info->free_slot_top--;
+}
+
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
 	struct netdev_info *netdev_info;
@@ -291,6 +346,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	init_flow_masks_free_slot(netdev_info);
+
 	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
@@ -331,14 +388,423 @@ static int handle_unload(const struct mbox_request *req)
 	return 0;
 }
 
+static int get_table_fd(const struct netdev_info *netdev_info,
+			const char *table_name)
+{
+	char errbuf[ERRBUF_SIZE];
+	struct bpf_map *map;
+	int map_fd;
+	int err;
+
+	map = bpf_object__find_map_by_name(netdev_info->obj, table_name);
+	if (!map) {
+		pr_err("BUG: %s map not found.\n", table_name);
+		return -ENOENT;
+	}
+
+	map_fd = bpf_map__fd(map);
+	if (map_fd < 0) {
+		err = libbpf_err(map_fd, errbuf);
+		pr_err("Invalid map fd: %s\n", errbuf);
+		return err;
+	}
+
+	return map_fd;
+}
+
+static int get_flow_masks_head_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks_head");
+}
+
+static int get_flow_masks_head(int head_fd, int *head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_lookup_elem(head_fd, &zero, head)) {
+		err = -errno;
+		pr_err("Cannot get flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int update_flow_masks_head(int head_fd, int head)
+{
+	int err, zero = 0;
+
+	if (bpf_map_update_elem(head_fd, &zero, &head, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks_head: %s\n", strerror(errno));
+		return err;
+	}
+
+	return 0;
+}
+
+static int get_flow_masks_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_masks");
+}
+
+static int get_flow_tables_fd(const struct netdev_info *netdev_info)
+{
+	return get_table_fd(netdev_info, "flow_tables");
+}
+
+static int __flow_table_insert_elem(int flow_table_fd,
+				    const struct xdp_flow *flow)
+{
+	int err = 0;
+
+	if (bpf_map_update_elem(flow_table_fd, &flow->key, &flow->actions, 0)) {
+		err = -errno;
+		pr_err("Cannot insert flow entry: %s\n",
+		       strerror(errno));
+	}
+
+	return err;
+}
+
+static void __flow_table_delete_elem(int flow_table_fd,
+				     const struct xdp_flow *flow)
+{
+	bpf_map_delete_elem(flow_table_fd, &flow->key);
+}
+
+static int flow_table_insert_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, free_slot, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		flow_table_fd = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_entry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL)
+			break;
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto err;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask)) {
+			__u32 id;
+
+			if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+				err = -errno;
+				pr_err("Cannot lookup flow_tables: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			flow_table_fd = bpf_map_get_fd_by_id(id);
+			if (flow_table_fd < 0) {
+				err = -errno;
+				pr_err("Cannot get flow_table fd by id: %s\n",
+				       strerror(errno));
+				goto err;
+			}
+
+			err = __flow_table_insert_elem(flow_table_fd, flow);
+			if (err)
+				goto out;
+
+			entry->count++;
+			if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+				err = -errno;
+				pr_err("Cannot update flow_masks count: %s\n",
+				       strerror(errno));
+				__flow_table_delete_elem(flow_table_fd, flow);
+				goto out;
+			}
+
+			goto out;
+		}
+
+		if (entry->priority > flow->priority)
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS && idx != FLOW_MASKS_TAIL)) {
+		err = -EINVAL;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	/* Flow mask was not found. Create a new one */
+
+	free_slot = get_flow_masks_free_slot(netdev_info);
+	if (free_slot < 0) {
+		err = free_slot;
+		goto err;
+	}
+
+	entry->mask = flow->mask;
+	entry->priority = flow->priority;
+	entry->count = 1;
+	entry->next = idx;
+	if (bpf_map_update_elem(masks_fd, &free_slot, entry, 0)) {
+		err = -errno;
+		pr_err("Cannot update flow_masks: %s\n", strerror(errno));
+		goto err;
+	}
+
+	flow_table_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
+				       sizeof(struct xdp_flow_key),
+				       sizeof(struct xdp_flow_actions),
+				       MAX_FLOWS, 0);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("map creation for flow_table failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
+	err = __flow_table_insert_elem(flow_table_fd, flow);
+	if (err)
+		goto out;
+
+	if (bpf_map_update_elem(flow_tables_fd, &free_slot, &flow_table_fd, 0)) {
+		err = -errno;
+		pr_err("Failed to insert flow_table into flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, free_slot);
+		if (err)
+			goto err_flow_table;
+	} else {
+		pentry->next = free_slot;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto err_flow_table;
+		}
+	}
+	delete_flow_masks_free_slot(netdev_info, free_slot);
+out:
+	close(flow_table_fd);
+err:
+	free(pentry);
+err_entry:
+	free(entry);
+
+	return err;
+
+err_flow_table:
+	bpf_map_delete_elem(flow_tables_fd, &free_slot);
+
+	goto out;
+}
+
+static int flow_table_delete_elem(struct netdev_info *netdev_info,
+				  const struct xdp_flow *flow)
+{
+	int masks_fd, head_fd, flow_tables_fd, flow_table_fd, head;
+	struct xdp_flow_mask_entry *entry, *pentry;
+	int err, cnt, idx, pidx;
+	__u32 id;
+
+	masks_fd = get_flow_masks_fd(netdev_info);
+	if (masks_fd < 0)
+		return masks_fd;
+
+	head_fd = get_flow_masks_head_fd(netdev_info);
+	if (head_fd < 0)
+		return head_fd;
+
+	err = get_flow_masks_head(head_fd, &head);
+	if (err)
+		return err;
+
+	flow_tables_fd = get_flow_tables_fd(netdev_info);
+	if (flow_tables_fd < 0)
+		return flow_tables_fd;
+
+	entry = zalloc(sizeof(*entry));
+	if (!entry) {
+		pr_err("Memory allocation for flow_masks entry failed\n");
+		return -ENOMEM;
+	}
+
+	pentry = zalloc(sizeof(*pentry));
+	if (!pentry) {
+		err = -ENOMEM;
+		pr_err("Memory allocation for flow_masks prev entry failed\n");
+		goto err_pentry;
+	}
+
+	idx = head;
+	for (cnt = 0; cnt < MAX_FLOW_MASKS; cnt++) {
+		if (idx == FLOW_MASKS_TAIL) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (bpf_map_lookup_elem(masks_fd, &idx, entry)) {
+			err = -errno;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+
+		if (entry->priority > flow->priority) {
+			err = -ENOENT;
+			pr_err("Cannot lookup flow_masks: %s\n",
+			       strerror(-err));
+			goto out;
+		}
+
+		if (entry->priority == flow->priority &&
+		    flow_equal(&entry->mask, &flow->mask))
+			break;
+
+		*pentry = *entry;
+		pidx = idx;
+		idx = entry->next;
+	}
+
+	if (unlikely(cnt == MAX_FLOW_MASKS)) {
+		err = -ENOENT;
+		pr_err("Cannot lookup flow_masks: Broken flow_masks list\n");
+		goto out;
+	}
+
+	if (bpf_map_lookup_elem(flow_tables_fd, &idx, &id)) {
+		err = -errno;
+		pr_err("Cannot lookup flow_tables: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	flow_table_fd = bpf_map_get_fd_by_id(id);
+	if (flow_table_fd < 0) {
+		err = -errno;
+		pr_err("Cannot get flow_table fd by id: %s\n",
+		       strerror(errno));
+		goto out;
+	}
+
+	__flow_table_delete_elem(flow_table_fd, flow);
+	close(flow_table_fd);
+
+	if (--entry->count > 0) {
+		if (bpf_map_update_elem(masks_fd, &idx, entry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks count: %s\n",
+			       strerror(errno));
+		}
+
+		goto out;
+	}
+
+	if (unlikely(entry->count < 0)) {
+		pr_warn("flow_masks has negative count: %d\n",
+			entry->count);
+	}
+
+	if (cnt == 0) {
+		err = update_flow_masks_head(head_fd, entry->next);
+		if (err)
+			goto out;
+	} else {
+		pentry->next = entry->next;
+		/* This effectively only updates one byte of entry->next */
+		if (bpf_map_update_elem(masks_fd, &pidx, pentry, 0)) {
+			err = -errno;
+			pr_err("Cannot update flow_masks prev entry: %s\n",
+			       strerror(errno));
+			goto out;
+		}
+	}
+
+	bpf_map_delete_elem(flow_tables_fd, &idx);
+	err = add_flow_masks_free_slot(netdev_info, idx);
+	if (err)
+		pr_err("Cannot add flow_masks free slot: %s\n", strerror(-err));
+out:
+	free(pentry);
+err_pentry:
+	free(entry);
+
+	return err;
+}
+
 static int handle_replace(struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_insert_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static int handle_delete(const struct mbox_request *req)
 {
-	return -EOPNOTSUPP;
+	struct netdev_info *netdev_info;
+	int err;
+
+	netdev_info = get_netdev_info(req);
+	if (IS_ERR(netdev_info))
+		return PTR_ERR(netdev_info);
+
+	err = flow_table_delete_elem(netdev_info, &req->flow);
+	if (err)
+		return err;
+
+	return 0;
 }
 
 static void loop(void)
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 07/14] xdp_flow: Add flow handling and basic actions in bpf prog
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (5 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

BPF prog for XDP parses the packet and extracts the flow key. Then find
an entry from flow tables.
Only "accept" and "drop" actions are implemented at this point.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 297 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 296 insertions(+), 1 deletion(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index c101156..ceb8a92 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -1,9 +1,27 @@
 // SPDX-License-Identifier: GPL-2.0
 #define KBUILD_MODNAME "foo"
 #include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <net/ipv6.h>
+#include <net/dsfield.h>
 #include <bpf_helpers.h>
 #include "umh_bpf.h"
 
+/* Used when the action only modifies the packet */
+#define _XDP_CONTINUE -1
+
+struct bpf_map_def SEC("maps") debug_stats = {
+	.type = BPF_MAP_TYPE_PERCPU_ARRAY,
+	.key_size = sizeof(u32),
+	.value_size = sizeof(long),
+	.max_entries = 256,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -25,10 +43,287 @@ struct bpf_map_def SEC("maps") flow_tables = {
 	.max_entries = MAX_FLOW_MASKS,
 };
 
+static inline void account_debug(int idx)
+{
+	long *cnt;
+
+	cnt = bpf_map_lookup_elem(&debug_stats, &idx);
+	if (cnt)
+		*cnt += 1;
+}
+
+static inline void account_action(int act)
+{
+	account_debug(act + 1);
+}
+
+static inline int action_accept(void)
+{
+	account_action(XDP_FLOW_ACTION_ACCEPT);
+	return XDP_PASS;
+}
+
+static inline int action_drop(void)
+{
+	account_action(XDP_FLOW_ACTION_DROP);
+	return XDP_DROP;
+}
+
+static inline int action_redirect(struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_REDIRECT);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_push(struct xdp_md *ctx,
+				   struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_pop(struct xdp_md *ctx,
+				  struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_POP);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_vlan_mangle(struct xdp_md *ctx,
+				     struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_VLAN_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_mangle(struct xdp_md *ctx,
+				struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_MANGLE);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline int action_csum(struct xdp_md *ctx,
+			      struct xdp_flow_action *action)
+{
+	account_action(XDP_FLOW_ACTION_CSUM);
+
+	// TODO: implement this
+	return XDP_ABORTED;
+}
+
+static inline void __ether_addr_copy(u8 *dst, const u8 *src)
+{
+	u16 *a = (u16 *)dst;
+	const u16 *b = (const u16 *)src;
+
+	a[0] = b[0];
+	a[1] = b[1];
+	a[2] = b[2];
+}
+
+static inline int parse_ipv4(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct iphdr *iph = data + *nh_off;
+
+	if (iph + 1 > data_end)
+		return -1;
+
+	key->ipv4.src = iph->saddr;
+	key->ipv4.dst = iph->daddr;
+	key->ip.ttl = iph->ttl;
+	key->ip.tos = iph->tos;
+	*nh_off += iph->ihl * 4;
+
+	return iph->protocol;
+}
+
+static inline int parse_ipv6(void *data, u64 *nh_off, void *data_end,
+			     struct xdp_flow_key *key)
+{
+	struct ipv6hdr *ip6h = data + *nh_off;
+
+	if (ip6h + 1 > data_end)
+		return -1;
+
+	key->ipv6.src = ip6h->saddr;
+	key->ipv6.dst = ip6h->daddr;
+	key->ip.ttl = ip6h->hop_limit;
+	key->ip.tos = ipv6_get_dsfield(ip6h);
+	*nh_off += sizeof(*ip6h);
+
+	if (ip6h->nexthdr == NEXTHDR_HOP ||
+	    ip6h->nexthdr == NEXTHDR_ROUTING ||
+	    ip6h->nexthdr == NEXTHDR_FRAGMENT ||
+	    ip6h->nexthdr == NEXTHDR_AUTH ||
+	    ip6h->nexthdr == NEXTHDR_NONE ||
+	    ip6h->nexthdr == NEXTHDR_DEST)
+		return 0;
+
+	return ip6h->nexthdr;
+}
+
+#define for_each_flow_mask(entry, head, idx, cnt) \
+	for (entry = bpf_map_lookup_elem(&flow_masks, (head)), \
+	     idx = *(head), cnt = 0; \
+	     entry != NULL && cnt < MAX_FLOW_MASKS; \
+	     idx = entry->next, \
+	     entry = bpf_map_lookup_elem(&flow_masks, &idx), cnt++)
+
+static inline void flow_mask(struct xdp_flow_key *mkey,
+			     const struct xdp_flow_key *key,
+			     const struct xdp_flow_key *mask)
+{
+	long *lmkey = (long *)mkey;
+	long *lmask = (long *)mask;
+	long *lkey = (long *)key;
+	int i;
+
+	for (i = 0; i < sizeof(*mkey); i += sizeof(long))
+		*lmkey++ = *lkey++ & *lmask++;
+}
+
 SEC("xdp_flow")
 int xdp_flow_prog(struct xdp_md *ctx)
 {
-	return XDP_PASS;
+	void *data_end = (void *)(long)ctx->data_end;
+	struct xdp_flow_actions *actions = NULL;
+	void *data = (void *)(long)ctx->data;
+	int cnt, idx, action_idx, zero = 0;
+	struct xdp_flow_mask_entry *entry;
+	struct ethhdr *eth = data;
+	struct xdp_flow_key key;
+	int rc = XDP_DROP;
+	long *value;
+	u16 h_proto;
+	u32 ipproto;
+	u64 nh_off;
+	int *head;
+
+	account_debug(0);
+
+	nh_off = sizeof(*eth);
+	if (data + nh_off > data_end)
+		return XDP_DROP;
+
+	__builtin_memset(&key, 0, sizeof(key));
+	h_proto = eth->h_proto;
+	__ether_addr_copy(key.eth.dst, eth->h_dest);
+	__ether_addr_copy(key.eth.src, eth->h_source);
+
+	if (eth_type_vlan(h_proto)) {
+		struct vlan_hdr *vhdr;
+
+		vhdr = data + nh_off;
+		nh_off += sizeof(*vhdr);
+		if (data + nh_off > data_end)
+			return XDP_DROP;
+		key.vlan.tpid = h_proto;
+		key.vlan.tci = vhdr->h_vlan_TCI;
+		h_proto = vhdr->h_vlan_encapsulated_proto;
+	}
+	key.eth.type = h_proto;
+
+	if (h_proto == htons(ETH_P_IP))
+		ipproto = parse_ipv4(data, &nh_off, data_end, &key);
+	else if (h_proto == htons(ETH_P_IPV6))
+		ipproto = parse_ipv6(data, &nh_off, data_end, &key);
+	else
+		ipproto = 0;
+	if (ipproto < 0)
+		return XDP_DROP;
+	key.ip.proto = ipproto;
+
+	if (ipproto == IPPROTO_TCP) {
+		struct tcphdr *th = data + nh_off;
+
+		if (th + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = th->source;
+		key.l4port.dst = th->dest;
+		key.tcp.flags = (*(__be16 *)&tcp_flag_word(th) & htons(0x0FFF));
+	} else if (ipproto == IPPROTO_UDP) {
+		struct udphdr *uh = data + nh_off;
+
+		if (uh + 1 > data_end)
+			return XDP_DROP;
+
+		key.l4port.src = uh->source;
+		key.l4port.dst = uh->dest;
+	}
+
+	head = bpf_map_lookup_elem(&flow_masks_head, &zero);
+	if (!head)
+		return XDP_PASS;
+
+	for_each_flow_mask(entry, head, idx, cnt) {
+		struct xdp_flow_key mkey;
+		void *flow_table;
+
+		flow_table = bpf_map_lookup_elem(&flow_tables, &idx);
+		if (!flow_table)
+			return XDP_ABORTED;
+
+		flow_mask(&mkey, &key, &entry->mask);
+		actions = bpf_map_lookup_elem(flow_table, &mkey);
+		if (actions)
+			break;
+	}
+
+	if (!actions)
+		return XDP_PASS;
+
+	for (action_idx = 0;
+	     action_idx < actions->num_actions &&
+	     action_idx < MAX_XDP_FLOW_ACTIONS;
+	     action_idx++) {
+		struct xdp_flow_action *action;
+		int act;
+
+		action = &actions->actions[action_idx];
+
+		switch (action->id) {
+		case XDP_FLOW_ACTION_ACCEPT:
+			return action_accept();
+		case XDP_FLOW_ACTION_DROP:
+			return action_drop();
+		case XDP_FLOW_ACTION_REDIRECT:
+			return action_redirect(action);
+		case XDP_FLOW_ACTION_VLAN_PUSH:
+			act = action_vlan_push(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_POP:
+			act = action_vlan_pop(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_VLAN_MANGLE:
+			act = action_vlan_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_MANGLE:
+			act = action_mangle(ctx, action);
+			break;
+		case XDP_FLOW_ACTION_CSUM:
+			act = action_csum(ctx, action);
+			break;
+		default:
+			return XDP_ABORTED;
+		}
+		if (act != _XDP_CONTINUE)
+			return act;
+	}
+
+	return XDP_ABORTED;
 }
 
 char _license[] SEC("license") = "GPL";
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (6 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 07/14] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP Toshiaki Makita
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

As struct flow_rule has descrete storages for flow_dissector and
key/mask containers, we need to serialize them in some way to pass them
to UMH.

Convert flow_rule into flow key form used in xdp_flow bpf prog and
pass it.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_mod.c | 334 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 331 insertions(+), 3 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 9cf527d..fe925db 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -3,13 +3,266 @@
 #include <linux/module.h>
 #include <linux/umh.h>
 #include <linux/sched/signal.h>
+#include <linux/rhashtable.h>
 #include <net/pkt_cls.h>
 #include <net/flow_offload_xdp.h>
 #include "msgfmt.h"
 
+struct xdp_flow_rule {
+	struct rhash_head ht_node;
+	unsigned long cookie;
+	struct xdp_flow_key key;
+	struct xdp_flow_key mask;
+};
+
+static const struct rhashtable_params rules_params = {
+	.key_len = sizeof(unsigned long),
+	.key_offset = offsetof(struct xdp_flow_rule, cookie),
+	.head_offset = offsetof(struct xdp_flow_rule, ht_node),
+	.automatic_shrinking = true,
+};
+
+static struct rhashtable rules;
+
 extern char xdp_flow_umh_start;
 extern char xdp_flow_umh_end;
 
+static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
+				  struct flow_action *flow_action,
+				  struct netlink_ext_ack *extack)
+{
+	const struct flow_action_entry *act;
+	int i;
+
+	if (!flow_action_has_entries(flow_action))
+		return 0;
+
+	if (flow_action->num_entries > MAX_XDP_FLOW_ACTIONS)
+		return -ENOBUFS;
+
+	flow_action_for_each(i, act, flow_action) {
+		struct xdp_flow_action *action = &actions->actions[i];
+
+		switch (act->id) {
+		case FLOW_ACTION_ACCEPT:
+			action->id = XDP_FLOW_ACTION_ACCEPT;
+			break;
+		case FLOW_ACTION_DROP:
+			action->id = XDP_FLOW_ACTION_DROP;
+			break;
+		case FLOW_ACTION_REDIRECT:
+		case FLOW_ACTION_VLAN_PUSH:
+		case FLOW_ACTION_VLAN_POP:
+		case FLOW_ACTION_VLAN_MANGLE:
+		case FLOW_ACTION_MANGLE:
+		case FLOW_ACTION_CSUM:
+			/* TODO: implement these */
+			/* fall through */
+		default:
+			NL_SET_ERR_MSG_MOD(extack, "Unsupported action");
+			return -EOPNOTSUPP;
+		}
+	}
+	actions->num_actions = flow_action->num_entries;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ports(struct xdp_flow_key *key,
+				struct xdp_flow_key *mask,
+				struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ports match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_PORTS))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP && ip_proto != IPPROTO_UDP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "Only UDP and TCP keys are supported");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ports(rule, &match);
+
+	key->l4port.src = match.key->src;
+	mask->l4port.src = match.mask->src;
+	key->l4port.dst = match.key->dst;
+	mask->l4port.dst = match.mask->dst;
+
+	return 0;
+}
+
+static int xdp_flow_parse_tcp(struct xdp_flow_key *key,
+			      struct xdp_flow_key *mask,
+			      struct flow_cls_offload *f, u8 ip_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_tcp match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_TCP))
+		return 0;
+
+	if (ip_proto != IPPROTO_TCP) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "TCP keys supported only for TCP");
+		return -EINVAL;
+	}
+
+	flow_rule_match_tcp(rule, &match);
+
+	key->tcp.flags = match.key->flags;
+	mask->tcp.flags = match.mask->flags;
+
+	return 0;
+}
+
+static int xdp_flow_parse_ip(struct xdp_flow_key *key,
+			     struct xdp_flow_key *mask,
+			     struct flow_cls_offload *f, __be16 n_proto)
+{
+	const struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_match_ip match;
+
+	if (!flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_IP))
+		return 0;
+
+	if (n_proto != htons(ETH_P_IP) && n_proto != htons(ETH_P_IPV6)) {
+		NL_SET_ERR_MSG_MOD(f->common.extack,
+				   "IP keys supported only for IPv4/6");
+		return -EINVAL;
+	}
+
+	flow_rule_match_ip(rule, &match);
+
+	key->ip.ttl = match.key->ttl;
+	mask->ip.ttl = match.mask->ttl;
+	key->ip.tos = match.key->tos;
+	mask->ip.tos = match.mask->tos;
+
+	return 0;
+}
+
+static int xdp_flow_parse(struct xdp_flow_key *key, struct xdp_flow_key *mask,
+			  struct xdp_flow_actions *actions,
+			  struct flow_cls_offload *f)
+{
+	struct flow_rule *rule = flow_cls_offload_flow_rule(f);
+	struct flow_dissector *dissector = rule->match.dissector;
+	__be16 n_proto = 0, n_proto_mask = 0;
+	u16 addr_type = 0;
+	u8 ip_proto = 0;
+	int err;
+
+	if (dissector->used_keys &
+	    ~(BIT(FLOW_DISSECTOR_KEY_CONTROL) |
+	      BIT(FLOW_DISSECTOR_KEY_BASIC) |
+	      BIT(FLOW_DISSECTOR_KEY_ETH_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV4_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_IPV6_ADDRS) |
+	      BIT(FLOW_DISSECTOR_KEY_PORTS) |
+	      BIT(FLOW_DISSECTOR_KEY_TCP) |
+	      BIT(FLOW_DISSECTOR_KEY_IP) |
+	      BIT(FLOW_DISSECTOR_KEY_VLAN))) {
+		NL_SET_ERR_MSG_MOD(f->common.extack, "Unsupported key");
+		return -EOPNOTSUPP;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_CONTROL)) {
+		struct flow_match_control match;
+
+		flow_rule_match_control(rule, &match);
+		addr_type = match.key->addr_type;
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+		struct flow_match_basic match;
+
+		flow_rule_match_basic(rule, &match);
+
+		n_proto = match.key->n_proto;
+		n_proto_mask = match.mask->n_proto;
+		if (n_proto == htons(ETH_P_ALL)) {
+			n_proto = 0;
+			n_proto_mask = 0;
+		}
+
+		key->eth.type = n_proto;
+		mask->eth.type = n_proto_mask;
+
+		if (match.mask->ip_proto) {
+			ip_proto = match.key->ip_proto;
+			key->ip.proto = ip_proto;
+			mask->ip.proto = match.mask->ip_proto;
+		}
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+		struct flow_match_eth_addrs match;
+
+		flow_rule_match_eth_addrs(rule, &match);
+
+		ether_addr_copy(key->eth.dst, match.key->dst);
+		ether_addr_copy(mask->eth.dst, match.mask->dst);
+		ether_addr_copy(key->eth.src, match.key->src);
+		ether_addr_copy(mask->eth.src, match.mask->src);
+	}
+
+	if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_VLAN)) {
+		struct flow_match_vlan match;
+
+		flow_rule_match_vlan(rule, &match);
+
+		key->vlan.tpid = match.key->vlan_tpid;
+		mask->vlan.tpid = match.mask->vlan_tpid;
+		key->vlan.tci = htons(match.key->vlan_id |
+				      (match.key->vlan_priority <<
+				       VLAN_PRIO_SHIFT));
+		mask->vlan.tci = htons(match.mask->vlan_id |
+				       (match.mask->vlan_priority <<
+					VLAN_PRIO_SHIFT));
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV4_ADDRS) {
+		struct flow_match_ipv4_addrs match;
+
+		flow_rule_match_ipv4_addrs(rule, &match);
+
+		key->ipv4.src = match.key->src;
+		mask->ipv4.src = match.mask->src;
+		key->ipv4.dst = match.key->dst;
+		mask->ipv4.dst = match.mask->dst;
+	}
+
+	if (addr_type == FLOW_DISSECTOR_KEY_IPV6_ADDRS) {
+		struct flow_match_ipv6_addrs match;
+
+		flow_rule_match_ipv6_addrs(rule, &match);
+
+		key->ipv6.src = match.key->src;
+		mask->ipv6.src = match.mask->src;
+		key->ipv6.dst = match.key->dst;
+		mask->ipv6.dst = match.mask->dst;
+	}
+
+	err = xdp_flow_parse_ports(key, mask, f, ip_proto);
+	if (err)
+		return err;
+	err = xdp_flow_parse_tcp(key, mask, f, ip_proto);
+	if (err)
+		return err;
+
+	err = xdp_flow_parse_ip(key, mask, f, n_proto);
+	if (err)
+		return err;
+
+	// TODO: encapsulation related tasks
+
+	return xdp_flow_parse_actions(actions, &rule->action,
+					   f->common.extack);
+}
+
 static void shutdown_umh(void)
 {
 	struct task_struct *tsk;
@@ -60,12 +313,78 @@ static int transact_umh(struct mbox_request *req, u32 *id)
 
 static int xdp_flow_replace(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct xdp_flow_rule *rule;
+	struct mbox_request *req;
+	int err;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	rule = kzalloc(sizeof(*rule), GFP_KERNEL);
+	if (!rule) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	req->flow.priority = f->common.prio >> 16;
+	err = xdp_flow_parse(&req->flow.key, &req->flow.mask,
+			     &req->flow.actions, f);
+	if (err)
+		goto err_parse;
+
+	rule->cookie = f->cookie;
+	rule->key = req->flow.key;
+	rule->mask = req->flow.mask;
+	err = rhashtable_insert_fast(&rules, &rule->ht_node, rules_params);
+	if (err)
+		goto err_parse;
+
+	req->cmd = XDP_FLOW_CMD_REPLACE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+	if (err)
+		goto err_umh;
+out:
+	kfree(req);
+
+	return err;
+err_umh:
+	rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+err_parse:
+	kfree(rule);
+	goto out;
 }
 
 int xdp_flow_destroy(struct net_device *dev, struct flow_cls_offload *f)
 {
-	return -EOPNOTSUPP;
+	struct mbox_request *req;
+	struct xdp_flow_rule *rule;
+	int err;
+
+	rule = rhashtable_lookup_fast(&rules, &f->cookie, rules_params);
+	if (!rule)
+		return 0;
+
+	req = kzalloc(sizeof(*req), GFP_KERNEL);
+	if (!req)
+		return -ENOMEM;
+
+	req->flow.priority = f->common.prio >> 16;
+	req->flow.key = rule->key;
+	req->flow.mask = rule->mask;
+	req->cmd = XDP_FLOW_CMD_DELETE;
+	req->ifindex = dev->ifindex;
+	err = transact_umh(req, NULL);
+
+	kfree(req);
+
+	if (!err) {
+		rhashtable_remove_fast(&rules, &rule->ht_node, rules_params);
+		kfree(rule);
+	}
+
+	return err;
 }
 
 static int xdp_flow_setup_flower(struct net_device *dev,
@@ -267,7 +586,11 @@ static int start_umh(void)
 
 static int __init load_umh(void)
 {
-	int err = 0;
+	int err;
+
+	err = rhashtable_init(&rules, &rules_params);
+	if (err)
+		return err;
 
 	mutex_lock(&xdp_flow_ops.lock);
 	if (!xdp_flow_ops.stop) {
@@ -283,8 +606,12 @@ static int __init load_umh(void)
 	xdp_flow_ops.setup = &xdp_flow_setup;
 	xdp_flow_ops.start = &start_umh;
 	xdp_flow_ops.module = THIS_MODULE;
+
+	mutex_unlock(&xdp_flow_ops.lock);
+	return 0;
 err:
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
 	return err;
 }
 
@@ -297,6 +624,7 @@ static void __exit fini_umh(void)
 	xdp_flow_ops.setup = NULL;
 	xdp_flow_ops.setup_cb = NULL;
 	mutex_unlock(&xdp_flow_ops.lock);
+	rhashtable_destroy(&rules);
 }
 module_init(load_umh);
 module_exit(fini_umh);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (7 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 10/14] xdp_flow: Implement redirect action Toshiaki Makita
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

The usage would be like this:

 $ ethtool -K eth0 tc-offload-xdp on
 $ tc qdisc add dev eth0 clsact
 $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...

Then the filters offloaded to XDP are marked as "in_hw".

If the tc flow block is created when tc-offload-xdp is enabled on the
device, the block is internally marked as xdp and only can be offloaded
to XDP.
The reason not to allow HW-offload and XDP-offload at the same time is
to avoid the situation where offloading to only one of them succeeds.
If we allow offloading to both, users cannot know which offload
succeeded.

NOTE: This makes flows offloaded to XDP look as if they are HW
offloaded, since they will be marked as "in_hw". This could be confusing.
Maybe we can add another status "in_xdp"? Then we can allow both of HW-
and XDP-offload at the same time.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 include/linux/netdev_features.h  |  2 ++
 include/net/pkt_cls.h            |  5 +++
 include/net/sch_generic.h        |  1 +
 net/core/dev.c                   |  2 ++
 net/core/ethtool.c               |  1 +
 net/sched/cls_api.c              | 67 +++++++++++++++++++++++++++++++++++++---
 net/xdp_flow/xdp_flow_kern_mod.c |  6 ++++
 7 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 4b19c54..ddd201e 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -80,6 +80,7 @@ enum {
 
 	NETIF_F_GRO_HW_BIT,		/* Hardware Generic receive offload */
 	NETIF_F_HW_TLS_RECORD_BIT,	/* Offload TLS record */
+	NETIF_F_XDP_TC_BIT,		/* Offload TC to XDP */
 
 	/*
 	 * Add your fresh new feature above and remember to update
@@ -150,6 +151,7 @@ enum {
 #define NETIF_F_GSO_UDP_L4	__NETIF_F(GSO_UDP_L4)
 #define NETIF_F_HW_TLS_TX	__NETIF_F(HW_TLS_TX)
 #define NETIF_F_HW_TLS_RX	__NETIF_F(HW_TLS_RX)
+#define NETIF_F_XDP_TC		__NETIF_F(XDP_TC)
 
 /* Finds the next feature with the highest number of the range of start till 0.
  */
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e429809..d190aae 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -610,6 +610,11 @@ static inline bool tc_can_offload_extack(const struct net_device *dev,
 	return true;
 }
 
+static inline bool tc_xdp_offload_enabled(const struct net_device *dev)
+{
+	return dev->features & NETIF_F_XDP_TC;
+}
+
 static inline bool tc_skip_hw(u32 flags)
 {
 	return (flags & TCA_CLS_FLAGS_SKIP_HW) ? true : false;
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6b6b012..a4d90b5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -402,6 +402,7 @@ struct tcf_block {
 	struct flow_block flow_block;
 	struct list_head owner_list;
 	bool keep_dst;
+	bool xdp;
 	unsigned int offloadcnt; /* Number of oddloaded filters */
 	unsigned int nooffloaddevcnt; /* Number of devs unable to do offload */
 	struct {
diff --git a/net/core/dev.c b/net/core/dev.c
index a45d2e4..d1f980d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8680,6 +8680,8 @@ int register_netdevice(struct net_device *dev)
 	 * software offloads (GSO and GRO).
 	 */
 	dev->hw_features |= NETIF_F_SOFT_FEATURES;
+	if (IS_ENABLED(CONFIG_XDP_FLOW))
+		dev->hw_features |= NETIF_F_XDP_TC;
 	dev->features |= NETIF_F_SOFT_FEATURES;
 
 	if (dev->netdev_ops->ndo_udp_tunnel_add) {
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 6288e69..c7e61cf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -111,6 +111,7 @@ int ethtool_op_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info)
 	[NETIF_F_HW_TLS_RECORD_BIT] =	"tls-hw-record",
 	[NETIF_F_HW_TLS_TX_BIT] =	 "tls-hw-tx-offload",
 	[NETIF_F_HW_TLS_RX_BIT] =	 "tls-hw-rx-offload",
+	[NETIF_F_XDP_TC_BIT] =		 "tc-offload-xdp",
 };
 
 static const char
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3565d9a..4c89bab 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -37,6 +37,7 @@
 #include <net/tc_act/tc_skbedit.h>
 #include <net/tc_act/tc_ct.h>
 #include <net/tc_act/tc_mpls.h>
+#include <net/flow_offload_xdp.h>
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
 
@@ -806,7 +807,7 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 				 struct net_device *dev,
 				 struct tcf_block_ext_info *ei,
 				 enum flow_block_command command,
-				 struct netlink_ext_ack *extack)
+				 bool xdp, struct netlink_ext_ack *extack)
 {
 	struct flow_block_offload bo = {};
 	int err;
@@ -819,13 +820,39 @@ static int tcf_block_offload_cmd(struct tcf_block *block,
 	bo.extack = extack;
 	INIT_LIST_HEAD(&bo.cb_list);
 
-	err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
+	if (xdp)
+		err = xdp_flow_setup_block(dev, &bo);
+	else
+		err = dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
 	if (err < 0)
 		return err;
 
 	return tcf_block_setup(block, &bo);
 }
 
+static int tcf_block_offload_bind_xdp(struct tcf_block *block, struct Qdisc *q,
+				      struct tcf_block_ext_info *ei,
+				      struct netlink_ext_ack *extack)
+{
+	struct net_device *dev = q->dev_queue->dev;
+	int err;
+
+	if (!tc_xdp_offload_enabled(dev) && tcf_block_offload_in_use(block)) {
+		NL_SET_ERR_MSG(extack,
+			       "Bind to offloaded block failed as dev has tc-offload-xdp disabled");
+		return -EOPNOTSUPP;
+	}
+
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, true,
+				    extack);
+	if (err == -EOPNOTSUPP) {
+		block->nooffloaddevcnt++;
+		err = 0;
+	}
+
+	return err;
+}
+
 static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 				  struct tcf_block_ext_info *ei,
 				  struct netlink_ext_ack *extack)
@@ -833,6 +860,15 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	if (block->xdp)
+		return tcf_block_offload_bind_xdp(block, q, ei, extack);
+
+	if (tc_xdp_offload_enabled(dev)) {
+		NL_SET_ERR_MSG(extack,
+			       "Cannot bind to block created with tc-offload-xdp disabled");
+		return -EOPNOTSUPP;
+	}
+
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_inc;
 
@@ -844,7 +880,8 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 		return -EOPNOTSUPP;
 	}
 
-	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, extack);
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_BIND, false,
+				    extack);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_inc;
 	if (err)
@@ -861,17 +898,35 @@ static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
 	return 0;
 }
 
+static void tcf_block_offload_unbind_xdp(struct tcf_block *block,
+					 struct net_device *dev,
+					 struct tcf_block_ext_info *ei)
+{
+	int err;
+
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, true,
+				    NULL);
+	if (err == -EOPNOTSUPP)
+		WARN_ON(block->nooffloaddevcnt-- == 0);
+}
+
 static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
 				     struct tcf_block_ext_info *ei)
 {
 	struct net_device *dev = q->dev_queue->dev;
 	int err;
 
+	if (block->xdp) {
+		tcf_block_offload_unbind_xdp(block, dev, ei);
+		return;
+	}
+
 	tc_indr_block_call(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
 
 	if (!dev->netdev_ops->ndo_setup_tc)
 		goto no_offload_dev_dec;
-	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, NULL);
+	err = tcf_block_offload_cmd(block, dev, ei, FLOW_BLOCK_UNBIND, false,
+				    NULL);
 	if (err == -EOPNOTSUPP)
 		goto no_offload_dev_dec;
 	return;
@@ -1004,6 +1059,10 @@ static struct tcf_block *tcf_block_create(struct net *net, struct Qdisc *q,
 	/* Don't store q pointer for blocks which are shared */
 	if (!tcf_block_shared(block))
 		block->q = q;
+
+	if (tc_xdp_offload_enabled(q->dev_queue->dev))
+		block->xdp = true;
+
 	return block;
 }
 
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index fe925db..891b18c 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -410,6 +410,12 @@ static int xdp_flow_setup_block_cb(enum tc_setup_type type, void *type_data,
 	struct net_device *dev = cb_priv;
 	int err = 0;
 
+	if (!tc_xdp_offload_enabled(dev)) {
+		NL_SET_ERR_MSG(common->extack,
+			       "tc-offload-xdp is disabled on net device");
+		return -EOPNOTSUPP;
+	}
+
 	if (common->chain_index) {
 		NL_SET_ERR_MSG(common->extack,
 			       "xdp_flow supports only offload of chain 0");
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 10/14] xdp_flow: Implement redirect action
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (8 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 11/14] xdp_flow: Implement vlan_push action Toshiaki Makita
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

Add a devmap for XDP_REDIRECT and use it for redirect action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/umh_bpf.h           |   1 +
 net/xdp_flow/xdp_flow_kern_bpf.c |  14 +++-
 net/xdp_flow/xdp_flow_kern_mod.c |   3 +
 net/xdp_flow/xdp_flow_umh.c      | 164 +++++++++++++++++++++++++++++++++++++--
 4 files changed, 175 insertions(+), 7 deletions(-)

diff --git a/net/xdp_flow/umh_bpf.h b/net/xdp_flow/umh_bpf.h
index 4e4633f..a279d0a1 100644
--- a/net/xdp_flow/umh_bpf.h
+++ b/net/xdp_flow/umh_bpf.h
@@ -4,6 +4,7 @@
 
 #include "msgfmt.h"
 
+#define MAX_PORTS 65536
 #define MAX_FLOWS 1024
 #define MAX_FLOW_MASKS 255
 #define FLOW_MASKS_TAIL 255
diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index ceb8a92..8f3d359 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -22,6 +22,13 @@ struct bpf_map_def SEC("maps") debug_stats = {
 	.max_entries = 256,
 };
 
+struct bpf_map_def SEC("maps") output_map = {
+	.type = BPF_MAP_TYPE_DEVMAP,
+	.key_size = sizeof(int),
+	.value_size = sizeof(int),
+	.max_entries = MAX_PORTS,
+};
+
 struct bpf_map_def SEC("maps") flow_masks_head = {
 	.type = BPF_MAP_TYPE_ARRAY,
 	.key_size = sizeof(u32),
@@ -71,10 +78,13 @@ static inline int action_drop(void)
 
 static inline int action_redirect(struct xdp_flow_action *action)
 {
+	int tx_port;
+
 	account_action(XDP_FLOW_ACTION_REDIRECT);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	tx_port = action->ifindex;
+
+	return bpf_redirect_map(&output_map, tx_port, 0);
 }
 
 static inline int action_vlan_push(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index 891b18c..caa4968 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -51,6 +51,9 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->id = XDP_FLOW_ACTION_DROP;
 			break;
 		case FLOW_ACTION_REDIRECT:
+			action->id = XDP_FLOW_ACTION_REDIRECT;
+			action->ifindex = act->dev->ifindex;
+			break;
 		case FLOW_ACTION_VLAN_PUSH:
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
diff --git a/net/xdp_flow/xdp_flow_umh.c b/net/xdp_flow/xdp_flow_umh.c
index 9a4769b..cbb766a 100644
--- a/net/xdp_flow/xdp_flow_umh.c
+++ b/net/xdp_flow/xdp_flow_umh.c
@@ -18,6 +18,7 @@
 extern char xdp_flow_bpf_start;
 extern char xdp_flow_bpf_end;
 int progfile_fd;
+int output_map_fd;
 
 #define zalloc(size) calloc(1, (size))
 
@@ -40,12 +41,22 @@ struct netdev_info {
 	struct netdev_info_key key;
 	struct hlist_node node;
 	struct bpf_object *obj;
+	int devmap_idx;
 	int free_slot_top;
 	int free_slots[MAX_FLOW_MASKS];
 };
 
 DEFINE_HASHTABLE(netdev_info_table, 16);
 
+struct devmap_idx_node {
+	int devmap_idx;
+	struct hlist_node node;
+};
+
+DEFINE_HASHTABLE(devmap_idx_table, 16);
+
+int max_devmap_idx;
+
 static int libbpf_err(int err, char *errbuf)
 {
 	libbpf_strerror(err, errbuf, ERRBUF_SIZE);
@@ -90,6 +101,15 @@ static int setup(void)
 		goto err;
 	}
 
+	output_map_fd = bpf_create_map(BPF_MAP_TYPE_DEVMAP, sizeof(int),
+				       sizeof(int), MAX_PORTS, 0);
+	if (output_map_fd < 0) {
+		err = -errno;
+		pr_err("map creation for output_map failed: %s\n",
+		       strerror(errno));
+		goto err;
+	}
+
 	return 0;
 err:
 	close(progfile_fd);
@@ -97,10 +117,23 @@ static int setup(void)
 	return err;
 }
 
-static int load_bpf(int ifindex, struct bpf_object **objp)
+static void delete_output_map_elem(int idx)
+{
+	char errbuf[ERRBUF_SIZE];
+	int err;
+
+	err = bpf_map_delete_elem(output_map_fd, &idx);
+	if (err) {
+		libbpf_err(err, errbuf);
+		pr_warn("Failed to delete idx %d from output_map: %s\n",
+			idx, errbuf);
+	}
+}
+
+static int load_bpf(int ifindex, int devmap_idx, struct bpf_object **objp)
 {
 	int prog_fd, flow_tables_fd, flow_meta_fd, flow_masks_head_fd, err;
-	struct bpf_map *flow_tables, *flow_masks_head;
+	struct bpf_map *output_map, *flow_tables, *flow_masks_head;
 	int zero = 0, flow_masks_tail = FLOW_MASKS_TAIL;
 	struct bpf_object_open_attr attr = {};
 	char path[256], errbuf[ERRBUF_SIZE];
@@ -133,6 +166,27 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 	bpf_object__for_each_program(prog, obj)
 		bpf_program__set_type(prog, attr.prog_type);
 
+	output_map = bpf_object__find_map_by_name(obj, "output_map");
+	if (!output_map) {
+		pr_err("Cannot find output_map\n");
+		err = -ENOENT;
+		goto err_obj;
+	}
+
+	err = bpf_map__reuse_fd(output_map, output_map_fd);
+	if (err) {
+		err = libbpf_err(err, errbuf);
+		pr_err("Failed to reuse output_map fd: %s\n", errbuf);
+		goto err_obj;
+	}
+
+	if (bpf_map_update_elem(output_map_fd, &devmap_idx, &ifindex, 0)) {
+		err = -errno;
+		pr_err("Failed to insert idx %d if %d into output_map: %s\n",
+		       devmap_idx, ifindex, strerror(errno));
+		goto err_obj;
+	}
+
 	flow_meta_fd = bpf_create_map(BPF_MAP_TYPE_HASH,
 				      sizeof(struct xdp_flow_key),
 				      sizeof(struct xdp_flow_actions),
@@ -222,6 +276,8 @@ static int load_bpf(int ifindex, struct bpf_object **objp)
 
 	return prog_fd;
 err:
+	delete_output_map_elem(devmap_idx);
+err_obj:
 	bpf_object__close(obj);
 	return err;
 }
@@ -272,6 +328,56 @@ static struct netdev_info *get_netdev_info(const struct mbox_request *req)
 	return netdev_info;
 }
 
+static struct devmap_idx_node *find_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node;
+
+	hash_for_each_possible(devmap_idx_table, node, node, devmap_idx) {
+		if (node->devmap_idx == devmap_idx)
+			return node;
+	}
+
+	return NULL;
+}
+
+static int get_new_devmap_idx(void)
+{
+	int offset;
+
+	for (offset = 0; offset < MAX_PORTS; offset++) {
+		int devmap_idx = max_devmap_idx++;
+
+		if (max_devmap_idx >= MAX_PORTS)
+			max_devmap_idx -= MAX_PORTS;
+
+		if (!find_devmap_idx(devmap_idx)) {
+			struct devmap_idx_node *node;
+
+			node = malloc(sizeof(*node));
+			if (!node) {
+				pr_err("malloc for devmap_idx failed\n");
+				return -ENOMEM;
+			}
+			node->devmap_idx = devmap_idx;
+			hash_add(devmap_idx_table, &node->node, devmap_idx);
+
+			return devmap_idx;
+		}
+	}
+
+	return -ENOSPC;
+}
+
+static void delete_devmap_idx(int devmap_idx)
+{
+	struct devmap_idx_node *node = find_devmap_idx(devmap_idx);
+
+	if (node) {
+		hash_del(&node->node);
+		free(node);
+	}
+}
+
 static void init_flow_masks_free_slot(struct netdev_info *netdev_info)
 {
 	int i;
@@ -325,11 +431,11 @@ static void delete_flow_masks_free_slot(struct netdev_info *netdev_info,
 
 static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 {
+	int err, prog_fd, devmap_idx = -1;
 	struct netdev_info *netdev_info;
 	struct bpf_prog_info info = {};
 	struct netdev_info_key key;
 	__u32 len = sizeof(info);
-	int err, prog_fd;
 
 	err = get_netdev_info_key(req, &key);
 	if (err)
@@ -346,12 +452,19 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	}
 	netdev_info->key.ifindex = key.ifindex;
 
+	devmap_idx = get_new_devmap_idx();
+	if (devmap_idx < 0) {
+		err = devmap_idx;
+		goto err_netdev_info;
+	}
+	netdev_info->devmap_idx = devmap_idx;
+
 	init_flow_masks_free_slot(netdev_info);
 
-	prog_fd = load_bpf(req->ifindex, &netdev_info->obj);
+	prog_fd = load_bpf(req->ifindex, devmap_idx, &netdev_info->obj);
 	if (prog_fd < 0) {
 		err = prog_fd;
-		goto err_netdev_info;
+		goto err_devmap_idx;
 	}
 
 	err = bpf_obj_get_info_by_fd(prog_fd, &info, &len);
@@ -366,6 +479,8 @@ static int handle_load(const struct mbox_request *req, __u32 *prog_id)
 	return 0;
 err_obj:
 	bpf_object__close(netdev_info->obj);
+err_devmap_idx:
+	delete_devmap_idx(devmap_idx);
 err_netdev_info:
 	free(netdev_info);
 
@@ -382,12 +497,45 @@ static int handle_unload(const struct mbox_request *req)
 
 	hash_del(&netdev_info->node);
 	bpf_object__close(netdev_info->obj);
+	delete_output_map_elem(netdev_info->devmap_idx);
+	delete_devmap_idx(netdev_info->devmap_idx);
 	free(netdev_info);
 	pr_debug("XDP program for if %d was closed\n", req->ifindex);
 
 	return 0;
 }
 
+static int convert_ifindex_to_devmap_idx(struct mbox_request *req)
+{
+	int i;
+
+	for (i = 0; i < req->flow.actions.num_actions; i++) {
+		struct xdp_flow_action *action = &req->flow.actions.actions[i];
+
+		if (action->id == XDP_FLOW_ACTION_REDIRECT) {
+			struct netdev_info *netdev_info;
+			struct netdev_info_key key;
+			int err;
+
+			err = get_netdev_info_key(req, &key);
+			if (err)
+				return err;
+			key.ifindex = action->ifindex;
+
+			netdev_info = find_netdev_info(&key);
+			if (!netdev_info) {
+				pr_err("Cannot redirect to ifindex %d. Please setup xdp_flow on ifindex %d in advance.\n",
+				       key.ifindex, key.ifindex);
+				return -ENOENT;
+			}
+
+			action->ifindex = netdev_info->devmap_idx;
+		}
+	}
+
+	return 0;
+}
+
 static int get_table_fd(const struct netdev_info *netdev_info,
 			const char *table_name)
 {
@@ -784,6 +932,11 @@ static int handle_replace(struct mbox_request *req)
 	if (IS_ERR(netdev_info))
 		return PTR_ERR(netdev_info);
 
+	/* TODO: Use XDP_TX for redirect action when possible */
+	err = convert_ifindex_to_devmap_idx(req);
+	if (err)
+		return err;
+
 	err = flow_table_insert_elem(netdev_info, &req->flow);
 	if (err)
 		return err;
@@ -875,6 +1028,7 @@ int main(void)
 		return -1;
 	loop();
 	close(progfile_fd);
+	close(output_map_fd);
 
 	return 0;
 }
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 11/14] xdp_flow: Implement vlan_push action
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (9 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 10/14] xdp_flow: Implement redirect action Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 12/14] bpf, selftest: Add test for xdp_flow Toshiaki Makita
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

This is another example action.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 net/xdp_flow/xdp_flow_kern_bpf.c | 23 +++++++++++++++++++++--
 net/xdp_flow/xdp_flow_kern_mod.c |  5 +++++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/net/xdp_flow/xdp_flow_kern_bpf.c b/net/xdp_flow/xdp_flow_kern_bpf.c
index 8f3d359..51e181b 100644
--- a/net/xdp_flow/xdp_flow_kern_bpf.c
+++ b/net/xdp_flow/xdp_flow_kern_bpf.c
@@ -90,10 +90,29 @@ static inline int action_redirect(struct xdp_flow_action *action)
 static inline int action_vlan_push(struct xdp_md *ctx,
 				   struct xdp_flow_action *action)
 {
+	struct vlan_ethhdr *vehdr;
+	void *data, *data_end;
+	__be16 proto, tci;
+
 	account_action(XDP_FLOW_ACTION_VLAN_PUSH);
 
-	// TODO: implement this
-	return XDP_ABORTED;
+	proto = action->vlan.proto;
+	tci = action->vlan.tci;
+
+	if (bpf_xdp_adjust_head(ctx, -VLAN_HLEN))
+		return XDP_DROP;
+
+	data_end = (void *)(long)ctx->data_end;
+	data = (void *)(long)ctx->data;
+	if (data + VLAN_ETH_HLEN > data_end)
+		return XDP_DROP;
+
+	__builtin_memmove(data, data + VLAN_HLEN, ETH_ALEN * 2);
+	vehdr = data;
+	vehdr->h_vlan_proto = proto;
+	vehdr->h_vlan_TCI = tci;
+
+	return _XDP_CONTINUE;
 }
 
 static inline int action_vlan_pop(struct xdp_md *ctx,
diff --git a/net/xdp_flow/xdp_flow_kern_mod.c b/net/xdp_flow/xdp_flow_kern_mod.c
index caa4968..52dc64e 100644
--- a/net/xdp_flow/xdp_flow_kern_mod.c
+++ b/net/xdp_flow/xdp_flow_kern_mod.c
@@ -55,6 +55,11 @@ static int xdp_flow_parse_actions(struct xdp_flow_actions *actions,
 			action->ifindex = act->dev->ifindex;
 			break;
 		case FLOW_ACTION_VLAN_PUSH:
+			action->id = XDP_FLOW_ACTION_VLAN_PUSH;
+			action->vlan.tci = act->vlan.vid |
+					   (act->vlan.prio << VLAN_PRIO_SHIFT);
+			action->vlan.proto = act->vlan.proto;
+			break;
 		case FLOW_ACTION_VLAN_POP:
 		case FLOW_ACTION_VLAN_MANGLE:
 		case FLOW_ACTION_MANGLE:
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 12/14] bpf, selftest: Add test for xdp_flow
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (10 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 11/14] xdp_flow: Implement vlan_push action Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 13/14] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

Check if TC flower offloading to XDP works.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 tools/testing/selftests/bpf/Makefile         |   1 +
 tools/testing/selftests/bpf/test_xdp_flow.sh | 103 +++++++++++++++++++++++++++
 2 files changed, 104 insertions(+)
 create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 3bd0f4a..886702a 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -50,6 +50,7 @@ TEST_PROGS := test_kmod.sh \
 	test_xdp_redirect.sh \
 	test_xdp_meta.sh \
 	test_xdp_veth.sh \
+	test_xdp_flow.sh \
 	test_offload.py \
 	test_sock_addr.sh \
 	test_tunnel.sh \
diff --git a/tools/testing/selftests/bpf/test_xdp_flow.sh b/tools/testing/selftests/bpf/test_xdp_flow.sh
new file mode 100755
index 0000000..cb06f3e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_flow.sh
@@ -0,0 +1,103 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+#
+# Create 2 namespaces with 2 veth peers, and
+# forward packets in-between using xdp_flow
+#
+# NS1(veth11)        NS2(veth22)
+#      |                  |
+#      |                  |
+#   (veth1)            (veth2)
+#      ^                  ^
+#      |     xdp_flow     |
+#      --------------------
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+TESTNAME=xdp_flow
+
+_cleanup()
+{
+	set +e
+	ip link del veth1 2> /dev/null
+	ip link del veth2 2> /dev/null
+	ip netns del ns1 2> /dev/null
+	ip netns del ns2 2> /dev/null
+}
+
+cleanup_skip()
+{
+	echo "selftests: $TESTNAME [SKIP]"
+	_cleanup
+
+	exit $ksft_skip
+}
+
+cleanup()
+{
+	if [ "$?" = 0 ]; then
+		echo "selftests: $TESTNAME [PASS]"
+	else
+		echo "selftests: $TESTNAME [FAILED]"
+	fi
+	_cleanup
+}
+
+if [ $(id -u) -ne 0 ]; then
+	echo "selftests: $TESTNAME [SKIP] Need root privileges"
+	exit $ksft_skip
+fi
+
+if ! ip link set dev lo xdp off > /dev/null 2>&1; then
+	echo "selftests: $TESTNAME [SKIP] Could not run test without the ip xdp support"
+	exit $ksft_skip
+fi
+
+set -e
+
+trap cleanup_skip EXIT
+
+ip netns add ns1
+ip netns add ns2
+
+ip link add veth1 type veth peer name veth11 netns ns1
+ip link add veth2 type veth peer name veth22 netns ns2
+
+ip link set veth1 up
+ip link set veth2 up
+
+ip -n ns1 addr add 10.1.1.11/24 dev veth11
+ip -n ns2 addr add 10.1.1.22/24 dev veth22
+
+ip -n ns1 link set dev veth11 up
+ip -n ns2 link set dev veth22 up
+
+ip -n ns1 link set dev veth11 xdp obj xdp_dummy.o sec xdp_dummy
+ip -n ns2 link set dev veth22 xdp obj xdp_dummy.o sec xdp_dummy
+
+ethtool -K veth1 tc-offload-xdp on
+ethtool -K veth2 tc-offload-xdp on
+
+trap cleanup EXIT
+
+# Adding clsact or ingress will trigger loading bpf prog in UMH
+tc qdisc add dev veth1 clsact
+tc qdisc add dev veth2 clsact
+
+# Adding filter will have UMH populate flow table map
+# 'skip_sw' can be accepted only when 'tc-offload-xdp' is enabled on veth
+tc filter add dev veth1 ingress protocol ip flower skip_sw \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol ip flower skip_sw \
+	dst_ip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+# ARP is not supported so don't add 'skip_sw'
+tc filter add dev veth1 ingress protocol arp flower \
+	arp_tip 10.1.1.0/24 action mirred egress redirect dev veth2
+tc filter add dev veth2 ingress protocol arp flower \
+	arp_sip 10.1.1.0/24 action mirred egress redirect dev veth1
+
+ip netns exec ns1 ping -c 1 -W 1 10.1.1.22
+
+exit 0
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 13/14] i40e: prefetch xdp->data before running XDP prog
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (11 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 12/14] bpf, selftest: Add test for xdp_flow Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-13 12:05 ` [RFC PATCH bpf-next 14/14] bpf, hashtab: Compare keys in long Toshiaki Makita
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

XDP progs are likely to read/write xdp->data.
This improves the performance of xdp_flow.
This is included in this series just to demonstrate to what extent
xdp_flow performance can increase.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index f162252..ea775ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2207,6 +2207,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 	if (!xdp_prog)
 		goto xdp_out;
 
+	prefetchw(xdp->data);
 	prefetchw(xdp->data_hard_start); /* xdp_frame write */
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH bpf-next 14/14] bpf, hashtab: Compare keys in long
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (12 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 13/14] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
@ 2019-08-13 12:05 ` Toshiaki Makita
  2019-08-14  1:44 ` [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Alexei Starovoitov
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-13 12:05 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko
  Cc: Toshiaki Makita, netdev, bpf, William Tu

memcmp() is generally slow. Compare keys in long if possible.
This improves xdp_flow performance.
This is included in this series just to demonstrate to what extent
xdp_flow performance can increase.

Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
---
 kernel/bpf/hashtab.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
index 22066a6..8b5ffd4 100644
--- a/kernel/bpf/hashtab.c
+++ b/kernel/bpf/hashtab.c
@@ -417,6 +417,29 @@ static inline struct hlist_nulls_head *select_bucket(struct bpf_htab *htab, u32
 	return &__select_bucket(htab, hash)->head;
 }
 
+/* key1 must be aligned to sizeof long */
+static bool key_equal(void *key1, void *key2, u32 size)
+{
+	/* Check for key1 */
+	BUILD_BUG_ON(!IS_ALIGNED(offsetof(struct htab_elem, key),
+				 sizeof(long)));
+
+	if (IS_ALIGNED((unsigned long)key2 | (unsigned long)size,
+		       sizeof(long))) {
+		unsigned long *lkey1, *lkey2;
+
+		for (lkey1 = key1, lkey2 = key2; size > 0;
+		     lkey1++, lkey2++, size -= sizeof(long)) {
+			if (*lkey1 != *lkey2)
+				return false;
+		}
+
+		return true;
+	}
+
+	return !memcmp(key1, key2, size);
+}
+
 /* this lookup function can only be called with bucket lock taken */
 static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash,
 					 void *key, u32 key_size)
@@ -425,7 +448,7 @@ static struct htab_elem *lookup_elem_raw(struct hlist_nulls_head *head, u32 hash
 	struct htab_elem *l;
 
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	return NULL;
@@ -444,7 +467,7 @@ static struct htab_elem *lookup_nulls_elem_raw(struct hlist_nulls_head *head,
 
 again:
 	hlist_nulls_for_each_entry_rcu(l, n, head, hash_node)
-		if (l->hash == hash && !memcmp(&l->key, key, key_size))
+		if (l->hash == hash && key_equal(&l->key, key, key_size))
 			return l;
 
 	if (unlikely(get_nulls_value(n) != (hash & (n_buckets - 1))))
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (13 preceding siblings ...)
  2019-08-13 12:05 ` [RFC PATCH bpf-next 14/14] bpf, hashtab: Compare keys in long Toshiaki Makita
@ 2019-08-14  1:44 ` Alexei Starovoitov
  2019-08-14  7:33   ` Toshiaki Makita
  2019-08-14 17:07 ` Stanislav Fomichev
  2019-08-15 15:46 ` William Tu
  16 siblings, 1 reply; 35+ messages in thread
From: Alexei Starovoitov @ 2019-08-14  1:44 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
> This is a rough PoC for an idea to offload TC flower to XDP.
...
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  4.0 Mpps  1.1 Mpps  1.1 Mpps

Is xdp_flow limited to 4 Mpps due to veth or something else?

> 
> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
> 
> OTOH the time to add a flow increases with xdp_flow.
> 
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
> 
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  25ms      12ms      0.6ms
> 
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.

make sense, but why vanilla TC is so slow ?

> * Implementation
> 
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.

I think UMH approach is a good fit for this.
Clearly the same algorithm can be done as kernel code or kernel module, but
bpfilter-like UMH is a safer approach.

> - patch 9
>  Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>  TC flower offload code.

The hook into UMH from TC looks simple. Do you expect the same interface to be
reused from OVS ?

> * About alternative userland (ovs-vswitchd etc.) implementation
> 
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
> 
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
> 
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.

Makes sense to me.

>   bpf, hashtab: Compare keys in long

3Mpps vs 4Mpps just from this patch ?
or combined with i40 prefech patch ?

>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +

Could you share "perf report" for just hash tab optimization
and for i40 ?
I haven't seen memcmp to be bottle neck in hash tab.
What is the the of the key?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-14  1:44 ` [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Alexei Starovoitov
@ 2019-08-14  7:33   ` Toshiaki Makita
  2019-08-15 10:59     ` Toshiaki Makita
  0 siblings, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-14  7:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

Hi Alexei, thank you for taking a look!

On 2019/08/14 10:44, Alexei Starovoitov wrote:
> On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
>> This is a rough PoC for an idea to offload TC flower to XDP.
> ...
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   4.0 Mpps  1.1 Mpps  1.1 Mpps
> 
> Is xdp_flow limited to 4 Mpps due to veth or something else?

Looking at perf, accumulation of each layer's overhead resulted in the number.
With XDP prog which only redirects packets and does not touch the data,
the drop rate is 10 Mpps. In this case the main overhead is XDP's redirect processing
and handling of 2 XDP progs (in veth and i40e).
In the xdp_flow test the overhead additionally includes flow key parse in XDP prog
and hash table lookup (including jhash calculation) which resulted in 4 Mpps.

>> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
>>
>> OTOH the time to add a flow increases with xdp_flow.
>>
>> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
>>
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   25ms      12ms      0.6ms
>>
>> xdp_flow does a lot of work to emulate TC behavior including UMH
>> transaction and multiple bpf map update from UMH which I think increases
>> the latency.
> 
> make sense, but why vanilla TC is so slow ?

No ideas. At least TC requires additional syscall to insert a flow compared to ovs kmod,
but 12 ms looks too long for that.

>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> 
> I think UMH approach is a good fit for this.
> Clearly the same algorithm can be done as kernel code or kernel module, but
> bpfilter-like UMH is a safer approach.
> 
>> - patch 9
>>   Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>>   TC flower offload code.
> 
> The hook into UMH from TC looks simple. Do you expect the same interface to be
> reused from OVS ?

Do you mean openvswitch kernel module by OVS?
If so, no, at this point. TC hook is simple because I reused flow offload mechanism.
OVS kmod does not have offload interface and ovs-vswitchd is using TC for offload.
I wanted to reuse this mechanism for offloading to XDP, so using TC.

>> * About alternative userland (ovs-vswitchd etc.) implementation
>>
>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>> mechanism, instead of adding code to kernel. I just thought offloading
>> TC is more generic and allows wider usage with direct TC command.
>>
>> For example, considering that OVS inserts a flow to kernel only when
>> flow miss happens in kernel, we can in advance add offloaded flows via
>> tc filter to avoid flow insertion latency for certain sensitive flows.
>> TC flower usage without using OVS is also possible.
>>
>> Also as written above nftables can be offloaded to XDP with this
>> mechanism as well.
> 
> Makes sense to me.
> 
>>    bpf, hashtab: Compare keys in long
> 
> 3Mpps vs 4Mpps just from this patch ?
> or combined with i40 prefech patch ?

Combined.

>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
> 
> Could you share "perf report" for just hash tab optimization
> and for i40 ?

Sure, I'll get some more data and post them.

> I haven't seen memcmp to be bottle neck in hash tab.
> What is the the of the key?

typo of "size of the key"? IIRC 64 bytes.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (14 preceding siblings ...)
  2019-08-14  1:44 ` [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Alexei Starovoitov
@ 2019-08-14 17:07 ` Stanislav Fomichev
  2019-08-15 10:26   ` Toshiaki Makita
  2019-08-15 15:46 ` William Tu
  16 siblings, 1 reply; 35+ messages in thread
From: Stanislav Fomichev @ 2019-08-14 17:07 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 08/13, Toshiaki Makita wrote:
> * Implementation
> 
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.
Can this be instead implemented with a new hook that will be called
for TC events? This hook can write to perf event buffer and control
plane will insert/remove/modify flow tables in the BPF maps (contol
plane will also install xdp program).

Why do we need UMH? What am I missing?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-14 17:07 ` Stanislav Fomichev
@ 2019-08-15 10:26   ` Toshiaki Makita
  2019-08-15 15:21     ` Stanislav Fomichev
  0 siblings, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-15 10:26 UTC (permalink / raw)
  To: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 2019/08/15 2:07, Stanislav Fomichev wrote:
> On 08/13, Toshiaki Makita wrote:
>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> Can this be instead implemented with a new hook that will be called
> for TC events? This hook can write to perf event buffer and control
> plane will insert/remove/modify flow tables in the BPF maps (contol
> plane will also install xdp program).
> 
> Why do we need UMH? What am I missing?

So you suggest doing everything in xdp_flow kmod?
I also thought about that. There are two phases so let's think about them separately.

1) TC block (qdisc) creation / eBPF load

I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
done from userland, not from kernel, to run the verifier for safety.
However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
allow such programs to be loaded from kernel? I currently don't have the will
to make such an API as loading can be done with current UMH mechanism.

2) flow insertion / eBPF map update

Not sure if this needs to be done from userland. One concern is that eBPF maps can
be modified by unrelated processes and we need to handle all unexpected state of maps.
Such handling tends to be difficult and may cause unexpected kernel behavior.
OTOH updating maps from kmod may reduces the latency of flow insertion drastically.

Alexei, Daniel, what do you think?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-14  7:33   ` Toshiaki Makita
@ 2019-08-15 10:59     ` Toshiaki Makita
  0 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-15 10:59 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 2019/08/14 16:33, Toshiaki Makita wrote:
>>>    bpf, hashtab: Compare keys in long
>>
>> 3Mpps vs 4Mpps just from this patch ?
>> or combined with i40 prefech patch ?
> 
> Combined.
> 
>>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>>
>> Could you share "perf report" for just hash tab optimization
>> and for i40 ?
> 
> Sure, I'll get some more data and post them.

Here are perf report and performance numbers.
This time for some reason the performance is better than before.
Something in my env may have changed but could not identify that.

I cut and paste top 10 functions from perf report with drop rate for each case.
perf report is run with --no-child option, so does not include child functions load.
It looks like the hottest function is always xdp_flow BPF program for XDP,
but the shown function name is some meaningless one, like __this_module+0x800000007446.

- No prefetch, no long-compare

   3.3 Mpps

     25.22%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000007446
     21.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     14.93%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.07%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.60%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.44%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.69%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      2.29%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.51%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- With prefetch, no long-compare

   3.7 Mpps

     25.02%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x800000008052
     21.52%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     13.20%  ksoftirqd/4      [kernel.kallsyms]             [k] memcmp
      7.38%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.57%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.50%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      2.86%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.84%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      1.79%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

- No prefetch, with long-compare

   4.2 Mpps

     24.64%  ksoftirqd/4      [kernel.kallsyms]             [k] __this_module+0x800000008f47
     24.42%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
      6.91%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.04%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      3.53%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.14%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.13%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      2.34%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      1.76%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.37%  ksoftirqd/4      [kernel.kallsyms]             [k] zero_key+0x800000010e93

   NOTE: key_equal is called in place of memcmp.

- With prefetch, with long-compare

   4.6 Mpps

     26.68%  ksoftirqd/4      [kernel.kallsyms]             [k] mirred_list_lock+0x80000000a109
     22.37%  ksoftirqd/4      [kernel.kallsyms]             [k] __htab_map_lookup_elem
     10.79%  ksoftirqd/4      [kernel.kallsyms]             [k] i40e_clean_rx_irq
      4.74%  ksoftirqd/4      [kernel.kallsyms]             [k] page_frag_free
      4.09%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_rcv
      3.97%  ksoftirqd/4      [kernel.kallsyms]             [k] dev_map_enqueue
      3.79%  ksoftirqd/4      [kernel.kallsyms]             [k] lookup_nulls_elem_raw
      3.09%  ksoftirqd/4      [kernel.kallsyms]             [k] xdp_do_redirect
      2.45%  ksoftirqd/4      [kernel.kallsyms]             [k] key_equal
      1.91%  ksoftirqd/4      [kernel.kallsyms]             [k] veth_xdp_xmit

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 10:26   ` Toshiaki Makita
@ 2019-08-15 15:21     ` Stanislav Fomichev
  2019-08-15 19:22       ` Jakub Kicinski
  2019-08-16  1:09       ` Toshiaki Makita
  0 siblings, 2 replies; 35+ messages in thread
From: Stanislav Fomichev @ 2019-08-15 15:21 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 08/15, Toshiaki Makita wrote:
> On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > On 08/13, Toshiaki Makita wrote:
> > > * Implementation
> > > 
> > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > mainly because flow insertion is considerably frequent. If we generate
> > > and load an eBPF program on each insertion of a flow, the latency of the
> > > first packet of ping in above test will incease, which I want to avoid.
> > Can this be instead implemented with a new hook that will be called
> > for TC events? This hook can write to perf event buffer and control
> > plane will insert/remove/modify flow tables in the BPF maps (contol
> > plane will also install xdp program).
> > 
> > Why do we need UMH? What am I missing?
> 
> So you suggest doing everything in xdp_flow kmod?
You probably don't even need xdp_flow kmod. Add new tc "offload" mode
(bypass) that dumps every command via netlink (or calls the BPF hook
where you can dump it into perf event buffer) and then read that info
from userspace and install xdp programs and modify flow tables.
I don't think you need any kernel changes besides that stream
of data from the kernel about qdisc/tc flow creation/removal/etc.

But, I haven't looked at the series deeply, so I might be missing
something :-)

> I also thought about that. There are two phases so let's think about them separately.
> 
> 1) TC block (qdisc) creation / eBPF load
> 
> I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> done from userland, not from kernel, to run the verifier for safety.
> However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> allow such programs to be loaded from kernel? I currently don't have the will
> to make such an API as loading can be done with current UMH mechanism.
> 
> 2) flow insertion / eBPF map update
> 
> Not sure if this needs to be done from userland. One concern is that eBPF maps can
> be modified by unrelated processes and we need to handle all unexpected state of maps.
> Such handling tends to be difficult and may cause unexpected kernel behavior.
> OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
Latency from the moment I type 'tc filter add ...' to the moment the rule
is installed into the maps? Does it really matter?

Do I understand correctly that both of those events (qdisc creation and
flow insertion) are triggered from tcf_block_offload_cmd (or similar)?

> Alexei, Daniel, what do you think?
> 
> Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
                   ` (15 preceding siblings ...)
  2019-08-14 17:07 ` Stanislav Fomichev
@ 2019-08-15 15:46 ` William Tu
  2019-08-16  1:38   ` Toshiaki Makita
  16 siblings, 1 reply; 35+ messages in thread
From: William Tu @ 2019-08-15 15:46 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Linux Kernel Network Developers, bpf

On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> This is a rough PoC for an idea to offload TC flower to XDP.
>
>
> * Motivation
>
> The purpose is to speed up software TC flower by using XDP.
>
> I chose TC flower because my current interest is in OVS. OVS uses TC to
> offload flow tables to hardware, so if TC can offload flows to XDP, OVS
> also can be offloaded to XDP.
>
> When TC flower filter is offloaded to XDP, the received packets are
> handled by XDP first, and if their protocol or something is not
> supported by the eBPF program, the program returns XDP_PASS and packets
> are passed to upper layer TC.
>
> The packet processing flow will be like this when this mechanism,
> xdp_flow, is used with OVS.
>
>  +-------------+
>  | openvswitch |
>  |    kmod     |
>  +-------------+
>         ^
>         | if not match in filters (flow key or action not supported by TC)
>  +-------------+
>  |  TC flower  |
>  +-------------+
>         ^
>         | if not match in flow tables (flow key or action not supported by XDP)
>  +-------------+
>  |  XDP prog   |
>  +-------------+
>         ^
>         | incoming packets
>
I like this idea, some comments about the OVS AF_XDP work.

Another way when using OVS AF_XDP is to serve as slow path of TC flow
HW offload.
For example:

 Userspace OVS datapath (The one used by OVS-DPDK)
     ^
      |
  +------------------------------+
  |  OVS AF_XDP netdev |
  +------------------------------+
         ^
         | if not supported or not match in flow tables
  +---------------------+
  |  TC HW flower  |
  +---------------------+
         ^
         | incoming packets

So in this case it's either TC HW flower offload, or the userspace PMD OVS.
Both cases should be pretty fast.

I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between
TC HW flower and OVS AF_XDP netdev.
Before the XDP program sending packet to AF_XDP socket, the
xdp_flow can execute first, and if not match, then send to AF_XDP.
So in your patch set, implement s.t like
  bpf_redirect_map(&xsks_map, index, 0);

Another thing is that at each layer we are doing its own packet parsing.
From your graph, first parse at XDP program, then at TC flow, then at
openvswitch kmod.
I wonder if we can reuse some parsing result.

Regards,
William

> This is useful especially when the device does not support HW-offload.
> Such interfaces include virtual interfaces like veth.
>
>
> * How to use
>
> It only supports ingress (clsact) flower filter at this point.
> Enable the feature via ethtool before adding ingress/clsact qdisc.
>
>  $ ethtool -K eth0 tc-offload-xdp on
>
> Then add qdisc/filters as normal.
>
>  $ tc qdisc add dev eth0 clsact
>  $ tc filter add dev eth0 ingress protocol ip flower skip_sw ...
>
> Alternatively, when using OVS, adding qdisc and filters will be
> automatically done by setting hw-offload.
>
>  $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>  $ systemctl stop openvswitch
>  $ tc qdisc del dev eth0 ingress # or reboot
>  $ ethtool -K eth0 tc-offload-xdp on
>  $ systemctl start openvswitch
>
>
> * Performance
>
> I measured drop rate at veth interface with redirect action from physical
> interface (i40e 25G NIC, XXV 710) to veth. The CPU is Xeon Silver 4114
> (2.20 GHz).
>                                                                  XDP_DROP
>                     +------+                        +-------+    +-------+
>  pktgen -- wire --> | eth0 | -- TC/OVS redirect --> | veth0 |----| veth1 |
>                     +------+   (offloaded to XDP)   +-------+    +-------+
>
> The setup for redirect is done by OVS like this.
>
>  $ ovs-vsctl add-br ovsbr0
>  $ ovs-vsctl add-port ovsbr0 eth0
>  $ ovs-vsctl add-port ovsbr0 veth0
>  $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
>  $ systemctl stop openvswitch
>  $ tc qdisc del dev eth0 ingress
>  $ tc qdisc del dev veth0 ingress
>  $ ethtool -K eth0 tc-offload-xdp on
>  $ ethtool -K veth0 tc-offload-xdp on
>  $ systemctl start openvswitch
>
> Tested single core/single flow with 3 configurations.
> - xdp_flow: hw-offload=true, tc-offload-xdp on
> - TC:       hw-offload=true, tc-offload-xdp off (software TC)
> - ovs kmod: hw-offload=false
>
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  4.0 Mpps  1.1 Mpps  1.1 Mpps
>
> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
>
> OTOH the time to add a flow increases with xdp_flow.
>
> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
>
>  xdp_flow  TC        ovs kmod
>  --------  --------  --------
>  25ms      12ms      0.6ms
>
> xdp_flow does a lot of work to emulate TC behavior including UMH
> transaction and multiple bpf map update from UMH which I think increases
> the latency.
>
>
> * Implementation
>
> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> bpfilter. The difference is that xdp_flow does not generate the eBPF
> program dynamically but a prebuilt program is embedded in UMH. This is
> mainly because flow insertion is considerably frequent. If we generate
> and load an eBPF program on each insertion of a flow, the latency of the
> first packet of ping in above test will incease, which I want to avoid.
>
>                          +----------------------+
>                          |    xdp_flow_umh      | load eBPF prog for XDP
>                          | (eBPF prog embedded) | update maps for flow tables
>                          +----------------------+
>                                    ^ |
>                            request | v eBPF prog id
>  +-----------+  offload  +-----------------------+
>  | TC flower | --------> |    xdp_flow kmod      | attach the prog to XDP
>  +-----------+           | (flow offload driver) |
>                          +-----------------------+
>
> - When ingress/clsact qdisc is created, i.e. a device is bound to a flow
>   block, xdp_flow kmod requests xdp_flow_umh to load eBPF prog.
>   xdp_flow_umh returns prog id and xdp_flow kmod attach the prog to XDP
>   (the reason of attaching XDP from kmod is that rtnl_lock is held here).
>
> - When flower filter is added, xdp_flow kmod requests xdp_flow_umh to
>   update maps for flow tables.
>
>
> * Patches
>
> - patch 1
>  Basic framework for xdp_flow kmod and UMH.
>
> - patch 2
>  Add prebuilt eBPF program embedded in UMH.
>
> - patch 3, 4
>  Attach the prog to XDP in kmod after using the prog id returned from
>  UMH.
>
> - patch 5, 6
>  Add maps for flow tables and flow table manipulation logic in UMH.
>
> - patch 7
>  Implement flow lookup and basic actions in eBPF prog.
>
> - patch 8
>  Implement flow manipulation logic, serialize flow key and actions from
>  TC flower and make requests to UMH in kmod.
>
> - patch 9
>  Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>  TC flower offload code.
>
> - patch 10, 11
>  Add example actions, redirect and vlan_push.
>
> - patch 12
>  Add testcase for xdp_flow.
>
> - patch 13, 14
>  These are unrelated patches. They just improves XDP program's
>  performance. They are included to demonstrate to what extent xdp_flow
>  performance can increase. Without them, drop rate goes down from 4Mpps
>  to 3Mpps.
>
>
> * About OVS AF_XDP netdev
>
> Recently OVS has added AF_XDP netdev type support. This also makes use
> of XDP, but in some ways different from this patch set.
>
> - AF_XDP work originally started in order to bring BPF's flexibility to
>   OVS, which enables us to upgrade datapath without updating kernel.
>   AF_XDP solution uses userland datapath so it achieved its goal.
>   xdp_flow will not replace OVS datapath completely, but offload it
>   partially just for speed up.
>
> - OVS AF_XDP requires PMD for the best performance so consumes 100% CPU.
>
> - OVS AF_XDP needs packet copy when forwarding packets.
>
> - xdp_flow can be used not only for OVS. It works for direct use of TC
>   flower. nftables also can be offloaded by the same mechanism in the
>   future.
>
>
> * About alternative userland (ovs-vswitchd etc.) implementation
>
> Maybe a similar logic can be implemented in ovs-vswitchd offload
> mechanism, instead of adding code to kernel. I just thought offloading
> TC is more generic and allows wider usage with direct TC command.
>
> For example, considering that OVS inserts a flow to kernel only when
> flow miss happens in kernel, we can in advance add offloaded flows via
> tc filter to avoid flow insertion latency for certain sensitive flows.
> TC flower usage without using OVS is also possible.
>
> Also as written above nftables can be offloaded to XDP with this
> mechanism as well.
>
>
> * Note
>
> This patch set is based on top of commit a664a834579a ("tools: bpftool:
> fix reading from /proc/config.gz").
>
> Any feedback is welcome.
> Thanks!
>
> Signed-off-by: Toshiaki Makita <toshiaki.makita1@gmail.com>
>
> Toshiaki Makita (14):
>   xdp_flow: Add skeleton of XDP based TC offload driver
>   xdp_flow: Add skeleton bpf program for XDP
>   bpf: Add API to get program from id
>   xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program
>   xdp_flow: Prepare flow tables in bpf
>   xdp_flow: Add flow entry insertion/deletion logic in UMH
>   xdp_flow: Add flow handling and basic actions in bpf prog
>   xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod
>   xdp_flow: Add netdev feature for enabling TC flower offload to XDP
>   xdp_flow: Implement redirect action
>   xdp_flow: Implement vlan_push action
>   bpf, selftest: Add test for xdp_flow
>   i40e: prefetch xdp->data before running XDP prog
>   bpf, hashtab: Compare keys in long
>
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
>  include/linux/bpf.h                          |    6 +
>  include/linux/netdev_features.h              |    2 +
>  include/linux/netdevice.h                    |    4 +
>  include/net/flow_offload_xdp.h               |   33 +
>  include/net/pkt_cls.h                        |    5 +
>  include/net/sch_generic.h                    |    1 +
>  kernel/bpf/hashtab.c                         |   27 +-
>  kernel/bpf/syscall.c                         |   26 +-
>  net/Kconfig                                  |    1 +
>  net/Makefile                                 |    1 +
>  net/core/dev.c                               |   13 +-
>  net/core/ethtool.c                           |    1 +
>  net/sched/cls_api.c                          |   67 +-
>  net/xdp_flow/.gitignore                      |    1 +
>  net/xdp_flow/Kconfig                         |   16 +
>  net/xdp_flow/Makefile                        |  112 +++
>  net/xdp_flow/msgfmt.h                        |  102 +++
>  net/xdp_flow/umh_bpf.h                       |   34 +
>  net/xdp_flow/xdp_flow_core.c                 |  126 ++++
>  net/xdp_flow/xdp_flow_kern_bpf.c             |  358 +++++++++
>  net/xdp_flow/xdp_flow_kern_bpf_blob.S        |    7 +
>  net/xdp_flow/xdp_flow_kern_mod.c             |  645 ++++++++++++++++
>  net/xdp_flow/xdp_flow_umh.c                  | 1034 ++++++++++++++++++++++++++
>  net/xdp_flow/xdp_flow_umh_blob.S             |    7 +
>  tools/testing/selftests/bpf/Makefile         |    1 +
>  tools/testing/selftests/bpf/test_xdp_flow.sh |  103 +++
>  27 files changed, 2716 insertions(+), 18 deletions(-)
>  create mode 100644 include/net/flow_offload_xdp.h
>  create mode 100644 net/xdp_flow/.gitignore
>  create mode 100644 net/xdp_flow/Kconfig
>  create mode 100644 net/xdp_flow/Makefile
>  create mode 100644 net/xdp_flow/msgfmt.h
>  create mode 100644 net/xdp_flow/umh_bpf.h
>  create mode 100644 net/xdp_flow/xdp_flow_core.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf.c
>  create mode 100644 net/xdp_flow/xdp_flow_kern_bpf_blob.S
>  create mode 100644 net/xdp_flow/xdp_flow_kern_mod.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh.c
>  create mode 100644 net/xdp_flow/xdp_flow_umh_blob.S
>  create mode 100755 tools/testing/selftests/bpf/test_xdp_flow.sh
>
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 15:21     ` Stanislav Fomichev
@ 2019-08-15 19:22       ` Jakub Kicinski
  2019-08-16  1:28         ` Toshiaki Makita
  2019-08-16 15:59         ` Stanislav Fomichev
  2019-08-16  1:09       ` Toshiaki Makita
  1 sibling, 2 replies; 35+ messages in thread
From: Jakub Kicinski @ 2019-08-15 19:22 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> On 08/15, Toshiaki Makita wrote:
> > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > On 08/13, Toshiaki Makita wrote:  
> > > > * Implementation
> > > > 
> > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > mainly because flow insertion is considerably frequent. If we generate
> > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > first packet of ping in above test will incease, which I want to avoid.  
> > > Can this be instead implemented with a new hook that will be called
> > > for TC events? This hook can write to perf event buffer and control
> > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > plane will also install xdp program).
> > > 
> > > Why do we need UMH? What am I missing?  
> > 
> > So you suggest doing everything in xdp_flow kmod?  
> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> (bypass) that dumps every command via netlink (or calls the BPF hook
> where you can dump it into perf event buffer) and then read that info
> from userspace and install xdp programs and modify flow tables.
> I don't think you need any kernel changes besides that stream
> of data from the kernel about qdisc/tc flow creation/removal/etc.

There's a certain allure in bringing the in-kernel BPF translation
infrastructure forward. OTOH from system architecture perspective IMHO
it does seem like a task best handed in user space. bpfilter can replace
iptables completely, here we're looking at an acceleration relatively
loosely coupled with flower.

FWIW Quentin spent some time working on a universal flow rule to BPF
translation library:

https://github.com/Netronome/libkefir

A lot remains to be done there, but flower front end is one of the
targets. A library can be tuned for any application, without a
dependency on flower uAPI.

> But, I haven't looked at the series deeply, so I might be missing
> something :-)

I don't think you are :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 15:21     ` Stanislav Fomichev
  2019-08-15 19:22       ` Jakub Kicinski
@ 2019-08-16  1:09       ` Toshiaki Makita
  2019-08-16 15:35         ` Stanislav Fomichev
  1 sibling, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-16  1:09 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 2019/08/16 0:21, Stanislav Fomichev wrote:
> On 08/15, Toshiaki Makita wrote:
>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>> On 08/13, Toshiaki Makita wrote:
>>>> * Implementation
>>>>
>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>> mainly because flow insertion is considerably frequent. If we generate
>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>> first packet of ping in above test will incease, which I want to avoid.
>>> Can this be instead implemented with a new hook that will be called
>>> for TC events? This hook can write to perf event buffer and control
>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>> plane will also install xdp program).
>>>
>>> Why do we need UMH? What am I missing?
>>
>> So you suggest doing everything in xdp_flow kmod?
> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> (bypass) that dumps every command via netlink (or calls the BPF hook
> where you can dump it into perf event buffer) and then read that info
> from userspace and install xdp programs and modify flow tables.
> I don't think you need any kernel changes besides that stream
> of data from the kernel about qdisc/tc flow creation/removal/etc.

My intention is to make more people who want high speed network easily use XDP,
so making transparent XDP offload with current TC interface.

What userspace program would monitor TC events with your suggestion?
ovs-vswitchd? If so, it even does not need to monitor TC. It can
implement XDP offload directly.
(However I prefer kernel solution. Please refer to "About alternative
userland (ovs-vswitchd etc.) implementation" section in the cover letter.)

Also such a TC monitoring solution easily can be out-of-sync with real TC
behavior as TC filter/flower is being heavily developed and changed,
e.g. introduction of TC block, support multiple masks with the same pref, etc.
I'm not sure such an unreliable solution have much value.

> But, I haven't looked at the series deeply, so I might be missing
> something :-)
> 
>> I also thought about that. There are two phases so let's think about them separately.
>>
>> 1) TC block (qdisc) creation / eBPF load
>>
>> I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
>> done from userland, not from kernel, to run the verifier for safety.
>> However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
>> allow such programs to be loaded from kernel? I currently don't have the will
>> to make such an API as loading can be done with current UMH mechanism.
>>
>> 2) flow insertion / eBPF map update
>>
>> Not sure if this needs to be done from userland. One concern is that eBPF maps can
>> be modified by unrelated processes and we need to handle all unexpected state of maps.
>> Such handling tends to be difficult and may cause unexpected kernel behavior.
>> OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> Latency from the moment I type 'tc filter add ...' to the moment the rule
> is installed into the maps? Does it really matter?

Yes it matters. Flow insertion is kind of data path in OVS.
Please see how ping latency is affected in the cover letter.

> Do I understand correctly that both of those events (qdisc creation and
> flow insertion) are triggered from tcf_block_offload_cmd (or similar)?

Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
I think you understand it correctly.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 19:22       ` Jakub Kicinski
@ 2019-08-16  1:28         ` Toshiaki Makita
  2019-08-16 18:52           ` Jakub Kicinski
  2019-08-16 15:59         ` Stanislav Fomichev
  1 sibling, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-16  1:28 UTC (permalink / raw)
  To: Jakub Kicinski, Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jesper Dangaard Brouer,
	John Fastabend, Jamal Hadi Salim, Cong Wang, Jiri Pirko, netdev,
	bpf, William Tu

On 2019/08/16 4:22, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
>> On 08/15, Toshiaki Makita wrote:
>>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>>> On 08/13, Toshiaki Makita wrote:
>>>>> * Implementation
>>>>>
>>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>>> mainly because flow insertion is considerably frequent. If we generate
>>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>>> first packet of ping in above test will incease, which I want to avoid.
>>>> Can this be instead implemented with a new hook that will be called
>>>> for TC events? This hook can write to perf event buffer and control
>>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>>> plane will also install xdp program).
>>>>
>>>> Why do we need UMH? What am I missing?
>>>
>>> So you suggest doing everything in xdp_flow kmod?
>> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
>> (bypass) that dumps every command via netlink (or calls the BPF hook
>> where you can dump it into perf event buffer) and then read that info
>> from userspace and install xdp programs and modify flow tables.
>> I don't think you need any kernel changes besides that stream
>> of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.

I don't think it's loosely coupled. Emulating TC behavior in userspace
is not so easy.

Think about recent multi-mask support in flower. Previously userspace could
assume there is one mask and hash table for each preference in TC. After the
change TC accepts different masks with the same pref. Such a change tends to
break userspace emulation. It may ignore masks passed from flow insertion
and use the mask remembered when the first flow of the pref is inserted. It
may override the mask of all existing flows with the pref. It may fail to
insert such flows. Any of them would result in unexpected wrong datapath
handling which is critical.
I think such an emulation layer needs to be updated in sync with TC.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 15:46 ` William Tu
@ 2019-08-16  1:38   ` Toshiaki Makita
  0 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-16  1:38 UTC (permalink / raw)
  To: William Tu
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Linux Kernel Network Developers, bpf

On 2019/08/16 0:46, William Tu wrote:
> On Tue, Aug 13, 2019 at 5:07 AM Toshiaki Makita
> <toshiaki.makita1@gmail.com> wrote:
>>
>> This is a rough PoC for an idea to offload TC flower to XDP.
>>
>>
>> * Motivation
>>
>> The purpose is to speed up software TC flower by using XDP.
>>
>> I chose TC flower because my current interest is in OVS. OVS uses TC to
>> offload flow tables to hardware, so if TC can offload flows to XDP, OVS
>> also can be offloaded to XDP.
>>
>> When TC flower filter is offloaded to XDP, the received packets are
>> handled by XDP first, and if their protocol or something is not
>> supported by the eBPF program, the program returns XDP_PASS and packets
>> are passed to upper layer TC.
>>
>> The packet processing flow will be like this when this mechanism,
>> xdp_flow, is used with OVS.
>>
>>   +-------------+
>>   | openvswitch |
>>   |    kmod     |
>>   +-------------+
>>          ^
>>          | if not match in filters (flow key or action not supported by TC)
>>   +-------------+
>>   |  TC flower  |
>>   +-------------+
>>          ^
>>          | if not match in flow tables (flow key or action not supported by XDP)
>>   +-------------+
>>   |  XDP prog   |
>>   +-------------+
>>          ^
>>          | incoming packets
>>
> I like this idea, some comments about the OVS AF_XDP work.
> 
> Another way when using OVS AF_XDP is to serve as slow path of TC flow
> HW offload.
> For example:
> 
>   Userspace OVS datapath (The one used by OVS-DPDK)
>       ^
>        |
>    +------------------------------+
>    |  OVS AF_XDP netdev |
>    +------------------------------+
>           ^
>           | if not supported or not match in flow tables
>    +---------------------+
>    |  TC HW flower  |
>    +---------------------+
>           ^
>           | incoming packets
> 
> So in this case it's either TC HW flower offload, or the userspace PMD OVS.
> Both cases should be pretty fast.
> 
> I think xdp_flow can also be used by OVS AF_XDP netdev, sitting between
> TC HW flower and OVS AF_XDP netdev.
> Before the XDP program sending packet to AF_XDP socket, the
> xdp_flow can execute first, and if not match, then send to AF_XDP.
> So in your patch set, implement s.t like
>    bpf_redirect_map(&xsks_map, index, 0);

Thanks, the concept sounds good but this is probably difficult as long as
this is a TC offload, which is emulating TC.
If I changed the direction and implement offload in ovs-vswitchd, it would
be possible. I'll remember this optimization.

> Another thing is that at each layer we are doing its own packet parsing.
>  From your graph, first parse at XDP program, then at TC flow, then at
> openvswitch kmod.
> I wonder if we can reuse some parsing result.

That would be nice if possible...
Currently I don't have any ideas to do that. Someday XDP may support more
metadata for this or HW-offload like checksum. Then we can store the information
and upper layers may be able to use that.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-16  1:09       ` Toshiaki Makita
@ 2019-08-16 15:35         ` Stanislav Fomichev
  2019-08-17 14:10           ` Toshiaki Makita
  0 siblings, 1 reply; 35+ messages in thread
From: Stanislav Fomichev @ 2019-08-16 15:35 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 08/16, Toshiaki Makita wrote:
> On 2019/08/16 0:21, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:
> > > > On 08/13, Toshiaki Makita wrote:
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?
> > > 
> > > So you suggest doing everything in xdp_flow kmod?
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> My intention is to make more people who want high speed network easily use XDP,
> so making transparent XDP offload with current TC interface.
> 
> What userspace program would monitor TC events with your suggestion?
Have a new system daemon (xdpflowerd) that is independently
packaged/shipped/installed. Anybody who wants accelerated TC can
download/install it. OVS can be completely unaware of this.

> ovs-vswitchd? If so, it even does not need to monitor TC. It can
> implement XDP offload directly.
> (However I prefer kernel solution. Please refer to "About alternative
> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
> 
> Also such a TC monitoring solution easily can be out-of-sync with real TC
> behavior as TC filter/flower is being heavily developed and changed,
> e.g. introduction of TC block, support multiple masks with the same pref, etc.
> I'm not sure such an unreliable solution have much value.
This same issue applies to the in-kernel implementation, isn't it?
What happens if somebody sends patches for a new flower feature but
doesn't add appropriate xdp support? Do we reject them?

That's why I'm suggesting to move this problem to the userspace :-)

> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> > 
> > > I also thought about that. There are two phases so let's think about them separately.
> > > 
> > > 1) TC block (qdisc) creation / eBPF load
> > > 
> > > I saw eBPF maintainers repeatedly saying eBPF program loading needs to be
> > > done from userland, not from kernel, to run the verifier for safety.
> > > However xdp_flow eBPF program is prebuilt and embedded in kernel so we may
> > > allow such programs to be loaded from kernel? I currently don't have the will
> > > to make such an API as loading can be done with current UMH mechanism.
> > > 
> > > 2) flow insertion / eBPF map update
> > > 
> > > Not sure if this needs to be done from userland. One concern is that eBPF maps can
> > > be modified by unrelated processes and we need to handle all unexpected state of maps.
> > > Such handling tends to be difficult and may cause unexpected kernel behavior.
> > > OTOH updating maps from kmod may reduces the latency of flow insertion drastically.
> > Latency from the moment I type 'tc filter add ...' to the moment the rule
> > is installed into the maps? Does it really matter?
> 
> Yes it matters. Flow insertion is kind of data path in OVS.
> Please see how ping latency is affected in the cover letter.
Ok, but what I'm suggesting shouldn't be less performant.
We are talking about UMH writing into a pipe vs writing TC events into
a netlink.

> > Do I understand correctly that both of those events (qdisc creation and
> > flow insertion) are triggered from tcf_block_offload_cmd (or similar)?
> 
> Both of eBPF load and map update are triggered from tcf_block_offload_cmd.
> I think you understand it correctly.
> 
> Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-15 19:22       ` Jakub Kicinski
  2019-08-16  1:28         ` Toshiaki Makita
@ 2019-08-16 15:59         ` Stanislav Fomichev
  2019-08-16 16:20           ` Stanislav Fomichev
  1 sibling, 1 reply; 35+ messages in thread
From: Stanislav Fomichev @ 2019-08-16 15:59 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 08/15, Jakub Kicinski wrote:
> On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > On 08/15, Toshiaki Makita wrote:
> > > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > > On 08/13, Toshiaki Makita wrote:  
> > > > > * Implementation
> > > > > 
> > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > first packet of ping in above test will incease, which I want to avoid.  
> > > > Can this be instead implemented with a new hook that will be called
> > > > for TC events? This hook can write to perf event buffer and control
> > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > plane will also install xdp program).
> > > > 
> > > > Why do we need UMH? What am I missing?  
> > > 
> > > So you suggest doing everything in xdp_flow kmod?  
> > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > (bypass) that dumps every command via netlink (or calls the BPF hook
> > where you can dump it into perf event buffer) and then read that info
> > from userspace and install xdp programs and modify flow tables.
> > I don't think you need any kernel changes besides that stream
> > of data from the kernel about qdisc/tc flow creation/removal/etc.
> 
> There's a certain allure in bringing the in-kernel BPF translation
> infrastructure forward. OTOH from system architecture perspective IMHO
> it does seem like a task best handed in user space. bpfilter can replace
> iptables completely, here we're looking at an acceleration relatively
> loosely coupled with flower.
Even for bpfilter I would've solved it using something similar:
iptables bypass + redirect iptables netlink requests to some
userspace helper that was registered to be iptables compatibility
manager. And then, again, it becomes a purely userspace problem.

The issue with UMH is that the helper has to be statically compiled
from the kernel tree, which means we can't bring in any dependencies
(stuff like libkefir you mentioned below).

But I digress :-)

> FWIW Quentin spent some time working on a universal flow rule to BPF
> translation library:
> 
> https://github.com/Netronome/libkefir
> 
> A lot remains to be done there, but flower front end is one of the
> targets. A library can be tuned for any application, without a
> dependency on flower uAPI.
> 
> > But, I haven't looked at the series deeply, so I might be missing
> > something :-)
> 
> I don't think you are :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-16 15:59         ` Stanislav Fomichev
@ 2019-08-16 16:20           ` Stanislav Fomichev
  0 siblings, 0 replies; 35+ messages in thread
From: Stanislav Fomichev @ 2019-08-16 16:20 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 08/16, Stanislav Fomichev wrote:
> On 08/15, Jakub Kicinski wrote:
> > On Thu, 15 Aug 2019 08:21:00 -0700, Stanislav Fomichev wrote:
> > > On 08/15, Toshiaki Makita wrote:
> > > > On 2019/08/15 2:07, Stanislav Fomichev wrote:  
> > > > > On 08/13, Toshiaki Makita wrote:  
> > > > > > * Implementation
> > > > > > 
> > > > > > xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
> > > > > > bpfilter. The difference is that xdp_flow does not generate the eBPF
> > > > > > program dynamically but a prebuilt program is embedded in UMH. This is
> > > > > > mainly because flow insertion is considerably frequent. If we generate
> > > > > > and load an eBPF program on each insertion of a flow, the latency of the
> > > > > > first packet of ping in above test will incease, which I want to avoid.  
> > > > > Can this be instead implemented with a new hook that will be called
> > > > > for TC events? This hook can write to perf event buffer and control
> > > > > plane will insert/remove/modify flow tables in the BPF maps (contol
> > > > > plane will also install xdp program).
> > > > > 
> > > > > Why do we need UMH? What am I missing?  
> > > > 
> > > > So you suggest doing everything in xdp_flow kmod?  
> > > You probably don't even need xdp_flow kmod. Add new tc "offload" mode
> > > (bypass) that dumps every command via netlink (or calls the BPF hook
> > > where you can dump it into perf event buffer) and then read that info
> > > from userspace and install xdp programs and modify flow tables.
> > > I don't think you need any kernel changes besides that stream
> > > of data from the kernel about qdisc/tc flow creation/removal/etc.
> > 
> > There's a certain allure in bringing the in-kernel BPF translation
> > infrastructure forward. OTOH from system architecture perspective IMHO
> > it does seem like a task best handed in user space. bpfilter can replace
> > iptables completely, here we're looking at an acceleration relatively
> > loosely coupled with flower.
> Even for bpfilter I would've solved it using something similar:
> iptables bypass + redirect iptables netlink requests to some
> userspace helper that was registered to be iptables compatibility
> manager. And then, again, it becomes a purely userspace problem.
Oh, wait, isn't iptables kernel api is setsockopt/getsockopt?
With the new cgroup hooks you can now try to do bpfilter completely
in BPF 🤯

> The issue with UMH is that the helper has to be statically compiled
> from the kernel tree, which means we can't bring in any dependencies
> (stuff like libkefir you mentioned below).
> 
> But I digress :-)
> 
> > FWIW Quentin spent some time working on a universal flow rule to BPF
> > translation library:
> > 
> > https://github.com/Netronome/libkefir
> > 
> > A lot remains to be done there, but flower front end is one of the
> > targets. A library can be tuned for any application, without a
> > dependency on flower uAPI.
> > 
> > > But, I haven't looked at the series deeply, so I might be missing
> > > something :-)
> > 
> > I don't think you are :)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-16  1:28         ` Toshiaki Makita
@ 2019-08-16 18:52           ` Jakub Kicinski
  2019-08-17 14:01             ` Toshiaki Makita
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2019-08-16 18:52 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
> On 2019/08/16 4:22, Jakub Kicinski wrote:
> > There's a certain allure in bringing the in-kernel BPF translation
> > infrastructure forward. OTOH from system architecture perspective IMHO
> > it does seem like a task best handed in user space. bpfilter can replace
> > iptables completely, here we're looking at an acceleration relatively
> > loosely coupled with flower.  
> 
> I don't think it's loosely coupled. Emulating TC behavior in userspace
> is not so easy.
> 
> Think about recent multi-mask support in flower. Previously userspace could
> assume there is one mask and hash table for each preference in TC. After the
> change TC accepts different masks with the same pref. Such a change tends to
> break userspace emulation. It may ignore masks passed from flow insertion
> and use the mask remembered when the first flow of the pref is inserted. It
> may override the mask of all existing flows with the pref. It may fail to
> insert such flows. Any of them would result in unexpected wrong datapath
> handling which is critical.
> I think such an emulation layer needs to be updated in sync with TC.

Oh, so you're saying that if xdp_flow is merged all patches to
cls_flower and netfilter which affect flow offload will be required 
to update xdp_flow as well?

That's a question of policy. Technically the implementation in user
space is equivalent.

The advantage of user space implementation is that you can add more
to it and explore use cases which do not fit in the flow offload API,
but are trivial for BPF. Not to mention the obvious advantage of
decoupling the upgrade path.


Personally I'm not happy with the way this patch set messes with the
flow infrastructure. You should use the indirect callback
infrastructure instead, and that way you can build the whole thing
touching none of the flow offload core.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-16 18:52           ` Jakub Kicinski
@ 2019-08-17 14:01             ` Toshiaki Makita
  2019-08-19 18:15               ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-17 14:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>> There's a certain allure in bringing the in-kernel BPF translation
>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>> it does seem like a task best handed in user space. bpfilter can replace
>>> iptables completely, here we're looking at an acceleration relatively
>>> loosely coupled with flower.
>>
>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>> is not so easy.
>>
>> Think about recent multi-mask support in flower. Previously userspace could
>> assume there is one mask and hash table for each preference in TC. After the
>> change TC accepts different masks with the same pref. Such a change tends to
>> break userspace emulation. It may ignore masks passed from flow insertion
>> and use the mask remembered when the first flow of the pref is inserted. It
>> may override the mask of all existing flows with the pref. It may fail to
>> insert such flows. Any of them would result in unexpected wrong datapath
>> handling which is critical.
>> I think such an emulation layer needs to be updated in sync with TC.
> 
> Oh, so you're saying that if xdp_flow is merged all patches to
> cls_flower and netfilter which affect flow offload will be required
> to update xdp_flow as well?

Hmm... you are saying that we are allowed to break other in-kernel 
subsystem by some change? Sounds strange...

> That's a question of policy. Technically the implementation in user
> space is equivalent.
 >
> The advantage of user space implementation is that you can add more
> to it and explore use cases which do not fit in the flow offload API,
> but are trivial for BPF. Not to mention the obvious advantage of
> decoupling the upgrade path.

I understand the advantage, but I can't trust such a third-party kernel 
emulation solution for this kind of thing which handles critical data path.

> Personally I'm not happy with the way this patch set messes with the
> flow infrastructure. You should use the indirect callback
> infrastructure instead, and that way you can build the whole thing
> touching none of the flow offload core.

I don't want to mess up the core flow infrastructure either. I'm all 
ears about less invasive ways. Using indirect callback sounds like a 
good idea. Will give it a try. Many thanks.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-16 15:35         ` Stanislav Fomichev
@ 2019-08-17 14:10           ` Toshiaki Makita
  0 siblings, 0 replies; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-17 14:10 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, David S. Miller, Jakub Kicinski,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 19/08/17 (土) 0:35:50, Stanislav Fomichev wrote:
> On 08/16, Toshiaki Makita wrote:
>> On 2019/08/16 0:21, Stanislav Fomichev wrote:
>>> On 08/15, Toshiaki Makita wrote:
>>>> On 2019/08/15 2:07, Stanislav Fomichev wrote:
>>>>> On 08/13, Toshiaki Makita wrote:
>>>>>> * Implementation
>>>>>>
>>>>>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>>>>>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>>>>>> program dynamically but a prebuilt program is embedded in UMH. This is
>>>>>> mainly because flow insertion is considerably frequent. If we generate
>>>>>> and load an eBPF program on each insertion of a flow, the latency of the
>>>>>> first packet of ping in above test will incease, which I want to avoid.
>>>>> Can this be instead implemented with a new hook that will be called
>>>>> for TC events? This hook can write to perf event buffer and control
>>>>> plane will insert/remove/modify flow tables in the BPF maps (contol
>>>>> plane will also install xdp program).
>>>>>
>>>>> Why do we need UMH? What am I missing?
>>>>
>>>> So you suggest doing everything in xdp_flow kmod?
>>> You probably don't even need xdp_flow kmod. Add new tc "offload" mode
>>> (bypass) that dumps every command via netlink (or calls the BPF hook
>>> where you can dump it into perf event buffer) and then read that info
>>> from userspace and install xdp programs and modify flow tables.
>>> I don't think you need any kernel changes besides that stream
>>> of data from the kernel about qdisc/tc flow creation/removal/etc.
>>
>> My intention is to make more people who want high speed network easily use XDP,
>> so making transparent XDP offload with current TC interface.
>>
>> What userspace program would monitor TC events with your suggestion?
> Have a new system daemon (xdpflowerd) that is independently
> packaged/shipped/installed. Anybody who wants accelerated TC can
> download/install it. OVS can be completely unaware of this.

Thanks, but that's what I called an unreliable solution...

>> ovs-vswitchd? If so, it even does not need to monitor TC. It can
>> implement XDP offload directly.
>> (However I prefer kernel solution. Please refer to "About alternative
>> userland (ovs-vswitchd etc.) implementation" section in the cover letter.)
>>
>> Also such a TC monitoring solution easily can be out-of-sync with real TC
>> behavior as TC filter/flower is being heavily developed and changed,
>> e.g. introduction of TC block, support multiple masks with the same pref, etc.
>> I'm not sure such an unreliable solution have much value.
> This same issue applies to the in-kernel implementation, isn't it?
> What happens if somebody sends patches for a new flower feature but
> doesn't add appropriate xdp support? Do we reject them?

Why can we accept a patch which breaks other in-kernel subsystem...
Such patches can be applied accidentally but we are supposed to fix such 
problems in -rc phase, aren't we?

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-17 14:01             ` Toshiaki Makita
@ 2019-08-19 18:15               ` Jakub Kicinski
  2019-08-21  8:49                 ` Toshiaki Makita
  0 siblings, 1 reply; 35+ messages in thread
From: Jakub Kicinski @ 2019-08-19 18:15 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
> > On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:  
> >> On 2019/08/16 4:22, Jakub Kicinski wrote:  
> >>> There's a certain allure in bringing the in-kernel BPF translation
> >>> infrastructure forward. OTOH from system architecture perspective IMHO
> >>> it does seem like a task best handed in user space. bpfilter can replace
> >>> iptables completely, here we're looking at an acceleration relatively
> >>> loosely coupled with flower.  
> >>
> >> I don't think it's loosely coupled. Emulating TC behavior in userspace
> >> is not so easy.
> >>
> >> Think about recent multi-mask support in flower. Previously userspace could
> >> assume there is one mask and hash table for each preference in TC. After the
> >> change TC accepts different masks with the same pref. Such a change tends to
> >> break userspace emulation. It may ignore masks passed from flow insertion
> >> and use the mask remembered when the first flow of the pref is inserted. It
> >> may override the mask of all existing flows with the pref. It may fail to
> >> insert such flows. Any of them would result in unexpected wrong datapath
> >> handling which is critical.
> >> I think such an emulation layer needs to be updated in sync with TC.  
> > 
> > Oh, so you're saying that if xdp_flow is merged all patches to
> > cls_flower and netfilter which affect flow offload will be required
> > to update xdp_flow as well?  
> 
> Hmm... you are saying that we are allowed to break other in-kernel 
> subsystem by some change? Sounds strange...

No I'm not saying that, please don't put words in my mouth.
I'm asking you if that's your intention.

Having an implementation nor support a feature of another implementation
and degrade gracefully to the slower one is not necessarily breakage.
We need to make a concious decision here, hence the clarifying question.

> > That's a question of policy. Technically the implementation in user
> > space is equivalent.
> > 
> > The advantage of user space implementation is that you can add more
> > to it and explore use cases which do not fit in the flow offload API,
> > but are trivial for BPF. Not to mention the obvious advantage of
> > decoupling the upgrade path.  
> 
> I understand the advantage, but I can't trust such a third-party kernel 
> emulation solution for this kind of thing which handles critical data path.

That's a strange argument to make. All production data path BPF today
comes from user space.

> > Personally I'm not happy with the way this patch set messes with the
> > flow infrastructure. You should use the indirect callback
> > infrastructure instead, and that way you can build the whole thing
> > touching none of the flow offload core.  
> 
> I don't want to mess up the core flow infrastructure either. I'm all 
> ears about less invasive ways. Using indirect callback sounds like a 
> good idea. Will give it a try. Many thanks.

Excellent, thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-19 18:15               ` Jakub Kicinski
@ 2019-08-21  8:49                 ` Toshiaki Makita
  2019-08-21 18:38                   ` Jakub Kicinski
  0 siblings, 1 reply; 35+ messages in thread
From: Toshiaki Makita @ 2019-08-21  8:49 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On 19/08/20 (火) 3:15:46, Jakub Kicinski wrote:

I'm on vacation and replying slowly. Sorry for any inconvenience.

> On Sat, 17 Aug 2019 23:01:59 +0900, Toshiaki Makita wrote:
>> On 19/08/17 (土) 3:52:24, Jakub Kicinski wrote:
>>> On Fri, 16 Aug 2019 10:28:10 +0900, Toshiaki Makita wrote:
>>>> On 2019/08/16 4:22, Jakub Kicinski wrote:
>>>>> There's a certain allure in bringing the in-kernel BPF translation
>>>>> infrastructure forward. OTOH from system architecture perspective IMHO
>>>>> it does seem like a task best handed in user space. bpfilter can replace
>>>>> iptables completely, here we're looking at an acceleration relatively
>>>>> loosely coupled with flower.
>>>>
>>>> I don't think it's loosely coupled. Emulating TC behavior in userspace
>>>> is not so easy.
>>>>
>>>> Think about recent multi-mask support in flower. Previously userspace could
>>>> assume there is one mask and hash table for each preference in TC. After the
>>>> change TC accepts different masks with the same pref. Such a change tends to
>>>> break userspace emulation. It may ignore masks passed from flow insertion
>>>> and use the mask remembered when the first flow of the pref is inserted. It
>>>> may override the mask of all existing flows with the pref. It may fail to
>>>> insert such flows. Any of them would result in unexpected wrong datapath
>>>> handling which is critical.
>>>> I think such an emulation layer needs to be updated in sync with TC.
>>>
>>> Oh, so you're saying that if xdp_flow is merged all patches to
>>> cls_flower and netfilter which affect flow offload will be required
>>> to update xdp_flow as well?
>>
>> Hmm... you are saying that we are allowed to break other in-kernel
>> subsystem by some change? Sounds strange...
> 
> No I'm not saying that, please don't put words in my mouth.

If we ignore xdp_flow when modifying something which affects flow 
offload, that may cause breakage. I showed such an example using 
multi-mask support. So I just wondered what you mean and guessed you 
think we can break other subsystem in some situation.

I admit I should not have used the wording "you are saying...?". If it 
was not unpleasant to you I'm sorry about that. But I think you should 
not use it as well. I did not say "cls_flower and netfilter which affect 
flow offload will be required to update xdp_flow". I guess most patches 
which affect flow offload core will not break xdp_flow. In some cases 
breakage may happen. In that case we need to fix xdp_flow as well.

> I'm asking you if that's your intention.
> 
> Having an implementation nor support a feature of another implementation
> and degrade gracefully to the slower one is not necessarily breakage.
> We need to make a concious decision here, hence the clarifying question.

As I described above, breakage can happen in some case, and if the patch 
breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
xdp_flow does not support newly added features but it works for existing 
ones, it is OK. In the first place not all features can be offloaded to 
xdp_flow. I think this is the same as HW-offload.

>>> That's a question of policy. Technically the implementation in user
>>> space is equivalent.
>>>
>>> The advantage of user space implementation is that you can add more
>>> to it and explore use cases which do not fit in the flow offload API,
>>> but are trivial for BPF. Not to mention the obvious advantage of
>>> decoupling the upgrade path.
>>
>> I understand the advantage, but I can't trust such a third-party kernel
>> emulation solution for this kind of thing which handles critical data path.
> 
> That's a strange argument to make. All production data path BPF today
> comes from user space.

Probably my explanation was not sufficient. What I'm concerned about is 
that this needs to emulate kernel behavior, and it is difficult.
I don't think userspace-defined datapath itself is not reliable, nor 
eBPF ecosystem.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
  2019-08-21  8:49                 ` Toshiaki Makita
@ 2019-08-21 18:38                   ` Jakub Kicinski
  0 siblings, 0 replies; 35+ messages in thread
From: Jakub Kicinski @ 2019-08-21 18:38 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, netdev, bpf, William Tu

On Wed, 21 Aug 2019 17:49:33 +0900, Toshiaki Makita wrote:
> > Having an implementation nor support a feature of another implementation
> > and degrade gracefully to the slower one is not necessarily breakage.
> > We need to make a concious decision here, hence the clarifying question.  
> 
> As I described above, breakage can happen in some case, and if the patch 
> breaks xdp_flow I think we need to fix xdp_flow at the same time. If 
> xdp_flow does not support newly added features but it works for existing 
> ones, it is OK. In the first place not all features can be offloaded to 
> xdp_flow. I think this is the same as HW-offload.

I see, that sounds reasonable, yes. Thanks for clarifying.

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, back to index

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-13 12:05 [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 01/14] xdp_flow: Add skeleton of XDP based TC offload driver Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 02/14] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 03/14] bpf: Add API to get program from id Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 04/14] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 05/14] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 06/14] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 07/14] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 08/14] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 09/14] xdp_flow: Add netdev feature for enabling TC flower offload to XDP Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 10/14] xdp_flow: Implement redirect action Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 11/14] xdp_flow: Implement vlan_push action Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 12/14] bpf, selftest: Add test for xdp_flow Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 13/14] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
2019-08-13 12:05 ` [RFC PATCH bpf-next 14/14] bpf, hashtab: Compare keys in long Toshiaki Makita
2019-08-14  1:44 ` [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP Alexei Starovoitov
2019-08-14  7:33   ` Toshiaki Makita
2019-08-15 10:59     ` Toshiaki Makita
2019-08-14 17:07 ` Stanislav Fomichev
2019-08-15 10:26   ` Toshiaki Makita
2019-08-15 15:21     ` Stanislav Fomichev
2019-08-15 19:22       ` Jakub Kicinski
2019-08-16  1:28         ` Toshiaki Makita
2019-08-16 18:52           ` Jakub Kicinski
2019-08-17 14:01             ` Toshiaki Makita
2019-08-19 18:15               ` Jakub Kicinski
2019-08-21  8:49                 ` Toshiaki Makita
2019-08-21 18:38                   ` Jakub Kicinski
2019-08-16 15:59         ` Stanislav Fomichev
2019-08-16 16:20           ` Stanislav Fomichev
2019-08-16  1:09       ` Toshiaki Makita
2019-08-16 15:35         ` Stanislav Fomichev
2019-08-17 14:10           ` Toshiaki Makita
2019-08-15 15:46 ` William Tu
2019-08-16  1:38   ` Toshiaki Makita

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git