bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
@ 2021-01-19 15:50 Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 1/8] xdp: restructure redirect actions Björn Töpel
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, bjorn.topel, magnus.karlsson,
	maciej.fijalkowski, kuba, jonathan.lemon, maximmi, davem, hawk,
	john.fastabend, ciara.loftus, weqaar.a.janjua

This series extends bind() for XDP sockets, so that the bound socket
is added to the netdev_rx_queue _rx array in the netdevice. We call
this to register the socket. To redirect packets to the registered
socket, a new BPF helper is used: bpf_redirect_xsk().

For shared XDP sockets, only the first bound socket is
registered. Users that need more complex setup has to use XSKMAP and
bpf_redirect_map().

Now, why would one use bpf_redirect_xsk() over the regular
bpf_redirect_map() helper?

* Better performance!
* Convenience; Most user use one socket per queue. This scenario is
  what registered sockets support. There is no need to create an
  XSKMAP. This can also reduce complexity from containerized setups,
  where users might what to use XDP sockets without CAP_SYS_ADMIN
  capabilities.

The first patch restructures xdp_do_redirect() a bit, to make it
easier to add the new helper. This restructure also give us a slight
performance benefit. The following three patches extends bind() and
adds the new helper. After that, two libbpf patches that selects XDP
program based on what kernel is running. Finally, selftests for the new
functionality is added.

Note that the libbpf "auto-selection" is based on kernel version, so
it is hard coded to the "-next" version (5.12). If you would like to
try this is out, you will need to change the libbpf patch locally!

Thanks to Maciej and Magnus for the internal review/comments!

Performance (rxdrop, zero-copy)

Baseline
Two cores:                   21.3 Mpps
One core:                    24.5 Mpps

Patched
Two cores, bpf_redirect_map: 21.7 Mpps + 2%
One core, bpf_redirect_map:  24.9 Mpps + 2%

Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
One core, bpf_redirect_xsk:  25.5 Mpps + 4%

Thanks!
Björn

v1->v2: 
  * Added missing XDP programs to selftests.
  * Fixed checkpatch warning in selftests.

Björn Töpel (8):
  xdp: restructure redirect actions
  xsk: remove explicit_free parameter from __xsk_rcv()
  xsk: fold xp_assign_dev and __xp_assign_dev
  xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  libbpf, xsk: select AF_XDP BPF program based on kernel version
  libbpf, xsk: select bpf_redirect_xsk(), if supported
  selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
  selftest/bpf: remove a lot of ifobject casting in xdpxceiver

 include/linux/filter.h                        |  10 +
 include/linux/netdevice.h                     |   1 +
 include/net/xdp_sock.h                        |  12 +
 include/net/xsk_buff_pool.h                   |   2 +-
 include/trace/events/xdp.h                    |  46 ++--
 include/uapi/linux/bpf.h                      |   7 +
 net/core/filter.c                             | 205 ++++++++++--------
 net/xdp/xsk.c                                 | 112 ++++++++--
 net/xdp/xsk_buff_pool.c                       |  12 +-
 tools/include/uapi/linux/bpf.h                |   7 +
 tools/lib/bpf/libbpf.c                        |   2 +-
 tools/lib/bpf/libbpf_internal.h               |   2 +
 tools/lib/bpf/libbpf_probes.c                 |  16 --
 tools/lib/bpf/xsk.c                           |  83 ++++++-
 .../selftests/bpf/progs/xdpxceiver_ext1.c     |  15 ++
 .../selftests/bpf/progs/xdpxceiver_ext2.c     |   9 +
 tools/testing/selftests/bpf/test_xsk.sh       |  48 ++++
 tools/testing/selftests/bpf/xdpxceiver.c      | 164 +++++++++-----
 tools/testing/selftests/bpf/xdpxceiver.h      |   2 +
 19 files changed, 554 insertions(+), 201 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c


base-commit: 95204c9bfa48d2f4d3bab7df55c1cc823957ff81
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-20 12:44   ` Toke Høiland-Jørgensen
  2021-01-19 15:50 ` [PATCH bpf-next v2 2/8] xsk: remove explicit_free parameter from __xsk_rcv() Björn Töpel
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

The XDP_REDIRECT implementations for maps and non-maps are fairly
similar, but obviously need to take different code paths depending on
if the target is using a map or not. Today, the redirect targets for
XDP either uses a map, or is based on ifindex.

Future commits will introduce yet another redirect target via the a
new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
an explicit redirect type to bpf_redirect_info. This makes the code
easier to follow, and makes it easier to add new redirect targets.

Further, using an explicit type in bpf_redirect_info has a slight
positive performance impact by avoiding a pointer indirection for the
map type lookup, and instead use the hot cacheline for
bpf_redirect_info.

The bpf_redirect_info flags member is not used by XDP, and not
read/written any more. The map member is only written to when
required/used, and not unconditionally.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h     |   9 ++
 include/trace/events/xdp.h |  46 +++++++----
 net/core/filter.c          | 164 ++++++++++++++++++-------------------
 3 files changed, 117 insertions(+), 102 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 7fdce5407214..5fc336a271c2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -637,10 +637,19 @@ struct bpf_redirect_info {
 	u32 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	u32 tgt_type;
 	u32 kern_flags;
 	struct bpf_nh_params nh;
 };
 
+enum xdp_redirect_type {
+	XDP_REDIR_UNSET,
+	XDP_REDIR_DEV_IFINDEX,
+	XDP_REDIR_DEV_MAP,
+	XDP_REDIR_CPU_MAP,
+	XDP_REDIR_XSK_MAP,
+};
+
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
 /* flags for bpf_redirect_info kern_flags */
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index 76a97176ab81..0e17b9a74f28 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -96,9 +96,10 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
 		 const void *tgt, int err,
-		 const struct bpf_map *map, u32 index),
+		 enum xdp_redirect_type type,
+		 const struct bpf_redirect_info *ri),
 
-	TP_ARGS(dev, xdp, tgt, err, map, index),
+	TP_ARGS(dev, xdp, tgt, err, type, ri),
 
 	TP_STRUCT__entry(
 		__field(int, prog_id)
@@ -111,12 +112,19 @@ DECLARE_EVENT_CLASS(xdp_redirect_template,
 	),
 
 	TP_fast_assign(
+		struct bpf_map *map = NULL;
+		u32 index = ri->tgt_index;
+
+		if (type == XDP_REDIR_DEV_MAP || type == XDP_REDIR_CPU_MAP ||
+		    type == XDP_REDIR_XSK_MAP)
+			map = READ_ONCE(ri->map);
+
 		__entry->prog_id	= xdp->aux->id;
 		__entry->act		= XDP_REDIRECT;
 		__entry->ifindex	= dev->ifindex;
 		__entry->err		= err;
 		__entry->to_ifindex	= map ? devmap_ifindex(tgt, map) :
-						index;
+						(u32)(long)tgt;
 		__entry->map_id		= map ? map->id : 0;
 		__entry->map_index	= map ? index : 0;
 	),
@@ -133,45 +141,49 @@ DEFINE_EVENT(xdp_redirect_template, xdp_redirect,
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
 		 const void *tgt, int err,
-		 const struct bpf_map *map, u32 index),
-	TP_ARGS(dev, xdp, tgt, err, map, index)
+		 enum xdp_redirect_type type,
+		 const struct bpf_redirect_info *ri),
+	TP_ARGS(dev, xdp, tgt, err, type, ri)
 );
 
 DEFINE_EVENT(xdp_redirect_template, xdp_redirect_err,
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
 		 const void *tgt, int err,
-		 const struct bpf_map *map, u32 index),
-	TP_ARGS(dev, xdp, tgt, err, map, index)
+		 enum xdp_redirect_type type,
+		 const struct bpf_redirect_info *ri),
+	TP_ARGS(dev, xdp, tgt, err, type, ri)
 );
 
 #define _trace_xdp_redirect(dev, xdp, to)				\
-	 trace_xdp_redirect(dev, xdp, NULL, 0, NULL, to)
+	trace_xdp_redirect(dev, xdp, NULL, 0, XDP_REDIR_DEV_IFINDEX, NULL)
 
 #define _trace_xdp_redirect_err(dev, xdp, to, err)			\
-	 trace_xdp_redirect_err(dev, xdp, NULL, err, NULL, to)
+	trace_xdp_redirect_err(dev, xdp, NULL, err, XDP_REDIR_DEV_IFINDEX, NULL)
 
-#define _trace_xdp_redirect_map(dev, xdp, to, map, index)		\
-	 trace_xdp_redirect(dev, xdp, to, 0, map, index)
+#define _trace_xdp_redirect_map(dev, xdp, to, type, ri)		\
+	trace_xdp_redirect(dev, xdp, to, 0, type, ri)
 
-#define _trace_xdp_redirect_map_err(dev, xdp, to, map, index, err)	\
-	 trace_xdp_redirect_err(dev, xdp, to, err, map, index)
+#define _trace_xdp_redirect_map_err(dev, xdp, to, type, ri, err)	\
+	trace_xdp_redirect_err(dev, xdp, to, err, type, ri)
 
 /* not used anymore, but kept around so as not to break old programs */
 DEFINE_EVENT(xdp_redirect_template, xdp_redirect_map,
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
 		 const void *tgt, int err,
-		 const struct bpf_map *map, u32 index),
-	TP_ARGS(dev, xdp, tgt, err, map, index)
+		 enum xdp_redirect_type type,
+		 const struct bpf_redirect_info *ri),
+	TP_ARGS(dev, xdp, tgt, err, type, ri)
 );
 
 DEFINE_EVENT(xdp_redirect_template, xdp_redirect_map_err,
 	TP_PROTO(const struct net_device *dev,
 		 const struct bpf_prog *xdp,
 		 const void *tgt, int err,
-		 const struct bpf_map *map, u32 index),
-	TP_ARGS(dev, xdp, tgt, err, map, index)
+		 enum xdp_redirect_type type,
+		 const struct bpf_redirect_info *ri),
+	TP_ARGS(dev, xdp, tgt, err, type, ri)
 );
 
 TRACE_EVENT(xdp_cpumap_kthread,
diff --git a/net/core/filter.c b/net/core/filter.c
index 9ab94e90d660..5f31e21be531 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3923,23 +3923,6 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
-static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
-			    struct bpf_map *map, struct xdp_buff *xdp)
-{
-	switch (map->map_type) {
-	case BPF_MAP_TYPE_DEVMAP:
-	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return dev_map_enqueue(fwd, xdp, dev_rx);
-	case BPF_MAP_TYPE_CPUMAP:
-		return cpu_map_enqueue(fwd, xdp, dev_rx);
-	case BPF_MAP_TYPE_XSKMAP:
-		return __xsk_map_redirect(fwd, xdp);
-	default:
-		return -EBADRQC;
-	}
-	return 0;
-}
-
 void xdp_do_flush(void)
 {
 	__dev_flush();
@@ -3948,22 +3931,6 @@ void xdp_do_flush(void)
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush);
 
-static inline void *__xdp_map_lookup_elem(struct bpf_map *map, u32 index)
-{
-	switch (map->map_type) {
-	case BPF_MAP_TYPE_DEVMAP:
-		return __dev_map_lookup_elem(map, index);
-	case BPF_MAP_TYPE_DEVMAP_HASH:
-		return __dev_map_hash_lookup_elem(map, index);
-	case BPF_MAP_TYPE_CPUMAP:
-		return __cpu_map_lookup_elem(map, index);
-	case BPF_MAP_TYPE_XSKMAP:
-		return __xsk_map_lookup_elem(map, index);
-	default:
-		return NULL;
-	}
-}
-
 void bpf_clear_redirect_map(struct bpf_map *map)
 {
 	struct bpf_redirect_info *ri;
@@ -3985,34 +3952,42 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	struct bpf_map *map = READ_ONCE(ri->map);
-	u32 index = ri->tgt_index;
+	enum xdp_redirect_type type = ri->tgt_type;
 	void *fwd = ri->tgt_value;
 	int err;
 
-	ri->tgt_index = 0;
+	ri->tgt_type = XDP_REDIR_UNSET;
 	ri->tgt_value = NULL;
-	WRITE_ONCE(ri->map, NULL);
 
-	if (unlikely(!map)) {
-		fwd = dev_get_by_index_rcu(dev_net(dev), index);
+	switch (type) {
+	case XDP_REDIR_DEV_IFINDEX:
+		fwd = dev_get_by_index_rcu(dev_net(dev), (u32)(long)fwd);
 		if (unlikely(!fwd)) {
 			err = -EINVAL;
-			goto err;
+			break;
 		}
-
 		err = dev_xdp_enqueue(fwd, xdp, dev);
-	} else {
-		err = __bpf_tx_xdp_map(dev, fwd, map, xdp);
+		break;
+	case XDP_REDIR_DEV_MAP:
+		err = dev_map_enqueue(fwd, xdp, dev);
+		break;
+	case XDP_REDIR_CPU_MAP:
+		err = cpu_map_enqueue(fwd, xdp, dev);
+		break;
+	case XDP_REDIR_XSK_MAP:
+		err = __xsk_map_redirect(fwd, xdp);
+		break;
+	default:
+		err = -EBADRQC;
 	}
 
 	if (unlikely(err))
 		goto err;
 
-	_trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+	_trace_xdp_redirect_map(dev, xdp_prog, fwd, type, ri);
 	return 0;
 err:
-	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
+	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, type, ri, err);
 	return err;
 }
 EXPORT_SYMBOL_GPL(xdp_do_redirect);
@@ -4021,41 +3996,40 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct sk_buff *skb,
 				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog,
-				       struct bpf_map *map)
+				       void *fwd,
+				       enum xdp_redirect_type type)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	u32 index = ri->tgt_index;
-	void *fwd = ri->tgt_value;
-	int err = 0;
-
-	ri->tgt_index = 0;
-	ri->tgt_value = NULL;
-	WRITE_ONCE(ri->map, NULL);
+	int err;
 
-	if (map->map_type == BPF_MAP_TYPE_DEVMAP ||
-	    map->map_type == BPF_MAP_TYPE_DEVMAP_HASH) {
+	switch (type) {
+	case XDP_REDIR_DEV_MAP: {
 		struct bpf_dtab_netdev *dst = fwd;
 
 		err = dev_map_generic_redirect(dst, skb, xdp_prog);
 		if (unlikely(err))
 			goto err;
-	} else if (map->map_type == BPF_MAP_TYPE_XSKMAP) {
+		break;
+	}
+	case XDP_REDIR_XSK_MAP: {
 		struct xdp_sock *xs = fwd;
 
 		err = xsk_generic_rcv(xs, xdp);
 		if (err)
 			goto err;
 		consume_skb(skb);
-	} else {
+		break;
+	}
+	default:
 		/* TODO: Handle BPF_MAP_TYPE_CPUMAP */
 		err = -EBADRQC;
 		goto err;
 	}
 
-	_trace_xdp_redirect_map(dev, xdp_prog, fwd, map, index);
+	_trace_xdp_redirect_map(dev, xdp_prog, fwd, type, ri);
 	return 0;
 err:
-	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map, index, err);
+	_trace_xdp_redirect_map_err(dev, xdp_prog, fwd, type, ri, err);
 	return err;
 }
 
@@ -4063,29 +4037,31 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-	struct bpf_map *map = READ_ONCE(ri->map);
-	u32 index = ri->tgt_index;
-	struct net_device *fwd;
+	enum xdp_redirect_type type = ri->tgt_type;
+	void *fwd = ri->tgt_value;
 	int err = 0;
 
-	if (map)
-		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog,
-						   map);
-	ri->tgt_index = 0;
-	fwd = dev_get_by_index_rcu(dev_net(dev), index);
-	if (unlikely(!fwd)) {
-		err = -EINVAL;
-		goto err;
-	}
+	ri->tgt_type = XDP_REDIR_UNSET;
+	ri->tgt_value = NULL;
 
-	err = xdp_ok_fwd_dev(fwd, skb->len);
-	if (unlikely(err))
-		goto err;
+	if (type == XDP_REDIR_DEV_IFINDEX) {
+		fwd = dev_get_by_index_rcu(dev_net(dev), (u32)(long)fwd);
+		if (unlikely(!fwd)) {
+			err = -EINVAL;
+			goto err;
+		}
 
-	skb->dev = fwd;
-	_trace_xdp_redirect(dev, xdp_prog, index);
-	generic_xdp_tx(skb, xdp_prog);
-	return 0;
+		err = xdp_ok_fwd_dev(fwd, skb->len);
+		if (unlikely(err))
+			goto err;
+
+		skb->dev = fwd;
+		_trace_xdp_redirect(dev, xdp_prog, index);
+		generic_xdp_tx(skb, xdp_prog);
+		return 0;
+	}
+
+	return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, fwd, type);
 err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
 	return err;
@@ -4098,10 +4074,9 @@ BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 	if (unlikely(flags))
 		return XDP_ABORTED;
 
-	ri->flags = flags;
-	ri->tgt_index = ifindex;
-	ri->tgt_value = NULL;
-	WRITE_ONCE(ri->map, NULL);
+	ri->tgt_type = XDP_REDIR_DEV_IFINDEX;
+	ri->tgt_index = 0;
+	ri->tgt_value = (void *)(long)ifindex;
 
 	return XDP_REDIRECT;
 }
@@ -4123,18 +4098,37 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
 	if (unlikely(flags > XDP_TX))
 		return XDP_ABORTED;
 
-	ri->tgt_value = __xdp_map_lookup_elem(map, ifindex);
+	switch (map->map_type) {
+	case BPF_MAP_TYPE_DEVMAP:
+		ri->tgt_value = __dev_map_lookup_elem(map, ifindex);
+		ri->tgt_type = XDP_REDIR_DEV_MAP;
+		break;
+	case BPF_MAP_TYPE_DEVMAP_HASH:
+		ri->tgt_value = __dev_map_hash_lookup_elem(map, ifindex);
+		ri->tgt_type = XDP_REDIR_DEV_MAP;
+		break;
+	case BPF_MAP_TYPE_CPUMAP:
+		ri->tgt_value = __cpu_map_lookup_elem(map, ifindex);
+		ri->tgt_type = XDP_REDIR_CPU_MAP;
+		break;
+	case BPF_MAP_TYPE_XSKMAP:
+		ri->tgt_value = __xsk_map_lookup_elem(map, ifindex);
+		ri->tgt_type = XDP_REDIR_XSK_MAP;
+		break;
+	default:
+		ri->tgt_value = NULL;
+	}
+
 	if (unlikely(!ri->tgt_value)) {
 		/* If the lookup fails we want to clear out the state in the
 		 * redirect_info struct completely, so that if an eBPF program
 		 * performs multiple lookups, the last one always takes
 		 * precedence.
 		 */
-		WRITE_ONCE(ri->map, NULL);
+		ri->tgt_type = XDP_REDIR_UNSET;
 		return flags;
 	}
 
-	ri->flags = flags;
 	ri->tgt_index = ifindex;
 	WRITE_ONCE(ri->map, map);
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 2/8] xsk: remove explicit_free parameter from __xsk_rcv()
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 1/8] xdp: restructure redirect actions Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 3/8] xsk: fold xp_assign_dev and __xp_assign_dev Björn Töpel
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

The explicit_free parameter of the __xsk_rcv() function was used to
mark whether the call was via the generic XDP or the native XDP
path. Instead of clutter the code with if-statements and "true/false"
parameters which are hard to understand, simply move the explicit free
to the __xsk_map_redirect() which is always called from the native XDP
path.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 47 +++++++++++++++++++++++++++++++----------------
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 8037b04a9edd..5820de65060b 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -184,12 +184,13 @@ static void xsk_copy_xdp(struct xdp_buff *to, struct xdp_buff *from, u32 len)
 	memcpy(to_buf, from_buf, len + metalen);
 }
 
-static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len,
-		     bool explicit_free)
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	struct xdp_buff *xsk_xdp;
 	int err;
+	u32 len;
 
+	len = xdp->data_end - xdp->data;
 	if (len > xsk_pool_get_rx_frame_size(xs->pool)) {
 		xs->rx_dropped++;
 		return -ENOSPC;
@@ -207,8 +208,6 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len,
 		xsk_buff_free(xsk_xdp);
 		return err;
 	}
-	if (explicit_free)
-		xdp_return_buff(xdp);
 	return 0;
 }
 
@@ -230,11 +229,8 @@ static bool xsk_is_bound(struct xdp_sock *xs)
 	return false;
 }
 
-static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp,
-		   bool explicit_free)
+static int xsk_rcv_check(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	u32 len;
-
 	if (!xsk_is_bound(xs))
 		return -EINVAL;
 
@@ -242,11 +238,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp,
 		return -EINVAL;
 
 	sk_mark_napi_id_once_xdp(&xs->sk, xdp);
-	len = xdp->data_end - xdp->data;
-
-	return xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL ?
-		__xsk_rcv_zc(xs, xdp, len) :
-		__xsk_rcv(xs, xdp, len, explicit_free);
+	return 0;
 }
 
 static void xsk_flush(struct xdp_sock *xs)
@@ -261,18 +253,41 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	int err;
 
 	spin_lock_bh(&xs->rx_lock);
-	err = xsk_rcv(xs, xdp, false);
-	xsk_flush(xs);
+	err = xsk_rcv_check(xs, xdp);
+	if (!err) {
+		err = __xsk_rcv(xs, xdp);
+		xsk_flush(xs);
+	}
 	spin_unlock_bh(&xs->rx_lock);
 	return err;
 }
 
+static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+	u32 len;
+
+	err = xsk_rcv_check(xs, xdp);
+	if (err)
+		return err;
+
+	if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) {
+		len = xdp->data_end - xdp->data;
+		return __xsk_rcv_zc(xs, xdp, len);
+	}
+
+	err = __xsk_rcv(xs, xdp);
+	if (!err)
+		xdp_return_buff(xdp);
+	return err;
+}
+
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
 	int err;
 
-	err = xsk_rcv(xs, xdp, true);
+	err = xsk_rcv(xs, xdp);
 	if (err)
 		return err;
 
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 3/8] xsk: fold xp_assign_dev and __xp_assign_dev
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 1/8] xdp: restructure redirect actions Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 2/8] xsk: remove explicit_free parameter from __xsk_rcv() Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Fold xp_assign_dev and __xp_assign_dev. The former directly calls the
latter.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk_buff_pool.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 20598eea658c..8de01aaac4a0 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -119,8 +119,8 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 	}
 }
 
-static int __xp_assign_dev(struct xsk_buff_pool *pool,
-			   struct net_device *netdev, u16 queue_id, u16 flags)
+int xp_assign_dev(struct xsk_buff_pool *pool,
+		  struct net_device *netdev, u16 queue_id, u16 flags)
 {
 	bool force_zc, force_copy;
 	struct netdev_bpf bpf;
@@ -191,12 +191,6 @@ static int __xp_assign_dev(struct xsk_buff_pool *pool,
 	return err;
 }
 
-int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev,
-		  u16 queue_id, u16 flags)
-{
-	return __xp_assign_dev(pool, dev, queue_id, flags);
-}
-
 int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem,
 			 struct net_device *dev, u16 queue_id)
 {
@@ -210,7 +204,7 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem,
 	if (pool->uses_need_wakeup)
 		flags |= XDP_USE_NEED_WAKEUP;
 
-	return __xp_assign_dev(pool, dev, queue_id, flags);
+	return xp_assign_dev(pool, dev, queue_id, flags);
 }
 
 void xp_clear_dev(struct xsk_buff_pool *pool)
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (2 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 3/8] xsk: fold xp_assign_dev and __xp_assign_dev Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-20  8:25   ` kernel test robot
                     ` (2 more replies)
  2021-01-19 15:50 ` [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version Björn Töpel
                   ` (4 subsequent siblings)
  8 siblings, 3 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Extend bind() for XDP sockets, so that the bound socket is added to
the netdev_rx_queue _rx array in the netdevice. We call this to
register an XDP socket. To redirect packets to a registered socket, a
new BPF helper is used: bpf_redirect_xsk().

For shared XDP sockets, only the first bound socket is
registered. Users that require more advanced setups, continue to the
XSKMAP and bpf_redirect_map().

Now, why would one use bpf_redirect_xsk() over the regular
bpf_redirect_map() helper? First: Slightly better performance. Second:
Convenience. Most user use one socket per queue. This scenario is what
registered sockets support. There is no need to create an XSKMAP. This
can also reduce complexity from containerized setups, where users
might what to use XDP sockets without CAP_SYS_ADMIN capabilities.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 include/linux/filter.h         |  1 +
 include/linux/netdevice.h      |  1 +
 include/net/xdp_sock.h         | 12 +++++
 include/net/xsk_buff_pool.h    |  2 +-
 include/uapi/linux/bpf.h       |  7 +++
 net/core/filter.c              | 49 ++++++++++++++++--
 net/xdp/xsk.c                  | 93 ++++++++++++++++++++++++++++------
 net/xdp/xsk_buff_pool.c        |  4 +-
 tools/include/uapi/linux/bpf.h |  7 +++
 9 files changed, 153 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 5fc336a271c2..3f9efbd08cba 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -648,6 +648,7 @@ enum xdp_redirect_type {
 	XDP_REDIR_DEV_MAP,
 	XDP_REDIR_CPU_MAP,
 	XDP_REDIR_XSK_MAP,
+	XDP_REDIR_XSK,
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5b949076ed23..cb0e215e981c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -749,6 +749,7 @@ struct netdev_rx_queue {
 	struct xdp_rxq_info		xdp_rxq;
 #ifdef CONFIG_XDP_SOCKETS
 	struct xsk_buff_pool            *pool;
+	struct xdp_sock			*xsk;
 #endif
 } ____cacheline_aligned_in_smp;
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index cc17bc957548..97b21c483baf 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -77,8 +77,10 @@ struct xdp_sock {
 #ifdef CONFIG_XDP_SOCKETS
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+int xsk_generic_redirect(struct net_device *dev, struct xdp_buff *xdp);
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 void __xsk_map_flush(void);
+int xsk_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 
 static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
 						     u32 key)
@@ -100,6 +102,11 @@ static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return -ENOTSUPP;
 }
 
+static inline int xsk_generic_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	return -EOPNOTSUPP;
@@ -115,6 +122,11 @@ static inline struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map,
 	return NULL;
 }
 
+static inline int xsk_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index eaa8386dbc63..bd531d561c60 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -84,7 +84,7 @@ struct xsk_buff_pool {
 /* AF_XDP core. */
 struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 						struct xdp_umem *umem);
-int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev,
+int xp_assign_dev(struct xdp_sock *xs, struct xsk_buff_pool *pool, struct net_device *dev,
 		  u16 queue_id, u16 flags);
 int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem,
 			 struct net_device *dev, u16 queue_id);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c001766adcbc..bbc7d9a57262 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3836,6 +3836,12 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
+ *	Description
+ *		Redirect to the registered AF_XDP socket.
+ *	Return
+ *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4001,6 +4007,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(redirect_xsk),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 5f31e21be531..b457c83fba70 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3977,6 +3977,9 @@ int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 	case XDP_REDIR_XSK_MAP:
 		err = __xsk_map_redirect(fwd, xdp);
 		break;
+	case XDP_REDIR_XSK:
+		err = xsk_redirect(fwd, xdp);
+		break;
 	default:
 		err = -EBADRQC;
 	}
@@ -4044,25 +4047,33 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 	ri->tgt_type = XDP_REDIR_UNSET;
 	ri->tgt_value = NULL;
 
-	if (type == XDP_REDIR_DEV_IFINDEX) {
+	switch (type) {
+	case XDP_REDIR_DEV_IFINDEX: {
 		fwd = dev_get_by_index_rcu(dev_net(dev), (u32)(long)fwd);
 		if (unlikely(!fwd)) {
 			err = -EINVAL;
-			goto err;
+			break;
 		}
 
 		err = xdp_ok_fwd_dev(fwd, skb->len);
 		if (unlikely(err))
-			goto err;
+			break;
 
 		skb->dev = fwd;
 		_trace_xdp_redirect(dev, xdp_prog, index);
 		generic_xdp_tx(skb, xdp_prog);
 		return 0;
 	}
+	case XDP_REDIR_XSK:
+		err = xsk_generic_redirect(dev, xdp);
+		if (err)
+			break;
+		consume_skb(skb);
+		break;
+	default:
+		return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, fwd, type);
+	}
 
-	return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, fwd, type);
-err:
 	_trace_xdp_redirect_err(dev, xdp_prog, index, err);
 	return err;
 }
@@ -4144,6 +4155,32 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_redirect_xsk, struct xdp_buff *, xdp, u64, action)
+{
+	struct net_device *dev = xdp->rxq->dev;
+	u32 queue_id = xdp->rxq->queue_index;
+	struct bpf_redirect_info *ri;
+	struct xdp_sock *xs;
+
+	xs = READ_ONCE(dev->_rx[queue_id].xsk);
+	if (!xs)
+		return action;
+
+	ri = this_cpu_ptr(&bpf_redirect_info);
+	ri->tgt_type = XDP_REDIR_XSK;
+	ri->tgt_value = xs;
+
+	return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_xsk_proto = {
+	.func           = bpf_xdp_redirect_xsk,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -7260,6 +7297,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_tcp_gen_syncookie:
 		return &bpf_tcp_gen_syncookie_proto;
 #endif
+	case BPF_FUNC_redirect_xsk:
+		return &bpf_xdp_redirect_xsk_proto;
 	default:
 		return bpf_sk_base_func_proto(func_id);
 	}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5820de65060b..79f1492e71e2 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -134,6 +134,28 @@ int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
 	return 0;
 }
 
+static struct xdp_sock *xsk_get_at_qid(struct net_device *dev, u16 queue_id)
+{
+	return READ_ONCE(dev->_rx[queue_id].xsk);
+}
+
+static void xsk_clear(struct xdp_sock *xs)
+{
+	struct net_device *dev = xs->dev;
+	u16 queue_id = xs->queue_id;
+
+	if (queue_id < dev->num_rx_queues)
+		WRITE_ONCE(dev->_rx[queue_id].xsk, NULL);
+}
+
+static void xsk_reg(struct xdp_sock *xs)
+{
+	struct net_device *dev = xs->dev;
+	u16 queue_id = xs->queue_id;
+
+	WRITE_ONCE(dev->_rx[queue_id].xsk, xs);
+}
+
 void xp_release(struct xdp_buff_xsk *xskb)
 {
 	xskb->pool->free_heads[xskb->pool->free_heads_cnt++] = xskb;
@@ -184,7 +206,7 @@ static void xsk_copy_xdp(struct xdp_buff *to, struct xdp_buff *from, u32 len)
 	memcpy(to_buf, from_buf, len + metalen);
 }
 
-static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+static int ____xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	struct xdp_buff *xsk_xdp;
 	int err;
@@ -211,6 +233,22 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	return 0;
 }
 
+static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	int err;
+	u32 len;
+
+	if (likely(xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL)) {
+		len = xdp->data_end - xdp->data;
+		return __xsk_rcv_zc(xs, xdp, len);
+	}
+
+	err = ____xsk_rcv(xs, xdp);
+	if (!err)
+		xdp_return_buff(xdp);
+	return err;
+}
+
 static bool xsk_tx_writeable(struct xdp_sock *xs)
 {
 	if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
@@ -248,6 +286,39 @@ static void xsk_flush(struct xdp_sock *xs)
 	sock_def_readable(&xs->sk);
 }
 
+int xsk_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+	struct list_head *flush_list = this_cpu_ptr(&xskmap_flush_list);
+	int err;
+
+	sk_mark_napi_id_once_xdp(&xs->sk, xdp);
+	err = __xsk_rcv(xs, xdp);
+	if (err)
+		return err;
+
+	if (!xs->flush_node.prev)
+		list_add(&xs->flush_node, flush_list);
+	return 0;
+}
+
+int xsk_generic_redirect(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_sock *xs;
+	u32 queue_id;
+	int err;
+
+	queue_id = xdp->rxq->queue_index;
+	xs = xsk_get_at_qid(dev, queue_id);
+	if (!xs)
+		return -EINVAL;
+
+	spin_lock_bh(&xs->rx_lock);
+	err = ____xsk_rcv(xs, xdp);
+	xsk_flush(xs);
+	spin_unlock_bh(&xs->rx_lock);
+	return err;
+}
+
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	int err;
@@ -255,7 +326,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	spin_lock_bh(&xs->rx_lock);
 	err = xsk_rcv_check(xs, xdp);
 	if (!err) {
-		err = __xsk_rcv(xs, xdp);
+		err = ____xsk_rcv(xs, xdp);
 		xsk_flush(xs);
 	}
 	spin_unlock_bh(&xs->rx_lock);
@@ -264,22 +335,12 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-	int err;
-	u32 len;
+	int err = xsk_rcv_check(xs, xdp);
 
-	err = xsk_rcv_check(xs, xdp);
 	if (err)
 		return err;
 
-	if (xdp->rxq->mem.type == MEM_TYPE_XSK_BUFF_POOL) {
-		len = xdp->data_end - xdp->data;
-		return __xsk_rcv_zc(xs, xdp, len);
-	}
-
-	err = __xsk_rcv(xs, xdp);
-	if (!err)
-		xdp_return_buff(xdp);
-	return err;
+	return __xsk_rcv(xs, xdp);
 }
 
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -661,6 +722,7 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
 
 	if (xs->state != XSK_BOUND)
 		return;
+	xsk_clear(xs);
 	WRITE_ONCE(xs->state, XSK_UNBOUND);
 
 	/* Wait for driver to stop using the xdp socket. */
@@ -892,7 +954,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 			goto out_unlock;
 		}
 
-		err = xp_assign_dev(xs->pool, dev, qid, flags);
+		err = xp_assign_dev(xs, xs->pool, dev, qid, flags);
 		if (err) {
 			xp_destroy(xs->pool);
 			xs->pool = NULL;
@@ -918,6 +980,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		 */
 		smp_wmb();
 		WRITE_ONCE(xs->state, XSK_BOUND);
+		xsk_reg(xs);
 	}
 out_release:
 	mutex_unlock(&xs->mutex);
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 8de01aaac4a0..af02a69d0bf7 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -119,7 +119,7 @@ static void xp_disable_drv_zc(struct xsk_buff_pool *pool)
 	}
 }
 
-int xp_assign_dev(struct xsk_buff_pool *pool,
+int xp_assign_dev(struct xdp_sock *xs, struct xsk_buff_pool *pool,
 		  struct net_device *netdev, u16 queue_id, u16 flags)
 {
 	bool force_zc, force_copy;
@@ -204,7 +204,7 @@ int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem,
 	if (pool->uses_need_wakeup)
 		flags |= XDP_USE_NEED_WAKEUP;
 
-	return xp_assign_dev(pool, dev, queue_id, flags);
+	return xp_assign_dev(NULL, pool, dev, queue_id, flags);
 }
 
 void xp_clear_dev(struct xsk_buff_pool *pool)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c001766adcbc..bbc7d9a57262 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3836,6 +3836,12 @@ union bpf_attr {
  *	Return
  *		A pointer to a struct socket on success or NULL if the file is
  *		not a socket.
+ *
+ * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
+ *	Description
+ *		Redirect to the registered AF_XDP socket.
+ *	Return
+ *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -4001,6 +4007,7 @@ union bpf_attr {
 	FN(ktime_get_coarse_ns),	\
 	FN(ima_inode_hash),		\
 	FN(sock_from_file),		\
+	FN(redirect_xsk),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (3 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-20 12:52   ` Toke Høiland-Jørgensen
  2021-01-19 15:50 ` [PATCH bpf-next v2 6/8] libbpf, xsk: select bpf_redirect_xsk(), if supported Björn Töpel
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua, Marek Majtyka

From: Björn Töpel <bjorn.topel@intel.com>

Add detection for kernel version, and adapt the BPF program based on
kernel support. This way, users will get the best possible performance
from the BPF program.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Marek Majtyka  <alardam@gmail.com>
---
 tools/lib/bpf/libbpf.c          |  2 +-
 tools/lib/bpf/libbpf_internal.h |  2 ++
 tools/lib/bpf/libbpf_probes.c   | 16 -------------
 tools/lib/bpf/xsk.c             | 41 ++++++++++++++++++++++++++++++---
 4 files changed, 41 insertions(+), 20 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2abbc3800568..6a53adf14a9c 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -693,7 +693,7 @@ bpf_object__add_programs(struct bpf_object *obj, Elf_Data *sec_data,
 	return 0;
 }
 
-static __u32 get_kernel_version(void)
+__u32 get_kernel_version(void)
 {
 	__u32 major, minor, patch;
 	struct utsname info;
diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
index 969d0ac592ba..dafb780e2dd2 100644
--- a/tools/lib/bpf/libbpf_internal.h
+++ b/tools/lib/bpf/libbpf_internal.h
@@ -349,4 +349,6 @@ struct bpf_core_relo {
 	enum bpf_core_relo_kind kind;
 };
 
+__u32 get_kernel_version(void);
+
 #endif /* __LIBBPF_LIBBPF_INTERNAL_H */
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index ecaae2927ab8..aae0231371d0 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -48,22 +48,6 @@ static int get_vendor_id(int ifindex)
 	return strtol(buf, NULL, 0);
 }
 
-static int get_kernel_version(void)
-{
-	int version, subversion, patchlevel;
-	struct utsname utsn;
-
-	/* Return 0 on failure, and attempt to probe with empty kversion */
-	if (uname(&utsn))
-		return 0;
-
-	if (sscanf(utsn.release, "%d.%d.%d",
-		   &version, &subversion, &patchlevel) != 3)
-		return 0;
-
-	return (version << 16) + (subversion << 8) + patchlevel;
-}
-
 static void
 probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	   size_t insns_cnt, char *buf, size_t buf_len, __u32 ifindex)
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index e3e41ceeb1bc..c8642c6cb5d6 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -23,6 +23,7 @@
 #include <linux/kernel.h>
 #include <linux/list.h>
 #include <linux/sockios.h>
+#include <linux/version.h>
 #include <net/if.h>
 #include <sys/ioctl.h>
 #include <sys/mman.h>
@@ -46,6 +47,11 @@
  #define PF_XDP AF_XDP
 #endif
 
+enum xsk_prog {
+	XSK_PROG_FALLBACK,
+	XSK_PROG_REDIRECT_FLAGS,
+};
+
 struct xsk_umem {
 	struct xsk_ring_prod *fill_save;
 	struct xsk_ring_cons *comp_save;
@@ -351,6 +357,13 @@ int xsk_umem__create_v0_0_2(struct xsk_umem **umem_ptr, void *umem_area,
 COMPAT_VERSION(xsk_umem__create_v0_0_2, xsk_umem__create, LIBBPF_0.0.2)
 DEFAULT_VERSION(xsk_umem__create_v0_0_4, xsk_umem__create, LIBBPF_0.0.4)
 
+static enum xsk_prog get_xsk_prog(void)
+{
+	__u32 kver = get_kernel_version();
+
+	return kver < KERNEL_VERSION(5, 3, 0) ? XSK_PROG_FALLBACK : XSK_PROG_REDIRECT_FLAGS;
+}
+
 static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 {
 	static const int log_buf_size = 16 * 1024;
@@ -358,7 +371,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 	char log_buf[log_buf_size];
 	int err, prog_fd;
 
-	/* This is the C-program:
+	/* This is the fallback C-program:
 	 * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
 	 * {
 	 *     int ret, index = ctx->rx_queue_index;
@@ -414,9 +427,31 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 		/* The jumps are to this instruction */
 		BPF_EXIT_INSN(),
 	};
-	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
 
-	prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, prog, insns_cnt,
+	/* This is the post-5.3 kernel C-program:
+	 * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+	 * {
+	 *     return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_PASS);
+	 * }
+	 */
+	struct bpf_insn prog_redirect_flags[] = {
+		/* r2 = *(u32 *)(r1 + 16) */
+		BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1, 16),
+		/* r1 = xskmap[] */
+		BPF_LD_MAP_FD(BPF_REG_1, ctx->xsks_map_fd),
+		/* r3 = XDP_PASS */
+		BPF_MOV64_IMM(BPF_REG_3, 2),
+		/* call bpf_redirect_map */
+		BPF_EMIT_CALL(BPF_FUNC_redirect_map),
+		BPF_EXIT_INSN(),
+	};
+	size_t insns_cnt[] = {sizeof(prog) / sizeof(struct bpf_insn),
+			      sizeof(prog_redirect_flags) / sizeof(struct bpf_insn),
+	};
+	struct bpf_insn *progs[] = {prog, prog_redirect_flags};
+	enum xsk_prog option = get_xsk_prog();
+
+	prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, progs[option], insns_cnt[option],
 				   "LGPL-2.1 or BSD-2-Clause", 0, log_buf,
 				   log_buf_size);
 	if (prog_fd < 0) {
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 6/8] libbpf, xsk: select bpf_redirect_xsk(), if supported
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (4 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-19 15:50 ` [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}() Björn Töpel
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Select bpf_redirect_xsk() as the default AF_XDP BPF program, if
supported.

The bpf_redirect_xsk() helper does not require an XSKMAP, so make sure
that no map is created/updated when using it.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 tools/lib/bpf/xsk.c | 46 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index c8642c6cb5d6..27e36d6d92a6 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -47,9 +47,12 @@
  #define PF_XDP AF_XDP
 #endif
 
+#define XSKMAP_NOT_NEEDED -1
+
 enum xsk_prog {
 	XSK_PROG_FALLBACK,
 	XSK_PROG_REDIRECT_FLAGS,
+	XSK_PROG_REDIRECT_XSK,
 };
 
 struct xsk_umem {
@@ -361,7 +364,11 @@ static enum xsk_prog get_xsk_prog(void)
 {
 	__u32 kver = get_kernel_version();
 
-	return kver < KERNEL_VERSION(5, 3, 0) ? XSK_PROG_FALLBACK : XSK_PROG_REDIRECT_FLAGS;
+	if (kver < KERNEL_VERSION(5, 3, 0))
+		return XSK_PROG_FALLBACK;
+	if (kver < KERNEL_VERSION(5, 12, 0))
+		return XSK_PROG_REDIRECT_FLAGS;
+	return XSK_PROG_REDIRECT_XSK;
 }
 
 static int xsk_load_xdp_prog(struct xsk_socket *xsk)
@@ -445,10 +452,25 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 		BPF_EMIT_CALL(BPF_FUNC_redirect_map),
 		BPF_EXIT_INSN(),
 	};
+
+	/* This is the post-5.12 kernel C-program:
+	 * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+	 * {
+	 *     return bpf_redirect_xsk(ctx, XDP_PASS);
+	 * }
+	 */
+	struct bpf_insn prog_redirect_xsk[] = {
+		/* r2 = XDP_PASS */
+		BPF_MOV64_IMM(BPF_REG_2, 2),
+		/* call bpf_redirect_xsk */
+		BPF_EMIT_CALL(BPF_FUNC_redirect_xsk),
+		BPF_EXIT_INSN(),
+	};
 	size_t insns_cnt[] = {sizeof(prog) / sizeof(struct bpf_insn),
 			      sizeof(prog_redirect_flags) / sizeof(struct bpf_insn),
+			      sizeof(prog_redirect_xsk) / sizeof(struct bpf_insn),
 	};
-	struct bpf_insn *progs[] = {prog, prog_redirect_flags};
+	struct bpf_insn *progs[] = {prog, prog_redirect_flags, prog_redirect_xsk};
 	enum xsk_prog option = get_xsk_prog();
 
 	prog_fd = bpf_load_program(BPF_PROG_TYPE_XDP, progs[option], insns_cnt[option],
@@ -508,12 +530,22 @@ static int xsk_get_max_queues(struct xsk_socket *xsk)
 	return ret;
 }
 
+static bool xskmap_required(void)
+{
+	return get_xsk_prog() != XSK_PROG_REDIRECT_XSK;
+}
+
 static int xsk_create_bpf_maps(struct xsk_socket *xsk)
 {
 	struct xsk_ctx *ctx = xsk->ctx;
 	int max_queues;
 	int fd;
 
+	if (!xskmap_required()) {
+		ctx->xsks_map_fd = XSKMAP_NOT_NEEDED;
+		return 0;
+	}
+
 	max_queues = xsk_get_max_queues(xsk);
 	if (max_queues < 0)
 		return max_queues;
@@ -532,6 +564,9 @@ static void xsk_delete_bpf_maps(struct xsk_socket *xsk)
 {
 	struct xsk_ctx *ctx = xsk->ctx;
 
+	if (ctx->xsks_map_fd == XSKMAP_NOT_NEEDED)
+		return;
+
 	bpf_map_delete_elem(ctx->xsks_map_fd, &ctx->queue_id);
 	close(ctx->xsks_map_fd);
 }
@@ -563,7 +598,7 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk)
 	if (err)
 		goto out_map_ids;
 
-	ctx->xsks_map_fd = -1;
+	ctx->xsks_map_fd = XSKMAP_NOT_NEEDED;
 
 	for (i = 0; i < prog_info.nr_map_ids; i++) {
 		fd = bpf_map_get_fd_by_id(map_ids[i]);
@@ -585,7 +620,7 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk)
 	}
 
 	err = 0;
-	if (ctx->xsks_map_fd == -1)
+	if (ctx->xsks_map_fd == XSKMAP_NOT_NEEDED && xskmap_required())
 		err = -ENOENT;
 
 out_map_ids:
@@ -597,6 +632,9 @@ static int xsk_set_bpf_maps(struct xsk_socket *xsk)
 {
 	struct xsk_ctx *ctx = xsk->ctx;
 
+	if (ctx->xsks_map_fd == XSKMAP_NOT_NEEDED)
+		return 0;
+
 	return bpf_map_update_elem(ctx->xsks_map_fd, &ctx->queue_id,
 				   &xsk->fd, 0);
 }
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (5 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 6/8] libbpf, xsk: select bpf_redirect_xsk(), if supported Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-21  7:39   ` Andrii Nakryiko
  2021-01-19 15:50 ` [PATCH bpf-next v2 8/8] selftest/bpf: remove a lot of ifobject casting in xdpxceiver Björn Töpel
  2021-01-20 13:15 ` [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Maxim Mikityanskiy
  8 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Add support for externally loaded XDP programs to
xdpxceiver/test_xsk.sh, so that bpf_redirect_xsk() and
bpf_redirect_map() can be exercised.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 .../selftests/bpf/progs/xdpxceiver_ext1.c     | 15 ++++
 .../selftests/bpf/progs/xdpxceiver_ext2.c     |  9 +++
 tools/testing/selftests/bpf/test_xsk.sh       | 48 ++++++++++++
 tools/testing/selftests/bpf/xdpxceiver.c      | 77 ++++++++++++++++++-
 tools/testing/selftests/bpf/xdpxceiver.h      |  2 +
 5 files changed, 147 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c

diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
new file mode 100644
index 000000000000..18894040cca6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 32);
+	__uint(key_size, sizeof(int));
+	__uint(value_size, sizeof(int));
+} xsks_map SEC(".maps");
+
+SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+{
+	return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_DROP);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
new file mode 100644
index 000000000000..bd239b958c01
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <bpf/bpf_helpers.h>
+
+SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
+{
+	return bpf_redirect_xsk(ctx, XDP_DROP);
+}
+
diff --git a/tools/testing/selftests/bpf/test_xsk.sh b/tools/testing/selftests/bpf/test_xsk.sh
index 88a7483eaae4..3a3996edf527 100755
--- a/tools/testing/selftests/bpf/test_xsk.sh
+++ b/tools/testing/selftests/bpf/test_xsk.sh
@@ -245,6 +245,54 @@ retval=$?
 test_status $retval "${TEST_NAME}"
 statusList+=($retval)
 
+### TEST 10
+TEST_NAME="SKB EXT BPF_REDIRECT_MAP"
+
+vethXDPgeneric ${VETH0} ${VETH1} ${NS1}
+
+params=("-S" "--ext-prog1")
+execxdpxceiver params
+
+retval=$?
+test_status $retval "${TEST_NAME}"
+statusList+=($retval)
+
+### TEST 11
+TEST_NAME="DRV EXT BPF_REDIRECT_MAP"
+
+vethXDPnative ${VETH0} ${VETH1} ${NS1}
+
+params=("-N" "--ext-prog1")
+execxdpxceiver params
+
+retval=$?
+test_status $retval "${TEST_NAME}"
+statusList+=($retval)
+
+### TEST 12
+TEST_NAME="SKB EXT BPF_REDIRECT_XSK"
+
+vethXDPgeneric ${VETH0} ${VETH1} ${NS1}
+
+params=("-S" "--ext-prog2")
+execxdpxceiver params
+
+retval=$?
+test_status $retval "${TEST_NAME}"
+statusList+=($retval)
+
+### TEST 13
+TEST_NAME="DRV EXT BPF_REDIRECT_XSK"
+
+vethXDPnative ${VETH0} ${VETH1} ${NS1}
+
+params=("-N" "--ext-prog2")
+execxdpxceiver params
+
+retval=$?
+test_status $retval "${TEST_NAME}"
+statusList+=($retval)
+
 ## END TESTS
 
 cleanup_exit ${VETH0} ${VETH1} ${NS1}
diff --git a/tools/testing/selftests/bpf/xdpxceiver.c b/tools/testing/selftests/bpf/xdpxceiver.c
index 1e722ee76b1f..fd0852fdd97d 100644
--- a/tools/testing/selftests/bpf/xdpxceiver.c
+++ b/tools/testing/selftests/bpf/xdpxceiver.c
@@ -45,7 +45,7 @@
  *    - Only copy mode is supported because veth does not currently support
  *      zero-copy mode
  *
- * Total tests: 8
+ * Total tests: 13
  *
  * Flow:
  * -----
@@ -93,6 +93,7 @@ typedef __u16 __sum16;
 #include <unistd.h>
 #include <stdatomic.h>
 #include <bpf/xsk.h>
+#include <bpf/bpf.h>
 #include "xdpxceiver.h"
 #include "../kselftest.h"
 
@@ -296,6 +297,23 @@ static void xsk_populate_fill_ring(struct xsk_umem_info *umem)
 	xsk_ring_prod__submit(&umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS);
 }
 
+static int update_xskmap(struct bpf_object *obj, struct xsk_socket_info *xsk)
+{
+	int xskmap, fd, key = opt_queue;
+	struct bpf_map *map;
+
+	map = bpf_object__find_map_by_name(obj, "xsks_map");
+	xskmap = bpf_map__fd(map);
+	if (xskmap < 0)
+		return 0;
+
+	fd = xsk_socket__fd(xsk->xsk);
+	if (bpf_map_update_elem(xskmap, &key, &fd, 0))
+		return -1;
+
+	return 0;
+}
+
 static int xsk_configure_socket(struct ifobject *ifobject)
 {
 	struct xsk_socket_config cfg;
@@ -310,7 +328,7 @@ static int xsk_configure_socket(struct ifobject *ifobject)
 	ifobject->xsk->umem = ifobject->umem;
 	cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 	cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
-	cfg.libbpf_flags = 0;
+	cfg.libbpf_flags = ifobject->obj ? XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD : 0;
 	cfg.xdp_flags = opt_xdp_flags;
 	cfg.bind_flags = opt_xdp_bind_flags;
 
@@ -328,6 +346,11 @@ static int xsk_configure_socket(struct ifobject *ifobject)
 	if (ret)
 		return 1;
 
+	if (ifobject->obj) {
+		if (update_xskmap(ifobject->obj, ifobject->xsk))
+			exit_with_error(errno);
+	}
+
 	return 0;
 }
 
@@ -342,6 +365,8 @@ static struct option long_options[] = {
 	{"bidi", optional_argument, 0, 'B'},
 	{"debug", optional_argument, 0, 'D'},
 	{"tx-pkt-count", optional_argument, 0, 'C'},
+	{"ext-prog1", no_argument, 0, 1},
+	{"ext-prog2", no_argument, 0, 1},
 	{0, 0, 0, 0}
 };
 
@@ -441,9 +466,30 @@ static int validate_interfaces(void)
 	return ret;
 }
 
+static int load_xdp_program(char *argv0, struct bpf_object **obj, int ext_prog)
+{
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type      = BPF_PROG_TYPE_XDP,
+	};
+	char xdp_filename[256];
+	int prog_fd;
+
+	snprintf(xdp_filename, sizeof(xdp_filename), "%s_ext%d.o", argv0, ext_prog);
+	prog_load_attr.file = xdp_filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, obj, &prog_fd))
+		return -1;
+	return prog_fd;
+}
+
+static int attach_xdp_program(int ifindex, int prog_fd)
+{
+	return bpf_set_link_xdp_fd(ifindex, prog_fd, opt_xdp_flags);
+}
+
 static void parse_command_line(int argc, char **argv)
 {
-	int option_index, interface_index = 0, c;
+	int option_index = 0, interface_index = 0, ext_prog = 0, c;
 
 	opterr = 0;
 
@@ -454,6 +500,9 @@ static void parse_command_line(int argc, char **argv)
 			break;
 
 		switch (c) {
+		case 1:
+			ext_prog = atoi(long_options[option_index].name + strlen("ext-prog"));
+			break;
 		case 'i':
 			if (interface_index == MAX_INTERFACES)
 				break;
@@ -509,6 +558,22 @@ static void parse_command_line(int argc, char **argv)
 		usage(basename(argv[0]));
 		ksft_exit_xfail();
 	}
+
+	if (ext_prog) {
+		struct bpf_object *obj;
+		int prog_fd;
+
+		for (int i = 0; i < MAX_INTERFACES; i++) {
+			prog_fd = load_xdp_program(argv[0], &obj, ext_prog);
+			if (prog_fd < 0) {
+				ksft_test_result_fail("ERROR: could not load ext XDP program\n");
+				ksft_exit_xfail();
+			}
+
+			ifdict[i]->prog_fd = prog_fd;
+			ifdict[i]->obj = obj;
+		}
+	}
 }
 
 static void kick_tx(struct xsk_socket_info *xsk)
@@ -818,6 +883,7 @@ static void *worker_testapp_validate(void *arg)
 	struct generic_data *data = (struct generic_data *)malloc(sizeof(struct generic_data));
 	struct iphdr *ip_hdr = (struct iphdr *)(pkt_data + sizeof(struct ethhdr));
 	struct ethhdr *eth_hdr = (struct ethhdr *)pkt_data;
+	struct ifobject *ifobject = (struct ifobject *)arg;
 	void *bufs = NULL;
 
 	pthread_attr_setstacksize(&attr, THREAD_STACK);
@@ -830,6 +896,9 @@ static void *worker_testapp_validate(void *arg)
 
 		if (strcmp(((struct ifobject *)arg)->nsname, ""))
 			switch_namespace(((struct ifobject *)arg)->ifdict_index);
+
+		if (ifobject->obj && attach_xdp_program(ifobject->ifindex, ifobject->prog_fd) < 0)
+			exit_with_error(errno);
 	}
 
 	if (((struct ifobject *)arg)->fv.vector == tx) {
@@ -1035,7 +1104,7 @@ int main(int argc, char **argv)
 	ifaceconfig->src_port = UDP_SRC_PORT;
 
 	for (int i = 0; i < MAX_INTERFACES; i++) {
-		ifdict[i] = (struct ifobject *)malloc(sizeof(struct ifobject));
+		ifdict[i] = (struct ifobject *)calloc(1, sizeof(struct ifobject));
 		if (!ifdict[i])
 			exit_with_error(errno);
 
diff --git a/tools/testing/selftests/bpf/xdpxceiver.h b/tools/testing/selftests/bpf/xdpxceiver.h
index 61f595b6f200..3c15c2e95026 100644
--- a/tools/testing/selftests/bpf/xdpxceiver.h
+++ b/tools/testing/selftests/bpf/xdpxceiver.h
@@ -124,6 +124,8 @@ struct ifobject {
 	u32 src_ip;
 	u16 src_port;
 	u16 dst_port;
+	int prog_fd;
+	struct bpf_object *obj;
 };
 
 static struct ifobject *ifdict[MAX_INTERFACES];
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH bpf-next v2 8/8] selftest/bpf: remove a lot of ifobject casting in xdpxceiver
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (6 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}() Björn Töpel
@ 2021-01-19 15:50 ` Björn Töpel
  2021-01-20 13:15 ` [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Maxim Mikityanskiy
  8 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-19 15:50 UTC (permalink / raw)
  To: ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

From: Björn Töpel <bjorn.topel@intel.com>

Instead of passing void * all over the place, let us pass the actual
type (ifobject) and remove the void-ptr-to-type-ptr casting.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 tools/testing/selftests/bpf/xdpxceiver.c | 87 ++++++++++++------------
 1 file changed, 42 insertions(+), 45 deletions(-)

diff --git a/tools/testing/selftests/bpf/xdpxceiver.c b/tools/testing/selftests/bpf/xdpxceiver.c
index fd0852fdd97d..7734fc87124f 100644
--- a/tools/testing/selftests/bpf/xdpxceiver.c
+++ b/tools/testing/selftests/bpf/xdpxceiver.c
@@ -225,14 +225,14 @@ static inline u16 udp_csum(u32 saddr, u32 daddr, u32 len, u8 proto, u16 *udp_pkt
 	return csum_tcpudp_magic(saddr, daddr, len, proto, csum);
 }
 
-static void gen_eth_hdr(void *data, struct ethhdr *eth_hdr)
+static void gen_eth_hdr(struct ifobject *ifobject, struct ethhdr *eth_hdr)
 {
-	memcpy(eth_hdr->h_dest, ((struct ifobject *)data)->dst_mac, ETH_ALEN);
-	memcpy(eth_hdr->h_source, ((struct ifobject *)data)->src_mac, ETH_ALEN);
+	memcpy(eth_hdr->h_dest, ifobject->dst_mac, ETH_ALEN);
+	memcpy(eth_hdr->h_source, ifobject->src_mac, ETH_ALEN);
 	eth_hdr->h_proto = htons(ETH_P_IP);
 }
 
-static void gen_ip_hdr(void *data, struct iphdr *ip_hdr)
+static void gen_ip_hdr(struct ifobject *ifobject, struct iphdr *ip_hdr)
 {
 	ip_hdr->version = IP_PKT_VER;
 	ip_hdr->ihl = 0x5;
@@ -242,15 +242,15 @@ static void gen_ip_hdr(void *data, struct iphdr *ip_hdr)
 	ip_hdr->frag_off = 0;
 	ip_hdr->ttl = IPDEFTTL;
 	ip_hdr->protocol = IPPROTO_UDP;
-	ip_hdr->saddr = ((struct ifobject *)data)->src_ip;
-	ip_hdr->daddr = ((struct ifobject *)data)->dst_ip;
+	ip_hdr->saddr = ifobject->src_ip;
+	ip_hdr->daddr = ifobject->dst_ip;
 	ip_hdr->check = 0;
 }
 
-static void gen_udp_hdr(void *data, void *arg, struct udphdr *udp_hdr)
+static void gen_udp_hdr(void *data, struct ifobject *ifobject, struct udphdr *udp_hdr)
 {
-	udp_hdr->source = htons(((struct ifobject *)arg)->src_port);
-	udp_hdr->dest = htons(((struct ifobject *)arg)->dst_port);
+	udp_hdr->source = htons(ifobject->src_port);
+	udp_hdr->dest = htons(ifobject->dst_port);
 	udp_hdr->len = htons(UDP_PKT_SIZE);
 	memset32_htonl(pkt_data + PKT_HDR_SIZE,
 		       htonl(((struct generic_data *)data)->seqnum), UDP_PKT_DATA_SIZE);
@@ -693,28 +693,27 @@ static inline int get_batch_size(int pkt_cnt)
 	return opt_pkt_count - pkt_cnt;
 }
 
-static void complete_tx_only_all(void *arg)
+static void complete_tx_only_all(struct ifobject *ifobject)
 {
 	bool pending;
 
 	do {
 		pending = false;
-		if (((struct ifobject *)arg)->xsk->outstanding_tx) {
-			complete_tx_only(((struct ifobject *)
-					  arg)->xsk, BATCH_SIZE);
-			pending = !!((struct ifobject *)arg)->xsk->outstanding_tx;
+		if (ifobject->xsk->outstanding_tx) {
+			complete_tx_only(ifobject->xsk, BATCH_SIZE);
+			pending = !!ifobject->xsk->outstanding_tx;
 		}
 	} while (pending);
 }
 
-static void tx_only_all(void *arg)
+static void tx_only_all(struct ifobject *ifobject)
 {
 	struct pollfd fds[MAX_SOCKS] = { };
 	u32 frame_nb = 0;
 	int pkt_cnt = 0;
 	int ret;
 
-	fds[0].fd = xsk_socket__fd(((struct ifobject *)arg)->xsk->xsk);
+	fds[0].fd = xsk_socket__fd(ifobject->xsk->xsk);
 	fds[0].events = POLLOUT;
 
 	while ((opt_pkt_count && pkt_cnt < opt_pkt_count) || !opt_pkt_count) {
@@ -729,12 +728,12 @@ static void tx_only_all(void *arg)
 				continue;
 		}
 
-		tx_only(((struct ifobject *)arg)->xsk, &frame_nb, batch_size);
+		tx_only(ifobject->xsk, &frame_nb, batch_size);
 		pkt_cnt += batch_size;
 	}
 
 	if (opt_pkt_count)
-		complete_tx_only_all(arg);
+		complete_tx_only_all(ifobject);
 }
 
 static void worker_pkt_dump(void)
@@ -845,14 +844,14 @@ static void worker_pkt_validate(void)
 	}
 }
 
-static void thread_common_ops(void *arg, void *bufs, pthread_mutex_t *mutexptr,
+static void thread_common_ops(struct ifobject *ifobject, void *bufs, pthread_mutex_t *mutexptr,
 			      atomic_int *spinningptr)
 {
 	int ctr = 0;
 	int ret;
 
-	xsk_configure_umem((struct ifobject *)arg, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE);
-	ret = xsk_configure_socket((struct ifobject *)arg);
+	xsk_configure_umem(ifobject, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE);
+	ret = xsk_configure_socket(ifobject);
 
 	/* Retry Create Socket if it fails as xsk_socket__create()
 	 * is asynchronous
@@ -863,9 +862,8 @@ static void thread_common_ops(void *arg, void *bufs, pthread_mutex_t *mutexptr,
 	pthread_mutex_lock(mutexptr);
 	while (ret && ctr < SOCK_RECONF_CTR) {
 		atomic_store(spinningptr, 1);
-		xsk_configure_umem((struct ifobject *)arg,
-				   bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE);
-		ret = xsk_configure_socket((struct ifobject *)arg);
+		xsk_configure_umem(ifobject, bufs, num_frames * XSK_UMEM__DEFAULT_FRAME_SIZE);
+		ret = xsk_configure_socket(ifobject);
 		usleep(USLEEP_MAX);
 		ctr++;
 	}
@@ -894,52 +892,51 @@ static void *worker_testapp_validate(void *arg)
 		if (bufs == MAP_FAILED)
 			exit_with_error(errno);
 
-		if (strcmp(((struct ifobject *)arg)->nsname, ""))
-			switch_namespace(((struct ifobject *)arg)->ifdict_index);
+		if (strcmp(ifobject->nsname, ""))
+			switch_namespace(ifobject->ifdict_index);
 
 		if (ifobject->obj && attach_xdp_program(ifobject->ifindex, ifobject->prog_fd) < 0)
 			exit_with_error(errno);
 	}
 
-	if (((struct ifobject *)arg)->fv.vector == tx) {
+	if (ifobject->fv.vector == tx) {
 		int spinningrxctr = 0;
 
 		if (!bidi_pass)
-			thread_common_ops(arg, bufs, &sync_mutex_tx, &spinning_tx);
+			thread_common_ops(ifobject, bufs, &sync_mutex_tx, &spinning_tx);
 
 		while (atomic_load(&spinning_rx) && spinningrxctr < SOCK_RECONF_CTR) {
 			spinningrxctr++;
 			usleep(USLEEP_MAX);
 		}
 
-		ksft_print_msg("Interface [%s] vector [Tx]\n", ((struct ifobject *)arg)->ifname);
+		ksft_print_msg("Interface [%s] vector [Tx]\n", ifobject->ifname);
 		for (int i = 0; i < num_frames; i++) {
 			/*send EOT frame */
 			if (i == (num_frames - 1))
 				data->seqnum = -1;
 			else
 				data->seqnum = i;
-			gen_udp_hdr((void *)data, (void *)arg, udp_hdr);
-			gen_ip_hdr((void *)arg, ip_hdr);
+			gen_udp_hdr((void *)data, ifobject, udp_hdr);
+			gen_ip_hdr(ifobject, ip_hdr);
 			gen_udp_csum(udp_hdr, ip_hdr);
-			gen_eth_hdr((void *)arg, eth_hdr);
-			gen_eth_frame(((struct ifobject *)arg)->umem,
-				      i * XSK_UMEM__DEFAULT_FRAME_SIZE);
+			gen_eth_hdr(ifobject, eth_hdr);
+			gen_eth_frame(ifobject->umem, i * XSK_UMEM__DEFAULT_FRAME_SIZE);
 		}
 
 		free(data);
 		ksft_print_msg("Sending %d packets on interface %s\n",
-			       (opt_pkt_count - 1), ((struct ifobject *)arg)->ifname);
-		tx_only_all(arg);
-	} else if (((struct ifobject *)arg)->fv.vector == rx) {
+			       (opt_pkt_count - 1), ifobject->ifname);
+		tx_only_all(ifobject);
+	} else if (ifobject->fv.vector == rx) {
 		struct pollfd fds[MAX_SOCKS] = { };
 		int ret;
 
 		if (!bidi_pass)
-			thread_common_ops(arg, bufs, &sync_mutex_tx, &spinning_rx);
+			thread_common_ops(ifobject, bufs, &sync_mutex_tx, &spinning_rx);
 
-		ksft_print_msg("Interface [%s] vector [Rx]\n", ((struct ifobject *)arg)->ifname);
-		xsk_populate_fill_ring(((struct ifobject *)arg)->umem);
+		ksft_print_msg("Interface [%s] vector [Rx]\n", ifobject->ifname);
+		xsk_populate_fill_ring(ifobject->umem);
 
 		TAILQ_INIT(&head);
 		if (debug_pkt_dump) {
@@ -948,7 +945,7 @@ static void *worker_testapp_validate(void *arg)
 				exit_with_error(errno);
 		}
 
-		fds[0].fd = xsk_socket__fd(((struct ifobject *)arg)->xsk->xsk);
+		fds[0].fd = xsk_socket__fd(ifobject->xsk->xsk);
 		fds[0].events = POLLIN;
 
 		pthread_mutex_lock(&sync_mutex);
@@ -961,7 +958,7 @@ static void *worker_testapp_validate(void *arg)
 				if (ret <= 0)
 					continue;
 			}
-			rx_pkt(((struct ifobject *)arg)->xsk, fds);
+			rx_pkt(ifobject->xsk, fds);
 			worker_pkt_validate();
 
 			if (sigvar)
@@ -969,15 +966,15 @@ static void *worker_testapp_validate(void *arg)
 		}
 
 		ksft_print_msg("Received %d packets on interface %s\n",
-			       pkt_counter, ((struct ifobject *)arg)->ifname);
+			       pkt_counter, ifobject->ifname);
 
 		if (opt_teardown)
 			ksft_print_msg("Destroying socket\n");
 	}
 
 	if (!opt_bidi || (opt_bidi && bidi_pass)) {
-		xsk_socket__delete(((struct ifobject *)arg)->xsk->xsk);
-		(void)xsk_umem__delete(((struct ifobject *)arg)->umem->umem);
+		xsk_socket__delete(ifobject->xsk->xsk);
+		(void)xsk_umem__delete(ifobject->umem->umem);
 	}
 	pthread_exit(NULL);
 }
-- 
2.27.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
@ 2021-01-20  8:25   ` kernel test robot
  2021-01-20  8:41     ` Björn Töpel
  2021-01-20  8:50   ` kernel test robot
  2021-01-20 12:50   ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 45+ messages in thread
From: kernel test robot @ 2021-01-20  8:25 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: kbuild-all, Björn Töpel, magnus.karlsson,
	maciej.fijalkowski, kuba, jonathan.lemon, maximmi

[-- Attachment #1: Type: text/plain, Size: 9746 bytes --]

Hi "Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on 95204c9bfa48d2f4d3bab7df55c1cc823957ff81]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
base:    95204c9bfa48d2f4d3bab7df55c1cc823957ff81
config: x86_64-randconfig-m031-20210120 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/419b1341d7980ee57fb10f4306719eef3c1a15f8
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
        git checkout 419b1341d7980ee57fb10f4306719eef3c1a15f8
        # save the attached .config to linux build tree
        make W=1 ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   net/core/filter.c: In function '____bpf_xdp_redirect_xsk':
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:300:9: note: in definition of macro '__compiletime_assert'
     300 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert'
     320 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:21: note: in expansion of macro '__native_word'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:300:9: note: in definition of macro '__compiletime_assert'
     300 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert'
     320 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:21: note: in expansion of macro '__native_word'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:300:9: note: in definition of macro '__compiletime_assert'
     300 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert'
     320 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:21: note: in expansion of macro '__native_word'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:300:9: note: in definition of macro '__compiletime_assert'
     300 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert'
     320 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:21: note: in expansion of macro '__native_word'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |                     ^~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:300:9: note: in definition of macro '__compiletime_assert'
     300 |   if (!(condition))     \
         |         ^~~~~~~~~
   include/linux/compiler_types.h:320:2: note: in expansion of macro '_compiletime_assert'
     320 |  _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |  ^~~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:36:2: note: in expansion of macro 'compiletime_assert'
      36 |  compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long), \
         |  ^~~~~~~~~~~~~~~~~~
   include/asm-generic/rwonce.h:49:2: note: in expansion of macro 'compiletime_assert_rwonce_type'
      49 |  compiletime_assert_rwonce_type(x);    \
         |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/linux/compiler_types.h:271:13: note: in definition of macro '__unqual_scalar_typeof'
     271 |   _Generic((x),      \
         |             ^
   include/asm-generic/rwonce.h:50:2: note: in expansion of macro '__READ_ONCE'
      50 |  __READ_ONCE(x);       \
         |  ^~~~~~~~~~~
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~
   In file included from ./arch/x86/include/generated/asm/rwonce.h:1,
                    from include/linux/compiler.h:246,
                    from include/linux/export.h:43,
                    from include/linux/linkage.h:7,
                    from include/linux/kernel.h:7,
                    from include/linux/list.h:9,
                    from include/linux/module.h:12,
                    from net/core/filter.c:20:
>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |                                   ^
   include/asm-generic/rwonce.h:44:72: note: in definition of macro '__READ_ONCE'
      44 | #define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
         |                                                                        ^
   net/core/filter.c:4165:7: note: in expansion of macro 'READ_ONCE'
    4165 |  xs = READ_ONCE(dev->_rx[queue_id].xsk);
         |       ^~~~~~~~~


vim +4165 net/core/filter.c

  4157	
  4158	BPF_CALL_2(bpf_xdp_redirect_xsk, struct xdp_buff *, xdp, u64, action)
  4159	{
  4160		struct net_device *dev = xdp->rxq->dev;
  4161		u32 queue_id = xdp->rxq->queue_index;
  4162		struct bpf_redirect_info *ri;
  4163		struct xdp_sock *xs;
  4164	
> 4165		xs = READ_ONCE(dev->_rx[queue_id].xsk);
  4166		if (!xs)
  4167			return action;
  4168	
  4169		ri = this_cpu_ptr(&bpf_redirect_info);
  4170		ri->tgt_type = XDP_REDIR_XSK;
  4171		ri->tgt_value = xs;
  4172	
  4173		return XDP_REDIRECT;
  4174	}
  4175	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 35638 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20  8:25   ` kernel test robot
@ 2021-01-20  8:41     ` Björn Töpel
  0 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20  8:41 UTC (permalink / raw)
  To: kernel test robot, Björn Töpel, ast, daniel, netdev, bpf
  Cc: kbuild-all, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi

On 2021-01-20 09:25, kernel test robot wrote:
> Hi "Björn,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on 95204c9bfa48d2f4d3bab7df55c1cc823957ff81]
> 
> url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
> base:    95204c9bfa48d2f4d3bab7df55c1cc823957ff81
> config: x86_64-randconfig-m031-20210120 (attached as .config)
> compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
> reproduce (this is a W=1 build):
>          # https://github.com/0day-ci/linux/commit/419b1341d7980ee57fb10f4306719eef3c1a15f8
>          git remote add linux-review https://github.com/0day-ci/linux
>          git fetch --no-tags linux-review Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
>          git checkout 419b1341d7980ee57fb10f4306719eef3c1a15f8
>          # save the attached .config to linux build tree
>          make W=1 ARCH=x86_64
> 
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>     In file included from <command-line>:
>     net/core/filter.c: In function '____bpf_xdp_redirect_xsk':
>>> net/core/filter.c:4165:35: error: 'struct netdev_rx_queue' has no member named 'xsk'

This rev is missing proper CONFIG_XDP_SOCKET ifdefs. I'll fix this in
the next spin, but I'll wait a bit for more comments.


Björn

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
  2021-01-20  8:25   ` kernel test robot
@ 2021-01-20  8:50   ` kernel test robot
  2021-01-20 12:50   ` Toke Høiland-Jørgensen
  2 siblings, 0 replies; 45+ messages in thread
From: kernel test robot @ 2021-01-20  8:50 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: kbuild-all, clang-built-linux, Björn Töpel,
	magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi

[-- Attachment #1: Type: text/plain, Size: 11291 bytes --]

Hi "Björn,

I love your patch! Yet something to improve:

[auto build test ERROR on 95204c9bfa48d2f4d3bab7df55c1cc823957ff81]

url:    https://github.com/0day-ci/linux/commits/Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
base:    95204c9bfa48d2f4d3bab7df55c1cc823957ff81
config: arm-randconfig-r013-20210120 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project 22b68440e1647e16b5ee24b924986207173c02d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/419b1341d7980ee57fb10f4306719eef3c1a15f8
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Bj-rn-T-pel/Introduce-bpf_redirect_xsk-helper/20210120-150357
        git checkout 419b1341d7980ee57fb10f4306719eef3c1a15f8
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:282:10: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                   ^
   include/linux/compiler_types.h:320:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:308:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:300:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:282:39: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
                                                ^
   include/linux/compiler_types.h:320:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:308:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:300:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:283:10: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                   ^
   include/linux/compiler_types.h:320:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:308:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:300:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:35: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                            ^
   include/linux/compiler_types.h:283:38: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                                               ^
   include/linux/compiler_types.h:320:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:308:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:300:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:49:33: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
                                          ^
   include/asm-generic/rwonce.h:36:48: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                                                         ^
   include/linux/compiler_types.h:320:22: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
                               ^~~~~~~~~
   include/linux/compiler_types.h:308:23: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
                                ^~~~~~~~~
   include/linux/compiler_types.h:300:9: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                         ^~~~~~~~~
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:271:13: note: expanded from macro '__unqual_scalar_typeof'
                   _Generic((x),                                           \
                             ^
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:65: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                    ^
   include/linux/compiler_types.h:278:15: note: expanded from macro '__unqual_scalar_typeof'
                            default: (x)))
                                      ^
>> net/core/filter.c:4165:36: error: no member named 'xsk' in 'struct netdev_rx_queue'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
                          ~~~~~~~~~~~~~~~~~~ ^
   include/asm-generic/rwonce.h:50:14: note: expanded from macro 'READ_ONCE'
           __READ_ONCE(x);                                                 \
                       ^
   include/asm-generic/rwonce.h:44:72: note: expanded from macro '__READ_ONCE'
   #define __READ_ONCE(x)  (*(const volatile __unqual_scalar_typeof(x) *)&(x))
                                                                           ^
>> net/core/filter.c:4165:5: error: assigning to 'struct xdp_sock *' from incompatible type 'void'
           xs = READ_ONCE(dev->_rx[queue_id].xsk);
              ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   9 errors generated.


vim +4165 net/core/filter.c

  4157	
  4158	BPF_CALL_2(bpf_xdp_redirect_xsk, struct xdp_buff *, xdp, u64, action)
  4159	{
  4160		struct net_device *dev = xdp->rxq->dev;
  4161		u32 queue_id = xdp->rxq->queue_index;
  4162		struct bpf_redirect_info *ri;
  4163		struct xdp_sock *xs;
  4164	
> 4165		xs = READ_ONCE(dev->_rx[queue_id].xsk);
  4166		if (!xs)
  4167			return action;
  4168	
  4169		ri = this_cpu_ptr(&bpf_redirect_info);
  4170		ri->tgt_type = XDP_REDIR_XSK;
  4171		ri->tgt_value = xs;
  4172	
  4173		return XDP_REDIRECT;
  4174	}
  4175	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 32144 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-19 15:50 ` [PATCH bpf-next v2 1/8] xdp: restructure redirect actions Björn Töpel
@ 2021-01-20 12:44   ` Toke Høiland-Jørgensen
  2021-01-20 13:40     ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 12:44 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> The XDP_REDIRECT implementations for maps and non-maps are fairly
> similar, but obviously need to take different code paths depending on
> if the target is using a map or not. Today, the redirect targets for
> XDP either uses a map, or is based on ifindex.
>
> Future commits will introduce yet another redirect target via the a
> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
> an explicit redirect type to bpf_redirect_info. This makes the code
> easier to follow, and makes it easier to add new redirect targets.
>
> Further, using an explicit type in bpf_redirect_info has a slight
> positive performance impact by avoiding a pointer indirection for the
> map type lookup, and instead use the hot cacheline for
> bpf_redirect_info.
>
> The bpf_redirect_info flags member is not used by XDP, and not
> read/written any more. The map member is only written to when
> required/used, and not unconditionally.

I like the simplification. However, the handling of map clearing becomes
a bit murky with this change:

You're not changing anything in bpf_clear_redirect_map(), and you're
removing most of the reads and writes of ri->map. Instead,
bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
ri->tgt_value, which xdp_do_redirect() will just read and use without
checking. But if the map element (or the entire map) has been freed in
the meantime that will be a dangling pointer. I *think* the RCU callback
in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
protects against this, but that is by no means obvious. So confirming
this, and explaining it in a comment would be good.

Also, as far as I can tell after this, ri->map is only used for the
tracepoint. So how about just storing the map ID and getting rid of the
READ/WRITE_ONCE() entirely?

(Oh, and related to this I think this patch set will conflict with
Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
  2021-01-20  8:25   ` kernel test robot
  2021-01-20  8:50   ` kernel test robot
@ 2021-01-20 12:50   ` Toke Høiland-Jørgensen
  2021-01-20 13:25     ` Björn Töpel
  2 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 12:50 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua

Björn Töpel <bjorn.topel@gmail.com> writes:

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c001766adcbc..bbc7d9a57262 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3836,6 +3836,12 @@ union bpf_attr {
>   *	Return
>   *		A pointer to a struct socket on success or NULL if the file is
>   *		not a socket.
> + *
> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
> + *	Description
> + *		Redirect to the registered AF_XDP socket.
> + *	Return
> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>   */

I think it would be better to make the second argument a 'flags'
argument and make values > XDP_TX invalid (like we do in
bpf_xdp_redirect_map() now). By allowing any value as return you lose
the ability to turn it into a flags argument later...

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-19 15:50 ` [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version Björn Töpel
@ 2021-01-20 12:52   ` Toke Høiland-Jørgensen
  2021-01-20 13:25     ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 12:52 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: Björn Töpel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, maximmi, davem, hawk, john.fastabend,
	ciara.loftus, weqaar.a.janjua, Marek Majtyka

Björn Töpel <bjorn.topel@gmail.com> writes:

> From: Björn Töpel <bjorn.topel@intel.com>
>
> Add detection for kernel version, and adapt the BPF program based on
> kernel support. This way, users will get the best possible performance
> from the BPF program.

Please do explicit feature detection instead of relying on the kernel
version number; some distro kernels are known to have a creative notion
of their own version, which is not really related to the features they
actually support (I'm sure you know which one I'm referring to ;)).

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
                   ` (7 preceding siblings ...)
  2021-01-19 15:50 ` [PATCH bpf-next v2 8/8] selftest/bpf: remove a lot of ifobject casting in xdpxceiver Björn Töpel
@ 2021-01-20 13:15 ` Maxim Mikityanskiy
  2021-01-20 13:27   ` Björn Töpel
  2021-01-20 15:57   ` Jesper Dangaard Brouer
  8 siblings, 2 replies; 45+ messages in thread
From: Maxim Mikityanskiy @ 2021-01-20 13:15 UTC (permalink / raw)
  To: Björn Töpel, ast, daniel, netdev, bpf
  Cc: bjorn.topel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-19 17:50, Björn Töpel wrote:
> This series extends bind() for XDP sockets, so that the bound socket
> is added to the netdev_rx_queue _rx array in the netdevice. We call
> this to register the socket. To redirect packets to the registered
> socket, a new BPF helper is used: bpf_redirect_xsk().
> 
> For shared XDP sockets, only the first bound socket is
> registered. Users that need more complex setup has to use XSKMAP and
> bpf_redirect_map().
> 
> Now, why would one use bpf_redirect_xsk() over the regular
> bpf_redirect_map() helper?
> 
> * Better performance!
> * Convenience; Most user use one socket per queue. This scenario is
>    what registered sockets support. There is no need to create an
>    XSKMAP. This can also reduce complexity from containerized setups,
>    where users might what to use XDP sockets without CAP_SYS_ADMIN
>    capabilities.
> 
> The first patch restructures xdp_do_redirect() a bit, to make it
> easier to add the new helper. This restructure also give us a slight
> performance benefit. The following three patches extends bind() and
> adds the new helper. After that, two libbpf patches that selects XDP
> program based on what kernel is running. Finally, selftests for the new
> functionality is added.
> 
> Note that the libbpf "auto-selection" is based on kernel version, so
> it is hard coded to the "-next" version (5.12). If you would like to
> try this is out, you will need to change the libbpf patch locally!
> 
> Thanks to Maciej and Magnus for the internal review/comments!
> 
> Performance (rxdrop, zero-copy)
> 
> Baseline
> Two cores:                   21.3 Mpps
> One core:                    24.5 Mpps

Two cores is slower? It used to be faster all the time, didn't it?

> Patched
> Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> One core, bpf_redirect_map:  24.9 Mpps + 2%
> 
> Two cores, bpf_redirect_xsk: 24.0 Mpps +13%

Nice, impressive improvement!

> One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> 
> Thanks!
> Björn
> 
> v1->v2:
>    * Added missing XDP programs to selftests.
>    * Fixed checkpatch warning in selftests.
> 
> Björn Töpel (8):
>    xdp: restructure redirect actions
>    xsk: remove explicit_free parameter from __xsk_rcv()
>    xsk: fold xp_assign_dev and __xp_assign_dev
>    xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
>    libbpf, xsk: select AF_XDP BPF program based on kernel version
>    libbpf, xsk: select bpf_redirect_xsk(), if supported
>    selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
>    selftest/bpf: remove a lot of ifobject casting in xdpxceiver
> 
>   include/linux/filter.h                        |  10 +
>   include/linux/netdevice.h                     |   1 +
>   include/net/xdp_sock.h                        |  12 +
>   include/net/xsk_buff_pool.h                   |   2 +-
>   include/trace/events/xdp.h                    |  46 ++--
>   include/uapi/linux/bpf.h                      |   7 +
>   net/core/filter.c                             | 205 ++++++++++--------
>   net/xdp/xsk.c                                 | 112 ++++++++--
>   net/xdp/xsk_buff_pool.c                       |  12 +-
>   tools/include/uapi/linux/bpf.h                |   7 +
>   tools/lib/bpf/libbpf.c                        |   2 +-
>   tools/lib/bpf/libbpf_internal.h               |   2 +
>   tools/lib/bpf/libbpf_probes.c                 |  16 --
>   tools/lib/bpf/xsk.c                           |  83 ++++++-
>   .../selftests/bpf/progs/xdpxceiver_ext1.c     |  15 ++
>   .../selftests/bpf/progs/xdpxceiver_ext2.c     |   9 +
>   tools/testing/selftests/bpf/test_xsk.sh       |  48 ++++
>   tools/testing/selftests/bpf/xdpxceiver.c      | 164 +++++++++-----
>   tools/testing/selftests/bpf/xdpxceiver.h      |   2 +
>   19 files changed, 554 insertions(+), 201 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
>   create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> 
> 
> base-commit: 95204c9bfa48d2f4d3bab7df55c1cc823957ff81
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 12:52   ` Toke Høiland-Jørgensen
@ 2021-01-20 13:25     ` Björn Töpel
  2021-01-20 14:49       ` Björn Töpel
  2021-01-20 14:56       ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 13:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
> 
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> Add detection for kernel version, and adapt the BPF program based on
>> kernel support. This way, users will get the best possible performance
>> from the BPF program.
> 
> Please do explicit feature detection instead of relying on the kernel
> version number; some distro kernels are known to have a creative notion
> of their own version, which is not really related to the features they
> actually support (I'm sure you know which one I'm referring to ;)).
>

Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
from the verifier to detect support. What about "bpf_redirect_map() now
supports passing return value as flags"? Any ideas how to do that in a
robust, non-version number-based scheme?


Thanks,
Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 12:50   ` Toke Høiland-Jørgensen
@ 2021-01-20 13:25     ` Björn Töpel
  2021-01-20 14:54       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 13:25 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
> 
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index c001766adcbc..bbc7d9a57262 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>    *	Return
>>    *		A pointer to a struct socket on success or NULL if the file is
>>    *		not a socket.
>> + *
>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>> + *	Description
>> + *		Redirect to the registered AF_XDP socket.
>> + *	Return
>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>    */
> 
> I think it would be better to make the second argument a 'flags'
> argument and make values > XDP_TX invalid (like we do in
> bpf_xdp_redirect_map() now). By allowing any value as return you lose
> the ability to turn it into a flags argument later...
>

Yes, but that adds a run-time check. I prefer this non-checked version,
even though it is a bit less futureproof.


Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-20 13:15 ` [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Maxim Mikityanskiy
@ 2021-01-20 13:27   ` Björn Töpel
  2021-01-20 15:57   ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 13:27 UTC (permalink / raw)
  To: Maxim Mikityanskiy, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon, davem,
	hawk, john.fastabend, ciara.loftus, weqaar.a.janjua

On 2021-01-20 14:15, Maxim Mikityanskiy wrote:
> On 2021-01-19 17:50, Björn Töpel wrote:
>> This series extends bind() for XDP sockets, so that the bound socket
>> is added to the netdev_rx_queue _rx array in the netdevice. We call
>> this to register the socket. To redirect packets to the registered
>> socket, a new BPF helper is used: bpf_redirect_xsk().
>>
>> For shared XDP sockets, only the first bound socket is
>> registered. Users that need more complex setup has to use XSKMAP and
>> bpf_redirect_map().
>>
>> Now, why would one use bpf_redirect_xsk() over the regular
>> bpf_redirect_map() helper?
>>
>> * Better performance!
>> * Convenience; Most user use one socket per queue. This scenario is
>>    what registered sockets support. There is no need to create an
>>    XSKMAP. This can also reduce complexity from containerized setups,
>>    where users might what to use XDP sockets without CAP_SYS_ADMIN
>>    capabilities.
>>
>> The first patch restructures xdp_do_redirect() a bit, to make it
>> easier to add the new helper. This restructure also give us a slight
>> performance benefit. The following three patches extends bind() and
>> adds the new helper. After that, two libbpf patches that selects XDP
>> program based on what kernel is running. Finally, selftests for the new
>> functionality is added.
>>
>> Note that the libbpf "auto-selection" is based on kernel version, so
>> it is hard coded to the "-next" version (5.12). If you would like to
>> try this is out, you will need to change the libbpf patch locally!
>>
>> Thanks to Maciej and Magnus for the internal review/comments!
>>
>> Performance (rxdrop, zero-copy)
>>
>> Baseline
>> Two cores:                   21.3 Mpps
>> One core:                    24.5 Mpps
> 
> Two cores is slower? It used to be faster all the time, didn't it?
>

Up until busy-polling. Note that I'm using a busy-poll napi budget of 512.


>> Patched
>> Two cores, bpf_redirect_map: 21.7 Mpps + 2%
>> One core, bpf_redirect_map:  24.9 Mpps + 2%
>>
>> Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
> 
> Nice, impressive improvement!
>

Thanks! Getting rid of the queue/netdev checks really payed off!


Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-20 12:44   ` Toke Høiland-Jørgensen
@ 2021-01-20 13:40     ` Björn Töpel
  2021-01-20 14:52       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 13:40 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua



On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@gmail.com> writes:
> 
>> From: Björn Töpel <bjorn.topel@intel.com>
>>
>> The XDP_REDIRECT implementations for maps and non-maps are fairly
>> similar, but obviously need to take different code paths depending on
>> if the target is using a map or not. Today, the redirect targets for
>> XDP either uses a map, or is based on ifindex.
>>
>> Future commits will introduce yet another redirect target via the a
>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
>> an explicit redirect type to bpf_redirect_info. This makes the code
>> easier to follow, and makes it easier to add new redirect targets.
>>
>> Further, using an explicit type in bpf_redirect_info has a slight
>> positive performance impact by avoiding a pointer indirection for the
>> map type lookup, and instead use the hot cacheline for
>> bpf_redirect_info.
>>
>> The bpf_redirect_info flags member is not used by XDP, and not
>> read/written any more. The map member is only written to when
>> required/used, and not unconditionally.
> 
> I like the simplification. However, the handling of map clearing becomes
> a bit murky with this change:
> 
> You're not changing anything in bpf_clear_redirect_map(), and you're
> removing most of the reads and writes of ri->map. Instead,
> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
> ri->tgt_value, which xdp_do_redirect() will just read and use without
> checking. But if the map element (or the entire map) has been freed in
> the meantime that will be a dangling pointer. I *think* the RCU callback
> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
> protects against this, but that is by no means obvious. So confirming
> this, and explaining it in a comment would be good.
>

Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only 
the bpf_redirect_map(), and as you write, the tracepoints.

The content/element of the map is RCU protected, and actually even the
map will be around until the XDP processing is complete. Note the
synchronize_rcu() followed after all bpf_clear_redirect_map() calls.

I'll try to make it clearer in the commit message! Thanks for pointing 
that out!

> Also, as far as I can tell after this, ri->map is only used for the
> tracepoint. So how about just storing the map ID and getting rid of the
> READ/WRITE_ONCE() entirely?
>

...and the bpf_redirect_map() helper. Don't you think the current
READ_ONCE(ri->map) scheme is more obvious/clear?


> (Oh, and related to this I think this patch set will conflict with
> Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))
>

Yeah, good idea! I would guess Hangbin's would go in before this, so I
would need to adapt.


Thanks for taking of look at the series, Toke! Much appreciated!


Björn


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 13:25     ` Björn Töpel
@ 2021-01-20 14:49       ` Björn Töpel
  2021-01-20 15:11         ` Toke Høiland-Jørgensen
  2021-01-20 14:56       ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 14:49 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

On 2021-01-20 14:25, Björn Töpel wrote:
> On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> Add detection for kernel version, and adapt the BPF program based on
>>> kernel support. This way, users will get the best possible performance
>>> from the BPF program.
>>
>> Please do explicit feature detection instead of relying on the kernel
>> version number; some distro kernels are known to have a creative notion
>> of their own version, which is not really related to the features they
>> actually support (I'm sure you know which one I'm referring to ;)).
>>
> 
> Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
> from the verifier to detect support. What about "bpf_redirect_map() now
> supports passing return value as flags"? Any ideas how to do that in a
> robust, non-version number-based scheme?
>

Just so that I understand this correctly. Red^WSome distro vendors
backport the world, and call that franken kernel, say, 3.10. Is that
interpretation correct? My hope was that wasn't the case. :-(

Would it make sense with some kind of BPF-specific "supported features"
mechanism? Something else with a bigger scope (whole kernel)?



Cheers,
Björn


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-20 13:40     ` Björn Töpel
@ 2021-01-20 14:52       ` Toke Høiland-Jørgensen
  2021-01-20 15:49         ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 14:52 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>> 
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> The XDP_REDIRECT implementations for maps and non-maps are fairly
>>> similar, but obviously need to take different code paths depending on
>>> if the target is using a map or not. Today, the redirect targets for
>>> XDP either uses a map, or is based on ifindex.
>>>
>>> Future commits will introduce yet another redirect target via the a
>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
>>> an explicit redirect type to bpf_redirect_info. This makes the code
>>> easier to follow, and makes it easier to add new redirect targets.
>>>
>>> Further, using an explicit type in bpf_redirect_info has a slight
>>> positive performance impact by avoiding a pointer indirection for the
>>> map type lookup, and instead use the hot cacheline for
>>> bpf_redirect_info.
>>>
>>> The bpf_redirect_info flags member is not used by XDP, and not
>>> read/written any more. The map member is only written to when
>>> required/used, and not unconditionally.
>> 
>> I like the simplification. However, the handling of map clearing becomes
>> a bit murky with this change:
>> 
>> You're not changing anything in bpf_clear_redirect_map(), and you're
>> removing most of the reads and writes of ri->map. Instead,
>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
>> ri->tgt_value, which xdp_do_redirect() will just read and use without
>> checking. But if the map element (or the entire map) has been freed in
>> the meantime that will be a dangling pointer. I *think* the RCU callback
>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
>> protects against this, but that is by no means obvious. So confirming
>> this, and explaining it in a comment would be good.
>>
>
> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only 
> the bpf_redirect_map(), and as you write, the tracepoints.
>
> The content/element of the map is RCU protected, and actually even the
> map will be around until the XDP processing is complete. Note the
> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.
>
> I'll try to make it clearer in the commit message! Thanks for pointing 
> that out!
>
>> Also, as far as I can tell after this, ri->map is only used for the
>> tracepoint. So how about just storing the map ID and getting rid of the
>> READ/WRITE_ONCE() entirely?
>>
>
> ...and the bpf_redirect_map() helper. Don't you think the current
> READ_ONCE(ri->map) scheme is more obvious/clear?

Yeah, after your patch we WRITE_ONCE() the pointer in
bpf_redirect_map(), but the only place it is actually *read* is in the
tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure
that an invalid pointer is not read in the tracepoint function. Which
seems a bit excessive when we could just store the map ID for direct use
in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?

Besides, from a UX point of view, having the tracepoint display the map
ID even if that map ID is no longer valid seems to me like it makes more
sense than just displaying a map ID of 0 and leaving it up to the user
to figure out that this is because the map was cleared. I mean, at the
time the redirect was made, that *was* the map ID that was used...

Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I
think this whole discussion is superfluous anyway, since it can't
actually happen that the map gets freed between the setting and reading
of ri->map, no?

>> (Oh, and related to this I think this patch set will conflict with
>> Hangbin's multi-redirect series, so maybe you two ought to coordinate? :))
>>
>
> Yeah, good idea! I would guess Hangbin's would go in before this, so I
> would need to adapt.
>
>
> Thanks for taking of look at the series, Toke! Much appreciated!

You're welcome :)

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 13:25     ` Björn Töpel
@ 2021-01-20 14:54       ` Toke Høiland-Jørgensen
  2021-01-20 15:18         ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 14:54 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>> 
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index c001766adcbc..bbc7d9a57262 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>>    *	Return
>>>    *		A pointer to a struct socket on success or NULL if the file is
>>>    *		not a socket.
>>> + *
>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>>> + *	Description
>>> + *		Redirect to the registered AF_XDP socket.
>>> + *	Return
>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>>    */
>> 
>> I think it would be better to make the second argument a 'flags'
>> argument and make values > XDP_TX invalid (like we do in
>> bpf_xdp_redirect_map() now). By allowing any value as return you lose
>> the ability to turn it into a flags argument later...
>>
>
> Yes, but that adds a run-time check. I prefer this non-checked version,
> even though it is a bit less futureproof.

That...seems a bit short-sighted? :)
Can you actually see a difference in your performance numbers?

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 13:25     ` Björn Töpel
  2021-01-20 14:49       ` Björn Töpel
@ 2021-01-20 14:56       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 14:56 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@gmail.com> writes:
>> 
>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>
>>> Add detection for kernel version, and adapt the BPF program based on
>>> kernel support. This way, users will get the best possible performance
>>> from the BPF program.
>> 
>> Please do explicit feature detection instead of relying on the kernel
>> version number; some distro kernels are known to have a creative notion
>> of their own version, which is not really related to the features they
>> actually support (I'm sure you know which one I'm referring to ;)).
>>
>
> Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
> from the verifier to detect support. What about "bpf_redirect_map() now
> supports passing return value as flags"? Any ideas how to do that in a
> robust, non-version number-based scheme?

Well, having a BPF program pass in a flag of '1' with an invalid lookup
and checking if it returns 1 or 0. But how to do that from libbpf, hmm,
good question. BPF_PROG_RUN()?

An alternative could be to default to a program that will handle both
cases in the BPF code, and make it opt-in to use the optimised versions
if the user knows their kernel supports it?

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 14:49       ` Björn Töpel
@ 2021-01-20 15:11         ` Toke Høiland-Jørgensen
  2021-01-20 15:27           ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 15:11 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 14:25, Björn Töpel wrote:
>> On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> Add detection for kernel version, and adapt the BPF program based on
>>>> kernel support. This way, users will get the best possible performance
>>>> from the BPF program.
>>>
>>> Please do explicit feature detection instead of relying on the kernel
>>> version number; some distro kernels are known to have a creative notion
>>> of their own version, which is not really related to the features they
>>> actually support (I'm sure you know which one I'm referring to ;)).
>>>
>> 
>> Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
>> from the verifier to detect support. What about "bpf_redirect_map() now
>> supports passing return value as flags"? Any ideas how to do that in a
>> robust, non-version number-based scheme?
>>
>
> Just so that I understand this correctly. Red^WSome distro vendors
> backport the world, and call that franken kernel, say, 3.10. Is that
> interpretation correct? My hope was that wasn't the case. :-(

Yup, indeed. All kernels shipped for the entire lifetime of RHEL8 think
they are v4.18.0... :/

I don't think we're the only ones doing it (there are examples in the
embedded world as well, for instance, and not sure about the other
enterprise distros), but RHEL is probably the most extreme example.

We could patch the version check in the distro-supplied version of
libbpf, of course, but that doesn't help anyone using upstream versions,
and given the prevalence of vendoring libbpf, I fear that going with the
version check will just result in a bad user experience...

> Would it make sense with some kind of BPF-specific "supported
> features" mechanism? Something else with a bigger scope (whole
> kernel)?

Heh, in my opinion, yeah. Seems like we'll finally get it for XDP, but
for BPF in general the approach has always been probing AFAICT.

For the particular case of arguments to helpers, I suppose the verifier
could technically validate value ranges for flags arguments, say. That
would be nice as an early reject anyway, but I'm not sure if it is
possible to add after-the-fact without breaking existing programs
because the verifier can't prove the argument is within the valid range.
And of course it doesn't help you with compatibility with
already-released kernels.

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 14:54       ` Toke Høiland-Jørgensen
@ 2021-01-20 15:18         ` Björn Töpel
  2021-01-20 17:29           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 15:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:
> 
>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index c001766adcbc..bbc7d9a57262 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>>>     *	Return
>>>>     *		A pointer to a struct socket on success or NULL if the file is
>>>>     *		not a socket.
>>>> + *
>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>>>> + *	Description
>>>> + *		Redirect to the registered AF_XDP socket.
>>>> + *	Return
>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>>>     */
>>>
>>> I think it would be better to make the second argument a 'flags'
>>> argument and make values > XDP_TX invalid (like we do in
>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose
>>> the ability to turn it into a flags argument later...
>>>
>>
>> Yes, but that adds a run-time check. I prefer this non-checked version,
>> even though it is a bit less futureproof.
> 
> That...seems a bit short-sighted? :)
> Can you actually see a difference in your performance numbers?
>

I would rather add an additional helper *if* we see the need for flags,
instead of paying for that upfront. For me, BPF is about being able to
specialize, and not having one call with tons of checks.

(Related; Going forward, the growing switch() for redirect targets in
xdp_do_redirect() is a concern for me...)

And yes, even with all those fancy branch predictors, less instructions
is still less. :-) (It shows in my ubenchs.)


Björn


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 15:11         ` Toke Høiland-Jørgensen
@ 2021-01-20 15:27           ` Björn Töpel
  2021-01-20 17:30             ` Toke Høiland-Jørgensen
  2021-01-20 18:25             ` Alexei Starovoitov
  0 siblings, 2 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 15:27 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

On 2021-01-20 16:11, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:
> 
>> On 2021-01-20 14:25, Björn Töpel wrote:
>>> On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>
>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>
>>>>> Add detection for kernel version, and adapt the BPF program based on
>>>>> kernel support. This way, users will get the best possible performance
>>>>> from the BPF program.
>>>>
>>>> Please do explicit feature detection instead of relying on the kernel
>>>> version number; some distro kernels are known to have a creative notion
>>>> of their own version, which is not really related to the features they
>>>> actually support (I'm sure you know which one I'm referring to ;)).
>>>>
>>>
>>> Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
>>> from the verifier to detect support. What about "bpf_redirect_map() now
>>> supports passing return value as flags"? Any ideas how to do that in a
>>> robust, non-version number-based scheme?
>>>
>>
>> Just so that I understand this correctly. Red^WSome distro vendors
>> backport the world, and call that franken kernel, say, 3.10. Is that
>> interpretation correct? My hope was that wasn't the case. :-(
> 
> Yup, indeed. All kernels shipped for the entire lifetime of RHEL8 think
> they are v4.18.0... :/
> 
> I don't think we're the only ones doing it (there are examples in the
> embedded world as well, for instance, and not sure about the other
> enterprise distros), but RHEL is probably the most extreme example.
> 
> We could patch the version check in the distro-supplied version of
> libbpf, of course, but that doesn't help anyone using upstream versions,
> and given the prevalence of vendoring libbpf, I fear that going with the
> version check will just result in a bad user experience...
>

Ok! Thanks for clearing that out!

>> Would it make sense with some kind of BPF-specific "supported
>> features" mechanism? Something else with a bigger scope (whole
>> kernel)?
> 
> Heh, in my opinion, yeah. Seems like we'll finally get it for XDP, but
> for BPF in general the approach has always been probing AFAICT.
> 
> For the particular case of arguments to helpers, I suppose the verifier
> could technically validate value ranges for flags arguments, say. That
> would be nice as an early reject anyway, but I'm not sure if it is
> possible to add after-the-fact without breaking existing programs
> because the verifier can't prove the argument is within the valid range.
> And of course it doesn't help you with compatibility with
> already-released kernels.
>

Hmm, think I have a way forward. I'll use BPF_PROG_TEST_RUN.

If the load fail for the new helper, fallback to bpf_redirect_map(). Use
BPF_PROG_TEST_RUN to make sure that "action via flags" passes.

That should work for you guys as well, right? I'll take a stab at it.


Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-20 14:52       ` Toke Høiland-Jørgensen
@ 2021-01-20 15:49         ` Björn Töpel
  2021-01-20 16:30           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 15:49 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-20 15:52, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:
> 
>> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:
>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>
>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>
>>>> The XDP_REDIRECT implementations for maps and non-maps are fairly
>>>> similar, but obviously need to take different code paths depending on
>>>> if the target is using a map or not. Today, the redirect targets for
>>>> XDP either uses a map, or is based on ifindex.
>>>>
>>>> Future commits will introduce yet another redirect target via the a
>>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
>>>> an explicit redirect type to bpf_redirect_info. This makes the code
>>>> easier to follow, and makes it easier to add new redirect targets.
>>>>
>>>> Further, using an explicit type in bpf_redirect_info has a slight
>>>> positive performance impact by avoiding a pointer indirection for the
>>>> map type lookup, and instead use the hot cacheline for
>>>> bpf_redirect_info.
>>>>
>>>> The bpf_redirect_info flags member is not used by XDP, and not
>>>> read/written any more. The map member is only written to when
>>>> required/used, and not unconditionally.
>>>
>>> I like the simplification. However, the handling of map clearing becomes
>>> a bit murky with this change:
>>>
>>> You're not changing anything in bpf_clear_redirect_map(), and you're
>>> removing most of the reads and writes of ri->map. Instead,
>>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
>>> ri->tgt_value, which xdp_do_redirect() will just read and use without
>>> checking. But if the map element (or the entire map) has been freed in
>>> the meantime that will be a dangling pointer. I *think* the RCU callback
>>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
>>> protects against this, but that is by no means obvious. So confirming
>>> this, and explaining it in a comment would be good.
>>>
>>
>> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only
>> the bpf_redirect_map(), and as you write, the tracepoints.
>>
>> The content/element of the map is RCU protected, and actually even the
>> map will be around until the XDP processing is complete. Note the
>> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.
>>
>> I'll try to make it clearer in the commit message! Thanks for pointing
>> that out!
>>
>>> Also, as far as I can tell after this, ri->map is only used for the
>>> tracepoint. So how about just storing the map ID and getting rid of the
>>> READ/WRITE_ONCE() entirely?
>>>
>>
>> ...and the bpf_redirect_map() helper. Don't you think the current
>> READ_ONCE(ri->map) scheme is more obvious/clear?
> 
> Yeah, after your patch we WRITE_ONCE() the pointer in
> bpf_redirect_map(), but the only place it is actually *read* is in the
> tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure
> that an invalid pointer is not read in the tracepoint function. Which
> seems a bit excessive when we could just store the map ID for direct use
> in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?
> 
> Besides, from a UX point of view, having the tracepoint display the map
> ID even if that map ID is no longer valid seems to me like it makes more
> sense than just displaying a map ID of 0 and leaving it up to the user
> to figure out that this is because the map was cleared. I mean, at the
> time the redirect was made, that *was* the map ID that was used...
>

Convinced! Getting rid of bpf_clear_redirect_map() would be good! I'll
take a stab at this for v3!


> Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I
> think this whole discussion is superfluous anyway, since it can't
> actually happen that the map gets freed between the setting and reading
> of ri->map, no?
>

It can't be free'd but, ri->map can be cleared via
bpf_clear_redirect_map(). So, between the helper (setting) and the
tracepoint in xdp_do_redirect() it can be cleared (say if the XDP
program is swapped out, prior running xdp_do_redirect()).

Moving to the scheme you suggested, does make the discussion
superfluous. :-)


Thanks for the input!
Björn


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-20 13:15 ` [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Maxim Mikityanskiy
  2021-01-20 13:27   ` Björn Töpel
@ 2021-01-20 15:57   ` Jesper Dangaard Brouer
  2021-01-20 16:19     ` Maciej Fijalkowski
  1 sibling, 1 reply; 45+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-20 15:57 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: brouer, Björn Töpel, ast, daniel, netdev, bpf,
	bjorn.topel, magnus.karlsson, maciej.fijalkowski, kuba,
	jonathan.lemon, davem, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On Wed, 20 Jan 2021 15:15:22 +0200
Maxim Mikityanskiy <maximmi@nvidia.com> wrote:

> On 2021-01-19 17:50, Björn Töpel wrote:
> > This series extends bind() for XDP sockets, so that the bound socket
> > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > this to register the socket. To redirect packets to the registered
> > socket, a new BPF helper is used: bpf_redirect_xsk().
> > 
> > For shared XDP sockets, only the first bound socket is
> > registered. Users that need more complex setup has to use XSKMAP and
> > bpf_redirect_map().
> > 
> > Now, why would one use bpf_redirect_xsk() over the regular
> > bpf_redirect_map() helper?
> > 
> > * Better performance!
> > * Convenience; Most user use one socket per queue. This scenario is
> >    what registered sockets support. There is no need to create an
> >    XSKMAP. This can also reduce complexity from containerized setups,
> >    where users might what to use XDP sockets without CAP_SYS_ADMIN
> >    capabilities.

I'm buying into the convenience and reduce complexity, and XDP sockets
without CAP_SYS_ADMIN into containers.

People might be surprised that I'm actually NOT buying into the better
performance argument here.  At these speeds we are basically comparing
how close we are to zero (and have to use nanosec time scale for our
comparisons), more below.

 
> > The first patch restructures xdp_do_redirect() a bit, to make it
> > easier to add the new helper. This restructure also give us a slight
> > performance benefit. The following three patches extends bind() and
> > adds the new helper. After that, two libbpf patches that selects XDP
> > program based on what kernel is running. Finally, selftests for the new
> > functionality is added.
> > 
> > Note that the libbpf "auto-selection" is based on kernel version, so
> > it is hard coded to the "-next" version (5.12). If you would like to
> > try this is out, you will need to change the libbpf patch locally!
> > 
> > Thanks to Maciej and Magnus for the internal review/comments!
> > 
> > Performance (rxdrop, zero-copy)
> > 
> > Baseline
> > Two cores:                   21.3 Mpps
> > One core:                    24.5 Mpps  
> 
> Two cores is slower? It used to be faster all the time, didn't it?
> 
> > Patched
> > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > 
> > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%  
> 
> Nice, impressive improvement!

I do appreciate you work and performance optimizations at this level,
because when we are using this few CPU cycles per packet, then it is
really hard to find new ways to reduce cycles further.

Thank for you saying +13% instead of saying +2.7 Mpps.
It *is* impressive to basically reduce cycles with 13%.

 21.3 Mpps = 46.94 nanosec per packet
 24.0 Mpps = 41.66 nanosec per packet

 21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved

On my 3.60GHz testlab machine that gives me 19 cycles.

 
> > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> >

 24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved

At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.


We still need these optimization in the kernel, but end-users in
userspace are very quickly going to waste the 19 cycles we found.
I still support/believe that the OS need to have a little overhead as
possible, but for me 42 nanosec overhead is close to zero overhead. For
comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
delivery of UDP packets into socket (almost 15 times slower).

I guess my point is that with XDP we have already achieved and exceeded
(my original) performance goals, making it even faster is just showing off ;-P
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
[2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
[3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-20 15:57   ` Jesper Dangaard Brouer
@ 2021-01-20 16:19     ` Maciej Fijalkowski
  2021-01-21 17:01       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 45+ messages in thread
From: Maciej Fijalkowski @ 2021-01-20 16:19 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Maxim Mikityanskiy, Björn Töpel, ast, daniel, netdev,
	bpf, bjorn.topel, magnus.karlsson, kuba, jonathan.lemon, davem,
	john.fastabend, ciara.loftus, weqaar.a.janjua

On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 20 Jan 2021 15:15:22 +0200
> Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
> 
> > On 2021-01-19 17:50, Björn Töpel wrote:
> > > This series extends bind() for XDP sockets, so that the bound socket
> > > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > > this to register the socket. To redirect packets to the registered
> > > socket, a new BPF helper is used: bpf_redirect_xsk().
> > > 
> > > For shared XDP sockets, only the first bound socket is
> > > registered. Users that need more complex setup has to use XSKMAP and
> > > bpf_redirect_map().
> > > 
> > > Now, why would one use bpf_redirect_xsk() over the regular
> > > bpf_redirect_map() helper?
> > > 
> > > * Better performance!
> > > * Convenience; Most user use one socket per queue. This scenario is
> > >    what registered sockets support. There is no need to create an
> > >    XSKMAP. This can also reduce complexity from containerized setups,
> > >    where users might what to use XDP sockets without CAP_SYS_ADMIN
> > >    capabilities.
> 
> I'm buying into the convenience and reduce complexity, and XDP sockets
> without CAP_SYS_ADMIN into containers.
> 
> People might be surprised that I'm actually NOT buying into the better
> performance argument here.  At these speeds we are basically comparing
> how close we are to zero (and have to use nanosec time scale for our
> comparisons), more below.
> 
>  
> > > The first patch restructures xdp_do_redirect() a bit, to make it
> > > easier to add the new helper. This restructure also give us a slight
> > > performance benefit. The following three patches extends bind() and
> > > adds the new helper. After that, two libbpf patches that selects XDP
> > > program based on what kernel is running. Finally, selftests for the new
> > > functionality is added.
> > > 
> > > Note that the libbpf "auto-selection" is based on kernel version, so
> > > it is hard coded to the "-next" version (5.12). If you would like to
> > > try this is out, you will need to change the libbpf patch locally!
> > > 
> > > Thanks to Maciej and Magnus for the internal review/comments!
> > > 
> > > Performance (rxdrop, zero-copy)
> > > 
> > > Baseline
> > > Two cores:                   21.3 Mpps
> > > One core:                    24.5 Mpps  
> > 
> > Two cores is slower? It used to be faster all the time, didn't it?
> > 
> > > Patched
> > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > > 
> > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%  
> > 
> > Nice, impressive improvement!
> 
> I do appreciate you work and performance optimizations at this level,
> because when we are using this few CPU cycles per packet, then it is
> really hard to find new ways to reduce cycles further.
> 
> Thank for you saying +13% instead of saying +2.7 Mpps.
> It *is* impressive to basically reduce cycles with 13%.
> 
>  21.3 Mpps = 46.94 nanosec per packet
>  24.0 Mpps = 41.66 nanosec per packet
> 
>  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved
> 
> On my 3.60GHz testlab machine that gives me 19 cycles.
> 
>  
> > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> > >
> 
>  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved
> 
> At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.
> 
> 
> We still need these optimization in the kernel, but end-users in
> userspace are very quickly going to waste the 19 cycles we found.
> I still support/believe that the OS need to have a little overhead as
> possible, but for me 42 nanosec overhead is close to zero overhead. For
> comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
> delivery of UDP packets into socket (almost 15 times slower).
> 
> I guess my point is that with XDP we have already achieved and exceeded
> (my original) performance goals, making it even faster is just showing off ;-P

Even though I'll let Bjorn elaborating on this, we're talking here about
AF-XDP which is a bit different pair of shoes to me in terms of
performance. AFAIK we still have a gap when compared to DPDK's numbers. So
I'm really not sure why better performance bothers you? :)

Let's rather be more harsh on changes that actually decrease the
performance, not the other way around. And I suppose you were the one that
always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,
I'm a bit confused with your statement.

> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
> [3] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-20 15:49         ` Björn Töpel
@ 2021-01-20 16:30           ` Toke Høiland-Jørgensen
  2021-01-20 17:26             ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 16:30 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 15:52, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@intel.com> writes:
>> 
>>> On 2021-01-20 13:44, Toke Høiland-Jørgensen wrote:
>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>
>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>
>>>>> The XDP_REDIRECT implementations for maps and non-maps are fairly
>>>>> similar, but obviously need to take different code paths depending on
>>>>> if the target is using a map or not. Today, the redirect targets for
>>>>> XDP either uses a map, or is based on ifindex.
>>>>>
>>>>> Future commits will introduce yet another redirect target via the a
>>>>> new helper, bpf_redirect_xsk(). To pave the way for that, we introduce
>>>>> an explicit redirect type to bpf_redirect_info. This makes the code
>>>>> easier to follow, and makes it easier to add new redirect targets.
>>>>>
>>>>> Further, using an explicit type in bpf_redirect_info has a slight
>>>>> positive performance impact by avoiding a pointer indirection for the
>>>>> map type lookup, and instead use the hot cacheline for
>>>>> bpf_redirect_info.
>>>>>
>>>>> The bpf_redirect_info flags member is not used by XDP, and not
>>>>> read/written any more. The map member is only written to when
>>>>> required/used, and not unconditionally.
>>>>
>>>> I like the simplification. However, the handling of map clearing becomes
>>>> a bit murky with this change:
>>>>
>>>> You're not changing anything in bpf_clear_redirect_map(), and you're
>>>> removing most of the reads and writes of ri->map. Instead,
>>>> bpf_xdp_redirect_map() will store the bpf_dtab_netdev pointer in
>>>> ri->tgt_value, which xdp_do_redirect() will just read and use without
>>>> checking. But if the map element (or the entire map) has been freed in
>>>> the meantime that will be a dangling pointer. I *think* the RCU callback
>>>> in dev_map_delete_elem() and the rcu_barrier() in dev_map_free()
>>>> protects against this, but that is by no means obvious. So confirming
>>>> this, and explaining it in a comment would be good.
>>>>
>>>
>>> Yes, *most* of the READ_ONCE(ri->map) are removed, it's pretty much only
>>> the bpf_redirect_map(), and as you write, the tracepoints.
>>>
>>> The content/element of the map is RCU protected, and actually even the
>>> map will be around until the XDP processing is complete. Note the
>>> synchronize_rcu() followed after all bpf_clear_redirect_map() calls.
>>>
>>> I'll try to make it clearer in the commit message! Thanks for pointing
>>> that out!
>>>
>>>> Also, as far as I can tell after this, ri->map is only used for the
>>>> tracepoint. So how about just storing the map ID and getting rid of the
>>>> READ/WRITE_ONCE() entirely?
>>>>
>>>
>>> ...and the bpf_redirect_map() helper. Don't you think the current
>>> READ_ONCE(ri->map) scheme is more obvious/clear?
>> 
>> Yeah, after your patch we WRITE_ONCE() the pointer in
>> bpf_redirect_map(), but the only place it is actually *read* is in the
>> tracepoint. So the only purpose of bpf_clear_redirect_map() is to ensure
>> that an invalid pointer is not read in the tracepoint function. Which
>> seems a bit excessive when we could just store the map ID for direct use
>> in the tracepoint and get rid of bpf_clear_redirect_map() entirely, no?
>> 
>> Besides, from a UX point of view, having the tracepoint display the map
>> ID even if that map ID is no longer valid seems to me like it makes more
>> sense than just displaying a map ID of 0 and leaving it up to the user
>> to figure out that this is because the map was cleared. I mean, at the
>> time the redirect was made, that *was* the map ID that was used...
>>
>
> Convinced! Getting rid of bpf_clear_redirect_map() would be good! I'll
> take a stab at this for v3!

Cool!

>> Oh, and as you say due to the synchronize_rcu() call in dev_map_free() I
>> think this whole discussion is superfluous anyway, since it can't
>> actually happen that the map gets freed between the setting and reading
>> of ri->map, no?
>>
>
> It can't be free'd but, ri->map can be cleared via
> bpf_clear_redirect_map(). So, between the helper (setting) and the
> tracepoint in xdp_do_redirect() it can be cleared (say if the XDP
> program is swapped out, prior running xdp_do_redirect()).

But xdp_do_redirect() should be called on driver flush before exiting
the NAPI cycle, so how can the XDP program be swapped out?

> Moving to the scheme you suggested, does make the discussion
> superfluous. :-)

Yup, awesome :)

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 1/8] xdp: restructure redirect actions
  2021-01-20 16:30           ` Toke Høiland-Jørgensen
@ 2021-01-20 17:26             ` Björn Töpel
  0 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 17:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-20 17:30, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:
> 
[...]
>>
>> It can't be free'd but, ri->map can be cleared via
>> bpf_clear_redirect_map(). So, between the helper (setting) and the
>> tracepoint in xdp_do_redirect() it can be cleared (say if the XDP
>> program is swapped out, prior running xdp_do_redirect()).
> 
> But xdp_do_redirect() should be called on driver flush before exiting
> the NAPI cycle, so how can the XDP program be swapped out?
> 

To clarify; xdp_do_redirect() is called for each packet in the NAPI poll 
loop.

Yeah, you're right. The xdp_do_redirect() is within the RCU scope, so
the program wont be destroyed (but can be swapped though!).

Hmm, so IOW the bpf_clear_redirect_map() is not needed anymore...


Björn

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 15:18         ` Björn Töpel
@ 2021-01-20 17:29           ` Toke Høiland-Jørgensen
  2021-01-20 18:22             ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 17:29 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@intel.com> writes:
>> 
>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>
>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>> index c001766adcbc..bbc7d9a57262 100644
>>>>> --- a/include/uapi/linux/bpf.h
>>>>> +++ b/include/uapi/linux/bpf.h
>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>>>>     *	Return
>>>>>     *		A pointer to a struct socket on success or NULL if the file is
>>>>>     *		not a socket.
>>>>> + *
>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>>>>> + *	Description
>>>>> + *		Redirect to the registered AF_XDP socket.
>>>>> + *	Return
>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>>>>     */
>>>>
>>>> I think it would be better to make the second argument a 'flags'
>>>> argument and make values > XDP_TX invalid (like we do in
>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose
>>>> the ability to turn it into a flags argument later...
>>>>
>>>
>>> Yes, but that adds a run-time check. I prefer this non-checked version,
>>> even though it is a bit less futureproof.
>> 
>> That...seems a bit short-sighted? :)
>> Can you actually see a difference in your performance numbers?
>>
>
> I would rather add an additional helper *if* we see the need for flags,
> instead of paying for that upfront. For me, BPF is about being able to
> specialize, and not having one call with tons of checks.

I get that, I'm just pushing back because omitting a 'flags' argument is
literally among the most frequent reasons for having to replace a
syscall (see e.g., [0]) instead of extending it. And yeah, I do realise
that the performance implications are different for XDP than for
syscalls, but maintainability of the API is also important; it's all a
tradeoff. This will be the third redirect helper variant for XDP and I'd
hate for the fourth one to have to be bpf_redirect_xsk_flags() because
it did turn out to be needed...

(One potential concrete reason for this: I believe Magnus was talking
about an API that would allow a BPF program to redirect a packet into
more than one socket (cloning it in the process), or to redirect to a
socket+another target. How would you do that with this new helper?)

[0] https://lwn.net/Articles/585415/

> (Related; Going forward, the growing switch() for redirect targets in
> xdp_do_redirect() is a concern for me...)
>
> And yes, even with all those fancy branch predictors, less instructions
> is still less. :-) (It shows in my ubenchs.)

Right, I do agree that the run-time performance hit of checking the flag
sucks (along with being hard to check for, cf. our parallel discussion
about version checks). So ideally this would be fixed by having the
verifier enforce the argument ranges instead; but if we merge this
without the runtime check now we can't add that later without
potentially breaking programs... :(

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 15:27           ` Björn Töpel
@ 2021-01-20 17:30             ` Toke Høiland-Jørgensen
  2021-01-20 18:25             ` Alexei Starovoitov
  1 sibling, 0 replies; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 17:30 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua, Marek Majtyka

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 16:11, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@intel.com> writes:
>> 
>>> On 2021-01-20 14:25, Björn Töpel wrote:
>>>> On 2021-01-20 13:52, Toke Høiland-Jørgensen wrote:
>>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>>
>>>>>> From: Björn Töpel <bjorn.topel@intel.com>
>>>>>>
>>>>>> Add detection for kernel version, and adapt the BPF program based on
>>>>>> kernel support. This way, users will get the best possible performance
>>>>>> from the BPF program.
>>>>>
>>>>> Please do explicit feature detection instead of relying on the kernel
>>>>> version number; some distro kernels are known to have a creative notion
>>>>> of their own version, which is not really related to the features they
>>>>> actually support (I'm sure you know which one I'm referring to ;)).
>>>>>
>>>>
>>>> Right. For a *new* helper, like bpf_redirect_xsk, we rely on rejection
>>>> from the verifier to detect support. What about "bpf_redirect_map() now
>>>> supports passing return value as flags"? Any ideas how to do that in a
>>>> robust, non-version number-based scheme?
>>>>
>>>
>>> Just so that I understand this correctly. Red^WSome distro vendors
>>> backport the world, and call that franken kernel, say, 3.10. Is that
>>> interpretation correct? My hope was that wasn't the case. :-(
>> 
>> Yup, indeed. All kernels shipped for the entire lifetime of RHEL8 think
>> they are v4.18.0... :/
>> 
>> I don't think we're the only ones doing it (there are examples in the
>> embedded world as well, for instance, and not sure about the other
>> enterprise distros), but RHEL is probably the most extreme example.
>> 
>> We could patch the version check in the distro-supplied version of
>> libbpf, of course, but that doesn't help anyone using upstream versions,
>> and given the prevalence of vendoring libbpf, I fear that going with the
>> version check will just result in a bad user experience...
>>
>
> Ok! Thanks for clearing that out!
>
>>> Would it make sense with some kind of BPF-specific "supported
>>> features" mechanism? Something else with a bigger scope (whole
>>> kernel)?
>> 
>> Heh, in my opinion, yeah. Seems like we'll finally get it for XDP, but
>> for BPF in general the approach has always been probing AFAICT.
>> 
>> For the particular case of arguments to helpers, I suppose the verifier
>> could technically validate value ranges for flags arguments, say. That
>> would be nice as an early reject anyway, but I'm not sure if it is
>> possible to add after-the-fact without breaking existing programs
>> because the verifier can't prove the argument is within the valid range.
>> And of course it doesn't help you with compatibility with
>> already-released kernels.
>>
>
> Hmm, think I have a way forward. I'll use BPF_PROG_TEST_RUN.
>
> If the load fail for the new helper, fallback to bpf_redirect_map(). Use
> BPF_PROG_TEST_RUN to make sure that "action via flags" passes.
>
> That should work for you guys as well, right? I'll take a stab at it.

Yup, think so - SGTM! :)

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 17:29           ` Toke Høiland-Jørgensen
@ 2021-01-20 18:22             ` Björn Töpel
  2021-01-20 20:26               ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 18:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Björn Töpel, ast,
	daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

On 2021-01-20 18:29, Toke Høiland-Jørgensen wrote:
> Björn Töpel <bjorn.topel@intel.com> writes:
> 
>> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:
>>> Björn Töpel <bjorn.topel@intel.com> writes:
>>>
>>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
>>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>>
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index c001766adcbc..bbc7d9a57262 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>>>>>      *	Return
>>>>>>      *		A pointer to a struct socket on success or NULL if the file is
>>>>>>      *		not a socket.
>>>>>> + *
>>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>>>>>> + *	Description
>>>>>> + *		Redirect to the registered AF_XDP socket.
>>>>>> + *	Return
>>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>>>>>      */
>>>>>
>>>>> I think it would be better to make the second argument a 'flags'
>>>>> argument and make values > XDP_TX invalid (like we do in
>>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose
>>>>> the ability to turn it into a flags argument later...
>>>>>
>>>>
>>>> Yes, but that adds a run-time check. I prefer this non-checked version,
>>>> even though it is a bit less futureproof.
>>>
>>> That...seems a bit short-sighted? :)
>>> Can you actually see a difference in your performance numbers?
>>>
>>
>> I would rather add an additional helper *if* we see the need for flags,
>> instead of paying for that upfront. For me, BPF is about being able to
>> specialize, and not having one call with tons of checks.
> 
> I get that, I'm just pushing back because omitting a 'flags' argument is
> literally among the most frequent reasons for having to replace a
> syscall (see e.g., [0]) instead of extending it. And yeah, I do realise
> that the performance implications are different for XDP than for
> syscalls, but maintainability of the API is also important; it's all a
> tradeoff. This will be the third redirect helper variant for XDP and I'd
> hate for the fourth one to have to be bpf_redirect_xsk_flags() because
> it did turn out to be needed...
> 
> (One potential concrete reason for this: I believe Magnus was talking
> about an API that would allow a BPF program to redirect a packet into
> more than one socket (cloning it in the process), or to redirect to a
> socket+another target. How would you do that with this new helper?)
> 
> [0] https://lwn.net/Articles/585415/
>

I have a bit of different view. One of the really nice parts about BPF
is exactly specialization. A user can tailor the kernel do a specific
thing. I *don't* see an issue with yet another helper, if that is needed
in the future. I think that is better than bloated helpers trying to
cope for all scenarios. I don't mean we should just add helpers all over
the place, but I do see more lightly on adding helpers, than adding
syscalls.

Elaborating a bit on this: many device drivers try to handle all the
things in the fast-path. I see BPF as one way forward to moving away
from that. Setup what you need, and only run what you currently need,
instead of the current "Is bleh on, then baz? Is this on, then that."

So, I would like to avoid "future proofing" the helpers, if that makes
sense. Use what you need. That's why BPF is so good (one of the things)!

As for bpf_redirect_xsk() it's a leaner version of bpf_redirect_map().
You want flags/shared sockets/...? Well go use bpf_redirect_map() and
XSKMAP. bpf_redirect_xsk() is not for you.

A lot of back-and-forth for *one* if-statement, but it's kind of a
design thing for me. ;-)


Björn


>> (Related; Going forward, the growing switch() for redirect targets in
>> xdp_do_redirect() is a concern for me...)
>>
>> And yes, even with all those fancy branch predictors, less instructions
>> is still less. :-) (It shows in my ubenchs.)
> 
> Right, I do agree that the run-time performance hit of checking the flag
> sucks (along with being hard to check for, cf. our parallel discussion
> about version checks). So ideally this would be fixed by having the
> verifier enforce the argument ranges instead; but if we merge this
> without the runtime check now we can't add that later without
> potentially breaking programs... :(
>
> -Toke
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 15:27           ` Björn Töpel
  2021-01-20 17:30             ` Toke Høiland-Jørgensen
@ 2021-01-20 18:25             ` Alexei Starovoitov
  2021-01-20 18:30               ` Björn Töpel
  1 sibling, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-01-20 18:25 UTC (permalink / raw)
  To: Björn Töpel, Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development, bpf,
	Karlsson, Magnus, Fijalkowski, Maciej, Jakub Kicinski,
	Jonathan Lemon, maximmi, David S. Miller, Jesper Dangaard Brouer,
	John Fastabend, Ciara Loftus, weqaar.a.janjua, Marek Majtyka

On Wed, Jan 20, 2021 at 7:27 AM Björn Töpel <bjorn.topel@intel.com> wrote:
>
> >> Would it make sense with some kind of BPF-specific "supported
> >> features" mechanism? Something else with a bigger scope (whole
> >> kernel)?
> >
> > Heh, in my opinion, yeah. Seems like we'll finally get it for XDP, but
> > for BPF in general the approach has always been probing AFAICT.
> >
> > For the particular case of arguments to helpers, I suppose the verifier
> > could technically validate value ranges for flags arguments, say. That
> > would be nice as an early reject anyway, but I'm not sure if it is
> > possible to add after-the-fact without breaking existing programs
> > because the verifier can't prove the argument is within the valid range.
> > And of course it doesn't help you with compatibility with
> > already-released kernels.
> >
>
> Hmm, think I have a way forward. I'll use BPF_PROG_TEST_RUN.
>
> If the load fail for the new helper, fallback to bpf_redirect_map(). Use
> BPF_PROG_TEST_RUN to make sure that "action via flags" passes.

+1 to Toke's point. No version checks please.
One way to detect is to try prog_load. Search for FEAT_* in libbpf.
Another approach is to scan vmlinux BTF for necessary helpers.
Currently libbpf is relying on the former.
I think going forward would be good to detect features via BTF.
It's going to be much faster and won't create noise for audit that
could be looking at prog_load calls.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version
  2021-01-20 18:25             ` Alexei Starovoitov
@ 2021-01-20 18:30               ` Björn Töpel
  0 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-20 18:30 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development, bpf,
	Karlsson, Magnus, Fijalkowski, Maciej, Jakub Kicinski,
	Jonathan Lemon, maximmi, David S. Miller, Jesper Dangaard Brouer,
	John Fastabend, Ciara Loftus, weqaar.a.janjua, Marek Majtyka

On 2021-01-20 19:25, Alexei Starovoitov wrote:
> On Wed, Jan 20, 2021 at 7:27 AM Björn Töpel <bjorn.topel@intel.com> wrote:
>>
>>>> Would it make sense with some kind of BPF-specific "supported
>>>> features" mechanism? Something else with a bigger scope (whole
>>>> kernel)?
>>>
>>> Heh, in my opinion, yeah. Seems like we'll finally get it for XDP, but
>>> for BPF in general the approach has always been probing AFAICT.
>>>
>>> For the particular case of arguments to helpers, I suppose the verifier
>>> could technically validate value ranges for flags arguments, say. That
>>> would be nice as an early reject anyway, but I'm not sure if it is
>>> possible to add after-the-fact without breaking existing programs
>>> because the verifier can't prove the argument is within the valid range.
>>> And of course it doesn't help you with compatibility with
>>> already-released kernels.
>>>
>>
>> Hmm, think I have a way forward. I'll use BPF_PROG_TEST_RUN.
>>
>> If the load fail for the new helper, fallback to bpf_redirect_map(). Use
>> BPF_PROG_TEST_RUN to make sure that "action via flags" passes.
> 
> +1 to Toke's point. No version checks please.
> One way to detect is to try prog_load. Search for FEAT_* in libbpf.
> Another approach is to scan vmlinux BTF for necessary helpers.
> Currently libbpf is relying on the former.
> I think going forward would be good to detect features via BTF.
> It's going to be much faster and won't create noise for audit that
> could be looking at prog_load calls.
> 

Thanks Alexei. I'll explore both options for the next spin!


Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 18:22             ` Björn Töpel
@ 2021-01-20 20:26               ` Toke Høiland-Jørgensen
  2021-01-20 21:15                 ` Alexei Starovoitov
  0 siblings, 1 reply; 45+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-01-20 20:26 UTC (permalink / raw)
  To: Björn Töpel, Björn Töpel, ast, daniel, netdev, bpf
  Cc: magnus.karlsson, maciej.fijalkowski, kuba, jonathan.lemon,
	maximmi, davem, hawk, john.fastabend, ciara.loftus,
	weqaar.a.janjua

Björn Töpel <bjorn.topel@intel.com> writes:

> On 2021-01-20 18:29, Toke Høiland-Jørgensen wrote:
>> Björn Töpel <bjorn.topel@intel.com> writes:
>> 
>>> On 2021-01-20 15:54, Toke Høiland-Jørgensen wrote:
>>>> Björn Töpel <bjorn.topel@intel.com> writes:
>>>>
>>>>> On 2021-01-20 13:50, Toke Høiland-Jørgensen wrote:
>>>>>> Björn Töpel <bjorn.topel@gmail.com> writes:
>>>>>>
>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>> index c001766adcbc..bbc7d9a57262 100644
>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>> @@ -3836,6 +3836,12 @@ union bpf_attr {
>>>>>>>      *	Return
>>>>>>>      *		A pointer to a struct socket on success or NULL if the file is
>>>>>>>      *		not a socket.
>>>>>>> + *
>>>>>>> + * long bpf_redirect_xsk(struct xdp_buff *xdp_md, u64 action)
>>>>>>> + *	Description
>>>>>>> + *		Redirect to the registered AF_XDP socket.
>>>>>>> + *	Return
>>>>>>> + *		**XDP_REDIRECT** on success, otherwise the action parameter is returned.
>>>>>>>      */
>>>>>>
>>>>>> I think it would be better to make the second argument a 'flags'
>>>>>> argument and make values > XDP_TX invalid (like we do in
>>>>>> bpf_xdp_redirect_map() now). By allowing any value as return you lose
>>>>>> the ability to turn it into a flags argument later...
>>>>>>
>>>>>
>>>>> Yes, but that adds a run-time check. I prefer this non-checked version,
>>>>> even though it is a bit less futureproof.
>>>>
>>>> That...seems a bit short-sighted? :)
>>>> Can you actually see a difference in your performance numbers?
>>>>
>>>
>>> I would rather add an additional helper *if* we see the need for flags,
>>> instead of paying for that upfront. For me, BPF is about being able to
>>> specialize, and not having one call with tons of checks.
>> 
>> I get that, I'm just pushing back because omitting a 'flags' argument is
>> literally among the most frequent reasons for having to replace a
>> syscall (see e.g., [0]) instead of extending it. And yeah, I do realise
>> that the performance implications are different for XDP than for
>> syscalls, but maintainability of the API is also important; it's all a
>> tradeoff. This will be the third redirect helper variant for XDP and I'd
>> hate for the fourth one to have to be bpf_redirect_xsk_flags() because
>> it did turn out to be needed...
>> 
>> (One potential concrete reason for this: I believe Magnus was talking
>> about an API that would allow a BPF program to redirect a packet into
>> more than one socket (cloning it in the process), or to redirect to a
>> socket+another target. How would you do that with this new helper?)
>> 
>> [0] https://lwn.net/Articles/585415/
>>
>
> I have a bit of different view. One of the really nice parts about BPF
> is exactly specialization. A user can tailor the kernel do a specific
> thing. I *don't* see an issue with yet another helper, if that is needed
> in the future. I think that is better than bloated helpers trying to
> cope for all scenarios. I don't mean we should just add helpers all over
> the place, but I do see more lightly on adding helpers, than adding
> syscalls.
>
> Elaborating a bit on this: many device drivers try to handle all the
> things in the fast-path. I see BPF as one way forward to moving away
> from that. Setup what you need, and only run what you currently need,
> instead of the current "Is bleh on, then baz? Is this on, then that."
>
> So, I would like to avoid "future proofing" the helpers, if that makes
> sense. Use what you need. That's why BPF is so good (one of the
> things)!

Well, it's a tradeoff. We're still defining an API that should not be
(too) confusing...

> As for bpf_redirect_xsk() it's a leaner version of bpf_redirect_map().
> You want flags/shared sockets/...? Well go use bpf_redirect_map() and
> XSKMAP. bpf_redirect_xsk() is not for you.

This argument, however, I buy: bpf_redirect() is the single-purpose
helper for redirecting to an ifindex, bpf_redirect_xsk() is the
single-purpose helper for redirecting to an XSK, and bpf_redirect_map()
is the generic one that does both of those and more. Fair enough,
consider me convinced :)

> A lot of back-and-forth for *one* if-statement, but it's kind of a
> design thing for me. ;-)

Surely you don't mean to imply that you have *better* things to do with
your time than have a 10-emails-long argument over a single if
statement? ;)

-Toke


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 20:26               ` Toke Høiland-Jørgensen
@ 2021-01-20 21:15                 ` Alexei Starovoitov
  2021-01-21  8:18                   ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Alexei Starovoitov @ 2021-01-20 21:15 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Björn Töpel, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Network Development, bpf, Karlsson, Magnus,
	Fijalkowski, Maciej, Jakub Kicinski, Jonathan Lemon, maximmi,
	David S. Miller, Jesper Dangaard Brouer, John Fastabend,
	Ciara Loftus, weqaar.a.janjua

On Wed, Jan 20, 2021 at 12:26 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> This argument, however, I buy: bpf_redirect() is the single-purpose
> helper for redirecting to an ifindex, bpf_redirect_xsk() is the
> single-purpose helper for redirecting to an XSK, and bpf_redirect_map()
> is the generic one that does both of those and more. Fair enough,
> consider me convinced :)
>
> > A lot of back-and-forth for *one* if-statement, but it's kind of a
> > design thing for me. ;-)
>
> Surely you don't mean to imply that you have *better* things to do with
> your time than have a 10-emails-long argument over a single if
> statement? ;)

After reading this thread I think I have to pour cold water on the design.

The performance blip comes from hard coded assumptions:
+ queue_id = xdp->rxq->queue_index;
+ xs = READ_ONCE(dev->_rx[queue_id].xsk);

bpf can have specialized helpers, but imo this is beyond what's reasonable.
Please move such things into the program and try to make
bpf_redirect_map faster.

Making af_xdp non-root is orthogonal. If there is actual need for that
it has to be designed thoroughly and not presented as "this helper may
help to do that".
I don't think "may" will materialize unless people actually work
toward the goal of non-root.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
  2021-01-19 15:50 ` [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}() Björn Töpel
@ 2021-01-21  7:39   ` Andrii Nakryiko
  2021-01-21 12:31     ` Björn Töpel
  0 siblings, 1 reply; 45+ messages in thread
From: Andrii Nakryiko @ 2021-01-21  7:39 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, bpf,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jakub Kicinski, Jonathan Lemon, maximmi, David S. Miller,
	Jesper Dangaard Brouer, john fastabend, ciara.loftus,
	weqaar.a.janjua

On Tue, Jan 19, 2021 at 7:55 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> From: Björn Töpel <bjorn.topel@intel.com>
>
> Add support for externally loaded XDP programs to
> xdpxceiver/test_xsk.sh, so that bpf_redirect_xsk() and
> bpf_redirect_map() can be exercised.
>
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  .../selftests/bpf/progs/xdpxceiver_ext1.c     | 15 ++++
>  .../selftests/bpf/progs/xdpxceiver_ext2.c     |  9 +++
>  tools/testing/selftests/bpf/test_xsk.sh       | 48 ++++++++++++
>  tools/testing/selftests/bpf/xdpxceiver.c      | 77 ++++++++++++++++++-
>  tools/testing/selftests/bpf/xdpxceiver.h      |  2 +
>  5 files changed, 147 insertions(+), 4 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
>  create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
>
> diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
> new file mode 100644
> index 000000000000..18894040cca6
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_XSKMAP);
> +       __uint(max_entries, 32);
> +       __uint(key_size, sizeof(int));
> +       __uint(value_size, sizeof(int));
> +} xsks_map SEC(".maps");
> +
> +SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)

hmm.. that's unconventional... please keep SEC() on separate line

> +{
> +       return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_DROP);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> new file mode 100644
> index 000000000000..bd239b958c01
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> @@ -0,0 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/bpf.h>
> +#include <bpf/bpf_helpers.h>
> +
> +SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)

same here

> +{
> +       return bpf_redirect_xsk(ctx, XDP_DROP);
> +}
> +

[...]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper
  2021-01-20 21:15                 ` Alexei Starovoitov
@ 2021-01-21  8:18                   ` Björn Töpel
  0 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-21  8:18 UTC (permalink / raw)
  To: Alexei Starovoitov, Toke Høiland-Jørgensen
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Network Development, bpf, Karlsson, Magnus, Fijalkowski, Maciej,
	Jakub Kicinski, Jonathan Lemon, maximmi, David S. Miller,
	Jesper Dangaard Brouer, John Fastabend, Ciara Loftus,
	weqaar.a.janjua

On 2021-01-20 22:15, Alexei Starovoitov wrote:
> On Wed, Jan 20, 2021 at 12:26 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This argument, however, I buy: bpf_redirect() is the single-purpose
>> helper for redirecting to an ifindex, bpf_redirect_xsk() is the
>> single-purpose helper for redirecting to an XSK, and bpf_redirect_map()
>> is the generic one that does both of those and more. Fair enough,
>> consider me convinced :)
>>
>>> A lot of back-and-forth for *one* if-statement, but it's kind of a
>>> design thing for me. ;-)
>>
>> Surely you don't mean to imply that you have *better* things to do with
>> your time than have a 10-emails-long argument over a single if
>> statement? ;)
> 
> After reading this thread I think I have to pour cold water on the design.
> 
> The performance blip comes from hard coded assumptions:
> + queue_id = xdp->rxq->queue_index;
> + xs = READ_ONCE(dev->_rx[queue_id].xsk);
>

Yes, one can see this as a constrained map:

* The map belongs to a certain netdev.
* Each entry corresponds to a certain queue id.

I.e if we do a successful (non-NULL) lookup, we *know* that all sockets
in that map belong to the netdev, and has the correct queue id.

By doing that we can get rid of two run-time checks: "Is the socket
bound to this netdev?" and "Is this the correct queue id?".

> bpf can have specialized helpers, but imo this is beyond what's reasonable.
> Please move such things into the program and try to make
> bpf_redirect_map faster.
>

I obviously prefer this path, and ideally combined with a way to even
more specialize xdp_do_redirect(). Then again, you are the maintainer! :-)

Maybe an alternative be adding a new type of XSKMAP constrained in the
similar way as above, and continue with bpf_redirect_map(), but with
this new map the index argument would be ignored. Unfortunately the BPF
context (xdp_buff in this case) is not passed to bpf_redirect_map(), so
getting the actual queue_id in the helper is hard. Adding the context as
an additional argument would be a new helper...

I'll need to think a bit more about it. Input/ideas are welcome!


> Making af_xdp non-root is orthogonal. If there is actual need for that
> it has to be designed thoroughly and not presented as "this helper may
> help to do that".
> I don't think "may" will materialize unless people actually work
> toward the goal of non-root.
> 

Fair enough! Same goal could be reached using the existing map approach.


Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}()
  2021-01-21  7:39   ` Andrii Nakryiko
@ 2021-01-21 12:31     ` Björn Töpel
  0 siblings, 0 replies; 45+ messages in thread
From: Björn Töpel @ 2021-01-21 12:31 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Daniel Borkmann, Networking, bpf,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jakub Kicinski, Jonathan Lemon, maximmi, David S. Miller,
	Jesper Dangaard Brouer, john fastabend, Ciara Loftus,
	Weqaar Janjua

On Thu, 21 Jan 2021 at 08:39, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Tue, Jan 19, 2021 at 7:55 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
> >
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > Add support for externally loaded XDP programs to
> > xdpxceiver/test_xsk.sh, so that bpf_redirect_xsk() and
> > bpf_redirect_map() can be exercised.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  .../selftests/bpf/progs/xdpxceiver_ext1.c     | 15 ++++
> >  .../selftests/bpf/progs/xdpxceiver_ext2.c     |  9 +++
> >  tools/testing/selftests/bpf/test_xsk.sh       | 48 ++++++++++++
> >  tools/testing/selftests/bpf/xdpxceiver.c      | 77 ++++++++++++++++++-
> >  tools/testing/selftests/bpf/xdpxceiver.h      |  2 +
> >  5 files changed, 147 insertions(+), 4 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
> >  create mode 100644 tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> >
> > diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
> > new file mode 100644
> > index 000000000000..18894040cca6
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext1.c
> > @@ -0,0 +1,15 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +struct {
> > +       __uint(type, BPF_MAP_TYPE_XSKMAP);
> > +       __uint(max_entries, 32);
> > +       __uint(key_size, sizeof(int));
> > +       __uint(value_size, sizeof(int));
> > +} xsks_map SEC(".maps");
> > +
> > +SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>
> hmm.. that's unconventional... please keep SEC() on separate line
>
> > +{
> > +       return bpf_redirect_map(&xsks_map, ctx->rx_queue_index, XDP_DROP);
> > +}
> > diff --git a/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> > new file mode 100644
> > index 000000000000..bd239b958c01
> > --- /dev/null
> > +++ b/tools/testing/selftests/bpf/progs/xdpxceiver_ext2.c
> > @@ -0,0 +1,9 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/bpf.h>
> > +#include <bpf/bpf_helpers.h>
> > +
> > +SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>
> same here
>

Thanks Andrii! I'll make sure to have the SECs on separate lines going forward!

Björn

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-20 16:19     ` Maciej Fijalkowski
@ 2021-01-21 17:01       ` Jesper Dangaard Brouer
  2021-01-22  8:59         ` Magnus Karlsson
  0 siblings, 1 reply; 45+ messages in thread
From: Jesper Dangaard Brouer @ 2021-01-21 17:01 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Maxim Mikityanskiy, Björn Töpel, ast, daniel, netdev,
	bpf, bjorn.topel, magnus.karlsson, kuba, jonathan.lemon, davem,
	john.fastabend, ciara.loftus, weqaar.a.janjua, brouer

On Wed, 20 Jan 2021 17:19:31 +0100
Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

> On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> > On Wed, 20 Jan 2021 15:15:22 +0200
> > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
> >   
> > > On 2021-01-19 17:50, Björn Töpel wrote:  
> > > > This series extends bind() for XDP sockets, so that the bound socket
> > > > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > > > this to register the socket. To redirect packets to the registered
> > > > socket, a new BPF helper is used: bpf_redirect_xsk().
> > > > 
> > > > For shared XDP sockets, only the first bound socket is
> > > > registered. Users that need more complex setup has to use XSKMAP and
> > > > bpf_redirect_map().
> > > > 
> > > > Now, why would one use bpf_redirect_xsk() over the regular
> > > > bpf_redirect_map() helper?
> > > > 
> > > > * Better performance!
> > > > * Convenience; Most user use one socket per queue. This scenario is
> > > >    what registered sockets support. There is no need to create an
> > > >    XSKMAP. This can also reduce complexity from containerized setups,
> > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN
> > > >    capabilities.  
> > 
> > I'm buying into the convenience and reduce complexity, and XDP sockets
> > without CAP_SYS_ADMIN into containers.
> > 
> > People might be surprised that I'm actually NOT buying into the better
> > performance argument here.  At these speeds we are basically comparing
> > how close we are to zero (and have to use nanosec time scale for our
> > comparisons), more below.
> > 
> >    
> > > > The first patch restructures xdp_do_redirect() a bit, to make it
> > > > easier to add the new helper. This restructure also give us a slight
> > > > performance benefit. The following three patches extends bind() and
> > > > adds the new helper. After that, two libbpf patches that selects XDP
> > > > program based on what kernel is running. Finally, selftests for the new
> > > > functionality is added.
> > > > 
> > > > Note that the libbpf "auto-selection" is based on kernel version, so
> > > > it is hard coded to the "-next" version (5.12). If you would like to
> > > > try this is out, you will need to change the libbpf patch locally!
> > > > 
> > > > Thanks to Maciej and Magnus for the internal review/comments!
> > > > 
> > > > Performance (rxdrop, zero-copy)
> > > > 
> > > > Baseline
> > > > Two cores:                   21.3 Mpps
> > > > One core:                    24.5 Mpps    
> > > 
> > > Two cores is slower? It used to be faster all the time, didn't it?
> > >   
> > > > Patched
> > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > > > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > > > 
> > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%    
> > > 
> > > Nice, impressive improvement!  
> > 
> > I do appreciate you work and performance optimizations at this level,
> > because when we are using this few CPU cycles per packet, then it is
> > really hard to find new ways to reduce cycles further.
> > 
> > Thank for you saying +13% instead of saying +2.7 Mpps.
> > It *is* impressive to basically reduce cycles with 13%.
> > 
> >  21.3 Mpps = 46.94 nanosec per packet
> >  24.0 Mpps = 41.66 nanosec per packet
> > 
> >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved
> > 
> > On my 3.60GHz testlab machine that gives me 19 cycles.
> > 
> >    
> > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> > > >  
> > 
> >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved
> > 
> > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.
> > 
> > 
> > We still need these optimization in the kernel, but end-users in
> > userspace are very quickly going to waste the 19 cycles we found.
> > I still support/believe that the OS need to have a little overhead as
> > possible, but for me 42 nanosec overhead is close to zero overhead. For
> > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
> > delivery of UDP packets into socket (almost 15 times slower).
> > 
> > I guess my point is that with XDP we have already achieved and exceeded
> > (my original) performance goals, making it even faster is just showing off ;-P  
> 
> Even though I'll let Bjorn elaborating on this, we're talking here about
> AF-XDP which is a bit different pair of shoes to me in terms of
> performance. AFAIK we still have a gap when compared to DPDK's numbers. So

AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say
AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4
nanosec.

> I'm really not sure why better performance bothers you? :)

Finding those 11.4 nanosec is going to be really difficult and take a
huge effort. My fear is also that some of these optimizations will make
the code harder to maintain and understand.

If you are ready to do this huge effort, then I'm actually ready to
provide ideas on what you can optimize.

E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming
data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel
have no users of that feature).

The base design of XDP redirect API, that have a number of function
calls per packet, in itself becomes a limiting factor.  Even the cost
of getting the per-CPU pointer to bpf_redirect_info becomes something
to look at.


> Let's rather be more harsh on changes that actually decrease the
> performance, not the other way around. And I suppose you were the one that
> always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,
> I'm a bit confused with your statement.

Yes, we really need to protect the existing the performance.

I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.
People with good intentions are adding features, and performance slowly
disappears.  I've been spending weeks/months finding nanosec level
improvements, which I don't have time to "defend". Recently I realized
that the 2nd XDP prog on egress, even when no prog is attached, is
slowing down the performance of base redirect (I don't fully understand
why, maybe it is simply the I-cache increase, I didn't realize when
reviewing the code).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-21 17:01       ` Jesper Dangaard Brouer
@ 2021-01-22  8:59         ` Magnus Karlsson
  2021-01-22  9:45           ` Maciej Fijalkowski
  0 siblings, 1 reply; 45+ messages in thread
From: Magnus Karlsson @ 2021-01-22  8:59 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Maciej Fijalkowski, Maxim Mikityanskiy, Björn Töpel,
	Alexei Starovoitov, Daniel Borkmann, Network Development, bpf,
	Björn Töpel, Karlsson, Magnus, Jakub Kicinski,
	Jonathan Lemon, David S. Miller, John Fastabend, Ciara Loftus,
	Weqaar Janjua

On Thu, Jan 21, 2021 at 6:07 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> On Wed, 20 Jan 2021 17:19:31 +0100
> Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:
>
> > On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> > > On Wed, 20 Jan 2021 15:15:22 +0200
> > > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
> > >
> > > > On 2021-01-19 17:50, Björn Töpel wrote:
> > > > > This series extends bind() for XDP sockets, so that the bound socket
> > > > > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > > > > this to register the socket. To redirect packets to the registered
> > > > > socket, a new BPF helper is used: bpf_redirect_xsk().
> > > > >
> > > > > For shared XDP sockets, only the first bound socket is
> > > > > registered. Users that need more complex setup has to use XSKMAP and
> > > > > bpf_redirect_map().
> > > > >
> > > > > Now, why would one use bpf_redirect_xsk() over the regular
> > > > > bpf_redirect_map() helper?
> > > > >
> > > > > * Better performance!
> > > > > * Convenience; Most user use one socket per queue. This scenario is
> > > > >    what registered sockets support. There is no need to create an
> > > > >    XSKMAP. This can also reduce complexity from containerized setups,
> > > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN
> > > > >    capabilities.
> > >
> > > I'm buying into the convenience and reduce complexity, and XDP sockets
> > > without CAP_SYS_ADMIN into containers.
> > >
> > > People might be surprised that I'm actually NOT buying into the better
> > > performance argument here.  At these speeds we are basically comparing
> > > how close we are to zero (and have to use nanosec time scale for our
> > > comparisons), more below.
> > >
> > >
> > > > > The first patch restructures xdp_do_redirect() a bit, to make it
> > > > > easier to add the new helper. This restructure also give us a slight
> > > > > performance benefit. The following three patches extends bind() and
> > > > > adds the new helper. After that, two libbpf patches that selects XDP
> > > > > program based on what kernel is running. Finally, selftests for the new
> > > > > functionality is added.
> > > > >
> > > > > Note that the libbpf "auto-selection" is based on kernel version, so
> > > > > it is hard coded to the "-next" version (5.12). If you would like to
> > > > > try this is out, you will need to change the libbpf patch locally!
> > > > >
> > > > > Thanks to Maciej and Magnus for the internal review/comments!
> > > > >
> > > > > Performance (rxdrop, zero-copy)
> > > > >
> > > > > Baseline
> > > > > Two cores:                   21.3 Mpps
> > > > > One core:                    24.5 Mpps
> > > >
> > > > Two cores is slower? It used to be faster all the time, didn't it?
> > > >
> > > > > Patched
> > > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > > > > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > > > >
> > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
> > > >
> > > > Nice, impressive improvement!
> > >
> > > I do appreciate you work and performance optimizations at this level,
> > > because when we are using this few CPU cycles per packet, then it is
> > > really hard to find new ways to reduce cycles further.
> > >
> > > Thank for you saying +13% instead of saying +2.7 Mpps.
> > > It *is* impressive to basically reduce cycles with 13%.
> > >
> > >  21.3 Mpps = 46.94 nanosec per packet
> > >  24.0 Mpps = 41.66 nanosec per packet
> > >
> > >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved
> > >
> > > On my 3.60GHz testlab machine that gives me 19 cycles.
> > >
> > >
> > > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> > > > >
> > >
> > >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved
> > >
> > > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.
> > >
> > >
> > > We still need these optimization in the kernel, but end-users in
> > > userspace are very quickly going to waste the 19 cycles we found.
> > > I still support/believe that the OS need to have a little overhead as
> > > possible, but for me 42 nanosec overhead is close to zero overhead. For
> > > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
> > > delivery of UDP packets into socket (almost 15 times slower).
> > >
> > > I guess my point is that with XDP we have already achieved and exceeded
> > > (my original) performance goals, making it even faster is just showing off ;-P
> >
> > Even though I'll let Bjorn elaborating on this, we're talking here about
> > AF-XDP which is a bit different pair of shoes to me in terms of
> > performance. AFAIK we still have a gap when compared to DPDK's numbers. So
>
> AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say
> AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4
> nanosec.
>
> > I'm really not sure why better performance bothers you? :)
>
> Finding those 11.4 nanosec is going to be really difficult and take a
> huge effort. My fear is also that some of these optimizations will make
> the code harder to maintain and understand.
>
> If you are ready to do this huge effort, then I'm actually ready to
> provide ideas on what you can optimize.
>
> E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming
> data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel
> have no users of that feature).

In my experience, I would caution starting to use L2 prefetch at this
point since it is somewhat finicky. It is possible to get better
performance using it on one platform, but significantly harder to get
it to work also on a three year older one, not to mention other
platforms and architectures.

Instead, I would start with optimizations that always work such as not
having to execute code (or removing code) that is not necessary for
the functionality that is provided. One example of this is having to
execute all the bpf_redirect code when we are using packet steering in
HW to AF_XDP sockets and other potential consumers. In this scenario,
the core bpf_redirect code (xdp_do_redirect + bpf_xdp_redirect_map +
__xsk_map_redirect for the XSKMAP case) provides no value and amounts
to 19% of the total execution time on my machine (application plus
kernel, and the XDP program itself is another 7% on top of the 19%).
By removing this, you could cut down a large chunk of those 11.4 ns we
would need to find in our optimization work. Just to be clear, there
are plenty of other scenarios where the bpf_redirect functionality
provides great value, so it is really good to have. I just think it is
weird that it has to be executed all the time even when it is not
needed. The big question is just how to achieve this in an
architecturally sound way. Any ideas?

Just note that there are plenty that could be done in other areas too.
For example, the Rx part of the driver is around 22% of the execution
time and I am sure we can optimize this part too and increase the
readability and maintainability of the code at the same time. But not
trim it down to 0% as some of this is actually necessary work in all
cases where traffic is received :-). xp_alloc is another big time
consumer of cycles, but for this one I do have a patch set that cuts
it down from 21% to 8% that I plan on submitting in February.

perf top data:
21.58%  [ice]                      [k] ice_clean_rx_irq_zc
21.21%  [kernel]                   [k] xp_alloc
13.93%  [kernel]                   [k] __xsk_rcv_zc
8.53%  [kernel]                   [k] xdp_do_redirect
8.15%  [kernel]                   [k] xsk_rcv
6.66%  [kernel]                   [k] bpf_xdp_redirect_map
5.08%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629
3.73%  [ice]                      [k] ice_alloc_rx_bufs_zc
3.36%  [kernel]                   [k] __xsk_map_redirect
2.62%  xdpsock                    [.] main
0.96%  [kernel]                   [k] rcu_read_unlock_strict
0.90%  bpf_dispatcher_xdp         [k] bpf_dispatcher_xdp
0.87%  [kernel]                   [k] bpf_dispatcher_xdp_func
0.51%  [kernel]                   [k] xsk_flush
0.38%  [kernel]                   [k] lapic_next_deadline
0.13%  [ice]                      [k] ice_napi_poll

> The base design of XDP redirect API, that have a number of function
> calls per packet, in itself becomes a limiting factor.  Even the cost
> of getting the per-CPU pointer to bpf_redirect_info becomes something
> to look at.
>
>
> > Let's rather be more harsh on changes that actually decrease the
> > performance, not the other way around. And I suppose you were the one that
> > always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,
> > I'm a bit confused with your statement.
>
> Yes, we really need to protect the existing the performance.
>
> I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.
> People with good intentions are adding features, and performance slowly
> disappears.  I've been spending weeks/months finding nanosec level
> improvements, which I don't have time to "defend". Recently I realized
> that the 2nd XDP prog on egress, even when no prog is attached, is
> slowing down the performance of base redirect (I don't fully understand
> why, maybe it is simply the I-cache increase, I didn't realize when
> reviewing the code).
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper
  2021-01-22  8:59         ` Magnus Karlsson
@ 2021-01-22  9:45           ` Maciej Fijalkowski
  0 siblings, 0 replies; 45+ messages in thread
From: Maciej Fijalkowski @ 2021-01-22  9:45 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Jesper Dangaard Brouer, Maxim Mikityanskiy,
	Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Network Development, bpf, Björn Töpel, Karlsson,
	Magnus, Jakub Kicinski, Jonathan Lemon, David S. Miller,
	John Fastabend, Ciara Loftus, Weqaar Janjua

On Fri, Jan 22, 2021 at 09:59:53AM +0100, Magnus Karlsson wrote:
> On Thu, Jan 21, 2021 at 6:07 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> >
> > On Wed, 20 Jan 2021 17:19:31 +0100
> > Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:
> >
> > > On Wed, Jan 20, 2021 at 04:57:08PM +0100, Jesper Dangaard Brouer wrote:
> > > > On Wed, 20 Jan 2021 15:15:22 +0200
> > > > Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
> > > >
> > > > > On 2021-01-19 17:50, Björn Töpel wrote:
> > > > > > This series extends bind() for XDP sockets, so that the bound socket
> > > > > > is added to the netdev_rx_queue _rx array in the netdevice. We call
> > > > > > this to register the socket. To redirect packets to the registered
> > > > > > socket, a new BPF helper is used: bpf_redirect_xsk().
> > > > > >
> > > > > > For shared XDP sockets, only the first bound socket is
> > > > > > registered. Users that need more complex setup has to use XSKMAP and
> > > > > > bpf_redirect_map().
> > > > > >
> > > > > > Now, why would one use bpf_redirect_xsk() over the regular
> > > > > > bpf_redirect_map() helper?
> > > > > >
> > > > > > * Better performance!
> > > > > > * Convenience; Most user use one socket per queue. This scenario is
> > > > > >    what registered sockets support. There is no need to create an
> > > > > >    XSKMAP. This can also reduce complexity from containerized setups,
> > > > > >    where users might what to use XDP sockets without CAP_SYS_ADMIN
> > > > > >    capabilities.
> > > >
> > > > I'm buying into the convenience and reduce complexity, and XDP sockets
> > > > without CAP_SYS_ADMIN into containers.
> > > >
> > > > People might be surprised that I'm actually NOT buying into the better
> > > > performance argument here.  At these speeds we are basically comparing
> > > > how close we are to zero (and have to use nanosec time scale for our
> > > > comparisons), more below.
> > > >
> > > >
> > > > > > The first patch restructures xdp_do_redirect() a bit, to make it
> > > > > > easier to add the new helper. This restructure also give us a slight
> > > > > > performance benefit. The following three patches extends bind() and
> > > > > > adds the new helper. After that, two libbpf patches that selects XDP
> > > > > > program based on what kernel is running. Finally, selftests for the new
> > > > > > functionality is added.
> > > > > >
> > > > > > Note that the libbpf "auto-selection" is based on kernel version, so
> > > > > > it is hard coded to the "-next" version (5.12). If you would like to
> > > > > > try this is out, you will need to change the libbpf patch locally!
> > > > > >
> > > > > > Thanks to Maciej and Magnus for the internal review/comments!
> > > > > >
> > > > > > Performance (rxdrop, zero-copy)
> > > > > >
> > > > > > Baseline
> > > > > > Two cores:                   21.3 Mpps
> > > > > > One core:                    24.5 Mpps
> > > > >
> > > > > Two cores is slower? It used to be faster all the time, didn't it?
> > > > >
> > > > > > Patched
> > > > > > Two cores, bpf_redirect_map: 21.7 Mpps + 2%
> > > > > > One core, bpf_redirect_map:  24.9 Mpps + 2%
> > > > > >
> > > > > > Two cores, bpf_redirect_xsk: 24.0 Mpps +13%
> > > > >
> > > > > Nice, impressive improvement!
> > > >
> > > > I do appreciate you work and performance optimizations at this level,
> > > > because when we are using this few CPU cycles per packet, then it is
> > > > really hard to find new ways to reduce cycles further.
> > > >
> > > > Thank for you saying +13% instead of saying +2.7 Mpps.
> > > > It *is* impressive to basically reduce cycles with 13%.
> > > >
> > > >  21.3 Mpps = 46.94 nanosec per packet
> > > >  24.0 Mpps = 41.66 nanosec per packet
> > > >
> > > >  21.3 Mpps -> 24.0 Mpps = 5.28 nanosec saved
> > > >
> > > > On my 3.60GHz testlab machine that gives me 19 cycles.
> > > >
> > > >
> > > > > > One core, bpf_redirect_xsk:  25.5 Mpps + 4%
> > > > > >
> > > >
> > > >  24.5 Mpps -> 25.5 Mpps = 1.6 nanosec saved
> > > >
> > > > At this point with saving 1.6 ns this is around the cost of a function call 1.3 ns.
> > > >
> > > >
> > > > We still need these optimization in the kernel, but end-users in
> > > > userspace are very quickly going to waste the 19 cycles we found.
> > > > I still support/believe that the OS need to have a little overhead as
> > > > possible, but for me 42 nanosec overhead is close to zero overhead. For
> > > > comparison, I just ran a udp_sink[3] test, and it "cost" 625 ns for
> > > > delivery of UDP packets into socket (almost 15 times slower).
> > > >
> > > > I guess my point is that with XDP we have already achieved and exceeded
> > > > (my original) performance goals, making it even faster is just showing off ;-P
> > >
> > > Even though I'll let Bjorn elaborating on this, we're talking here about
> > > AF-XDP which is a bit different pair of shoes to me in terms of
> > > performance. AFAIK we still have a gap when compared to DPDK's numbers. So
> >
> > AFAIK the gap on this i40e hardware is DPDK=36Mpps, and lets say
> > AF_XDP=25.5 Mpps, that looks large +10.5Mpps, but it is only 11.4
> > nanosec.
> >
> > > I'm really not sure why better performance bothers you? :)
> >
> > Finding those 11.4 nanosec is going to be really difficult and take a
> > huge effort. My fear is also that some of these optimizations will make
> > the code harder to maintain and understand.
> >
> > If you are ready to do this huge effort, then I'm actually ready to
> > provide ideas on what you can optimize.
> >
> > E.g. you can get 2-4 nanosec if we prefetch to L2 in driver (assuming
> > data is DDIO in L3 already).  (CPU supports L2 prefetch, but kernel
> > have no users of that feature).

Just to add to what Magnus wrote below, when comparing technologies, XDP
is confronted with kernel's netstack and as we know, difference is really
big and there's no secret who's the winner in terms of performance in
specific scenarios. So, I'm not saying I'm super OK with that, but we can
allow ourselves for having a cool feature that boosts our control plane,
but degrades performance a bit, as you found out with egress XDP prog. We
are still going to be fine.

For AF_XDP the case is the opposite, meaning, we are the chaser in this
scenario. Performance work is still the top priority for AF_XDP in my
eyes.

> 
> In my experience, I would caution starting to use L2 prefetch at this
> point since it is somewhat finicky. It is possible to get better
> performance using it on one platform, but significantly harder to get
> it to work also on a three year older one, not to mention other
> platforms and architectures.
> 
> Instead, I would start with optimizations that always work such as not
> having to execute code (or removing code) that is not necessary for
> the functionality that is provided. One example of this is having to
> execute all the bpf_redirect code when we are using packet steering in
> HW to AF_XDP sockets and other potential consumers. In this scenario,
> the core bpf_redirect code (xdp_do_redirect + bpf_xdp_redirect_map +
> __xsk_map_redirect for the XSKMAP case) provides no value and amounts
> to 19% of the total execution time on my machine (application plus
> kernel, and the XDP program itself is another 7% on top of the 19%).
> By removing this, you could cut down a large chunk of those 11.4 ns we
> would need to find in our optimization work. Just to be clear, there
> are plenty of other scenarios where the bpf_redirect functionality
> provides great value, so it is really good to have. I just think it is
> weird that it has to be executed all the time even when it is not
> needed. The big question is just how to achieve this in an
> architecturally sound way. Any ideas?
> 
> Just note that there are plenty that could be done in other areas too.
> For example, the Rx part of the driver is around 22% of the execution
> time and I am sure we can optimize this part too and increase the
> readability and maintainability of the code at the same time. But not
> trim it down to 0% as some of this is actually necessary work in all
> cases where traffic is received :-). xp_alloc is another big time
> consumer of cycles, but for this one I do have a patch set that cuts
> it down from 21% to 8% that I plan on submitting in February.
> 
> perf top data:
> 21.58%  [ice]                      [k] ice_clean_rx_irq_zc
> 21.21%  [kernel]                   [k] xp_alloc
> 13.93%  [kernel]                   [k] __xsk_rcv_zc
> 8.53%  [kernel]                   [k] xdp_do_redirect
> 8.15%  [kernel]                   [k] xsk_rcv
> 6.66%  [kernel]                   [k] bpf_xdp_redirect_map
> 5.08%  bpf_prog_992d9ddc835e5629  [k] bpf_prog_992d9ddc835e5629
> 3.73%  [ice]                      [k] ice_alloc_rx_bufs_zc
> 3.36%  [kernel]                   [k] __xsk_map_redirect
> 2.62%  xdpsock                    [.] main
> 0.96%  [kernel]                   [k] rcu_read_unlock_strict
> 0.90%  bpf_dispatcher_xdp         [k] bpf_dispatcher_xdp
> 0.87%  [kernel]                   [k] bpf_dispatcher_xdp_func
> 0.51%  [kernel]                   [k] xsk_flush
> 0.38%  [kernel]                   [k] lapic_next_deadline
> 0.13%  [ice]                      [k] ice_napi_poll
> 
> > The base design of XDP redirect API, that have a number of function
> > calls per packet, in itself becomes a limiting factor.  Even the cost
> > of getting the per-CPU pointer to bpf_redirect_info becomes something
> > to look at.
> >
> >
> > > Let's rather be more harsh on changes that actually decrease the
> > > performance, not the other way around. And I suppose you were the one that
> > > always was bringing up the 'death by a 1000 paper cuts' of XDP. So yeah,
> > > I'm a bit confused with your statement.
> >
> > Yes, we really need to protect the existing the performance.
> >
> > I think I'm mostly demotivated by the 'death by a 1000 paper cuts'.
> > People with good intentions are adding features, and performance slowly
> > disappears.  I've been spending weeks/months finding nanosec level
> > improvements, which I don't have time to "defend". Recently I realized
> > that the 2nd XDP prog on egress, even when no prog is attached, is
> > slowing down the performance of base redirect (I don't fully understand
> > why, maybe it is simply the I-cache increase, I didn't realize when
> > reviewing the code).
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2021-01-22  9:56 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-19 15:50 [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 1/8] xdp: restructure redirect actions Björn Töpel
2021-01-20 12:44   ` Toke Høiland-Jørgensen
2021-01-20 13:40     ` Björn Töpel
2021-01-20 14:52       ` Toke Høiland-Jørgensen
2021-01-20 15:49         ` Björn Töpel
2021-01-20 16:30           ` Toke Høiland-Jørgensen
2021-01-20 17:26             ` Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 2/8] xsk: remove explicit_free parameter from __xsk_rcv() Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 3/8] xsk: fold xp_assign_dev and __xp_assign_dev Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 4/8] xsk: register XDP sockets at bind(), and add new AF_XDP BPF helper Björn Töpel
2021-01-20  8:25   ` kernel test robot
2021-01-20  8:41     ` Björn Töpel
2021-01-20  8:50   ` kernel test robot
2021-01-20 12:50   ` Toke Høiland-Jørgensen
2021-01-20 13:25     ` Björn Töpel
2021-01-20 14:54       ` Toke Høiland-Jørgensen
2021-01-20 15:18         ` Björn Töpel
2021-01-20 17:29           ` Toke Høiland-Jørgensen
2021-01-20 18:22             ` Björn Töpel
2021-01-20 20:26               ` Toke Høiland-Jørgensen
2021-01-20 21:15                 ` Alexei Starovoitov
2021-01-21  8:18                   ` Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 5/8] libbpf, xsk: select AF_XDP BPF program based on kernel version Björn Töpel
2021-01-20 12:52   ` Toke Høiland-Jørgensen
2021-01-20 13:25     ` Björn Töpel
2021-01-20 14:49       ` Björn Töpel
2021-01-20 15:11         ` Toke Høiland-Jørgensen
2021-01-20 15:27           ` Björn Töpel
2021-01-20 17:30             ` Toke Høiland-Jørgensen
2021-01-20 18:25             ` Alexei Starovoitov
2021-01-20 18:30               ` Björn Töpel
2021-01-20 14:56       ` Toke Høiland-Jørgensen
2021-01-19 15:50 ` [PATCH bpf-next v2 6/8] libbpf, xsk: select bpf_redirect_xsk(), if supported Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 7/8] selftest/bpf: add XDP socket tests for bpf_redirect_{xsk, map}() Björn Töpel
2021-01-21  7:39   ` Andrii Nakryiko
2021-01-21 12:31     ` Björn Töpel
2021-01-19 15:50 ` [PATCH bpf-next v2 8/8] selftest/bpf: remove a lot of ifobject casting in xdpxceiver Björn Töpel
2021-01-20 13:15 ` [PATCH bpf-next v2 0/8] Introduce bpf_redirect_xsk() helper Maxim Mikityanskiy
2021-01-20 13:27   ` Björn Töpel
2021-01-20 15:57   ` Jesper Dangaard Brouer
2021-01-20 16:19     ` Maciej Fijalkowski
2021-01-21 17:01       ` Jesper Dangaard Brouer
2021-01-22  8:59         ` Magnus Karlsson
2021-01-22  9:45           ` Maciej Fijalkowski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).