BPF Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH bpf-next 0/6] Various BPF helper improvements
@ 2020-09-24 18:21 Daniel Borkmann
  2020-09-24 18:21 ` [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Daniel Borkmann
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

This series adds two BPF helpers, that is, one for retrieving the classid
of an skb and another one to redirect via the neigh subsystem, and improves
also the cookie helpers by removing the atomic counter. I've also added
the bpf_tail_call_static() helper to the libbpf API that we've been using
in Cilium for a while now, and last but not least the series adds a few
selftests. For details, please check individual patches, thanks!

Daniel Borkmann (6):
  bpf: add classid helper only based on skb->sk
  bpf, net: rework cookie generator as per-cpu one
  bpf: add redirect_neigh helper as redirect drop-in
  bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  bpf, selftests: use bpf_tail_call_static where appropriate
  bpf, selftests: add redirect_neigh selftest

 include/linux/cookie.h                        |  41 +++
 include/linux/skbuff.h                        |   5 +
 include/uapi/linux/bpf.h                      |  24 ++
 net/core/filter.c                             | 277 +++++++++++++++++-
 net/core/net_namespace.c                      |   5 +-
 net/core/sock_diag.c                          |   7 +-
 tools/include/uapi/linux/bpf.h                |  24 ++
 tools/lib/bpf/bpf_helpers.h                   |  32 ++
 tools/testing/selftests/bpf/progs/bpf_flow.c  |  12 +-
 tools/testing/selftests/bpf/progs/tailcall1.c |  28 +-
 tools/testing/selftests/bpf/progs/tailcall2.c |  14 +-
 tools/testing/selftests/bpf/progs/tailcall3.c |   4 +-
 .../selftests/bpf/progs/tailcall_bpf2bpf1.c   |   4 +-
 .../selftests/bpf/progs/tailcall_bpf2bpf2.c   |   6 +-
 .../selftests/bpf/progs/tailcall_bpf2bpf3.c   |   6 +-
 .../selftests/bpf/progs/tailcall_bpf2bpf4.c   |   6 +-
 .../selftests/bpf/progs/test_tc_neigh.c       | 144 +++++++++
 tools/testing/selftests/bpf/test_tc_neigh.sh  | 168 +++++++++++
 18 files changed, 749 insertions(+), 58 deletions(-)
 create mode 100644 include/linux/cookie.h
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_neigh.c
 create mode 100755 tools/testing/selftests/bpf/test_tc_neigh.sh

-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  2020-09-25 14:46   ` Jakub Kicinski
  2020-09-24 18:21 ` [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one Daniel Borkmann
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to 9c4c325252c5
("skbuff: preserve sock reference when scrubbing the skb."). This
is similar to bpf_skb_cgroup_id() which implements the same for v2.
Kubernetes ecosystem is still operating on v1 however, hence net_cls
needs to be used there until this can be dropped in with the v2
helper of bpf_skb_cgroup_id().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/uapi/linux/bpf.h       | 10 ++++++++++
 net/core/filter.c              | 21 +++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 10 ++++++++++
 3 files changed, 41 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a22812561064..48ecf246d047 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3586,6 +3586,15 @@ union bpf_attr {
  * 		the data in *dst*. This is a wrapper of **copy_from_user**\ ().
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_skb_cgroup_classid(struct sk_buff *skb)
+ * 	Description
+ * 		See **bpf_get_cgroup_classid**\ () for the main description.
+ * 		This helper differs from **bpf_get_cgroup_classid**\ () in that
+ * 		the cgroup v1 net_cls class is retrieved only from the *skb*'s
+ * 		associated socket instead of the current process.
+ * 	Return
+ * 		The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3737,6 +3746,7 @@ union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(skb_cgroup_classid),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 6014e5f40c58..0f913755bcba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2704,6 +2704,23 @@ static const struct bpf_func_proto bpf_get_cgroup_classid_curr_proto = {
 	.gpl_only	= false,
 	.ret_type	= RET_INTEGER,
 };
+
+BPF_CALL_1(bpf_skb_cgroup_classid, const struct sk_buff *, skb)
+{
+	struct sock *sk = skb_to_full_sk(skb);
+
+	if (!sk || !sk_fullsock(sk))
+		return 0;
+
+	return sock_cgroup_classid(&sk->sk_cgrp_data);
+}
+
+static const struct bpf_func_proto bpf_skb_cgroup_classid_proto = {
+	.func		= bpf_skb_cgroup_classid,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
 #endif
 
 BPF_CALL_1(bpf_get_cgroup_classid, const struct sk_buff *, skb)
@@ -6770,6 +6787,10 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	case BPF_FUNC_skb_get_xfrm_state:
 		return &bpf_skb_get_xfrm_state_proto;
 #endif
+#ifdef CONFIG_CGROUP_NET_CLASSID
+	case BPF_FUNC_skb_cgroup_classid:
+		return &bpf_skb_cgroup_classid_proto;
+#endif
 #ifdef CONFIG_SOCK_CGROUP_DATA
 	case BPF_FUNC_skb_cgroup_id:
 		return &bpf_skb_cgroup_id_proto;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a22812561064..48ecf246d047 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3586,6 +3586,15 @@ union bpf_attr {
  * 		the data in *dst*. This is a wrapper of **copy_from_user**\ ().
  * 	Return
  * 		0 on success, or a negative error in case of failure.
+ *
+ * u64 bpf_skb_cgroup_classid(struct sk_buff *skb)
+ * 	Description
+ * 		See **bpf_get_cgroup_classid**\ () for the main description.
+ * 		This helper differs from **bpf_get_cgroup_classid**\ () in that
+ * 		the cgroup v1 net_cls class is retrieved only from the *skb*'s
+ * 		associated socket instead of the current process.
+ * 	Return
+ * 		The id is returned or 0 in case the id could not be retrieved.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3737,6 +3746,7 @@ union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(skb_cgroup_classid),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
  2020-09-24 18:21 ` [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  2020-09-24 18:58   ` Eric Dumazet
  2020-09-24 18:21 ` [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in Daniel Borkmann
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

With its use in BPF the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
shared atomic counter. As similarly done in f991bd2e1421 ("fs: introduce
a per-cpu last_ino allocator"), add a small helper library that both can
then use for the 64 bit counters.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/cookie.h   | 41 ++++++++++++++++++++++++++++++++++++++++
 net/core/net_namespace.c |  5 +++--
 net/core/sock_diag.c     |  7 ++++---
 3 files changed, 48 insertions(+), 5 deletions(-)
 create mode 100644 include/linux/cookie.h

diff --git a/include/linux/cookie.h b/include/linux/cookie.h
new file mode 100644
index 000000000000..2488203dc004
--- /dev/null
+++ b/include/linux/cookie.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __LINUX_COOKIE_H
+#define __LINUX_COOKIE_H
+
+#include <linux/atomic.h>
+#include <linux/percpu.h>
+
+struct gen_cookie {
+	u64 __percpu	*local_last;
+	atomic64_t	 shared_last ____cacheline_aligned_in_smp;
+};
+
+#define COOKIE_LOCAL_BATCH	4096
+
+#define DEFINE_COOKIE(name)					\
+	static DEFINE_PER_CPU(u64, __##name);			\
+	static struct gen_cookie name = {			\
+		.local_last	= &__##name,			\
+		.shared_last	= ATOMIC64_INIT(0),		\
+	}
+
+static inline u64 gen_cookie_next(struct gen_cookie *gc)
+{
+	u64 *local_last = &get_cpu_var(*gc->local_last);
+	u64 val = *local_last;
+
+	if (__is_defined(CONFIG_SMP) &&
+	    unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
+		s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
+					       &gc->shared_last);
+		val = next - COOKIE_LOCAL_BATCH;
+	}
+	val++;
+	if (unlikely(!val))
+		val++;
+	*local_last = val;
+	put_cpu_var(local_last);
+	return val;
+}
+
+#endif /* __LINUX_COOKIE_H */
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index dcd61aca343e..cf29ed8950d1 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -19,6 +19,7 @@
 #include <linux/net_namespace.h>
 #include <linux/sched/task.h>
 #include <linux/uidgid.h>
+#include <linux/cookie.h>
 
 #include <net/sock.h>
 #include <net/netlink.h>
@@ -69,7 +70,7 @@ EXPORT_SYMBOL_GPL(pernet_ops_rwsem);
 
 static unsigned int max_gen_ptrs = INITIAL_NET_GEN_PTRS;
 
-static atomic64_t cookie_gen;
+DEFINE_COOKIE(net_cookie);
 
 u64 net_gen_cookie(struct net *net)
 {
@@ -78,7 +79,7 @@ u64 net_gen_cookie(struct net *net)
 
 		if (res)
 			return res;
-		res = atomic64_inc_return(&cookie_gen);
+		res = gen_cookie_next(&net_cookie);
 		atomic64_cmpxchg(&net->net_cookie, 0, res);
 	}
 }
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index c13ffbd33d8d..4a4180e8dd35 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -11,7 +11,7 @@
 #include <linux/tcp.h>
 #include <linux/workqueue.h>
 #include <linux/nospec.h>
-
+#include <linux/cookie.h>
 #include <linux/inet_diag.h>
 #include <linux/sock_diag.h>
 
@@ -19,7 +19,8 @@ static const struct sock_diag_handler *sock_diag_handlers[AF_MAX];
 static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
-static atomic64_t cookie_gen;
+
+DEFINE_COOKIE(sock_cookie);
 
 u64 sock_gen_cookie(struct sock *sk)
 {
@@ -28,7 +29,7 @@ u64 sock_gen_cookie(struct sock *sk)
 
 		if (res)
 			return res;
-		res = atomic64_inc_return(&cookie_gen);
+		res = gen_cookie_next(&sock_cookie);
 		atomic64_cmpxchg(&sk->sk_cookie, 0, res);
 	}
 }
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
  2020-09-24 18:21 ` [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Daniel Borkmann
  2020-09-24 18:21 ` [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  2020-09-24 22:12   ` David Ahern
  2020-09-24 18:21 ` [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Daniel Borkmann
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.

This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.

Given that skb->sk neither gets orphaned when crossing the netns
as per 9c4c325252c5 ("skbuff: preserve sock reference when scrubbing
the skb.") the helper can then push the skbs directly to the phys
device where FQ scheduler can do its work and TCP stack gets proper
back-pressure given we hold on to skb->sk as long as skb is still
residing in queues.

With the helper used in BPF data path to then push the skb to the
phys device, I observed a stable/consistent TCP_STREAM improvement
on veth devices for traffic going container -> host -> host ->
container from ~10Gbps to ~15Gbps for a single stream in my test
environment.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 include/linux/skbuff.h         |   5 +
 include/uapi/linux/bpf.h       |  14 ++
 net/core/filter.c              | 256 +++++++++++++++++++++++++++++++--
 tools/include/uapi/linux/bpf.h |  14 ++
 4 files changed, 276 insertions(+), 13 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 46881d902124..ec3254acca77 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2539,6 +2539,11 @@ static inline int skb_mac_header_was_set(const struct sk_buff *skb)
 	return skb->mac_header != (typeof(skb->mac_header))~0U;
 }
 
+static inline void skb_unset_mac_header(struct sk_buff *skb)
+{
+	skb->mac_header = (typeof(skb->mac_header))~0U;
+}
+
 static inline void skb_reset_mac_header(struct sk_buff *skb)
 {
 	skb->mac_header = skb->data - skb->head;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 48ecf246d047..95337b650d7f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3595,6 +3595,19 @@ union bpf_attr {
  * 		associated socket instead of the current process.
  * 	Return
  * 		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * long bpf_redirect_neigh(u32 ifindex, u64 flags)
+ * 	Description
+ * 		Redirect the packet to another net device of index *ifindex*
+ * 		and fill in L2 addresses from neighboring subsystem. This helper
+ * 		is somewhat similar to **bpf_redirect**\ (), except that it
+ * 		fills in e.g. MAC addresses based on the L3 information from
+ * 		the packet. This helper is supported for IPv4 and IPv6 protocols.
+ * 		The *flags* argument is reserved and must be 0. The helper is
+ * 		currently only supported for tc BPF program types.
+ * 	Return
+ * 		The helper returns **TC_ACT_REDIRECT** on success or
+ * 		**TC_ACT_SHOT** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3747,6 +3760,7 @@ union bpf_attr {
 	FN(d_path),			\
 	FN(copy_from_user),		\
 	FN(skb_cgroup_classid),		\
+	FN(redirect_neigh),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 0f913755bcba..19caa2fc21e8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2160,6 +2160,205 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
 		return __bpf_redirect_no_mac(skb, dev, flags);
 }
 
+#if IS_ENABLED(CONFIG_IPV6)
+static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct net_device *dev = dst->dev;
+	const struct in6_addr *nexthop;
+	struct neighbour *neigh;
+
+	if (dev_xmit_recursion())
+		goto out_rec;
+	skb->dev = dev;
+	rcu_read_lock_bh();
+	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
+	neigh = __ipv6_neigh_lookup_noref_stub(dev, nexthop);
+	if (unlikely(!neigh))
+		neigh = __neigh_create(ipv6_stub->nd_tbl, nexthop, dev, false);
+	if (likely(!IS_ERR(neigh))) {
+		int ret;
+
+		sock_confirm_neigh(skb, neigh);
+		dev_xmit_recursion_inc();
+		ret = neigh_output(neigh, skb, false);
+		dev_xmit_recursion_dec();
+		rcu_read_unlock_bh();
+		return ret;
+	}
+	rcu_read_unlock_bh();
+	IP6_INC_STATS(dev_net(dst->dev),
+		      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
+out_drop:
+	kfree_skb(skb);
+	return -EINVAL;
+out_rec:
+	net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n");
+	goto out_drop;
+}
+
+static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev)
+{
+	const struct ipv6hdr *ip6h = ipv6_hdr(skb);
+	struct net *net = dev_net(dev);
+	int err, ret = NET_XMIT_DROP;
+	struct flowi6 fl6 = {
+		.flowi6_flags	= FLOWI_FLAG_ANYSRC,
+		.flowi6_mark	= skb->mark,
+		.flowlabel	= ip6_flowinfo(ip6h),
+		.flowi6_proto	= ip6h->nexthdr,
+		.flowi6_oif	= dev->ifindex,
+		.daddr		= ip6h->daddr,
+		.saddr		= ip6h->saddr,
+	};
+	struct dst_entry *dst;
+
+	skb->dev = dev;
+	skb->tstamp = 0;
+
+	dst = ipv6_stub->ipv6_dst_lookup_flow(net, NULL, &fl6, NULL);
+	if (IS_ERR(dst))
+		goto out_drop;
+
+	skb_dst_set(skb, dst);
+
+	err = bpf_out_neigh_v6(net, skb);
+	if (unlikely(net_xmit_eval(err)))
+		dev->stats.tx_errors++;
+	else
+		ret = NET_XMIT_SUCCESS;
+	goto out_xmit;
+out_drop:
+	dev->stats.tx_errors++;
+	kfree_skb(skb);
+out_xmit:
+	return ret;
+}
+#else
+static int __bpf_redirect_neigh_v6(struct sk_buff *skb, struct net_device *dev)
+{
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+}
+#endif /* CONFIG_IPV6 */
+
+#if IS_ENABLED(CONFIG_INET)
+static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct rtable *rt = (struct rtable *)dst;
+	struct net_device *dev = dst->dev;
+	u32 hh_len = LL_RESERVED_SPACE(dev);
+	struct neighbour *neigh;
+	bool is_v6gw = false;
+
+	if (dev_xmit_recursion())
+		goto out_rec;
+	if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
+		struct sk_buff *skb2;
+
+		skb2 = skb_realloc_headroom(skb, hh_len);
+		if (!skb2) {
+			kfree_skb(skb);
+			return -ENOMEM;
+		}
+		if (skb->sk)
+			skb_set_owner_w(skb2, skb->sk);
+		consume_skb(skb);
+		skb = skb2;
+	}
+	rcu_read_lock_bh();
+	neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
+	if (likely(!IS_ERR(neigh))) {
+		int ret;
+
+		sock_confirm_neigh(skb, neigh);
+		dev_xmit_recursion_inc();
+		ret = neigh_output(neigh, skb, is_v6gw);
+		dev_xmit_recursion_dec();
+		rcu_read_unlock_bh();
+		return ret;
+	}
+	rcu_read_unlock_bh();
+out_drop:
+	kfree_skb(skb);
+	return -EINVAL;
+out_rec:
+	net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n");
+	goto out_drop;
+}
+
+static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev)
+{
+	const struct iphdr *ip4h = ip_hdr(skb);
+	struct net *net = dev_net(dev);
+	int err, ret = NET_XMIT_DROP;
+	struct flowi4 fl4 = {
+		.flowi4_flags	= FLOWI_FLAG_ANYSRC,
+		.flowi4_mark	= skb->mark,
+		.flowi4_tos	= RT_TOS(ip4h->tos),
+		.flowi4_oif	= dev->ifindex,
+		.daddr		= ip4h->daddr,
+		.saddr		= ip4h->saddr,
+	};
+	struct rtable *rt;
+
+	skb->dev = dev;
+	skb->tstamp = 0;
+
+	rt = ip_route_output_flow(net, &fl4, NULL);
+	if (IS_ERR(rt))
+		goto out_drop;
+	if (rt->rt_type != RTN_UNICAST && rt->rt_type != RTN_LOCAL) {
+		ip_rt_put(rt);
+		goto out_drop;
+	}
+
+	skb_dst_set(skb, &rt->dst);
+
+	err = bpf_out_neigh_v4(net, skb);
+	if (unlikely(net_xmit_eval(err)))
+		dev->stats.tx_errors++;
+	else
+		ret = NET_XMIT_SUCCESS;
+	goto out_xmit;
+out_drop:
+	dev->stats.tx_errors++;
+	kfree_skb(skb);
+out_xmit:
+	return ret;
+}
+#else
+static int __bpf_redirect_neigh_v4(struct sk_buff *skb, struct net_device *dev)
+{
+	kfree_skb(skb);
+	return NET_XMIT_DROP;
+}
+#endif /* CONFIG_INET */
+
+static int __bpf_redirect_neigh(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ethhdr *ethh = eth_hdr(skb);
+
+	if (unlikely(skb->mac_header >= skb->network_header))
+		goto out;
+	bpf_push_mac_rcsum(skb);
+	if (is_multicast_ether_addr(ethh->h_dest))
+		goto out;
+
+	skb_pull(skb, sizeof(*ethh));
+	skb_unset_mac_header(skb);
+	skb_reset_network_header(skb);
+
+	if (skb->protocol == htons(ETH_P_IP))
+		return __bpf_redirect_neigh_v4(skb, dev);
+	else if (skb->protocol == htons(ETH_P_IPV6))
+		return __bpf_redirect_neigh_v6(skb, dev);
+out:
+	kfree_skb(skb);
+	return -ENOTSUPP;
+}
+
 BPF_CALL_3(bpf_clone_redirect, struct sk_buff *, skb, u32, ifindex, u64, flags)
 {
 	struct net_device *dev;
@@ -2203,23 +2402,16 @@ static const struct bpf_func_proto bpf_clone_redirect_proto = {
 DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
-BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
-{
-	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-
-	if (unlikely(flags & ~(BPF_F_INGRESS)))
-		return TC_ACT_SHOT;
-
-	ri->flags = flags;
-	ri->tgt_index = ifindex;
-
-	return TC_ACT_REDIRECT;
-}
+/* Internal, non-exposed redirect flags. */
+enum {
+	BPF_F_NEIGH = (1ULL << 1),
+};
 
 int skb_do_redirect(struct sk_buff *skb)
 {
 	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net_device *dev;
+	u32 flags = ri->flags;
 
 	dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->tgt_index);
 	ri->tgt_index = 0;
@@ -2228,7 +2420,22 @@ int skb_do_redirect(struct sk_buff *skb)
 		return -EINVAL;
 	}
 
-	return __bpf_redirect(skb, dev, ri->flags);
+	return flags & BPF_F_NEIGH ?
+	       __bpf_redirect_neigh(skb, dev) :
+	       __bpf_redirect(skb, dev, flags);
+}
+
+BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(flags & ~(BPF_F_INGRESS)))
+		return TC_ACT_SHOT;
+
+	ri->flags = flags;
+	ri->tgt_index = ifindex;
+
+	return TC_ACT_REDIRECT;
 }
 
 static const struct bpf_func_proto bpf_redirect_proto = {
@@ -2239,6 +2446,27 @@ static const struct bpf_func_proto bpf_redirect_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_redirect_neigh, u32, ifindex, u64, flags)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	if (unlikely(flags))
+		return TC_ACT_SHOT;
+
+	ri->flags = BPF_F_NEIGH;
+	ri->tgt_index = ifindex;
+
+	return TC_ACT_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_redirect_neigh_proto = {
+	.func		= bpf_redirect_neigh,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_ANYTHING,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_msg_apply_bytes, struct sk_msg *, msg, u32, bytes)
 {
 	msg->apply_bytes = bytes;
@@ -6757,6 +6985,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return bpf_get_skb_set_tunnel_proto(func_id);
 	case BPF_FUNC_redirect:
 		return &bpf_redirect_proto;
+	case BPF_FUNC_redirect_neigh:
+		return &bpf_redirect_neigh_proto;
 	case BPF_FUNC_get_route_realm:
 		return &bpf_get_route_realm_proto;
 	case BPF_FUNC_get_hash_recalc:
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 48ecf246d047..95337b650d7f 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3595,6 +3595,19 @@ union bpf_attr {
  * 		associated socket instead of the current process.
  * 	Return
  * 		The id is returned or 0 in case the id could not be retrieved.
+ *
+ * long bpf_redirect_neigh(u32 ifindex, u64 flags)
+ * 	Description
+ * 		Redirect the packet to another net device of index *ifindex*
+ * 		and fill in L2 addresses from neighboring subsystem. This helper
+ * 		is somewhat similar to **bpf_redirect**\ (), except that it
+ * 		fills in e.g. MAC addresses based on the L3 information from
+ * 		the packet. This helper is supported for IPv4 and IPv6 protocols.
+ * 		The *flags* argument is reserved and must be 0. The helper is
+ * 		currently only supported for tc BPF program types.
+ * 	Return
+ * 		The helper returns **TC_ACT_REDIRECT** on success or
+ * 		**TC_ACT_SHOT** on error.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3747,6 +3760,7 @@ union bpf_attr {
 	FN(d_path),			\
 	FN(copy_from_user),		\
 	FN(skb_cgroup_classid),		\
+	FN(redirect_neigh),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
                   ` (2 preceding siblings ...)
  2020-09-24 18:21 ` [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  2020-09-24 20:53   ` Andrii Nakryiko
  2020-09-25  6:13   ` Yonghong Song
  2020-09-24 18:21 ` [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate Daniel Borkmann
  2020-09-24 18:21 ` [PATCH bpf-next 6/6] bpf, selftests: add redirect_neigh selftest Daniel Borkmann
  5 siblings, 2 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

Port of tail_call_static() helper function from Cilium's BPF code base [0]
to libbpf, so others can easily consume it as well. We've been using this
in production code for some time now. The main idea is that we guarantee
that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
JITed BPF insns with direct jumps instead of having to fall back to using
expensive retpolines. By using inline asm, we guarantee that the compiler
won't merge the call from different paths with potentially different
content of r2/r3.

We're also using __throw_build_bug() macro in different places as a neat
trick to trigger compilation errors when compiler does not remove code at
compilation time. This works for the BPF backend as it does not implement
the __builtin_trap().

  [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index 1106777df00b..18b75a4c82e6 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -53,6 +53,38 @@
 	})
 #endif
 
+/*
+ * Misc useful helper macros
+ */
+#ifndef __throw_build_bug
+# define __throw_build_bug()	__builtin_trap()
+#endif
+
+static __always_inline void
+bpf_tail_call_static(void *ctx, const void *map, const __u32 slot)
+{
+	if (!__builtin_constant_p(slot))
+		__throw_build_bug();
+
+	/*
+	 * Don't gamble, but _guarantee_ that LLVM won't optimize setting
+	 * r2 and r3 from different paths ending up at the same call insn as
+	 * otherwise we won't be able to use the jmpq/nopl retpoline-free
+	 * patching by the x86-64 JIT in the kernel.
+	 *
+	 * Note on clobber list: we need to stay in-line with BPF calling
+	 * convention, so even if we don't end up using r0, r4, r5, we need
+	 * to mark them as clobber so that LLVM doesn't end up using them
+	 * before / after the call.
+	 */
+	asm volatile("r1 = %[ctx]\n\t"
+		     "r2 = %[map]\n\t"
+		     "r3 = %[slot]\n\t"
+		     "call 12\n\t"
+		     :: [ctx]"r"(ctx), [map]"r"(map), [slot]"i"(slot)
+		     : "r0", "r1", "r2", "r3", "r4", "r5");
+}
+
 /*
  * Helper structure used by eBPF C program
  * to describe BPF map attributes to libbpf loader
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
                   ` (3 preceding siblings ...)
  2020-09-24 18:21 ` [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  2020-09-24 19:25   ` Maciej Fijalkowski
  2020-09-24 18:21 ` [PATCH bpf-next 6/6] bpf, selftests: add redirect_neigh selftest Daniel Borkmann
  5 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

For those locations where we use an immediate tail call map index use the
newly added bpf_tail_call_static() helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 tools/testing/selftests/bpf/progs/bpf_flow.c  | 12 ++++----
 tools/testing/selftests/bpf/progs/tailcall1.c | 28 +++++++++----------
 tools/testing/selftests/bpf/progs/tailcall2.c | 14 +++++-----
 tools/testing/selftests/bpf/progs/tailcall3.c |  4 +--
 .../selftests/bpf/progs/tailcall_bpf2bpf1.c   |  4 +--
 .../selftests/bpf/progs/tailcall_bpf2bpf2.c   |  6 ++--
 .../selftests/bpf/progs/tailcall_bpf2bpf3.c   |  6 ++--
 .../selftests/bpf/progs/tailcall_bpf2bpf4.c   |  6 ++--
 8 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/bpf_flow.c b/tools/testing/selftests/bpf/progs/bpf_flow.c
index de6de9221518..5a65f6b51377 100644
--- a/tools/testing/selftests/bpf/progs/bpf_flow.c
+++ b/tools/testing/selftests/bpf/progs/bpf_flow.c
@@ -118,18 +118,18 @@ static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto)
 
 	switch (proto) {
 	case bpf_htons(ETH_P_IP):
-		bpf_tail_call(skb, &jmp_table, IP);
+		bpf_tail_call_static(skb, &jmp_table, IP);
 		break;
 	case bpf_htons(ETH_P_IPV6):
-		bpf_tail_call(skb, &jmp_table, IPV6);
+		bpf_tail_call_static(skb, &jmp_table, IPV6);
 		break;
 	case bpf_htons(ETH_P_MPLS_MC):
 	case bpf_htons(ETH_P_MPLS_UC):
-		bpf_tail_call(skb, &jmp_table, MPLS);
+		bpf_tail_call_static(skb, &jmp_table, MPLS);
 		break;
 	case bpf_htons(ETH_P_8021Q):
 	case bpf_htons(ETH_P_8021AD):
-		bpf_tail_call(skb, &jmp_table, VLAN);
+		bpf_tail_call_static(skb, &jmp_table, VLAN);
 		break;
 	default:
 		/* Protocol not supported */
@@ -246,10 +246,10 @@ static __always_inline int parse_ipv6_proto(struct __sk_buff *skb, __u8 nexthdr)
 	switch (nexthdr) {
 	case IPPROTO_HOPOPTS:
 	case IPPROTO_DSTOPTS:
-		bpf_tail_call(skb, &jmp_table, IPV6OP);
+		bpf_tail_call_static(skb, &jmp_table, IPV6OP);
 		break;
 	case IPPROTO_FRAGMENT:
-		bpf_tail_call(skb, &jmp_table, IPV6FR);
+		bpf_tail_call_static(skb, &jmp_table, IPV6FR);
 		break;
 	default:
 		return parse_ip_proto(skb, nexthdr);
diff --git a/tools/testing/selftests/bpf/progs/tailcall1.c b/tools/testing/selftests/bpf/progs/tailcall1.c
index 1f407e65ae52..7115bcefbe8a 100644
--- a/tools/testing/selftests/bpf/progs/tailcall1.c
+++ b/tools/testing/selftests/bpf/progs/tailcall1.c
@@ -26,20 +26,20 @@ int entry(struct __sk_buff *skb)
 	/* Multiple locations to make sure we patch
 	 * all of them.
 	 */
-	bpf_tail_call(skb, &jmp_table, 0);
-	bpf_tail_call(skb, &jmp_table, 0);
-	bpf_tail_call(skb, &jmp_table, 0);
-	bpf_tail_call(skb, &jmp_table, 0);
-
-	bpf_tail_call(skb, &jmp_table, 1);
-	bpf_tail_call(skb, &jmp_table, 1);
-	bpf_tail_call(skb, &jmp_table, 1);
-	bpf_tail_call(skb, &jmp_table, 1);
-
-	bpf_tail_call(skb, &jmp_table, 2);
-	bpf_tail_call(skb, &jmp_table, 2);
-	bpf_tail_call(skb, &jmp_table, 2);
-	bpf_tail_call(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
+
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
+
+	bpf_tail_call_static(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
 
 	return 3;
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall2.c b/tools/testing/selftests/bpf/progs/tailcall2.c
index a093e739cf0e..0431e4fe7efd 100644
--- a/tools/testing/selftests/bpf/progs/tailcall2.c
+++ b/tools/testing/selftests/bpf/progs/tailcall2.c
@@ -13,14 +13,14 @@ struct {
 SEC("classifier/0")
 int bpf_func_0(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
 	return 0;
 }
 
 SEC("classifier/1")
 int bpf_func_1(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
 	return 1;
 }
 
@@ -33,25 +33,25 @@ int bpf_func_2(struct __sk_buff *skb)
 SEC("classifier/3")
 int bpf_func_3(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 4);
+	bpf_tail_call_static(skb, &jmp_table, 4);
 	return 3;
 }
 
 SEC("classifier/4")
 int bpf_func_4(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 3);
+	bpf_tail_call_static(skb, &jmp_table, 3);
 	return 4;
 }
 
 SEC("classifier")
 int entry(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 	/* Check multi-prog update. */
-	bpf_tail_call(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
 	/* Check tail call limit. */
-	bpf_tail_call(skb, &jmp_table, 3);
+	bpf_tail_call_static(skb, &jmp_table, 3);
 	return 3;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/tailcall3.c b/tools/testing/selftests/bpf/progs/tailcall3.c
index cabda877cf0a..739dc2a51e74 100644
--- a/tools/testing/selftests/bpf/progs/tailcall3.c
+++ b/tools/testing/selftests/bpf/progs/tailcall3.c
@@ -16,14 +16,14 @@ SEC("classifier/0")
 int bpf_func_0(struct __sk_buff *skb)
 {
 	count++;
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 	return 1;
 }
 
 SEC("classifier")
 int entry(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 	return 0;
 }
 
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c
index b5d9c8e778ae..0103f3dd9f02 100644
--- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf1.c
@@ -21,7 +21,7 @@ TAIL_FUNC(1)
 static __noinline
 int subprog_tail(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 
 	return skb->len * 2;
 }
@@ -29,7 +29,7 @@ int subprog_tail(struct __sk_buff *skb)
 SEC("classifier")
 int entry(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
 
 	return subprog_tail(skb);
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c
index a004ab28ce28..7b1c04183824 100644
--- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf2.c
@@ -14,9 +14,9 @@ static __noinline
 int subprog_tail(struct __sk_buff *skb)
 {
 	if (load_byte(skb, 0))
-		bpf_tail_call(skb, &jmp_table, 1);
+		bpf_tail_call_static(skb, &jmp_table, 1);
 	else
-		bpf_tail_call(skb, &jmp_table, 0);
+		bpf_tail_call_static(skb, &jmp_table, 0);
 	return 1;
 }
 
@@ -32,7 +32,7 @@ int bpf_func_0(struct __sk_buff *skb)
 SEC("classifier")
 int entry(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c
index 96dbef2b6b7c..0d5482bea6c9 100644
--- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf3.c
@@ -16,9 +16,9 @@ int subprog_tail2(struct __sk_buff *skb)
 	volatile char arr[64] = {};
 
 	if (load_word(skb, 0) || load_half(skb, 0))
-		bpf_tail_call(skb, &jmp_table, 10);
+		bpf_tail_call_static(skb, &jmp_table, 10);
 	else
-		bpf_tail_call(skb, &jmp_table, 1);
+		bpf_tail_call_static(skb, &jmp_table, 1);
 
 	return skb->len;
 }
@@ -28,7 +28,7 @@ int subprog_tail(struct __sk_buff *skb)
 {
 	volatile char arr[64] = {};
 
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 
 	return skb->len * 2;
 }
diff --git a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c
index 98b40a95bc67..9a1b166b7fbe 100644
--- a/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c
+++ b/tools/testing/selftests/bpf/progs/tailcall_bpf2bpf4.c
@@ -14,21 +14,21 @@ static volatile int count;
 __noinline
 int subprog_tail_2(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 2);
+	bpf_tail_call_static(skb, &jmp_table, 2);
 	return skb->len * 3;
 }
 
 __noinline
 int subprog_tail_1(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 1);
+	bpf_tail_call_static(skb, &jmp_table, 1);
 	return skb->len * 2;
 }
 
 __noinline
 int subprog_tail(struct __sk_buff *skb)
 {
-	bpf_tail_call(skb, &jmp_table, 0);
+	bpf_tail_call_static(skb, &jmp_table, 0);
 	return skb->len;
 }
 
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH bpf-next 6/6] bpf, selftests: add redirect_neigh selftest
  2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
                   ` (4 preceding siblings ...)
  2020-09-24 18:21 ` [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate Daniel Borkmann
@ 2020-09-24 18:21 ` Daniel Borkmann
  5 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 18:21 UTC (permalink / raw)
  To: ast; +Cc: daniel, john.fastabend, netdev, bpf

Add a small test that excercises the new redirect_neigh() helper for the
IPv4 and IPv6 case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 .../selftests/bpf/progs/test_tc_neigh.c       | 144 +++++++++++++++
 tools/testing/selftests/bpf/test_tc_neigh.sh  | 168 ++++++++++++++++++
 2 files changed, 312 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/test_tc_neigh.c
 create mode 100755 tools/testing/selftests/bpf/test_tc_neigh.sh

diff --git a/tools/testing/selftests/bpf/progs/test_tc_neigh.c b/tools/testing/selftests/bpf/progs/test_tc_neigh.c
new file mode 100644
index 000000000000..889a72c3024f
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_tc_neigh.c
@@ -0,0 +1,144 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdint.h>
+#include <stdbool.h>
+
+#include <linux/bpf.h>
+#include <linux/stddef.h>
+#include <linux/pkt_cls.h>
+#include <linux/if_ether.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+#ifndef barrier_data
+# define barrier_data(ptr)	asm volatile("": :"r"(ptr) :"memory")
+#endif
+
+#ifndef ctx_ptr
+# define ctx_ptr(field)		(void *)(long)(field)
+#endif
+
+#define dst_to_src_tmp		0xeeddddeeU
+#define src_to_dst_tmp		0xeeffffeeU
+
+#define ip4_src			0xac100164 /* 172.16.1.100 */
+#define ip4_dst			0xac100264 /* 172.16.2.100 */
+
+#define ip6_src			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x01, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+#define ip6_dst			{ 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, \
+				  0x00, 0x02, 0xde, 0xad, 0xbe, 0xef, 0xca, 0xfe }
+
+#ifndef v6_equal
+# define v6_equal(a, b)		(a.s6_addr32[0] == b.s6_addr32[0] && \
+				 a.s6_addr32[1] == b.s6_addr32[1] && \
+				 a.s6_addr32[2] == b.s6_addr32[2] && \
+				 a.s6_addr32[3] == b.s6_addr32[3])
+#endif
+
+static __always_inline bool is_remote_ep_v4(struct __sk_buff *skb,
+					    __be32 addr)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	struct iphdr *ip4h;
+
+	if (data + sizeof(struct ethhdr) > data_end)
+		return false;
+
+	ip4h = (struct iphdr *)(data + sizeof(struct ethhdr));
+	if ((void *)(ip4h + 1) > data_end)
+		return false;
+
+	return ip4h->daddr == addr;
+}
+
+static __always_inline bool is_remote_ep_v6(struct __sk_buff *skb,
+					    struct in6_addr addr)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	struct ipv6hdr *ip6h;
+
+	if (data + sizeof(struct ethhdr) > data_end)
+		return false;
+
+	ip6h = (struct ipv6hdr *)(data + sizeof(struct ethhdr));
+	if ((void *)(ip6h + 1) > data_end)
+		return false;
+
+	return v6_equal(ip6h->daddr, addr);
+}
+
+SEC("chk_neigh") int tc_chk(struct __sk_buff *skb)
+{
+	void *data_end = ctx_ptr(skb->data_end);
+	void *data = ctx_ptr(skb->data);
+	__u32 *raw = data;
+
+	if (data + sizeof(struct ethhdr) > data_end)
+		return TC_ACT_SHOT;
+
+	return !raw[0] && !raw[1] && !raw[2] ? TC_ACT_SHOT : TC_ACT_OK;
+}
+
+SEC("dst_ingress") int tc_dst(struct __sk_buff *skb)
+{
+	int idx = dst_to_src_tmp;
+	__u8 zero[ETH_ALEN * 2];
+	bool redirect = false;
+
+	switch (skb->protocol) {
+	case __bpf_constant_htons(ETH_P_IP):
+		redirect = is_remote_ep_v4(skb, __bpf_constant_htonl(ip4_src));
+		break;
+	case __bpf_constant_htons(ETH_P_IPV6):
+		redirect = is_remote_ep_v6(skb, (struct in6_addr)ip6_src);
+		break;
+	}
+
+	if (!redirect)
+		return TC_ACT_OK;
+
+	barrier_data(&idx);
+	idx = bpf_ntohl(idx);
+
+	__builtin_memset(&zero, 0, sizeof(zero));
+	if (bpf_skb_store_bytes(skb, 0, &zero, sizeof(zero), 0) < 0)
+		return TC_ACT_SHOT;
+
+	return bpf_redirect_neigh(idx, 0);
+}
+
+SEC("src_ingress") int tc_src(struct __sk_buff *skb)
+{
+	int idx = src_to_dst_tmp;
+	__u8 zero[ETH_ALEN * 2];
+	bool redirect = false;
+
+	switch (skb->protocol) {
+	case __bpf_constant_htons(ETH_P_IP):
+		redirect = is_remote_ep_v4(skb, __bpf_constant_htonl(ip4_dst));
+		break;
+	case __bpf_constant_htons(ETH_P_IPV6):
+		redirect = is_remote_ep_v6(skb, (struct in6_addr)ip6_dst);
+		break;
+	}
+
+	if (!redirect)
+		return TC_ACT_OK;
+
+	barrier_data(&idx);
+	idx = bpf_ntohl(idx);
+
+	__builtin_memset(&zero, 0, sizeof(zero));
+	if (bpf_skb_store_bytes(skb, 0, &zero, sizeof(zero), 0) < 0)
+		return TC_ACT_SHOT;
+
+	return bpf_redirect_neigh(idx, 0);
+}
+
+char __license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_tc_neigh.sh b/tools/testing/selftests/bpf/test_tc_neigh.sh
new file mode 100755
index 000000000000..31d8c3df8b24
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_tc_neigh.sh
@@ -0,0 +1,168 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# This test sets up 3 netns (src <-> fwd <-> dst). There is no direct veth link
+# between src and dst. The netns fwd has veth links to each src and dst. The
+# client is in src and server in dst. The test installs a TC BPF program to each
+# host facing veth in fwd which calls into bpf_redirect_peer() to perform the
+# neigh addr population and redirect; it also installs a dropper prog on the
+# egress side to drop skbs if neigh addrs were not populated.
+
+if [[ $EUID -ne 0 ]]; then
+	echo "This script must be run as root"
+	echo "FAIL"
+	exit 1
+fi
+
+# check that nc, dd, ping, ping6 and timeout are present
+command -v nc >/dev/null 2>&1 || \
+	{ echo >&2 "nc is not available"; exit 1; }
+command -v dd >/dev/null 2>&1 || \
+	{ echo >&2 "dd is not available"; exit 1; }
+command -v timeout >/dev/null 2>&1 || \
+	{ echo >&2 "timeout is not available"; exit 1; }
+command -v ping >/dev/null 2>&1 || \
+	{ echo >&2 "ping is not available"; exit 1; }
+command -v ping6 >/dev/null 2>&1 || \
+	{ echo >&2 "ping6 is not available"; exit 1; }
+
+readonly GREEN='\033[0;92m'
+readonly RED='\033[0;31m'
+readonly NC='\033[0m' # No Color
+
+readonly PING_ARG="-c 3 -w 10 -q"
+
+readonly TIMEOUT=10
+
+readonly NS_SRC="ns-src-$(mktemp -u XXXXXX)"
+readonly NS_FWD="ns-fwd-$(mktemp -u XXXXXX)"
+readonly NS_DST="ns-dst-$(mktemp -u XXXXXX)"
+
+readonly IP4_SRC="172.16.1.100"
+readonly IP4_DST="172.16.2.100"
+
+readonly IP6_SRC="::1:dead:beef:cafe"
+readonly IP6_DST="::2:dead:beef:cafe"
+
+readonly IP4_SLL="169.254.0.1"
+readonly IP4_DLL="169.254.0.2"
+readonly IP4_NET="169.254.0.0"
+
+cleanup()
+{
+	ip netns del ${NS_SRC}
+	ip netns del ${NS_FWD}
+	ip netns del ${NS_DST}
+}
+
+trap cleanup EXIT
+
+set -e
+
+ip netns add "${NS_SRC}"
+ip netns add "${NS_FWD}"
+ip netns add "${NS_DST}"
+
+ip link add veth_src type veth peer name veth_src_fwd
+ip link add veth_dst type veth peer name veth_dst_fwd
+
+ip link set veth_src netns ${NS_SRC}
+ip link set veth_src_fwd netns ${NS_FWD}
+
+ip link set veth_dst netns ${NS_DST}
+ip link set veth_dst_fwd netns ${NS_FWD}
+
+ip -netns ${NS_SRC} addr add ${IP4_SRC}/32 dev veth_src
+ip -netns ${NS_DST} addr add ${IP4_DST}/32 dev veth_dst
+
+# The fwd netns automatically get a v6 LL address / routes, but also needs v4
+# one in order to start ARP probing. IP4_NET route is added to the endpoints
+# so that the ARP processing will reply.
+
+ip -netns ${NS_FWD} addr add ${IP4_SLL}/32 dev veth_src_fwd
+ip -netns ${NS_FWD} addr add ${IP4_DLL}/32 dev veth_dst_fwd
+
+ip -netns ${NS_SRC} addr add ${IP6_SRC}/128 dev veth_src nodad
+ip -netns ${NS_DST} addr add ${IP6_DST}/128 dev veth_dst nodad
+
+ip -netns ${NS_SRC} link set dev veth_src up
+ip -netns ${NS_FWD} link set dev veth_src_fwd up
+
+ip -netns ${NS_DST} link set dev veth_dst up
+ip -netns ${NS_FWD} link set dev veth_dst_fwd up
+
+ip -netns ${NS_SRC} route add ${IP4_DST}/32 dev veth_src scope global
+ip -netns ${NS_SRC} route add ${IP4_NET}/16 dev veth_src scope global
+ip -netns ${NS_FWD} route add ${IP4_SRC}/32 dev veth_src_fwd scope global
+
+ip -netns ${NS_SRC} route add ${IP6_DST}/128 dev veth_src scope global
+ip -netns ${NS_FWD} route add ${IP6_SRC}/128 dev veth_src_fwd scope global
+
+ip -netns ${NS_DST} route add ${IP4_SRC}/32 dev veth_dst scope global
+ip -netns ${NS_DST} route add ${IP4_NET}/16 dev veth_dst scope global
+ip -netns ${NS_FWD} route add ${IP4_DST}/32 dev veth_dst_fwd scope global
+
+ip -netns ${NS_DST} route add ${IP6_SRC}/128 dev veth_dst scope global
+ip -netns ${NS_FWD} route add ${IP6_DST}/128 dev veth_dst_fwd scope global
+
+fmac_src=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_src_fwd/address)
+fmac_dst=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_dst_fwd/address)
+
+ip -netns ${NS_SRC} neigh add ${IP4_DST} dev veth_src lladdr $fmac_src
+ip -netns ${NS_DST} neigh add ${IP4_SRC} dev veth_dst lladdr $fmac_dst
+
+ip -netns ${NS_SRC} neigh add ${IP6_DST} dev veth_src lladdr $fmac_src
+ip -netns ${NS_DST} neigh add ${IP6_SRC} dev veth_dst lladdr $fmac_dst
+
+veth_dst=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_dst_fwd/ifindex | awk '{printf "%08x\n", $1}')
+veth_src=$(ip netns exec ${NS_FWD} cat /sys/class/net/veth_src_fwd/ifindex | awk '{printf "%08x\n", $1}')
+
+xxd -p < test_tc_neigh.o   | sed "s/eeddddee/$veth_src/g" | xxd -r -p > test_tc_neigh.x.o
+xxd -p < test_tc_neigh.x.o | sed "s/eeffffee/$veth_dst/g" | xxd -r -p > test_tc_neigh.y.o
+
+ip netns exec ${NS_FWD} tc qdisc add dev veth_src_fwd clsact
+ip netns exec ${NS_FWD} tc filter add dev veth_src_fwd ingress bpf da obj test_tc_neigh.y.o sec src_ingress
+ip netns exec ${NS_FWD} tc filter add dev veth_src_fwd egress  bpf da obj test_tc_neigh.y.o sec chk_neigh
+
+ip netns exec ${NS_FWD} tc qdisc add dev veth_dst_fwd clsact
+ip netns exec ${NS_FWD} tc filter add dev veth_dst_fwd ingress bpf da obj test_tc_neigh.y.o sec dst_ingress
+ip netns exec ${NS_FWD} tc filter add dev veth_dst_fwd egress  bpf da obj test_tc_neigh.y.o sec chk_neigh
+
+rm -f test_tc_neigh.x.o test_tc_neigh.y.o
+
+ip netns exec ${NS_DST} bash -c "nc -4 -l -p 9004 &"
+ip netns exec ${NS_DST} bash -c "nc -6 -l -p 9006 &"
+
+set +e
+
+TEST="TCPv4 connectivity test"
+ip netns exec ${NS_SRC} bash -c "timeout ${TIMEOUT} dd if=/dev/zero bs=1000 count=100 > /dev/tcp/${IP4_DST}/9004"
+if [ $? -ne 0 ]; then
+	echo -e "${TEST}: ${RED}FAIL${NC}"
+	exit 1
+fi
+echo -e "${TEST}: ${GREEN}PASS${NC}"
+
+TEST="TCPv6 connectivity test"
+ip netns exec ${NS_SRC} bash -c "timeout ${TIMEOUT} dd if=/dev/zero bs=1000 count=100 > /dev/tcp/${IP6_DST}/9006"
+if [ $? -ne 0 ]; then
+	echo -e "${TEST}: ${RED}FAIL${NC}"
+	exit 1
+fi
+echo -e "${TEST}: ${GREEN}PASS${NC}"
+
+TEST="ICMPv4 connectivity test"
+ip netns exec ${NS_SRC} ping  $PING_ARG ${IP4_DST}
+if [ $? -ne 0 ]; then
+	echo -e "${TEST}: ${RED}FAIL${NC}"
+	exit 1
+fi
+echo -e "${TEST}: ${GREEN}PASS${NC}"
+
+TEST="ICMPv6 connectivity test"
+ip netns exec ${NS_SRC} ping6 $PING_ARG ${IP6_DST}
+if [ $? -ne 0 ]; then
+	echo -e "${TEST}: ${RED}FAIL${NC}"
+	exit 1
+fi
+echo -e "${TEST}: ${GREEN}PASS${NC}"
-- 
2.21.0


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-24 18:21 ` [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one Daniel Borkmann
@ 2020-09-24 18:58   ` Eric Dumazet
  2020-09-24 22:03     ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-09-24 18:58 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: john.fastabend, netdev, bpf



On 9/24/20 8:21 PM, Daniel Borkmann wrote:
> With its use in BPF the cookie generator can be called very frequently
> in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
> and attached to the root cgroup, for example, when used in v1/v2 mixed
> environments. In particular when there's a high churn on sockets in the
> system there can be many parallel requests to the bpf_get_socket_cookie()
> and bpf_get_netns_cookie() helpers which then cause contention on the
> shared atomic counter. As similarly done in f991bd2e1421 ("fs: introduce
> a per-cpu last_ino allocator"), add a small helper library that both can
> then use for the 64 bit counters.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  include/linux/cookie.h   | 41 ++++++++++++++++++++++++++++++++++++++++
>  net/core/net_namespace.c |  5 +++--
>  net/core/sock_diag.c     |  7 ++++---
>  3 files changed, 48 insertions(+), 5 deletions(-)
>  create mode 100644 include/linux/cookie.h
> 
> diff --git a/include/linux/cookie.h b/include/linux/cookie.h
> new file mode 100644
> index 000000000000..2488203dc004
> --- /dev/null
> +++ b/include/linux/cookie.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __LINUX_COOKIE_H
> +#define __LINUX_COOKIE_H
> +
> +#include <linux/atomic.h>
> +#include <linux/percpu.h>
> +
> +struct gen_cookie {
> +	u64 __percpu	*local_last;
> +	atomic64_t	 shared_last ____cacheline_aligned_in_smp;
> +};
> +
> +#define COOKIE_LOCAL_BATCH	4096
> +
> +#define DEFINE_COOKIE(name)					\
> +	static DEFINE_PER_CPU(u64, __##name);			\
> +	static struct gen_cookie name = {			\
> +		.local_last	= &__##name,			\
> +		.shared_last	= ATOMIC64_INIT(0),		\
> +	}
> +
> +static inline u64 gen_cookie_next(struct gen_cookie *gc)
> +{
> +	u64 *local_last = &get_cpu_var(*gc->local_last);
> +	u64 val = *local_last;
> +
> +	if (__is_defined(CONFIG_SMP) &&
> +	    unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
> +		s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
> +					       &gc->shared_last);
> +		val = next - COOKIE_LOCAL_BATCH;
> +	}
> +	val++;
> +	if (unlikely(!val))
> +		val++;
> +	*local_last = val;
> +	put_cpu_var(local_last);
> +	return val;


This is not interrupt safe.

I think sock_gen_cookie() can be called from interrupt context.

get_next_ino() is only called from process context, that is what I used get_cpu_var()
and put_cpu_var()


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate
  2020-09-24 18:21 ` [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate Daniel Borkmann
@ 2020-09-24 19:25   ` Maciej Fijalkowski
  2020-09-24 22:03     ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Maciej Fijalkowski @ 2020-09-24 19:25 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, john.fastabend, netdev, bpf

On Thu, Sep 24, 2020 at 08:21:26PM +0200, Daniel Borkmann wrote:
> For those locations where we use an immediate tail call map index use the
> newly added bpf_tail_call_static() helper.
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/testing/selftests/bpf/progs/bpf_flow.c  | 12 ++++----
>  tools/testing/selftests/bpf/progs/tailcall1.c | 28 +++++++++----------
>  tools/testing/selftests/bpf/progs/tailcall2.c | 14 +++++-----
>  tools/testing/selftests/bpf/progs/tailcall3.c |  4 +--
>  .../selftests/bpf/progs/tailcall_bpf2bpf1.c   |  4 +--
>  .../selftests/bpf/progs/tailcall_bpf2bpf2.c   |  6 ++--
>  .../selftests/bpf/progs/tailcall_bpf2bpf3.c   |  6 ++--
>  .../selftests/bpf/progs/tailcall_bpf2bpf4.c   |  6 ++--
>  8 files changed, 40 insertions(+), 40 deletions(-)

One nit, while you're at it, maybe it would be good to also address the
samples/bpf/sockex3_kern.c that is also using the immediate map index?

[...]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-24 18:21 ` [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Daniel Borkmann
@ 2020-09-24 20:53   ` Andrii Nakryiko
  2020-09-24 22:17     ` Daniel Borkmann
  2020-09-25  6:13   ` Yonghong Song
  1 sibling, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2020-09-24 20:53 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> Port of tail_call_static() helper function from Cilium's BPF code base [0]
> to libbpf, so others can easily consume it as well. We've been using this
> in production code for some time now. The main idea is that we guarantee
> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
> JITed BPF insns with direct jumps instead of having to fall back to using
> expensive retpolines. By using inline asm, we guarantee that the compiler
> won't merge the call from different paths with potentially different
> content of r2/r3.
>
> We're also using __throw_build_bug() macro in different places as a neat
> trick to trigger compilation errors when compiler does not remove code at
> compilation time. This works for the BPF backend as it does not implement
> the __builtin_trap().
>
>   [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
>
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>  tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>  1 file changed, 32 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 1106777df00b..18b75a4c82e6 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -53,6 +53,38 @@
>         })
>  #endif
>
> +/*
> + * Misc useful helper macros
> + */
> +#ifndef __throw_build_bug
> +# define __throw_build_bug()   __builtin_trap()
> +#endif

this will become part of libbpf stable API, do we want/need to expose
it? If we want to expose it, then we should probably provide a better
description.

But also curious, how is it better than _Static_assert() (see
test_cls_redirect.c), which also allows to provide a better error
message?

> +
> +static __always_inline void
> +bpf_tail_call_static(void *ctx, const void *map, const __u32 slot)
> +{
> +       if (!__builtin_constant_p(slot))
> +               __throw_build_bug();
> +
> +       /*
> +        * Don't gamble, but _guarantee_ that LLVM won't optimize setting
> +        * r2 and r3 from different paths ending up at the same call insn as
> +        * otherwise we won't be able to use the jmpq/nopl retpoline-free
> +        * patching by the x86-64 JIT in the kernel.
> +        *

So the clobbering comment below is completely clear. But this one is
less clear without some sort of example situation in which bad things
happen. Do you mind providing some pseudo-C example in which the
compiler will optimize things in such a way that the tail call
patching won't happen?

> +        * Note on clobber list: we need to stay in-line with BPF calling
> +        * convention, so even if we don't end up using r0, r4, r5, we need
> +        * to mark them as clobber so that LLVM doesn't end up using them
> +        * before / after the call.
> +        */
> +       asm volatile("r1 = %[ctx]\n\t"
> +                    "r2 = %[map]\n\t"
> +                    "r3 = %[slot]\n\t"
> +                    "call 12\n\t"
> +                    :: [ctx]"r"(ctx), [map]"r"(map), [slot]"i"(slot)
> +                    : "r0", "r1", "r2", "r3", "r4", "r5");
> +}
> +
>  /*
>   * Helper structure used by eBPF C program
>   * to describe BPF map attributes to libbpf loader
> --
> 2.21.0
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-24 18:58   ` Eric Dumazet
@ 2020-09-24 22:03     ` Daniel Borkmann
  2020-09-25  7:49       ` Eric Dumazet
  2020-09-25 15:00       ` Jakub Kicinski
  0 siblings, 2 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 22:03 UTC (permalink / raw)
  To: Eric Dumazet, ast; +Cc: john.fastabend, netdev, bpf

On 9/24/20 8:58 PM, Eric Dumazet wrote:
> On 9/24/20 8:21 PM, Daniel Borkmann wrote:
[...]
>> diff --git a/include/linux/cookie.h b/include/linux/cookie.h
>> new file mode 100644
>> index 000000000000..2488203dc004
>> --- /dev/null
>> +++ b/include/linux/cookie.h
>> @@ -0,0 +1,41 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef __LINUX_COOKIE_H
>> +#define __LINUX_COOKIE_H
>> +
>> +#include <linux/atomic.h>
>> +#include <linux/percpu.h>
>> +
>> +struct gen_cookie {
>> +	u64 __percpu	*local_last;
>> +	atomic64_t	 shared_last ____cacheline_aligned_in_smp;
>> +};
>> +
>> +#define COOKIE_LOCAL_BATCH	4096
>> +
>> +#define DEFINE_COOKIE(name)					\
>> +	static DEFINE_PER_CPU(u64, __##name);			\
>> +	static struct gen_cookie name = {			\
>> +		.local_last	= &__##name,			\
>> +		.shared_last	= ATOMIC64_INIT(0),		\
>> +	}
>> +
>> +static inline u64 gen_cookie_next(struct gen_cookie *gc)
>> +{
>> +	u64 *local_last = &get_cpu_var(*gc->local_last);
>> +	u64 val = *local_last;
>> +
>> +	if (__is_defined(CONFIG_SMP) &&
>> +	    unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
>> +		s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>> +					       &gc->shared_last);
>> +		val = next - COOKIE_LOCAL_BATCH;
>> +	}
>> +	val++;
>> +	if (unlikely(!val))
>> +		val++;
>> +	*local_last = val;
>> +	put_cpu_var(local_last);
>> +	return val;
> 
> This is not interrupt safe.
> 
> I think sock_gen_cookie() can be called from interrupt context.
> 
> get_next_ino() is only called from process context, that is what I used get_cpu_var()
> and put_cpu_var()

Hmm, agree, good point. Need to experiment a bit more .. initial thinking
potentially something like the below could do where we fall back to atomic
counter iff we encounter nesting (which should be an extremely rare case
normally).

BPF progs where this can be called from are non-preemptible, so we could
actually move the temp preempt_disable/enable() from get/put_cpu_var() into
a wrapper func for slow path non-BPF users as well.

static inline u64 gen_cookie_next(struct gen_cookie *gc)
{
         u64 val;

         if (likely(this_cpu_inc_return(*gc->level_nesting) == 1)) {
                 u64 *local_last = this_cpu_ptr(gc->local_last);

                 val = *local_last;
                 if (__is_defined(CONFIG_SMP) &&
                     unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
                         s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
                                                        &gc->shared_last);
                         val = next - COOKIE_LOCAL_BATCH;
                 }
                 val++;
                 if (unlikely(!val))
                         val++;
                 *local_last = val;
         } else {
                 val = atomic64_add_return(COOKIE_LOCAL_BATCH,
                                           &gc->shared_last);
         }
         this_cpu_dec(*gc->level_nesting);
         return val;
}

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate
  2020-09-24 19:25   ` Maciej Fijalkowski
@ 2020-09-24 22:03     ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 22:03 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: ast, john.fastabend, netdev, bpf

On 9/24/20 9:25 PM, Maciej Fijalkowski wrote:
> On Thu, Sep 24, 2020 at 08:21:26PM +0200, Daniel Borkmann wrote:
>> For those locations where we use an immediate tail call map index use the
>> newly added bpf_tail_call_static() helper.
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>>   tools/testing/selftests/bpf/progs/bpf_flow.c  | 12 ++++----
>>   tools/testing/selftests/bpf/progs/tailcall1.c | 28 +++++++++----------
>>   tools/testing/selftests/bpf/progs/tailcall2.c | 14 +++++-----
>>   tools/testing/selftests/bpf/progs/tailcall3.c |  4 +--
>>   .../selftests/bpf/progs/tailcall_bpf2bpf1.c   |  4 +--
>>   .../selftests/bpf/progs/tailcall_bpf2bpf2.c   |  6 ++--
>>   .../selftests/bpf/progs/tailcall_bpf2bpf3.c   |  6 ++--
>>   .../selftests/bpf/progs/tailcall_bpf2bpf4.c   |  6 ++--
>>   8 files changed, 40 insertions(+), 40 deletions(-)
> 
> One nit, while you're at it, maybe it would be good to also address the
> samples/bpf/sockex3_kern.c that is also using the immediate map index?

Sure, np, will add.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in
  2020-09-24 18:21 ` [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in Daniel Borkmann
@ 2020-09-24 22:12   ` David Ahern
  2020-09-24 22:19     ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: David Ahern @ 2020-09-24 22:12 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: john.fastabend, netdev, bpf

On 9/24/20 12:21 PM, Daniel Borkmann wrote:
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 0f913755bcba..19caa2fc21e8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2160,6 +2160,205 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
>  		return __bpf_redirect_no_mac(skb, dev, flags);
>  }
>  
> +#if IS_ENABLED(CONFIG_IPV6)
> +static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb)
> +{
> +	struct dst_entry *dst = skb_dst(skb);
> +	struct net_device *dev = dst->dev;
> +	const struct in6_addr *nexthop;
> +	struct neighbour *neigh;
> +
> +	if (dev_xmit_recursion())
> +		goto out_rec;
> +	skb->dev = dev;
> +	rcu_read_lock_bh();
> +	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
> +	neigh = __ipv6_neigh_lookup_noref_stub(dev, nexthop);
> +	if (unlikely(!neigh))
> +		neigh = __neigh_create(ipv6_stub->nd_tbl, nexthop, dev, false);

the last 3 lines can be replaced with ip_neigh_gw6.

> +	if (likely(!IS_ERR(neigh))) {
> +		int ret;
> +
> +		sock_confirm_neigh(skb, neigh);
> +		dev_xmit_recursion_inc();
> +		ret = neigh_output(neigh, skb, false);
> +		dev_xmit_recursion_dec();
> +		rcu_read_unlock_bh();
> +		return ret;
> +	}
> +	rcu_read_unlock_bh();
> +	IP6_INC_STATS(dev_net(dst->dev),
> +		      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
> +out_drop:
> +	kfree_skb(skb);
> +	return -EINVAL;
> +out_rec:
> +	net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n");
> +	goto out_drop;
> +}
> +

...

> +
> +#if IS_ENABLED(CONFIG_INET)
> +static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb)
> +{
> +	struct dst_entry *dst = skb_dst(skb);
> +	struct rtable *rt = (struct rtable *)dst;

please use container_of here; I'd like to see the typecasts get removed.

> +	struct net_device *dev = dst->dev;
> +	u32 hh_len = LL_RESERVED_SPACE(dev);
> +	struct neighbour *neigh;
> +	bool is_v6gw = false;
> +
> +	if (dev_xmit_recursion())
> +		goto out_rec;

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-24 20:53   ` Andrii Nakryiko
@ 2020-09-24 22:17     ` Daniel Borkmann
  2020-09-25 15:42       ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 22:17 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On 9/24/20 10:53 PM, Andrii Nakryiko wrote:
> On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>
>> Port of tail_call_static() helper function from Cilium's BPF code base [0]
>> to libbpf, so others can easily consume it as well. We've been using this
>> in production code for some time now. The main idea is that we guarantee
>> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
>> JITed BPF insns with direct jumps instead of having to fall back to using
>> expensive retpolines. By using inline asm, we guarantee that the compiler
>> won't merge the call from different paths with potentially different
>> content of r2/r3.
>>
>> We're also using __throw_build_bug() macro in different places as a neat
>> trick to trigger compilation errors when compiler does not remove code at
>> compilation time. This works for the BPF backend as it does not implement
>> the __builtin_trap().
>>
>>    [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>> ---
>>   tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>>   1 file changed, 32 insertions(+)
>>
>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>> index 1106777df00b..18b75a4c82e6 100644
>> --- a/tools/lib/bpf/bpf_helpers.h
>> +++ b/tools/lib/bpf/bpf_helpers.h
>> @@ -53,6 +53,38 @@
>>          })
>>   #endif
>>
>> +/*
>> + * Misc useful helper macros
>> + */
>> +#ifndef __throw_build_bug
>> +# define __throw_build_bug()   __builtin_trap()
>> +#endif
> 
> this will become part of libbpf stable API, do we want/need to expose
> it? If we want to expose it, then we should probably provide a better
> description.
> 
> But also curious, how is it better than _Static_assert() (see
> test_cls_redirect.c), which also allows to provide a better error
> message?

Need to get back to you whether that has same semantics. We use the __throw_build_bug()
also in __bpf_memzero() and friends [0] as a way to trigger a hard build bug if we hit
a default switch-case [0], so we detect unsupported sizes which are not covered by the
implementation yet. If _Static_assert (0, "foo") does the trick, we could also use that;
will check with our code base.

   [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h

>> +static __always_inline void
>> +bpf_tail_call_static(void *ctx, const void *map, const __u32 slot)
>> +{
>> +       if (!__builtin_constant_p(slot))
>> +               __throw_build_bug();
>> +
>> +       /*
>> +        * Don't gamble, but _guarantee_ that LLVM won't optimize setting
>> +        * r2 and r3 from different paths ending up at the same call insn as
>> +        * otherwise we won't be able to use the jmpq/nopl retpoline-free
>> +        * patching by the x86-64 JIT in the kernel.
>> +        *
> 
> So the clobbering comment below is completely clear. But this one is
> less clear without some sort of example situation in which bad things
> happen. Do you mind providing some pseudo-C example in which the
> compiler will optimize things in such a way that the tail call
> patching won't happen?

The details are pretty much here [1] and mentioned at plumbers, so if we end up from
different paths with different map or const key at the same tail-call call insn, then
the record_func_key() will determine that we need to emit retpoline instead of patching
which could have occured with two tail-call call insns, for example. This helper is just
to guarantee that the latter will always happen.

   [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2e4c1e6c2947269346054ac8937ccfe9e0bcc6b

>> +        * Note on clobber list: we need to stay in-line with BPF calling
>> +        * convention, so even if we don't end up using r0, r4, r5, we need
>> +        * to mark them as clobber so that LLVM doesn't end up using them
>> +        * before / after the call.
>> +        */
>> +       asm volatile("r1 = %[ctx]\n\t"
>> +                    "r2 = %[map]\n\t"
>> +                    "r3 = %[slot]\n\t"
>> +                    "call 12\n\t"
>> +                    :: [ctx]"r"(ctx), [map]"r"(map), [slot]"i"(slot)
>> +                    : "r0", "r1", "r2", "r3", "r4", "r5");
>> +}
>> +
>>   /*
>>    * Helper structure used by eBPF C program
>>    * to describe BPF map attributes to libbpf loader
>> --
>> 2.21.0
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in
  2020-09-24 22:12   ` David Ahern
@ 2020-09-24 22:19     ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-24 22:19 UTC (permalink / raw)
  To: David Ahern, ast; +Cc: john.fastabend, netdev, bpf

On 9/25/20 12:12 AM, David Ahern wrote:
> On 9/24/20 12:21 PM, Daniel Borkmann wrote:
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 0f913755bcba..19caa2fc21e8 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -2160,6 +2160,205 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
>>   		return __bpf_redirect_no_mac(skb, dev, flags);
>>   }
>>   
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +static int bpf_out_neigh_v6(struct net *net, struct sk_buff *skb)
>> +{
>> +	struct dst_entry *dst = skb_dst(skb);
>> +	struct net_device *dev = dst->dev;
>> +	const struct in6_addr *nexthop;
>> +	struct neighbour *neigh;
>> +
>> +	if (dev_xmit_recursion())
>> +		goto out_rec;
>> +	skb->dev = dev;
>> +	rcu_read_lock_bh();
>> +	nexthop = rt6_nexthop((struct rt6_info *)dst, &ipv6_hdr(skb)->daddr);
>> +	neigh = __ipv6_neigh_lookup_noref_stub(dev, nexthop);
>> +	if (unlikely(!neigh))
>> +		neigh = __neigh_create(ipv6_stub->nd_tbl, nexthop, dev, false);
> 
> the last 3 lines can be replaced with ip_neigh_gw6.

Ah, nice, I wasn't aware of that one. I'll take it. :)

>> +	if (likely(!IS_ERR(neigh))) {
>> +		int ret;
>> +
>> +		sock_confirm_neigh(skb, neigh);
>> +		dev_xmit_recursion_inc();
>> +		ret = neigh_output(neigh, skb, false);
>> +		dev_xmit_recursion_dec();
>> +		rcu_read_unlock_bh();
>> +		return ret;
>> +	}
>> +	rcu_read_unlock_bh();
>> +	IP6_INC_STATS(dev_net(dst->dev),
>> +		      ip6_dst_idev(dst), IPSTATS_MIB_OUTNOROUTES);
>> +out_drop:
>> +	kfree_skb(skb);
>> +	return -EINVAL;
>> +out_rec:
>> +	net_crit_ratelimited("bpf: recursion limit reached on datapath, buggy bpf program?\n");
>> +	goto out_drop;
>> +}
>> +
> 
> ...
> 
>> +#if IS_ENABLED(CONFIG_INET)
>> +static int bpf_out_neigh_v4(struct net *net, struct sk_buff *skb)
>> +{
>> +	struct dst_entry *dst = skb_dst(skb);
>> +	struct rtable *rt = (struct rtable *)dst;
> 
> please use container_of here; I'd like to see the typecasts get removed.

Will do, thx!

>> +	struct net_device *dev = dst->dev;
>> +	u32 hh_len = LL_RESERVED_SPACE(dev);
>> +	struct neighbour *neigh;
>> +	bool is_v6gw = false;
>> +
>> +	if (dev_xmit_recursion())
>> +		goto out_rec;


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-24 18:21 ` [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Daniel Borkmann
  2020-09-24 20:53   ` Andrii Nakryiko
@ 2020-09-25  6:13   ` Yonghong Song
  1 sibling, 0 replies; 28+ messages in thread
From: Yonghong Song @ 2020-09-25  6:13 UTC (permalink / raw)
  To: Daniel Borkmann, ast; +Cc: john.fastabend, netdev, bpf



On 9/24/20 11:21 AM, Daniel Borkmann wrote:
> Port of tail_call_static() helper function from Cilium's BPF code base [0]
> to libbpf, so others can easily consume it as well. We've been using this
> in production code for some time now. The main idea is that we guarantee
> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
> JITed BPF insns with direct jumps instead of having to fall back to using
> expensive retpolines. By using inline asm, we guarantee that the compiler
> won't merge the call from different paths with potentially different
> content of r2/r3.
> 
> We're also using __throw_build_bug() macro in different places as a neat
> trick to trigger compilation errors when compiler does not remove code at
> compilation time. This works for the BPF backend as it does not implement
> the __builtin_trap().
> 
>    [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> ---
>   tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>   1 file changed, 32 insertions(+)
> 
> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> index 1106777df00b..18b75a4c82e6 100644
> --- a/tools/lib/bpf/bpf_helpers.h
> +++ b/tools/lib/bpf/bpf_helpers.h
> @@ -53,6 +53,38 @@
>   	})
>   #endif
>   
> +/*
> + * Misc useful helper macros
> + */
> +#ifndef __throw_build_bug
> +# define __throw_build_bug()	__builtin_trap()

Just some general comments below. The patch itself is fine to me.

I guess we will never implement a 'trap' insn? The only possible
use I know is for failed CORE relocation, which currently encoded
as an illegal call insn.

> +#endif
> +
> +static __always_inline void
> +bpf_tail_call_static(void *ctx, const void *map, const __u32 slot)
> +{
> +	if (!__builtin_constant_p(slot))
> +		__throw_build_bug();
> +
> +	/*
> +	 * Don't gamble, but _guarantee_ that LLVM won't optimize setting
> +	 * r2 and r3 from different paths ending up at the same call insn as
> +	 * otherwise we won't be able to use the jmpq/nopl retpoline-free
> +	 * patching by the x86-64 JIT in the kernel.
> +	 *
> +	 * Note on clobber list: we need to stay in-line with BPF calling
> +	 * convention, so even if we don't end up using r0, r4, r5, we need
> +	 * to mark them as clobber so that LLVM doesn't end up using them
> +	 * before / after the call.
> +	 */
> +	asm volatile("r1 = %[ctx]\n\t"
> +		     "r2 = %[map]\n\t"
> +		     "r3 = %[slot]\n\t"
> +		     "call 12\n\t"
> +		     :: [ctx]"r"(ctx), [map]"r"(map), [slot]"i"(slot)
> +		     : "r0", "r1", "r2", "r3", "r4", "r5");

Clever scheme to enforce a constant!

This particular case is to avoid sinking common code.
We have a similar use case in https://reviews.llvm.org/D87153
which is to avoid PHI merging of two relocation globals like
    phi = [reloc_global1, reloc_global2]  // reachable from two paths
    ... phi ...

The solution is to insert bpf internal __builtin
    val = __builtin_bpf_passthrough(seq_num, val)
in proper places to prevent sinking common code.

I guess in this example, we could prevent this with
compiler inserting passthrough builtin's like
    ...
    ret = bpf_tail_call(ctx, map, slot)
    ret = __builtin_bpf_passthrough(seq_num, ret)
    ...

If in the future, such a builtin is proved useful for bpf program
writers as well. we could just define
    val = __builtin_bpf_passthrough(val)
and internally, it will be transformed to
    val = llvm.bpf.passthrough(seq_num, val)

> +}
> +
>   /*
>    * Helper structure used by eBPF C program
>    * to describe BPF map attributes to libbpf loader
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-24 22:03     ` Daniel Borkmann
@ 2020-09-25  7:49       ` Eric Dumazet
  2020-09-25  9:26         ` Daniel Borkmann
  2020-09-25 15:00       ` Jakub Kicinski
  1 sibling, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-09-25  7:49 UTC (permalink / raw)
  To: Daniel Borkmann, Eric Dumazet, ast; +Cc: john.fastabend, netdev, bpf



On 9/25/20 12:03 AM, Daniel Borkmann wrote:
> On 9/24/20 8:58 PM, Eric Dumazet wrote:
>> On 9/24/20 8:21 PM, Daniel Borkmann wrote:
> [...]
>>> diff --git a/include/linux/cookie.h b/include/linux/cookie.h
>>> new file mode 100644
>>> index 000000000000..2488203dc004
>>> --- /dev/null
>>> +++ b/include/linux/cookie.h
>>> @@ -0,0 +1,41 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef __LINUX_COOKIE_H
>>> +#define __LINUX_COOKIE_H
>>> +
>>> +#include <linux/atomic.h>
>>> +#include <linux/percpu.h>
>>> +
>>> +struct gen_cookie {
>>> +    u64 __percpu    *local_last;
>>> +    atomic64_t     shared_last ____cacheline_aligned_in_smp;
>>> +};
>>> +
>>> +#define COOKIE_LOCAL_BATCH    4096
>>> +
>>> +#define DEFINE_COOKIE(name)                    \
>>> +    static DEFINE_PER_CPU(u64, __##name);            \
>>> +    static struct gen_cookie name = {            \
>>> +        .local_last    = &__##name,            \
>>> +        .shared_last    = ATOMIC64_INIT(0),        \
>>> +    }
>>> +
>>> +static inline u64 gen_cookie_next(struct gen_cookie *gc)
>>> +{
>>> +    u64 *local_last = &get_cpu_var(*gc->local_last);
>>> +    u64 val = *local_last;
>>> +
>>> +    if (__is_defined(CONFIG_SMP) &&
>>> +        unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
>>> +        s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>>> +                           &gc->shared_last);
>>> +        val = next - COOKIE_LOCAL_BATCH;
>>> +    }
>>> +    val++;
>>> +    if (unlikely(!val))
>>> +        val++;
>>> +    *local_last = val;
>>> +    put_cpu_var(local_last);
>>> +    return val;
>>
>> This is not interrupt safe.
>>
>> I think sock_gen_cookie() can be called from interrupt context.
>>
>> get_next_ino() is only called from process context, that is what I used get_cpu_var()
>> and put_cpu_var()
> 
> Hmm, agree, good point. Need to experiment a bit more .. initial thinking
> potentially something like the below could do where we fall back to atomic
> counter iff we encounter nesting (which should be an extremely rare case
> normally).
> 
> BPF progs where this can be called from are non-preemptible, so we could
> actually move the temp preempt_disable/enable() from get/put_cpu_var() into
> a wrapper func for slow path non-BPF users as well.
> 
> static inline u64 gen_cookie_next(struct gen_cookie *gc)
> {
>         u64 val;
> 

I presume you would use a single structure to hold level_nesting and local_last
in the same cache line.

struct pcpu_gen_cookie {
    int level_nesting;
    u64 local_last;
} __aligned(16);

    

>         if (likely(this_cpu_inc_return(*gc->level_nesting) == 1)) {
>                 u64 *local_last = this_cpu_ptr(gc->local_last);
> 
>                 val = *local_last;
>                 if (__is_defined(CONFIG_SMP) &&
>                     unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
>                         s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>                                                        &gc->shared_last);
>                         val = next - COOKIE_LOCAL_BATCH;
>                 }
>                 val++;



>                 if (unlikely(!val))
>                         val++;

Note that we really expect this wrapping will never happen, with 64bit value.
(We had to take care of the wrapping in get_next_ino() as it was dealing with 32bit values)

>                 *local_last = val;
>         } else {
>                 val = atomic64_add_return(COOKIE_LOCAL_BATCH,
>                                           &gc->shared_last);

Or val = atomic64_dec_return(&reverse_counter)

With reverse_counter initial value set to ATOMIC64_INIT(0) ? 

This will start sending 'big cookies like 0xFFFFFFFFxxxxxxxx' to make sure applications
are not breaking with them, after few months of uptime.

This would also not consume COOKIE_LOCAL_BATCH units per value,
but this seems minor based on the available space.


>         }
>         this_cpu_dec(*gc->level_nesting);
>         return val;
> }
> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-25  7:49       ` Eric Dumazet
@ 2020-09-25  9:26         ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-25  9:26 UTC (permalink / raw)
  To: Eric Dumazet, ast; +Cc: john.fastabend, netdev, bpf

On 9/25/20 9:49 AM, Eric Dumazet wrote:
> On 9/25/20 12:03 AM, Daniel Borkmann wrote:
>> On 9/24/20 8:58 PM, Eric Dumazet wrote:
>>> On 9/24/20 8:21 PM, Daniel Borkmann wrote:
>> [...]
>>>> diff --git a/include/linux/cookie.h b/include/linux/cookie.h
>>>> new file mode 100644
>>>> index 000000000000..2488203dc004
>>>> --- /dev/null
>>>> +++ b/include/linux/cookie.h
>>>> @@ -0,0 +1,41 @@
>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>> +#ifndef __LINUX_COOKIE_H
>>>> +#define __LINUX_COOKIE_H
>>>> +
>>>> +#include <linux/atomic.h>
>>>> +#include <linux/percpu.h>
>>>> +
>>>> +struct gen_cookie {
>>>> +    u64 __percpu    *local_last;
>>>> +    atomic64_t     shared_last ____cacheline_aligned_in_smp;
>>>> +};
>>>> +
>>>> +#define COOKIE_LOCAL_BATCH    4096
>>>> +
>>>> +#define DEFINE_COOKIE(name)                    \
>>>> +    static DEFINE_PER_CPU(u64, __##name);            \
>>>> +    static struct gen_cookie name = {            \
>>>> +        .local_last    = &__##name,            \
>>>> +        .shared_last    = ATOMIC64_INIT(0),        \
>>>> +    }
>>>> +
>>>> +static inline u64 gen_cookie_next(struct gen_cookie *gc)
>>>> +{
>>>> +    u64 *local_last = &get_cpu_var(*gc->local_last);
>>>> +    u64 val = *local_last;
>>>> +
>>>> +    if (__is_defined(CONFIG_SMP) &&
>>>> +        unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
>>>> +        s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>>>> +                           &gc->shared_last);
>>>> +        val = next - COOKIE_LOCAL_BATCH;
>>>> +    }
>>>> +    val++;
>>>> +    if (unlikely(!val))
>>>> +        val++;
>>>> +    *local_last = val;
>>>> +    put_cpu_var(local_last);
>>>> +    return val;
>>>
>>> This is not interrupt safe.
>>>
>>> I think sock_gen_cookie() can be called from interrupt context.
>>>
>>> get_next_ino() is only called from process context, that is what I used get_cpu_var()
>>> and put_cpu_var()
>>
>> Hmm, agree, good point. Need to experiment a bit more .. initial thinking
>> potentially something like the below could do where we fall back to atomic
>> counter iff we encounter nesting (which should be an extremely rare case
>> normally).
>>
>> BPF progs where this can be called from are non-preemptible, so we could
>> actually move the temp preempt_disable/enable() from get/put_cpu_var() into
>> a wrapper func for slow path non-BPF users as well.
>>
>> static inline u64 gen_cookie_next(struct gen_cookie *gc)
>> {
>>          u64 val;
> 
> I presume you would use a single structure to hold level_nesting and local_last
> in the same cache line.
> 
> struct pcpu_gen_cookie {
>      int level_nesting;
>      u64 local_last;
> } __aligned(16);

Yes.

>>          if (likely(this_cpu_inc_return(*gc->level_nesting) == 1)) {
>>                  u64 *local_last = this_cpu_ptr(gc->local_last);
>>
>>                  val = *local_last;
>>                  if (__is_defined(CONFIG_SMP) &&
>>                      unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
>>                          s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>>                                                         &gc->shared_last);
>>                          val = next - COOKIE_LOCAL_BATCH;
>>                  }
>>                  val++;
> 
>>                  if (unlikely(!val))
>>                          val++;
> 
> Note that we really expect this wrapping will never happen, with 64bit value.
> (We had to take care of the wrapping in get_next_ino() as it was dealing with 32bit values)

Agree, all local counters will start off at 0, but we inc right after the batch and
thus never run into it anyway and neither via overflow. Will remove.

>>                  *local_last = val;
>>          } else {
>>                  val = atomic64_add_return(COOKIE_LOCAL_BATCH,
>>                                            &gc->shared_last);
> 
> Or val = atomic64_dec_return(&reverse_counter)
> 
> With reverse_counter initial value set to ATOMIC64_INIT(0) ?
> 
> This will start sending 'big cookies like 0xFFFFFFFFxxxxxxxx' to make sure applications
> are not breaking with them, after few months of uptime.
> 
> This would also not consume COOKIE_LOCAL_BATCH units per value,
> but this seems minor based on the available space.

Excellent idea, I like it given it doesn't waste COOKIE_LOCAL_BATCH space. Thanks for
the feedback!

>>          }
>>          this_cpu_dec(*gc->level_nesting);
>>          return val;
>> }
>>
>> Thanks,
>> Daniel


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk
  2020-09-24 18:21 ` [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Daniel Borkmann
@ 2020-09-25 14:46   ` Jakub Kicinski
  2020-09-25 15:35     ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2020-09-25 14:46 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast, john.fastabend, netdev, bpf

On Thu, 24 Sep 2020 20:21:22 +0200 Daniel Borkmann wrote:
> Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
> from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
> based on the skb->sk, so it can be used as key as part of BPF map
> lookups out of tc from host ns, in particular given the skb->sk is
> retained these days when crossing net ns thanks to 9c4c325252c5
> ("skbuff: preserve sock reference when scrubbing the skb."). This
> is similar to bpf_skb_cgroup_id() which implements the same for v2.
> Kubernetes ecosystem is still operating on v1 however, hence net_cls
> needs to be used there until this can be dropped in with the v2
> helper of bpf_skb_cgroup_id().
> 
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

FWIW lot's of whitespace warnings from checkpatch --strict about
comments having spaces before tabs here.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-24 22:03     ` Daniel Borkmann
  2020-09-25  7:49       ` Eric Dumazet
@ 2020-09-25 15:00       ` Jakub Kicinski
  2020-09-25 15:15         ` Eric Dumazet
  1 sibling, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2020-09-25 15:00 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Eric Dumazet, ast, john.fastabend, netdev, bpf

On Fri, 25 Sep 2020 00:03:14 +0200 Daniel Borkmann wrote:
> static inline u64 gen_cookie_next(struct gen_cookie *gc)
> {
>          u64 val;
> 
>          if (likely(this_cpu_inc_return(*gc->level_nesting) == 1)) {

Is this_cpu_inc() in itself atomic?

Is there a comparison of performance of various atomic ops and locking
somewhere? I wonder how this scheme would compare to a using a cmpxchg.

>                  u64 *local_last = this_cpu_ptr(gc->local_last);
> 
>                  val = *local_last;
>                  if (__is_defined(CONFIG_SMP) &&
>                      unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {

Can we reasonably assume we won't have more than 4k CPUs and just
statically divide this space by encoding CPU id in top bits?

>                          s64 next = atomic64_add_return(COOKIE_LOCAL_BATCH,
>                                                         &gc->shared_last);
>                          val = next - COOKIE_LOCAL_BATCH;
>                  }
>                  val++;
>                  if (unlikely(!val))
>                          val++;
>                  *local_last = val;
>          } else {
>                  val = atomic64_add_return(COOKIE_LOCAL_BATCH,
>                                            &gc->shared_last);
>          }
>          this_cpu_dec(*gc->level_nesting);
>          return val;
> }

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-25 15:00       ` Jakub Kicinski
@ 2020-09-25 15:15         ` Eric Dumazet
  2020-09-25 15:31           ` Jakub Kicinski
  0 siblings, 1 reply; 28+ messages in thread
From: Eric Dumazet @ 2020-09-25 15:15 UTC (permalink / raw)
  To: Jakub Kicinski, Daniel Borkmann
  Cc: Eric Dumazet, ast, john.fastabend, netdev, bpf



On 9/25/20 5:00 PM, Jakub Kicinski wrote:
                    unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0)) {
> 
> Can we reasonably assume we won't have more than 4k CPUs and just
> statically divide this space by encoding CPU id in top bits?

This might give some food to side channel attacks, since this would
give an indication of cpu that allocated the id.

Also, I hear that some distros enabled 8K cpus.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-25 15:15         ` Eric Dumazet
@ 2020-09-25 15:31           ` Jakub Kicinski
  2020-09-25 15:45             ` Eric Dumazet
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Kicinski @ 2020-09-25 15:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Daniel Borkmann, ast, john.fastabend, netdev, bpf

On Fri, 25 Sep 2020 17:15:17 +0200 Eric Dumazet wrote:
> On 9/25/20 5:00 PM, Jakub Kicinski wrote:
> > Is this_cpu_inc() in itself atomic?

To answer my own question - it is :)

> >                     unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0))
> > 
> > Can we reasonably assume we won't have more than 4k CPUs and just
> > statically divide this space by encoding CPU id in top bits?  
> 
> This might give some food to side channel attacks, since this would
> give an indication of cpu that allocated the id.
> 
> Also, I hear that some distros enabled 8K cpus.

Ok :(

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk
  2020-09-25 14:46   ` Jakub Kicinski
@ 2020-09-25 15:35     ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-25 15:35 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: ast, john.fastabend, netdev, bpf

On 9/25/20 4:46 PM, Jakub Kicinski wrote:
> On Thu, 24 Sep 2020 20:21:22 +0200 Daniel Borkmann wrote:
>> Similarly to 5a52ae4e32a6 ("bpf: Allow to retrieve cgroup v1 classid
>> from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
>> based on the skb->sk, so it can be used as key as part of BPF map
>> lookups out of tc from host ns, in particular given the skb->sk is
>> retained these days when crossing net ns thanks to 9c4c325252c5
>> ("skbuff: preserve sock reference when scrubbing the skb."). This
>> is similar to bpf_skb_cgroup_id() which implements the same for v2.
>> Kubernetes ecosystem is still operating on v1 however, hence net_cls
>> needs to be used there until this can be dropped in with the v2
>> helper of bpf_skb_cgroup_id().
>>
>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> 
> FWIW lot's of whitespace warnings from checkpatch --strict about
> comments having spaces before tabs here.

Expected given the way the UAPI helper comment is formatted / done in
order to then render the man page. So it's formatted the same way as
the other helper descriptions.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-24 22:17     ` Daniel Borkmann
@ 2020-09-25 15:42       ` Daniel Borkmann
  2020-09-25 15:52         ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-25 15:42 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On 9/25/20 12:17 AM, Daniel Borkmann wrote:
> On 9/24/20 10:53 PM, Andrii Nakryiko wrote:
>> On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>
>>> Port of tail_call_static() helper function from Cilium's BPF code base [0]
>>> to libbpf, so others can easily consume it as well. We've been using this
>>> in production code for some time now. The main idea is that we guarantee
>>> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
>>> JITed BPF insns with direct jumps instead of having to fall back to using
>>> expensive retpolines. By using inline asm, we guarantee that the compiler
>>> won't merge the call from different paths with potentially different
>>> content of r2/r3.
>>>
>>> We're also using __throw_build_bug() macro in different places as a neat
>>> trick to trigger compilation errors when compiler does not remove code at
>>> compilation time. This works for the BPF backend as it does not implement
>>> the __builtin_trap().
>>>
>>>    [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
>>>
>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>> ---
>>>   tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>>>   1 file changed, 32 insertions(+)
>>>
>>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>>> index 1106777df00b..18b75a4c82e6 100644
>>> --- a/tools/lib/bpf/bpf_helpers.h
>>> +++ b/tools/lib/bpf/bpf_helpers.h
>>> @@ -53,6 +53,38 @@
>>>          })
>>>   #endif
>>>
>>> +/*
>>> + * Misc useful helper macros
>>> + */
>>> +#ifndef __throw_build_bug
>>> +# define __throw_build_bug()   __builtin_trap()
>>> +#endif
>>
>> this will become part of libbpf stable API, do we want/need to expose
>> it? If we want to expose it, then we should probably provide a better
>> description.
>>
>> But also curious, how is it better than _Static_assert() (see
>> test_cls_redirect.c), which also allows to provide a better error
>> message?
> 
> Need to get back to you whether that has same semantics. We use the __throw_build_bug()
> also in __bpf_memzero() and friends [0] as a way to trigger a hard build bug if we hit
> a default switch-case [0], so we detect unsupported sizes which are not covered by the
> implementation yet. If _Static_assert (0, "foo") does the trick, we could also use that;
> will check with our code base.

So _Static_assert() won't work here, for example consider:

   # cat f1.c
   int main(void)
   {
	if (0)
		_Static_assert(0, "foo");
	return 0;
   }
   # clang -target bpf -Wall -O2 -c f1.c -o f1.o
   f1.c:4:3: error: expected expression
                 _Static_assert(0, "foo");
                 ^
   1 error generated.

In order for it to work as required form the use-case, the _Static_assert() must not trigger
here given the path is unreachable and will be optimized away. I'll add a comment to the
__throw_build_bug() helper. Given libbpf we should probably also prefix with bpf_. If you see
a better name that would fit, pls let me know.

>    [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h
Thanks,
Daniel

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one
  2020-09-25 15:31           ` Jakub Kicinski
@ 2020-09-25 15:45             ` Eric Dumazet
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Dumazet @ 2020-09-25 15:45 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: Daniel Borkmann, ast, john.fastabend, netdev, bpf



On 9/25/20 5:31 PM, Jakub Kicinski wrote:
> On Fri, 25 Sep 2020 17:15:17 +0200 Eric Dumazet wrote:
>> On 9/25/20 5:00 PM, Jakub Kicinski wrote:
>>> Is this_cpu_inc() in itself atomic?
> 
> To answer my own question - it is :)
> 
>>>                     unlikely((val & (COOKIE_LOCAL_BATCH - 1)) == 0))
>>>
>>> Can we reasonably assume we won't have more than 4k CPUs and just
>>> statically divide this space by encoding CPU id in top bits?  
>>
>> This might give some food to side channel attacks, since this would
>> give an indication of cpu that allocated the id.
>>
>> Also, I hear that some distros enabled 8K cpus.
> 
> Ok :(
> 

I was not really serious about the side channel attacks, just some
thought about possible implications :)

Even with 8192 max cpus, splitting space into 2^(64-13) blocks would be fine I think.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-25 15:42       ` Daniel Borkmann
@ 2020-09-25 15:52         ` Daniel Borkmann
  2020-09-25 16:50           ` Andrii Nakryiko
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-25 15:52 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On 9/25/20 5:42 PM, Daniel Borkmann wrote:
> On 9/25/20 12:17 AM, Daniel Borkmann wrote:
>> On 9/24/20 10:53 PM, Andrii Nakryiko wrote:
>>> On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>
>>>> Port of tail_call_static() helper function from Cilium's BPF code base [0]
>>>> to libbpf, so others can easily consume it as well. We've been using this
>>>> in production code for some time now. The main idea is that we guarantee
>>>> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
>>>> JITed BPF insns with direct jumps instead of having to fall back to using
>>>> expensive retpolines. By using inline asm, we guarantee that the compiler
>>>> won't merge the call from different paths with potentially different
>>>> content of r2/r3.
>>>>
>>>> We're also using __throw_build_bug() macro in different places as a neat
>>>> trick to trigger compilation errors when compiler does not remove code at
>>>> compilation time. This works for the BPF backend as it does not implement
>>>> the __builtin_trap().
>>>>
>>>>    [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
>>>>
>>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>>> ---
>>>>   tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>>>>   1 file changed, 32 insertions(+)
>>>>
>>>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>>>> index 1106777df00b..18b75a4c82e6 100644
>>>> --- a/tools/lib/bpf/bpf_helpers.h
>>>> +++ b/tools/lib/bpf/bpf_helpers.h
>>>> @@ -53,6 +53,38 @@
>>>>          })
>>>>   #endif
>>>>
>>>> +/*
>>>> + * Misc useful helper macros
>>>> + */
>>>> +#ifndef __throw_build_bug
>>>> +# define __throw_build_bug()   __builtin_trap()
>>>> +#endif
>>>
>>> this will become part of libbpf stable API, do we want/need to expose
>>> it? If we want to expose it, then we should probably provide a better
>>> description.
>>>
>>> But also curious, how is it better than _Static_assert() (see
>>> test_cls_redirect.c), which also allows to provide a better error
>>> message?
>>
>> Need to get back to you whether that has same semantics. We use the __throw_build_bug()
>> also in __bpf_memzero() and friends [0] as a way to trigger a hard build bug if we hit
>> a default switch-case [0], so we detect unsupported sizes which are not covered by the
>> implementation yet. If _Static_assert (0, "foo") does the trick, we could also use that;
>> will check with our code base.
> 
> So _Static_assert() won't work here, for example consider:
> 
>    # cat f1.c
>    int main(void)
>    {
>      if (0)
>          _Static_assert(0, "foo");
>      return 0;
>    }
>    # clang -target bpf -Wall -O2 -c f1.c -o f1.o
>    f1.c:4:3: error: expected expression
>                  _Static_assert(0, "foo");
>                  ^
>    1 error generated.

.. aaand it looks like I need some more coffee. ;-) But result is the same after all:

   # clang -target bpf -Wall -O2 -c f1.c -o f1.o
   f1.c:4:3: error: static_assert failed "foo"
                 _Static_assert(0, "foo");
                 ^              ~
   1 error generated.

   # cat f1.c
   int main(void)
   {
	if (0) {
		_Static_assert(0, "foo");
	}
	return 0;
   }

> In order for it to work as required form the use-case, the _Static_assert() must not trigger
> here given the path is unreachable and will be optimized away. I'll add a comment to the
> __throw_build_bug() helper. Given libbpf we should probably also prefix with bpf_. If you see
> a better name that would fit, pls let me know.
> 
>>    [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h
> Thanks,
> Daniel


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-25 15:52         ` Daniel Borkmann
@ 2020-09-25 16:50           ` Andrii Nakryiko
  2020-09-25 19:52             ` Daniel Borkmann
  0 siblings, 1 reply; 28+ messages in thread
From: Andrii Nakryiko @ 2020-09-25 16:50 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On Fri, Sep 25, 2020 at 8:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 9/25/20 5:42 PM, Daniel Borkmann wrote:
> > On 9/25/20 12:17 AM, Daniel Borkmann wrote:
> >> On 9/24/20 10:53 PM, Andrii Nakryiko wrote:
> >>> On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >>>>
> >>>> Port of tail_call_static() helper function from Cilium's BPF code base [0]
> >>>> to libbpf, so others can easily consume it as well. We've been using this
> >>>> in production code for some time now. The main idea is that we guarantee
> >>>> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
> >>>> JITed BPF insns with direct jumps instead of having to fall back to using
> >>>> expensive retpolines. By using inline asm, we guarantee that the compiler
> >>>> won't merge the call from different paths with potentially different
> >>>> content of r2/r3.
> >>>>
> >>>> We're also using __throw_build_bug() macro in different places as a neat
> >>>> trick to trigger compilation errors when compiler does not remove code at
> >>>> compilation time. This works for the BPF backend as it does not implement
> >>>> the __builtin_trap().
> >>>>
> >>>>    [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
> >>>>
> >>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> >>>> ---
> >>>>   tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
> >>>>   1 file changed, 32 insertions(+)
> >>>>
> >>>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
> >>>> index 1106777df00b..18b75a4c82e6 100644
> >>>> --- a/tools/lib/bpf/bpf_helpers.h
> >>>> +++ b/tools/lib/bpf/bpf_helpers.h
> >>>> @@ -53,6 +53,38 @@
> >>>>          })
> >>>>   #endif
> >>>>
> >>>> +/*
> >>>> + * Misc useful helper macros
> >>>> + */
> >>>> +#ifndef __throw_build_bug
> >>>> +# define __throw_build_bug()   __builtin_trap()
> >>>> +#endif
> >>>
> >>> this will become part of libbpf stable API, do we want/need to expose
> >>> it? If we want to expose it, then we should probably provide a better
> >>> description.
> >>>
> >>> But also curious, how is it better than _Static_assert() (see
> >>> test_cls_redirect.c), which also allows to provide a better error
> >>> message?
> >>
> >> Need to get back to you whether that has same semantics. We use the __throw_build_bug()
> >> also in __bpf_memzero() and friends [0] as a way to trigger a hard build bug if we hit
> >> a default switch-case [0], so we detect unsupported sizes which are not covered by the
> >> implementation yet. If _Static_assert (0, "foo") does the trick, we could also use that;
> >> will check with our code base.
> >
> > So _Static_assert() won't work here, for example consider:
> >
> >    # cat f1.c
> >    int main(void)
> >    {
> >      if (0)
> >          _Static_assert(0, "foo");
> >      return 0;
> >    }
> >    # clang -target bpf -Wall -O2 -c f1.c -o f1.o
> >    f1.c:4:3: error: expected expression
> >                  _Static_assert(0, "foo");
> >                  ^
> >    1 error generated.
>
> .. aaand it looks like I need some more coffee. ;-) But result is the same after all:
>
>    # clang -target bpf -Wall -O2 -c f1.c -o f1.o
>    f1.c:4:3: error: static_assert failed "foo"
>                  _Static_assert(0, "foo");
>                  ^              ~
>    1 error generated.
>
>    # cat f1.c
>    int main(void)
>    {
>         if (0) {
>                 _Static_assert(0, "foo");
>         }
>         return 0;
>    }

You need still more :-P. For you use case it will look like this:

$ cat test-bla.c
int bar(int x) {
       _Static_assert(!__builtin_constant_p(x), "not a constant!");
       return x;
}

int foo() {
        bar(123);
        return 0;
}
$ clang -target bpf -O2 -c test-bla.c -o test-bla.o
$ echo $?
0

But in general to ensure unreachable code it's probably useful anyway
to have this. How about calling it __bpf_build_error() or maybe even
__bpf_unreachable()?

>
> > In order for it to work as required form the use-case, the _Static_assert() must not trigger
> > here given the path is unreachable and will be optimized away. I'll add a comment to the
> > __throw_build_bug() helper. Given libbpf we should probably also prefix with bpf_. If you see
> > a better name that would fit, pls let me know.
> >
> >>    [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h
> > Thanks,
> > Daniel
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs
  2020-09-25 16:50           ` Andrii Nakryiko
@ 2020-09-25 19:52             ` Daniel Borkmann
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Borkmann @ 2020-09-25 19:52 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: Alexei Starovoitov, john fastabend, Networking, bpf

On 9/25/20 6:50 PM, Andrii Nakryiko wrote:
> On Fri, Sep 25, 2020 at 8:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 9/25/20 5:42 PM, Daniel Borkmann wrote:
>>> On 9/25/20 12:17 AM, Daniel Borkmann wrote:
>>>> On 9/24/20 10:53 PM, Andrii Nakryiko wrote:
>>>>> On Thu, Sep 24, 2020 at 11:22 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>>
>>>>>> Port of tail_call_static() helper function from Cilium's BPF code base [0]
>>>>>> to libbpf, so others can easily consume it as well. We've been using this
>>>>>> in production code for some time now. The main idea is that we guarantee
>>>>>> that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
>>>>>> JITed BPF insns with direct jumps instead of having to fall back to using
>>>>>> expensive retpolines. By using inline asm, we guarantee that the compiler
>>>>>> won't merge the call from different paths with potentially different
>>>>>> content of r2/r3.
>>>>>>
>>>>>> We're also using __throw_build_bug() macro in different places as a neat
>>>>>> trick to trigger compilation errors when compiler does not remove code at
>>>>>> compilation time. This works for the BPF backend as it does not implement
>>>>>> the __builtin_trap().
>>>>>>
>>>>>>     [0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035
>>>>>>
>>>>>> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
>>>>>> ---
>>>>>>    tools/lib/bpf/bpf_helpers.h | 32 ++++++++++++++++++++++++++++++++
>>>>>>    1 file changed, 32 insertions(+)
>>>>>>
>>>>>> diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
>>>>>> index 1106777df00b..18b75a4c82e6 100644
>>>>>> --- a/tools/lib/bpf/bpf_helpers.h
>>>>>> +++ b/tools/lib/bpf/bpf_helpers.h
>>>>>> @@ -53,6 +53,38 @@
>>>>>>           })
>>>>>>    #endif
>>>>>>
>>>>>> +/*
>>>>>> + * Misc useful helper macros
>>>>>> + */
>>>>>> +#ifndef __throw_build_bug
>>>>>> +# define __throw_build_bug()   __builtin_trap()
>>>>>> +#endif
>>>>>
>>>>> this will become part of libbpf stable API, do we want/need to expose
>>>>> it? If we want to expose it, then we should probably provide a better
>>>>> description.
>>>>>
>>>>> But also curious, how is it better than _Static_assert() (see
>>>>> test_cls_redirect.c), which also allows to provide a better error
>>>>> message?
>>>>
>>>> Need to get back to you whether that has same semantics. We use the __throw_build_bug()
>>>> also in __bpf_memzero() and friends [0] as a way to trigger a hard build bug if we hit
>>>> a default switch-case [0], so we detect unsupported sizes which are not covered by the
>>>> implementation yet. If _Static_assert (0, "foo") does the trick, we could also use that;
>>>> will check with our code base.
>>>
>>> So _Static_assert() won't work here, for example consider:
>>>
>>>     # cat f1.c
>>>     int main(void)
>>>     {
>>>       if (0)
>>>           _Static_assert(0, "foo");
>>>       return 0;
>>>     }
>>>     # clang -target bpf -Wall -O2 -c f1.c -o f1.o
>>>     f1.c:4:3: error: expected expression
>>>                   _Static_assert(0, "foo");
>>>                   ^
>>>     1 error generated.
>>
>> .. aaand it looks like I need some more coffee. ;-) But result is the same after all:
>>
>>     # clang -target bpf -Wall -O2 -c f1.c -o f1.o
>>     f1.c:4:3: error: static_assert failed "foo"
>>                   _Static_assert(0, "foo");
>>                   ^              ~
>>     1 error generated.
>>
>>     # cat f1.c
>>     int main(void)
>>     {
>>          if (0) {
>>                  _Static_assert(0, "foo");
>>          }
>>          return 0;
>>     }
> 
> You need still more :-P. For you use case it will look like this:
> 
> $ cat test-bla.c
> int bar(int x) {
>         _Static_assert(!__builtin_constant_p(x), "not a constant!");
>         return x;
> }
> 
> int foo() {
>          bar(123);
>          return 0;
> }
> $ clang -target bpf -O2 -c test-bla.c -o test-bla.o
> $ echo $?
> 0

Right, but that won't work for example for the use case to detect switch cases which fall
into default case as mentioned with the mem* optimizations earlier in this thread.

> But in general to ensure unreachable code it's probably useful anyway
> to have this. How about calling it __bpf_build_error() or maybe even
> __bpf_unreachable()?

I think the __bpf_unreachable() sounds best to me, will use that.

>>> In order for it to work as required form the use-case, the _Static_assert() must not trigger
>>> here given the path is unreachable and will be optimized away. I'll add a comment to the
>>> __throw_build_bug() helper. Given libbpf we should probably also prefix with bpf_. If you see
>>> a better name that would fit, pls let me know.
>>>
>>>>     [0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/builtins.h
>>> Thanks,
>>> Daniel
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 18:21 [PATCH bpf-next 0/6] Various BPF helper improvements Daniel Borkmann
2020-09-24 18:21 ` [PATCH bpf-next 1/6] bpf: add classid helper only based on skb->sk Daniel Borkmann
2020-09-25 14:46   ` Jakub Kicinski
2020-09-25 15:35     ` Daniel Borkmann
2020-09-24 18:21 ` [PATCH bpf-next 2/6] bpf, net: rework cookie generator as per-cpu one Daniel Borkmann
2020-09-24 18:58   ` Eric Dumazet
2020-09-24 22:03     ` Daniel Borkmann
2020-09-25  7:49       ` Eric Dumazet
2020-09-25  9:26         ` Daniel Borkmann
2020-09-25 15:00       ` Jakub Kicinski
2020-09-25 15:15         ` Eric Dumazet
2020-09-25 15:31           ` Jakub Kicinski
2020-09-25 15:45             ` Eric Dumazet
2020-09-24 18:21 ` [PATCH bpf-next 3/6] bpf: add redirect_neigh helper as redirect drop-in Daniel Borkmann
2020-09-24 22:12   ` David Ahern
2020-09-24 22:19     ` Daniel Borkmann
2020-09-24 18:21 ` [PATCH bpf-next 4/6] bpf, libbpf: add bpf_tail_call_static helper for bpf programs Daniel Borkmann
2020-09-24 20:53   ` Andrii Nakryiko
2020-09-24 22:17     ` Daniel Borkmann
2020-09-25 15:42       ` Daniel Borkmann
2020-09-25 15:52         ` Daniel Borkmann
2020-09-25 16:50           ` Andrii Nakryiko
2020-09-25 19:52             ` Daniel Borkmann
2020-09-25  6:13   ` Yonghong Song
2020-09-24 18:21 ` [PATCH bpf-next 5/6] bpf, selftests: use bpf_tail_call_static where appropriate Daniel Borkmann
2020-09-24 19:25   ` Maciej Fijalkowski
2020-09-24 22:03     ` Daniel Borkmann
2020-09-24 18:21 ` [PATCH bpf-next 6/6] bpf, selftests: add redirect_neigh selftest Daniel Borkmann

BPF Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/bpf/0 bpf/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 bpf bpf/ https://lore.kernel.org/bpf \
		bpf@vger.kernel.org
	public-inbox-index bpf

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.bpf


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git