All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
@ 2016-11-01  0:37 Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
                   ` (6 more replies)
  0 siblings, 7 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

{Open question:
 Tom brought up the question on whether it is safe to modify the packet
 in artbirary ways before dst_output(). This is the equivalent to a raw
 socket injecting illegal headers. This v2 currently assumes that
 dst_output() is ready to accept invalid header values. This needs to be
 verified and if not the case, then raw sockets or dst_output() handlers
 must be fixed as well. Another option is to mark lwtunnel_output() as
 read-only for now.}

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
skb as context. input is read-only, output can write, xmit can write,
push headers, and redirect.

Motiviation for this work:
 - Restricting outgoing routes beyond what the route tuple supports
 - Per route accounting byond realms
 - Fast attachment of L2 headers where header does not require resolving
   L2 addresses
 - ILA like uses cases where L3 addresses are resolved and then routed
   in an async manner
 - Fast encapsulation + redirect. For now limited to use cases where not
   setting inner and outer offset/protocol is OK.

A couple of samples on how to use it can be found in patch 04.

v1 -> v2:
 - Added new BPF_LWT_REROUTE return code for program to indicate
   that new route lookup should be performed. Suggested by Tom.
 - New sample to illustrate rerouting
 - New patch 05: Recursion limit for lwtunnel_output for the case
   when user creates circular dst redirection. Also resolves the
   issue for ILA.
 - Fix to ensure headroom for potential future L2 header is still
   guaranteed

Thomas Graf (5):
  route: Set orig_output when redirecting to lwt on locally generated
    traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel encapsulation
  bpf: Add samples for LWT-BPF
  lwtunnel: Limit number of recursions on output to 5

 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  37 +++-
 include/uapi/linux/lwtunnel.h |  21 ++
 kernel/bpf/verifier.c         |  16 +-
 net/Kconfig                   |   1 +
 net/core/Makefile             |   2 +-
 net/core/filter.c             | 148 ++++++++++++-
 net/core/lwt_bpf.c            | 504 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |  15 +-
 net/ipv4/route.c              |  37 +++-
 samples/bpf/bpf_helpers.h     |   4 +
 samples/bpf/lwt_bpf.c         | 235 ++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh   | 370 +++++++++++++++++++++++++++++++
 13 files changed, 1373 insertions(+), 19 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-11-01  0:37 ` Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 2/5] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 62d4d90..7da886e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2138,8 +2138,10 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate))
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
 		rth->dst.output = lwtunnel_output;
+	}
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v2 2/5] route: Set lwtstate for local traffic and cached input dsts
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
@ 2016-11-01  0:37 ` Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7da886e..44f5403 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1596,6 +1596,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
 	spin_unlock_bh(&fnhe_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
+		rth->dst.output = lwtunnel_output;
+	}
+
+	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_input = rth->dst.input;
+		rth->dst.input = lwtunnel_input;
+	}
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
@@ -1685,14 +1698,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->dst.input = ip_forward;
 
 	rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
-	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_input = rth->dst.input;
-		rth->dst.input = lwtunnel_input;
-	}
+	set_lwt_redirect(rth);
 	skb_dst_set(skb, &rth->dst);
 out:
 	err = 0;
@@ -1919,8 +1925,18 @@ out:	return err;
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
+
 	if (do_cache) {
-		if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
+		struct fib_nh *nh = &FIB_RES_NH(res);
+
+		rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+		if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+			WARN_ON(rth->dst.input == lwtunnel_input);
+			rth->dst.lwtstate->orig_input = rth->dst.input;
+			rth->dst.input = lwtunnel_input;
+		}
+
+		if (unlikely(!rt_cache_route(nh, rth))) {
 			rth->dst.flags |= DST_NOCACHE;
 			rt_add_uncached_list(rth);
 		}
@@ -2138,10 +2154,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
+	set_lwt_redirect(rth);
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 2/5] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
@ 2016-11-01  0:37 ` Thomas Graf
  2016-11-01  1:10   ` kbuild test robot
                     ` (3 more replies)
  2016-11-01  0:37 ` [PATCH net-next v2 4/5] bpf: Add samples for LWT-BPF Thomas Graf
                   ` (3 subsequent siblings)
  6 siblings, 4 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
LWT redirection of type LWTUNNEL_ENCAP_BPF.

The separate program types are required because manipulation of
packet data is only allowed on the output and transmit path as
the subsequent dst_input() call path assumes an IP header
validated by ip_rcv(). The BPF programs will be handed an skb
with the L3 header attached and may return one of the following
return codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid on lwtunnel_xmit() hook)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

A new helper bpf_skb_push() is added which allows to preprend an
L2 header in front of the skb, extend the existing L3 header, or
both. This allows to address a wide range of issues:
 - Optimize L2 header construction when L2 information is always
   static to avoid ARP/NDisc lookup.
 - Extend IP header to add additional IP options.
 - Perform simple encapsulation where offload is of no concern.
   (The existing funtionality to attach a tunnel key to the skb
    and redirect to a tunnel net_device to allow for offload
    continues to work obviously).

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  37 +++-
 include/uapi/linux/lwtunnel.h |  21 ++
 kernel/bpf/verifier.c         |  16 +-
 net/Kconfig                   |   1 +
 net/core/Makefile             |   2 +-
 net/core/filter.c             | 148 ++++++++++++-
 net/core/lwt_bpf.c            | 504 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |   1 +
 9 files changed, 725 insertions(+), 7 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..aad7f81 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..c034a2d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,9 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -383,6 +386,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ *     Add room to beginning of skb and adjusts MAC header offset accordingly.
+ *     Extends/reallocaes for needed skb headeroom automatically.
+ *     May change skb data pointer and will thus invalidate any check done
+ *     for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -427,7 +440,8 @@ union bpf_attr {
 	FN(skb_pull_data),		\
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
-	FN(get_numa_node_id),
+	FN(get_numa_node_id),		\
+	FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -511,6 +525,27 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+	/* >127 are reserved for prog type specific return codes */
+};
+
+/* LWT specific return codes */
+enum bpf_lwt_ret_code {
+	BPF_LWT_REROUTE = 128,
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..9354d997 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
 	LWTUNNEL_ENCAP_IP,
 	LWTUNNEL_ENCAP_ILA,
 	LWTUNNEL_ENCAP_IP6,
+	LWTUNNEL_ENCAP_BPF,
 	__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -42,4 +43,24 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+	LWT_BPF_PROG_UNSPEC,
+	LWT_BPF_PROG_FD,
+	LWT_BPF_PROG_NAME,
+	__LWT_BPF_PROG_MAX,
+};
+
+#define LWT_BPF_PROG_MAX (__LWT_BPF_PROG_MAX - 1)
+
+enum {
+	LWT_BPF_UNSPEC,
+	LWT_BPF_IN,
+	LWT_BPF_OUT,
+	LWT_BPF_XMIT,
+	__LWT_BPF_MAX,
+};
+
+#define LWT_BPF_MAX (__LWT_BPF_MAX - 1)
+
+
 #endif /* _UAPI_LWTUNNEL_H_ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9002575..519b58e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -633,12 +633,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
 #define MAX_PACKET_OFF 0xffff
 
 static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
-				       const struct bpf_call_arg_meta *meta)
+				       const struct bpf_call_arg_meta *meta,
+				       enum bpf_access_type t)
 {
 	switch (env->prog->type) {
+	case BPF_PROG_TYPE_LWT_IN:
+		/* dst_input() can't write for now, orig_input may depend on
+		 * IP header parsed by ip_rcv().
+		 */
+		if (t == BPF_WRITE)
+			return false;
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
 	case BPF_PROG_TYPE_XDP:
+	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_XMIT:
 		if (meta)
 			return meta->pkt_access;
 
@@ -837,7 +846,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL)) {
+		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
@@ -970,7 +979,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		return 0;
 	}
 
-	if (type == PTR_TO_PACKET && !may_access_direct_pkt_data(env, meta)) {
+	if (type == PTR_TO_PACKET &&
+	    !may_access_direct_pkt_data(env, meta, BPF_READ)) {
 		verbose("helper access to the packet is not allowed\n");
 		return -EACCES;
 	}
diff --git a/net/Kconfig b/net/Kconfig
index 7b6cd34..7554f12 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -396,6 +396,7 @@ source "net/nfc/Kconfig"
 
 config LWTUNNEL
 	bool "Network light weight tunnels"
+	depends on IPV6 || IPV6=n
 	---help---
 	  This feature provides an infrastructure to support light weight
 	  tunnels like mpls. There is no netdevice associated with a light
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..a675fd3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
 obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
-obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
+obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
diff --git a/net/core/filter.c b/net/core/filter.c
index cd9e2ba..325a9d8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2138,6 +2138,43 @@ static const struct bpf_func_proto bpf_skb_change_tail_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_skb_push, struct sk_buff *, skb, __u32, len, u64, flags)
+{
+	u32 new_len = skb->len + len;
+
+	/* restrict max skb size and check for overflow */
+	if (new_len > __bpf_skb_max_len(skb) || new_len < skb->len)
+		return -ERANGE;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len > 0) {
+		int ret;
+
+		ret = skb_cow(skb, len);
+		if (unlikely(ret < 0))
+			return ret;
+
+		__skb_push(skb, len);
+		memset(skb->data, 0, len);
+	}
+
+	skb_reset_mac_header(skb);
+
+	bpf_compute_data_end(skb);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_push_proto = {
+	.func		= bpf_skb_push,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_skb_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -2147,7 +2184,8 @@ bool bpf_helper_changes_skb_data(void *func)
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace)
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_skb_push)
 		return true;
 
 	return false;
@@ -2578,6 +2616,75 @@ xdp_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_store_bytes:
+		return &bpf_skb_store_bytes_proto;
+	case BPF_FUNC_csum_update:
+		return &bpf_csum_update_proto;
+	case BPF_FUNC_l3_csum_replace:
+		return &bpf_l3_csum_replace_proto;
+	case BPF_FUNC_l4_csum_replace:
+		return &bpf_l4_csum_replace_proto;
+	case BPF_FUNC_set_hash_invalid:
+		return &bpf_set_hash_invalid_proto;
+	default:
+		return lwt_in_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_xmit_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_get_tunnel_key:
+		return &bpf_skb_get_tunnel_key_proto;
+	case BPF_FUNC_skb_set_tunnel_key:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_skb_get_tunnel_opt:
+		return &bpf_skb_get_tunnel_opt_proto;
+	case BPF_FUNC_skb_set_tunnel_opt:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_redirect:
+		return &bpf_redirect_proto;
+	case BPF_FUNC_skb_change_tail:
+		return &bpf_skb_change_tail_proto;
+	case BPF_FUNC_skb_push:
+		return &bpf_skb_push_proto;
+	default:
+		return lwt_out_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2940,6 +3047,27 @@ static const struct bpf_verifier_ops xdp_ops = {
 	.convert_ctx_access	= xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops lwt_in_ops = {
+	.get_func_proto		= lwt_in_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_out_ops = {
+	.get_func_proto		= lwt_out_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_xmit_ops = {
+	.get_func_proto		= lwt_xmit_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2960,12 +3088,30 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list lwt_in_type __read_mostly = {
+	.ops	= &lwt_in_ops,
+	.type	= BPF_PROG_TYPE_LWT_IN,
+};
+
+static struct bpf_prog_type_list lwt_out_type __read_mostly = {
+	.ops	= &lwt_out_ops,
+	.type	= BPF_PROG_TYPE_LWT_OUT,
+};
+
+static struct bpf_prog_type_list lwt_xmit_type __read_mostly = {
+	.ops	= &lwt_xmit_ops,
+	.type	= BPF_PROG_TYPE_LWT_XMIT,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&lwt_in_type);
+	bpf_register_prog_type(&lwt_out_type);
+	bpf_register_prog_type(&lwt_xmit_type);
 
 	return 0;
 }
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
new file mode 100644
index 0000000..c9b0494
--- /dev/null
+++ b/net/core/lwt_bpf.c
@@ -0,0 +1,504 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <net/lwtunnel.h>
+#include <net/dst_cache.h>
+#include <net/ip6_route.h>
+
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
+struct bpf_lwt {
+	struct bpf_lwt_prog in;
+	struct bpf_lwt_prog out;
+	struct bpf_lwt_prog xmit;
+	struct dst_cache dst_cache;
+	int family;
+};
+
+#define MAX_PROG_NAME 256
+
+static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
+{
+	return (struct bpf_lwt *)lwt->data;
+}
+
+#define NO_REDIRECT false
+#define CAN_REDIRECT true
+
+static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
+		       struct dst_entry *dst, bool can_redirect)
+{
+	int ret;
+
+	/* Preempt disable is needed to protect per-cpu redirect_info between
+	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
+	 * access to maps strictly require a rcu_read_lock() for protection,
+	 * mixing with BH RCU lock doesn't work.
+	 */
+	preempt_disable();
+	rcu_read_lock();
+	bpf_compute_data_end(skb);
+	ret = BPF_PROG_RUN(lwt->prog, skb);
+	rcu_read_unlock();
+
+	switch (ret) {
+	case BPF_OK:
+		break;
+
+	case BPF_REDIRECT:
+		if (!can_redirect) {
+			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
+				  lwt->name ? : "<unknown>");
+			ret = BPF_OK;
+		} else {
+			ret = skb_do_redirect(skb);
+			if (ret == 0)
+				ret = BPF_REDIRECT;
+		}
+		break;
+
+	case BPF_DROP:
+		kfree_skb(skb);
+		ret = -EPERM;
+		break;
+
+	case BPF_LWT_REROUTE:
+		break;
+
+	default:
+		WARN_ONCE(1, "Illegal LWT BPF return value %u, expect packet loss\n",
+			  ret);
+		kfree_skb(skb);
+		ret = -EINVAL;
+		break;
+	}
+
+	preempt_enable();
+
+	return ret;
+}
+
+static int bpf_input(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->in.prog) {
+		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_input)) {
+		WARN_ONCE(1, "orig_input not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_input(skb);
+}
+
+#if IS_ENABLED(CONFIG_IPV6)
+static struct dst_entry *bpf_lwt_lookup6(struct net *net, struct sk_buff *skb,
+					 struct bpf_lwt *bpf)
+{
+	struct ipv6hdr *ip6h = ipv6_hdr(skb);
+	struct dst_entry *dst;
+	struct flowi6 fl6 = {
+		.daddr = ip6h->daddr,
+		.saddr = ip6h->saddr,
+		.flowlabel = ip6_flowinfo(ip6h),
+		.flowi6_mark = skb->mark,
+		.flowi6_proto = ip6h->nexthdr,
+		.flowi6_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0,
+	};
+
+	dst = ip6_route_output(net, skb->sk, &fl6);
+	if (unlikely(dst->error)) {
+		int err = dst->error;
+		dst_release(dst);
+		return ERR_PTR(err);
+	}
+
+	dst = xfrm_lookup(net, dst, flowi6_to_flowi(&fl6), NULL, 0);
+	if (IS_ERR(dst))
+		return dst;
+
+	dst_cache_set_ip6(&bpf->dst_cache, dst, &fl6.saddr);
+
+	return dst;
+}
+#endif
+
+static struct dst_entry *bpf_lwt_lookup4(struct net *net, struct sk_buff *skb,
+					 struct bpf_lwt *bpf)
+{
+	struct iphdr *ip4 = ip_hdr(skb);
+	struct dst_entry *dst;
+	struct rtable *rt;
+	struct flowi4 fl4 = {
+		.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0,
+		.flowi4_mark = skb->mark,
+		.flowi4_proto = ip4->protocol,
+		.flowi4_tos = RT_TOS(ip4->tos),
+		.flowi4_flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0,
+		.saddr = ip4->saddr,
+		.daddr = ip4->daddr,
+	};
+
+	rt = ip_route_output_key(net, &fl4);
+	if (IS_ERR(rt))
+		return ERR_CAST(rt);
+
+	dst = &rt->dst;
+	if (dst->error) {
+		int err = dst->error;
+		dst_release(dst);
+		return ERR_PTR(err);
+	}
+
+	dst = xfrm_lookup(net, dst, flowi4_to_flowi(&fl4), NULL, 0);
+	if (IS_ERR(dst))
+		return dst;
+
+	dst_cache_set_ip4(&bpf->dst_cache, dst, fl4.saddr);
+
+	return dst;
+}
+
+static int bpf_lwt_reroute(struct net *net, struct sk_buff *skb,
+			   struct bpf_lwt *bpf)
+{
+	struct dst_entry *dst;
+
+	dst = dst_cache_get(&bpf->dst_cache);
+	if (unlikely(!dst)) {
+		switch (bpf->family) {
+		case AF_INET:
+			dst = bpf_lwt_lookup4(net, skb, bpf);
+			break;
+#if IS_ENABLED(CONFIG_IPV6)
+		case AF_INET6:
+			dst = bpf_lwt_lookup6(net, skb, bpf);
+			break;
+#endif
+		default:
+			return -EAFNOSUPPORT;
+		}
+
+		if (IS_ERR(dst))
+			return PTR_ERR(dst);
+	}
+
+	skb_dst_drop(skb);
+	skb_dst_set(skb, dst);
+
+	return 0;
+}
+
+static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->out.prog) {
+		ret = run_lwt_bpf(skb, &bpf->out, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+
+		if (ret == BPF_LWT_REROUTE) {
+			ret = bpf_lwt_reroute(net, skb, bpf);
+			if (ret < 0) {
+				kfree_skb(skb);
+				return ret;
+			}
+
+			return dst_output(net, sk, skb);
+		}
+	}
+
+	if (unlikely(!dst->lwtstate->orig_output)) {
+		WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_output(net, sk, skb);
+}
+
+static int xmit_check_hhlen(struct sk_buff *skb)
+{
+	int hh_len = skb_dst(skb)->dev->hard_header_len;
+
+	if (skb_headroom(skb) < hh_len) {
+		int nhead = HH_DATA_ALIGN(hh_len - skb_headroom(skb));
+
+		if (pskb_expand_head(skb, nhead, 0, GFP_ATOMIC))
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+static int bpf_xmit(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->xmit.prog) {
+		int ret;
+
+		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
+		switch (ret) {
+		case BPF_OK:
+			/* If the L3 header was expanded, headroom might be too
+			 * small for L2 header now, expand as needed.
+			 */
+			ret = xmit_check_hhlen(skb);
+			if (unlikely(ret))
+				return ret;
+
+			return LWTUNNEL_XMIT_CONTINUE;
+		case BPF_REDIRECT:
+			return LWTUNNEL_XMIT_DONE;
+		default:
+			return ret;
+		}
+	}
+
+	return LWTUNNEL_XMIT_CONTINUE;
+}
+
+static void bpf_lwt_prog_destroy(struct bpf_lwt_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+
+	kfree(prog->name);
+}
+
+static void bpf_destroy_state(struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	dst_cache_destroy(&bpf->dst_cache);
+	bpf_lwt_prog_destroy(&bpf->in);
+	bpf_lwt_prog_destroy(&bpf->out);
+	bpf_lwt_prog_destroy(&bpf->xmit);
+}
+
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD] = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int bpf_parse_prog(struct nlattr *attr, struct bpf_lwt_prog *prog,
+			  enum bpf_prog_type type)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attr, bpf_prog_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	prog->name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!prog->name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, type);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	prog->prog = p;
+
+	return 0;
+}
+
+static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = {
+	[LWT_BPF_IN]   = { .type = NLA_NESTED, },
+	[LWT_BPF_OUT]  = { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT] = { .type = NLA_NESTED, },
+};
+
+static int bpf_build_state(struct net_device *dev, struct nlattr *nla,
+			   unsigned int family, const void *cfg,
+			   struct lwtunnel_state **ts)
+{
+	struct nlattr *tb[LWT_BPF_MAX + 1];
+	struct lwtunnel_state *newts;
+	struct bpf_lwt *bpf;
+	int ret;
+
+	if (family != AF_INET && family != AF_INET6)
+		return -EAFNOSUPPORT;
+
+	ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_IN] && !tb[LWT_BPF_OUT] && !tb[LWT_BPF_XMIT])
+		return -EINVAL;
+
+	newts = lwtunnel_state_alloc(sizeof(*bpf));
+	if (!newts)
+		return -ENOMEM;
+
+	newts->type = LWTUNNEL_ENCAP_BPF;
+	bpf = bpf_lwt_lwtunnel(newts);
+
+	if (tb[LWT_BPF_IN]) {
+		newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_IN], &bpf->in,
+				     BPF_PROG_TYPE_LWT_IN);
+		if (ret  < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_OUT]) {
+		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_OUT], &bpf->out,
+				     BPF_PROG_TYPE_LWT_OUT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	if (tb[LWT_BPF_XMIT]) {
+		newts->flags |= LWTUNNEL_STATE_XMIT_REDIRECT;
+		ret = bpf_parse_prog(tb[LWT_BPF_XMIT], &bpf->xmit,
+				     BPF_PROG_TYPE_LWT_XMIT);
+		if (ret < 0)
+			goto errout;
+	}
+
+	ret = dst_cache_init(&bpf->dst_cache, GFP_KERNEL);
+	if (ret)
+		goto errout;
+
+	bpf->family = family;
+	*ts = newts;
+
+	return 0;
+
+errout:
+	bpf_destroy_state(newts);
+	kfree(newts);
+	return ret;
+}
+
+static int bpf_fill_lwt_prog(struct sk_buff *skb, int attr,
+			     struct bpf_lwt_prog *prog)
+{
+	struct nlattr *nest;
+
+	if (!prog->prog)
+		return 0;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (prog->name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, prog->name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int bpf_fill_encap_info(struct sk_buff *skb, struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	if (bpf_fill_lwt_prog(skb, LWT_BPF_IN, &bpf->in) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_OUT, &bpf->out) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_XMIT, &bpf->xmit) < 0)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int bpf_encap_nlsize(struct lwtunnel_state *lwtstate)
+{
+	int nest_len = nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) + /* LWT_BPF_PROG_NAME */
+		       0;
+
+	return nest_len + /* LWT_BPF_IN */
+	       nest_len + /* LWT_BPF_OUT */
+	       nest_len + /* LWT_BPF_XMIT */
+	       0;
+}
+
+int bpf_lwt_prog_cmp(struct bpf_lwt_prog *a, struct bpf_lwt_prog *b)
+{
+	/* FIXME:
+	 * The LWT state is currently rebuilt for delete requests which
+	 * results in a new bpf_prog instance. Comparing names for now.
+	 */
+	if (!a->name && !b->name)
+		return 0;
+
+	if (!a->name || !b->name)
+		return 1;
+
+	return strcmp(a->name, b->name);
+}
+
+static int bpf_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
+{
+	struct bpf_lwt *a_bpf = bpf_lwt_lwtunnel(a);
+	struct bpf_lwt *b_bpf = bpf_lwt_lwtunnel(b);
+
+	return bpf_lwt_prog_cmp(&a_bpf->in, &b_bpf->in) ||
+	       bpf_lwt_prog_cmp(&a_bpf->out, &b_bpf->out) ||
+	       bpf_lwt_prog_cmp(&a_bpf->xmit, &b_bpf->xmit);
+}
+
+static const struct lwtunnel_encap_ops bpf_encap_ops = {
+	.build_state	= bpf_build_state,
+	.destroy_state	= bpf_destroy_state,
+	.input		= bpf_input,
+	.output		= bpf_output,
+	.xmit		= bpf_xmit,
+	.fill_encap	= bpf_fill_encap_info,
+	.get_encap_size = bpf_encap_nlsize,
+	.cmp_encap	= bpf_encap_cmp,
+};
+
+static int __init bpf_lwt_init(void)
+{
+	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
+}
+
+subsys_initcall(bpf_lwt_init)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 88fd642..554d901 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -39,6 +39,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
 		return "MPLS";
 	case LWTUNNEL_ENCAP_ILA:
 		return "ILA";
+	case LWTUNNEL_ENCAP_BPF:
 	case LWTUNNEL_ENCAP_IP6:
 	case LWTUNNEL_ENCAP_IP:
 	case LWTUNNEL_ENCAP_NONE:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v2 4/5] bpf: Add samples for LWT-BPF
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (2 preceding siblings ...)
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-11-01  0:37 ` Thomas Graf
  2016-11-01  0:37 ` [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5 Thomas Graf
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

This adds a set of samples demonstrating the use of lwt-bpf combined
with a shell script which allows running the samples in the form of
a basic selftest.

The samples include:
 - Allowing all packets
 - Dropping all packets
 - Printing context information
 - Access packet data
 - IPv4 daddr rewrite in dst_output()
 - L2 MAC header push + redirect in lwt xmit

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_bpf.c       | 235 ++++++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh | 370 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 609 insertions(+)
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
 	unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+	(void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
 	(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
 	(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
 	(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+	(void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_bpf.c b/samples/bpf/lwt_bpf.c
new file mode 100644
index 0000000..fc86275
--- /dev/null
+++ b/samples/bpf/lwt_bpf.c
@@ -0,0 +1,235 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <stdint.h>
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include "bpf_helpers.h"
+#include <string.h>
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+#define CB_MAGIC 1234
+
+/* Let all packets pass */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+	return BPF_OK;
+}
+
+/* Print some context information per packet to tracing buffer.
+ */
+SEC("ctx_test")
+int do_ctx_test(struct __sk_buff *skb)
+{
+	skb->cb[0] = CB_MAGIC;
+	printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+	       skb->protocol);
+	printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+	       skb->ingress_ifindex, skb->ifindex);
+
+	return BPF_OK;
+}
+
+/* Print content of skb->cb[] to tracing buffer */
+SEC("print_cb")
+int do_print_cb(struct __sk_buff *skb)
+{
+	printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+	       skb->cb[2]);
+	printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+	return BPF_OK;
+}
+
+/* Print source and destination IPv4 address to tracing buffer */
+SEC("data_test")
+int do_data_test(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph = data;
+
+	if (data + sizeof(*iph) > data_end) {
+		printk("packet truncated\n");
+		return BPF_DROP;
+	}
+
+	printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+	return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+			  uint32_t new_ip, int rw_daddr)
+{
+	int ret, off = 0, flags = IS_PSEUDO;
+	uint8_t proto;
+
+	ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, &proto, 1);
+	if (ret < 0) {
+		printk("bpf_l4_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	switch (proto) {
+	case IPPROTO_TCP:
+		off = TCP_CSUM_OFF;
+		break;
+
+	case IPPROTO_UDP:
+		off = UDP_CSUM_OFF;
+		flags |= BPF_F_MARK_MANGLED_0;
+		break;
+
+	case IPPROTO_ICMPV6:
+		off = offsetof(struct icmp6hdr, icmp6_cksum);
+		break;
+	}
+
+	if (off) {
+		ret = bpf_l4_csum_replace(skb, off, old_ip, new_ip,
+					  flags | sizeof(new_ip));
+		if (ret < 0) {
+			printk("bpf_l4_csum_replace failed: %d\n");
+			return BPF_DROP;
+		}
+	}
+
+	ret = bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ip, new_ip, sizeof(new_ip));
+	if (ret < 0) {
+		printk("bpf_l3_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (rw_daddr)
+		ret = bpf_skb_store_bytes(skb, IP_DST_OFF, &new_ip, sizeof(new_ip), 0);
+	else
+		ret = bpf_skb_store_bytes(skb, IP_SRC_OFF, &new_ip, sizeof(new_ip), 0);
+
+	if (ret < 0) {
+		printk("bpf_skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return BPF_OK;
+}
+
+/* Rewrite IPv4 destination address from 192.168.254.2 to 192.168.254.3 */
+SEC("rw_out")
+int do_rw_out(struct __sk_buff *skb)
+{
+	uint32_t old_ip, new_ip = 0x3fea8c0;
+	int ret;
+
+	ret = bpf_skb_load_bytes(skb, IP_DST_OFF, &old_ip, 4);
+	if (ret < 0) {
+		printk("bpf_skb_load_bytes failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (old_ip == 0x2fea8c0) {
+		printk("out: rewriting from %x to %x\n", old_ip, new_ip);
+		return rewrite(skb, old_ip, new_ip, 1);
+	}
+
+	return BPF_OK;
+}
+
+/* Rewrite IPv4 destination address from 192.168.254.2 to 192.168.111.2 */
+SEC("rw_out_reroute")
+int do_rw_out_reroute(struct __sk_buff *skb)
+{
+	uint32_t old_ip, new_ip = 0x26fa8c0;
+	int ret;
+
+	ret = bpf_skb_load_bytes(skb, IP_DST_OFF, &old_ip, 4);
+	if (ret < 0) {
+		printk("bpf_skb_load_bytes failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (old_ip == 0x2fea8c0) {
+		printk("out: rewriting from %x to %x\n", old_ip, new_ip);
+		ret = rewrite(skb, old_ip, new_ip, 1);
+		if (ret < 0)
+			return ret;
+
+		return BPF_LWT_REROUTE;
+	}
+
+	return BPF_OK;
+}
+
+SEC("redirect")
+int do_redirect(struct __sk_buff *skb)
+{
+	uint64_t smac = SRC_MAC, dmac = DST_MAC;
+	int ret, ifindex = DST_IFINDEX;
+	struct ethhdr ehdr;
+
+	ret = bpf_skb_push(skb, 14, 0);
+	if (ret < 0) {
+		printk("skb_push() failed: %d\n", ret);
+	}
+
+	ehdr.h_proto = __constant_htons(ETH_P_IP);
+	memcpy(&ehdr.h_source, &smac, 6);
+	memcpy(&ehdr.h_dest, &dmac, 6);
+
+	ret = bpf_skb_store_bytes(skb, 0, &ehdr, sizeof(ehdr), 0);
+	if (ret < 0) {
+		printk("skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	ret = bpf_redirect(ifindex, 0);
+	if (ret < 0) {
+		printk("bpf_redirect() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	printk("redirected to %d\n", ifindex);
+
+	return BPF_REDIRECT;
+}
+
+/* Drop all packets */
+SEC("drop_all")
+int do_drop_all(struct __sk_buff *skb)
+{
+	printk("dropping with: %d\n", BPF_DROP);
+	return BPF_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_lwt_bpf.sh b/samples/bpf/test_lwt_bpf.sh
new file mode 100755
index 0000000..6cbf96e
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.sh
@@ -0,0 +1,370 @@
+#!/bin/bash
+
+# Uncomment to see generated bytecode
+#VERBOSE=verbose
+
+NS1=lwt_ns1
+NS2=lwt_ns2
+VETH0=tst_lwt1a
+VETH1=tst_lwt1b
+VETH2=tst_lwt2a
+VETH3=tst_lwt2b
+IPVETH0="192.168.254.1"
+IPVETH1="192.168.254.2"
+IPVETH1b="192.168.254.3"
+IPVETH2="192.168.111.1"
+IPVETH3="192.168.111.2"
+IP_LOCAL="192.168.99.1"
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function hton_mac()
+{
+	MAC="${1//:/}"
+	echo "0x${MAC:10:2}${MAC:8:2}${MAC:6:2}${MAC:4:2}${MAC:2:2}${MAC:0:2}"
+}
+
+function lookup_mac()
+{
+	set +x
+	if [ ! -z "$2" ]; then
+		MAC=$(ip netns exec $2 ip link show $1 | grep ether | awk '{print $2}')
+	else
+		MAC=$(ip link show $1 | grep ether | awk '{print $2}')
+	fi
+	echo $(hton_mac $MAC)
+	set -x
+}
+
+function cleanup {
+        set +ex
+        rm lwt_bpf.o 2> /dev/null
+        ip link del $VETH0 2> /dev/null
+        ip link del $VETH1 2> /dev/null
+        ip link del $VETH2 2> /dev/null
+        ip link del $VETH3 2> /dev/null
+        ip netns delete $NS1 2> /dev/null
+        ip netns delete $NS2 2> /dev/null
+        set -ex
+}
+
+function setup_one_veth {
+        ip netns add $1
+
+        ip link add $2 type veth peer name $3
+
+        ip link set dev $2 up
+        ip addr add $4/24 dev $2
+
+        ip link set $3 netns $1
+        ip netns exec $1 ip link set dev $3 up
+        ip netns exec $1 ip addr add $5/24 dev $3
+
+	if [ "$6" ]; then
+		ip netns exec $1 ip addr add $6/32 dev $3
+	fi
+}
+
+function setup_veth {
+	setup_one_veth $NS1 $VETH0 $VETH1 $IPVETH0 $IPVETH1 $IPVETH1b
+	ip netns exec $NS1 ip route add 192.168.111.0/24 dev $VETH1
+	setup_one_veth $NS2 $VETH2 $VETH3 $IPVETH2 $IPVETH3
+	ip netns exec $NS2 ip route add 192.168.254.0/24 dev $VETH3
+
+        echo 1 > ${TRACE_ROOT}/tracing_on
+}
+
+function get_trace {
+	set +x
+        cat ${TRACE_ROOT}/trace | grep -v '^#'
+	set -x
+}
+
+function install_prog {
+	ip route del ${IPVETH1}/32 dev $VETH0 2> /dev/null || true
+	ip route del table local local ${IP_LOCAL}/32 dev lo 2> /dev/null || true
+	cp /dev/null ${TRACE_ROOT}/trace
+
+	OPTS="encap bpf $1 obj lwt_bpf.o section $2 $VERBOSE"
+
+	if [ "$1" == "in" ];  then
+		ip route add table local local ${IP_LOCAL}/32 $OPTS dev lo
+	else
+		ip route add ${IPVETH1}/32 $OPTS dev $VETH0
+	fi
+}
+
+function remove_prog {
+	if [ "$1" == "in" ];  then
+		ip route del table local local ${IP_LOCAL}/32 dev lo
+	else
+		ip route del ${IPVETH1}/32 dev $VETH0
+	fi
+}
+
+function filter_trace {
+	# Add newline to allow starting EXPECT= variables on newline
+	NL=$'\n'
+	echo "${NL}$*" | sed -e 's/^.*: : //g'
+}
+
+function expect_fail {
+	set +x
+	echo "FAIL:"
+	echo "Expected: $1"
+	echo "Got: $2"
+	set -x
+	exit 1
+}
+
+function match_trace {
+	set +x
+	RET=0
+	TRACE=$1
+	EXPECT=$2
+	GOT="$(filter_trace "$TRACE")"
+
+	[ "$GOT" != "$EXPECT" ] && {
+		expect_fail "$EXPECT" "$GOT"
+		RET=1
+	}
+	set -x
+	return $RET
+}
+
+function test_start {
+	set +x
+	echo "----------------------------------------------------------------"
+	echo "Starting test: $*"
+	echo "----------------------------------------------------------------"
+	set -x
+}
+
+function failure {
+	get_trace
+	echo "FAIL: $*"
+	exit 1
+}
+
+function test_ctx_xmit {
+	test_start "test_ctx on lwt xmit"
+	install_prog xmit ctx_test
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx xmit: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_ctx_out {
+	test_start "test_ctx on lwt out"
+	install_prog out ctx_test
+	ping -c 3 $IPVETH1 || {
+		failure "test_ctx out: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0" || exit 1
+	remove_prog out
+}
+
+function test_ctx_in {
+	test_start "test_ctx on lwt in"
+	install_prog in ctx_test
+	ping -c 3 $IP_LOCAL || {
+		failure "test_ctx out: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1" || exit 1
+	remove_prog in
+}
+
+function test_data {
+	test_start "test_data on lwt $1"
+	install_prog $1 data_test
+	ping -c 3 $IPVETH1 || {
+		failure "test_data ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0" || exit 1
+	remove_prog $1
+}
+
+function test_data_in {
+	test_start "test_data on lwt in"
+	install_prog in data_test
+	ping -c 3 $IP_LOCAL || {
+		failure "test_data in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0
+src: 163a8c0 dst: 163a8c0" || exit 1
+	remove_prog in
+}
+
+function test_cb {
+	test_start "test_cb on lwt $1"
+	install_prog $1 print_cb
+	ping -c 3 $IPVETH1 || {
+		failure "test_cb ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog $1
+}
+
+function test_cb_in {
+	test_start "test_cb on lwt in"
+	install_prog in print_cb
+	ping -c 3 $IP_LOCAL || {
+		failure "test_cb in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP_LOCAL => $IP_LOCAL
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog in
+}
+
+function test_drop_all {
+	test_start "test_drop_all on lwt $1"
+	install_prog $1 drop_all
+	ping -c 3 $IPVETH1 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog $1
+}
+
+function test_drop_all_in {
+	test_start "test_drop_all on lwt in"
+	install_prog in drop_all
+	ping -c 3 $IP_LOCAL && {
+		failure "test_drop_all in: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog in
+}
+
+function test_redirect_xmit {
+	test_start "test_redirect on lwt xmit"
+	install_prog xmit redirect
+	ping -c 3 $IPVETH1 || {
+		failure "Redirected packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_rw_out {
+	test_start "test_rw_out on lwt out"
+	install_prog out rw_out
+	ping -c 3 $IPVETH1 || {
+		failure "FAIL: Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0" || exit 1
+	remove_prog out
+}
+
+function test_rw_out_reroute {
+	test_start "test_rw_out_reroute on lwt out"
+	install_prog out rw_out_reroute
+	ping -c 3 $IPVETH1 || {
+		failure "FAIL: Rewritten packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0
+out: rewriting from 2fea8c0 to 3fea8c0" || exit 1
+	remove_prog out
+}
+
+cleanup
+setup_veth
+
+DST_MAC=$(lookup_mac $VETH1 $NS1)
+SRC_MAC=$(lookup_mac $VETH0)
+DST_IFINDEX=$(cat /sys/class/net/$VETH0/ifindex)
+
+CLANG_OPTS="-O2 -target bpf -I ../include/"
+CLANG_OPTS+=" -DSRC_MAC=$SRC_MAC -DDST_MAC=$DST_MAC -DDST_IFINDEX=$DST_IFINDEX"
+clang $CLANG_OPTS -c lwt_bpf.c -o lwt_bpf.o
+
+test_ctx_xmit
+test_ctx_out
+test_ctx_in
+test_data "xmit"
+test_data "out"
+test_data_in
+test_cb "xmit"
+test_cb "out"
+test_cb_in
+test_drop_all "xmit"
+test_drop_all "out"
+test_drop_all_in
+test_redirect_xmit
+test_rw_out
+test_rw_out_reroute
+
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+exit 0
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (3 preceding siblings ...)
  2016-11-01  0:37 ` [PATCH net-next v2 4/5] bpf: Add samples for LWT-BPF Thomas Graf
@ 2016-11-01  0:37 ` Thomas Graf
  2016-11-01  4:52   ` kbuild test robot
  2016-11-01 10:54 ` [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
  2016-11-01 16:17 ` Tom Herbert
  6 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  0:37 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/core/lwtunnel.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 554d901..6363d0b 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -231,6 +231,10 @@ int lwtunnel_cmp_encap(struct lwtunnel_state *a, struct lwtunnel_state *b)
 }
 EXPORT_SYMBOL(lwtunnel_cmp_encap);
 
+/* Per CPU recursion counter for dst_output() redirections via LWT */
+#define DST_RECURSION_LIMIT 5
+DEFINE_PER_CPU(int, dst_recursion);
+
 int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -246,11 +250,19 @@ int lwtunnel_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 	    lwtstate->type > LWTUNNEL_ENCAP_MAX)
 		return 0;
 
+	if (unlikely(__this_cpu_read(dst_recursion) > DST_RECURSION_LIMIT)) {
+		net_crit_ratelimited("lwt: recursion limit reached of redirected dst_output calls\n");
+		return -EFAULT;
+	}
+
 	ret = -EOPNOTSUPP;
 	rcu_read_lock();
 	ops = rcu_dereference(lwtun_encaps[lwtstate->type]);
-	if (likely(ops && ops->output))
+	if (likely(ops && ops->output)) {
+		__this_cpu_inc(dst_recursion);
 		ret = ops->output(net, sk, skb);
+		__this_cpu_dec(dst_recursion);
+	}
 	rcu_read_unlock();
 
 	if (ret == -EOPNOTSUPP)
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-11-01  1:10   ` kbuild test robot
  2016-11-01  2:39   ` kbuild test robot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2016-11-01  1:10 UTC (permalink / raw)
  To: Thomas Graf
  Cc: kbuild-all, davem, alexei.starovoitov, daniel, tom, roopa, netdev

[-- Attachment #1: Type: text/plain, Size: 838 bytes --]

Hi Thomas,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Thomas-Graf/bpf-BPF-for-lightweight-tunnel-encapsulation/20161101-084038
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `bpf_output':
>> lwt_bpf.c:(.text+0x2cff4): undefined reference to `ip6_route_output_flags'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11508 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-01  1:10   ` kbuild test robot
@ 2016-11-01  2:39   ` kbuild test robot
  2016-11-01 20:11   ` David Ahern
  2016-11-02 13:39   ` Roopa Prabhu
  3 siblings, 0 replies; 36+ messages in thread
From: kbuild test robot @ 2016-11-01  2:39 UTC (permalink / raw)
  To: Thomas Graf
  Cc: kbuild-all, davem, alexei.starovoitov, daniel, tom, roopa, netdev

[-- Attachment #1: Type: text/plain, Size: 1603 bytes --]

Hi Thomas,

[auto build test WARNING on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Thomas-Graf/bpf-BPF-for-lightweight-tunnel-encapsulation/20161101-084038
config: x86_64-randconfig-s0-11010954 (attached as .config)
compiler: gcc-4.4 (Debian 4.4.7-8) 4.4.7
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   net/core/lwt_bpf.c: In function 'bpf_lwt_lookup6':
>> net/core/lwt_bpf.c:132: warning: initialized field with side-effects overwritten
   net/core/lwt_bpf.c:132: warning: (near initialization for 'fl6')

vim +132 net/core/lwt_bpf.c

   116		}
   117	
   118		return dst->lwtstate->orig_input(skb);
   119	}
   120	
   121	#if IS_ENABLED(CONFIG_IPV6)
   122	static struct dst_entry *bpf_lwt_lookup6(struct net *net, struct sk_buff *skb,
   123						 struct bpf_lwt *bpf)
   124	{
   125		struct ipv6hdr *ip6h = ipv6_hdr(skb);
   126		struct dst_entry *dst;
   127		struct flowi6 fl6 = {
   128			.daddr = ip6h->daddr,
   129			.saddr = ip6h->saddr,
   130			.flowlabel = ip6_flowinfo(ip6h),
   131			.flowi6_mark = skb->mark,
 > 132			.flowi6_proto = ip6h->nexthdr,
   133			.flowi6_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0,
   134		};
   135	
   136		dst = ip6_route_output(net, skb->sk, &fl6);
   137		if (unlikely(dst->error)) {
   138			int err = dst->error;
   139			dst_release(dst);
   140			return ERR_PTR(err);

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 27506 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5
  2016-11-01  0:37 ` [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5 Thomas Graf
@ 2016-11-01  4:52   ` kbuild test robot
  2016-11-01  8:09     ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: kbuild test robot @ 2016-11-01  4:52 UTC (permalink / raw)
  To: Thomas Graf
  Cc: kbuild-all, davem, alexei.starovoitov, daniel, tom, roopa, netdev

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

Hi Thomas,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Thomas-Graf/bpf-BPF-for-lightweight-tunnel-encapsulation/20161101-084038
config: arm64-allmodconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   net/built-in.o: In function `bpf_output':
>> ncsi-manage.c:(.text+0x8e9f4): undefined reference to `ip6_route_output_flags'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 52482 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5
  2016-11-01  4:52   ` kbuild test robot
@ 2016-11-01  8:09     ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01  8:09 UTC (permalink / raw)
  To: netdev; +Cc: davem, alexei.starovoitov, daniel, tom, roopa

On 11/01/16 at 12:52pm, kbuild test robot wrote:
> Hi Thomas,
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Thomas-Graf/bpf-BPF-for-lightweight-tunnel-encapsulation/20161101-084038
> config: arm64-allmodconfig (attached as .config)
> compiler: aarch64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm64 
> 
> All errors (new ones prefixed by >>):
> 
>    net/built-in.o: In function `bpf_output':
> >> ncsi-manage.c:(.text+0x8e9f4): undefined reference to `ip6_route_output_flags'

Needs
	depends on IPV6_MULTIPLE_TABLES || IPV6=n 

Compile testing with IPV6=y and IPV6_MULTIPLE_TABLES=n would have been
great. I'll submit a v3 after some time has passed to review the new
rerouting bits.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (4 preceding siblings ...)
  2016-11-01  0:37 ` [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5 Thomas Graf
@ 2016-11-01 10:54 ` Hannes Frederic Sowa
  2016-11-01 18:51   ` Thomas Graf
  2016-11-01 16:17 ` Tom Herbert
  6 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-01 10:54 UTC (permalink / raw)
  To: Thomas Graf, davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

Hi Thomas,

On 01.11.2016 01:37, Thomas Graf wrote:
> {Open question:
>  Tom brought up the question on whether it is safe to modify the packet
>  in artbirary ways before dst_output(). This is the equivalent to a raw
>  socket injecting illegal headers. This v2 currently assumes that
>  dst_output() is ready to accept invalid header values. This needs to be
>  verified and if not the case, then raw sockets or dst_output() handlers
>  must be fixed as well. Another option is to mark lwtunnel_output() as
>  read-only for now.}
> 
> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure. The BPF program can be attached to
> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
> skb as context. input is read-only, output can write, xmit can write,
> push headers, and redirect.
> 
> Motiviation for this work:
>  - Restricting outgoing routes beyond what the route tuple supports
>  - Per route accounting byond realms
>  - Fast attachment of L2 headers where header does not require resolving
>    L2 addresses
>  - ILA like uses cases where L3 addresses are resolved and then routed
>    in an async manner
>  - Fast encapsulation + redirect. For now limited to use cases where not
>    setting inner and outer offset/protocol is OK.
> 
> A couple of samples on how to use it can be found in patch 04.

I do fear the complexity and debugability introduced by this patch set
quite a bit.

I wonder if architecturally it would be more feasible to add a generic
(bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
added into the dsts.

BPF could then be able to access parts of the metadata in the attached
metadata dst entries and performing the matching this way?

The reason why I would prefer an approach like this: irregardless of the
routing lookup we would process the skb with bpf or not. This gives a
single point to debug, where instead in your approach we first must
figure out the corresponding bpf program and then check for it specifically.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (5 preceding siblings ...)
  2016-11-01 10:54 ` [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
@ 2016-11-01 16:17 ` Tom Herbert
  2016-11-01 18:20   ` Thomas Graf
  6 siblings, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2016-11-01 16:17 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf <tgraf@suug.ch> wrote:
> {Open question:
>  Tom brought up the question on whether it is safe to modify the packet
>  in artbirary ways before dst_output(). This is the equivalent to a raw
>  socket injecting illegal headers. This v2 currently assumes that
>  dst_output() is ready to accept invalid header values. This needs to be
>  verified and if not the case, then raw sockets or dst_output() handlers
>  must be fixed as well. Another option is to mark lwtunnel_output() as
>  read-only for now.}
>
The question might not be so much about illegal headers but whether
fields in the skbuff related to the packet contents are kept correct.
We have protocol, header offsets, offsets for inner protocols also,
encapsulation settings, checksum status, checksum offset, checksum
complete value, vlan information. Any or all of which I believe could
be turned into being incorrect if we allow the packet to be
arbitrarily modified by BPF. This problem is different than raw
sockets because LWT operates in the middle of the stack, the skbuff
has already been set up which such things.

> This series implements BPF program invocation from dst entries via the
> lightweight tunnels infrastructure. The BPF program can be attached to
> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
> skb as context. input is read-only, output can write, xmit can write,
> push headers, and redirect.
>
> Motiviation for this work:
>  - Restricting outgoing routes beyond what the route tuple supports
>  - Per route accounting byond realms
>  - Fast attachment of L2 headers where header does not require resolving
>    L2 addresses
>  - ILA like uses cases where L3 addresses are resolved and then routed
>    in an async manner
>  - Fast encapsulation + redirect. For now limited to use cases where not
>    setting inner and outer offset/protocol is OK.
>
Is checksum offload supported? By default, at least for Linux, we
offload the outer UDP checksum in VXLAN and the other UDP
encapsulations for performance.

Tom

> A couple of samples on how to use it can be found in patch 04.
>
> v1 -> v2:
>  - Added new BPF_LWT_REROUTE return code for program to indicate
>    that new route lookup should be performed. Suggested by Tom.
>  - New sample to illustrate rerouting
>  - New patch 05: Recursion limit for lwtunnel_output for the case
>    when user creates circular dst redirection. Also resolves the
>    issue for ILA.
>  - Fix to ensure headroom for potential future L2 header is still
>    guaranteed
>
> Thomas Graf (5):
>   route: Set orig_output when redirecting to lwt on locally generated
>     traffic
>   route: Set lwtstate for local traffic and cached input dsts
>   bpf: BPF for lightweight tunnel encapsulation
>   bpf: Add samples for LWT-BPF
>   lwtunnel: Limit number of recursions on output to 5
>
>  include/linux/filter.h        |   2 +-
>  include/uapi/linux/bpf.h      |  37 +++-
>  include/uapi/linux/lwtunnel.h |  21 ++
>  kernel/bpf/verifier.c         |  16 +-
>  net/Kconfig                   |   1 +
>  net/core/Makefile             |   2 +-
>  net/core/filter.c             | 148 ++++++++++++-
>  net/core/lwt_bpf.c            | 504 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/lwtunnel.c           |  15 +-
>  net/ipv4/route.c              |  37 +++-
>  samples/bpf/bpf_helpers.h     |   4 +
>  samples/bpf/lwt_bpf.c         | 235 ++++++++++++++++++++
>  samples/bpf/test_lwt_bpf.sh   | 370 +++++++++++++++++++++++++++++++
>  13 files changed, 1373 insertions(+), 19 deletions(-)
>  create mode 100644 net/core/lwt_bpf.c
>  create mode 100644 samples/bpf/lwt_bpf.c
>  create mode 100755 samples/bpf/test_lwt_bpf.sh
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 16:17 ` Tom Herbert
@ 2016-11-01 18:20   ` Thomas Graf
  2016-11-01 18:51     ` Tom Herbert
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 18:20 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On 1 November 2016 at 09:17, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> {Open question:
>>  Tom brought up the question on whether it is safe to modify the packet
>>  in artbirary ways before dst_output(). This is the equivalent to a raw
>>  socket injecting illegal headers. This v2 currently assumes that
>>  dst_output() is ready to accept invalid header values. This needs to be
>>  verified and if not the case, then raw sockets or dst_output() handlers
>>  must be fixed as well. Another option is to mark lwtunnel_output() as
>>  read-only for now.}
>>
> The question might not be so much about illegal headers but whether
> fields in the skbuff related to the packet contents are kept correct.
> We have protocol, header offsets, offsets for inner protocols also,
> encapsulation settings, checksum status, checksum offset, checksum

The headers cannot be extended or reduced so the offsets always remain
correct. What can happen is that the header contains invalid data.

> complete value, vlan information. Any or all of which I believe could
> be turned into being incorrect if we allow the packet to be
> arbitrarily modified by BPF. This problem is different than raw
> sockets because LWT operates in the middle of the stack, the skbuff
> has already been set up which such things.

You keep saying this "middle in the stack" but the point is exactly
the same as a raw socket with IPPROTO_RAW and hdrincl, see
rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
arbitrary garbage into dst_output(). IPv4 does some minimal sanity
checks.

If this is a concern I'm fine with making the dst_output path read-only for now.

>> This series implements BPF program invocation from dst entries via the
>> lightweight tunnels infrastructure. The BPF program can be attached to
>> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
>> skb as context. input is read-only, output can write, xmit can write,
>> push headers, and redirect.
>>
>> Motiviation for this work:
>>  - Restricting outgoing routes beyond what the route tuple supports
>>  - Per route accounting byond realms
>>  - Fast attachment of L2 headers where header does not require resolving
>>    L2 addresses
>>  - ILA like uses cases where L3 addresses are resolved and then routed
>>    in an async manner
>>  - Fast encapsulation + redirect. For now limited to use cases where not
>>    setting inner and outer offset/protocol is OK.
>>
> Is checksum offload supported? By default, at least for Linux, we
> offload the outer UDP checksum in VXLAN and the other UDP
> encapsulations for performance.

No. UDP encap is done by setting a tunnel key through a helper and
letting the encapsulation device handle this. I don't currently see a
point in replicating all of that logic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 18:20   ` Thomas Graf
@ 2016-11-01 18:51     ` Tom Herbert
  2016-11-01 19:11       ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2016-11-01 18:51 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 1 November 2016 at 09:17, Tom Herbert <tom@herbertland.com> wrote:
>> On Mon, Oct 31, 2016 at 5:37 PM, Thomas Graf <tgraf@suug.ch> wrote:
>>> {Open question:
>>>  Tom brought up the question on whether it is safe to modify the packet
>>>  in artbirary ways before dst_output(). This is the equivalent to a raw
>>>  socket injecting illegal headers. This v2 currently assumes that
>>>  dst_output() is ready to accept invalid header values. This needs to be
>>>  verified and if not the case, then raw sockets or dst_output() handlers
>>>  must be fixed as well. Another option is to mark lwtunnel_output() as
>>>  read-only for now.}
>>>
>> The question might not be so much about illegal headers but whether
>> fields in the skbuff related to the packet contents are kept correct.
>> We have protocol, header offsets, offsets for inner protocols also,
>> encapsulation settings, checksum status, checksum offset, checksum
>
> The headers cannot be extended or reduced so the offsets always remain
> correct. What can happen is that the header contains invalid data.
>
If we can't add/remove headers then doesn't that really limit the
utility of these patches? My assumption was that BPF+LWT is needed to
allow users to define and implement their own encapsulations, EH
insertion, packet modification, etc.

>> complete value, vlan information. Any or all of which I believe could
>> be turned into being incorrect if we allow the packet to be
>> arbitrarily modified by BPF. This problem is different than raw
>> sockets because LWT operates in the middle of the stack, the skbuff
>> has already been set up which such things.
>
> You keep saying this "middle in the stack" but the point is exactly
> the same as a raw socket with IPPROTO_RAW and hdrincl, see
> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
> checks.
>
What I mean is that an admin can create a BPF program that run on any
user packets (for instance default route could be set). This would be
in the path of TCP, UDP, and other protocols tightly integrated with
the stack. Packets being routed may be encapsulated, VLAN, have
checksum offload, GORed set etc. They also might be looped back in
which case the settings in skbuff become receive parameters.

> If this is a concern I'm fine with making the dst_output path read-only for now.
>
The might be good. The ramifications around allowing an open ended
method for users to modify L3/L2 packets needs more consideration.

>>> This series implements BPF program invocation from dst entries via the
>>> lightweight tunnels infrastructure. The BPF program can be attached to
>>> lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
>>> skb as context. input is read-only, output can write, xmit can write,
>>> push headers, and redirect.
>>>
>>> Motiviation for this work:
>>>  - Restricting outgoing routes beyond what the route tuple supports
>>>  - Per route accounting byond realms
>>>  - Fast attachment of L2 headers where header does not require resolving
>>>    L2 addresses
>>>  - ILA like uses cases where L3 addresses are resolved and then routed
>>>    in an async manner
>>>  - Fast encapsulation + redirect. For now limited to use cases where not
>>>    setting inner and outer offset/protocol is OK.
>>>
>> Is checksum offload supported? By default, at least for Linux, we
>> offload the outer UDP checksum in VXLAN and the other UDP
>> encapsulations for performance.
>
> No. UDP encap is done by setting a tunnel key through a helper and
> letting the encapsulation device handle this. I don't currently see a
> point in replicating all of that logic.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 10:54 ` [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
@ 2016-11-01 18:51   ` Thomas Graf
  2016-11-01 20:08     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 18:51 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 1 November 2016 at 03:54, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> I do fear the complexity and debugability introduced by this patch set
> quite a bit.

What is the complexity concern? This is pretty straight forward. I
agree on debugability. This is being worked on separately as Alexei
mentioned, to address this for all BPF integrations.

> I wonder if architecturally it would be more feasible to add a generic
> (bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
> added into the dsts.
>
> BPF could then be able to access parts of the metadata in the attached
> metadata dst entries and performing the matching this way?

If I understand you correctly then a single BPF program would be
loaded which then applies to all dst_output() calls? This has a huge
drawback, instead of multiple small BPF programs which do exactly what
is required per dst, a large BPF program is needed which matches on
metadata. That's way slower and renders one of the biggest advantages
of BPF invalid, the ability to generate a a small program tailored to
a particular use. See Cilium.

> The reason why I would prefer an approach like this: irregardless of the
> routing lookup we would process the skb with bpf or not. This gives a
> single point to debug, where instead in your approach we first must
> figure out the corresponding bpf program and then check for it specifically.

Not sure I see what kind of advantage this actually provides. You can
dump the routes and see which programs get invoked and which section.
If it's based on metadata then you need to know the program logic and
associate it with the metadata in the dst. It actually doesn't get
much easier than to debug one of the samples, they are completely
static once compiled and it's very simple to verify if they do what
they are supposed to do.

If you like the single program approach, feel free to load the same
program for every dst. Perfectly acceptable but I don't see why we
should force everybody to use that model.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 18:51     ` Tom Herbert
@ 2016-11-01 19:11       ` Thomas Graf
  2016-11-01 19:27         ` Tom Herbert
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 19:11 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On 1 November 2016 at 11:51, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> The headers cannot be extended or reduced so the offsets always remain
>> correct. What can happen is that the header contains invalid data.
>>
> If we can't add/remove headers then doesn't that really limit the
> utility of these patches? My assumption was that BPF+LWT is needed to
> allow users to define and implement their own encapsulations, EH
> insertion, packet modification, etc.

I thought you were specifically referring to lwtunnel_output(). You
can extend headers at lwtunnel_xmit() but you can't for
lwtunnel_output() and lwtunnel_input(). The xmit hook is in
ip_finish_output2() where it is fine to do so. We have existing code
doing this there.

>> You keep saying this "middle in the stack" but the point is exactly
>> the same as a raw socket with IPPROTO_RAW and hdrincl, see
>> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
>> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
>> checks.
>>
> What I mean is that an admin can create a BPF program that run on any
> user packets (for instance default route could be set). This would be
> in the path of TCP, UDP, and other protocols tightly integrated with
> the stack. Packets being routed may be encapsulated, VLAN, have
> checksum offload, GORed set etc. They also might be looped back in
> which case the settings in skbuff become receive parameters.

You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
before or after you go into the encapsulation device. This is a tool
for root, if you install a drop all ssh rule then you won't be able to
log into the box anymore. If you modify the packet and render it
invalid, the packet will be dropped.
If you attach a VLAN header while VLAN offload is already set up, then
the hardware will add another VLAN header, this is what I would
expect. I truly hope that we don't have code which crashes if we
dev_queue_xmit() garbage into any device. That would be horrible.

Do you have a specific example that could be a problem?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 19:11       ` Thomas Graf
@ 2016-11-01 19:27         ` Tom Herbert
  2016-11-01 19:59           ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2016-11-01 19:27 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 1 November 2016 at 11:51, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Nov 1, 2016 at 11:20 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>> The headers cannot be extended or reduced so the offsets always remain
>>> correct. What can happen is that the header contains invalid data.
>>>
>> If we can't add/remove headers then doesn't that really limit the
>> utility of these patches? My assumption was that BPF+LWT is needed to
>> allow users to define and implement their own encapsulations, EH
>> insertion, packet modification, etc.
>
> I thought you were specifically referring to lwtunnel_output(). You
> can extend headers at lwtunnel_xmit() but you can't for
> lwtunnel_output() and lwtunnel_input(). The xmit hook is in
> ip_finish_output2() where it is fine to do so. We have existing code
> doing this there.
>
>>> You keep saying this "middle in the stack" but the point is exactly
>>> the same as a raw socket with IPPROTO_RAW and hdrincl, see
>>> rawv6_sendmsg() and rawv6_send_hdrincl(). An IPv6 raw socket can feed
>>> arbitrary garbage into dst_output(). IPv4 does some minimal sanity
>>> checks.
>>>
>> What I mean is that an admin can create a BPF program that run on any
>> user packets (for instance default route could be set). This would be
>> in the path of TCP, UDP, and other protocols tightly integrated with
>> the stack. Packets being routed may be encapsulated, VLAN, have
>> checksum offload, GORed set etc. They also might be looped back in
>> which case the settings in skbuff become receive parameters.
>
> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
> before or after you go into the encapsulation device. This is a tool
> for root, if you install a drop all ssh rule then you won't be able to
> log into the box anymore. If you modify the packet and render it
> invalid, the packet will be dropped.
> If you attach a VLAN header while VLAN offload is already set up, then
> the hardware will add another VLAN header, this is what I would
> expect. I truly hope that we don't have code which crashes if we
> dev_queue_xmit() garbage into any device. That would be horrible.
>
> Do you have a specific example that could be a problem?

I believe Alexander was dealing with problems where drivers were not
properly handling IP extension headers. We regularly get reports about
driver checksum failures on edge conditions. Making a fully functional
and efficient transmit data path is nontrivial, there are many
assumptions being made some documented some not. When drivers crash we
either fix the driver or the protocol stack that is doing something
bad.

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 19:27         ` Tom Herbert
@ 2016-11-01 19:59           ` Thomas Graf
  2016-11-01 20:33             ` Tom Herbert
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 19:59 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On 1 November 2016 at 12:27, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
>> before or after you go into the encapsulation device. This is a tool
>> for root, if you install a drop all ssh rule then you won't be able to
>> log into the box anymore. If you modify the packet and render it
>> invalid, the packet will be dropped.
>> If you attach a VLAN header while VLAN offload is already set up, then
>> the hardware will add another VLAN header, this is what I would
>> expect. I truly hope that we don't have code which crashes if we
>> dev_queue_xmit() garbage into any device. That would be horrible.
>>
>> Do you have a specific example that could be a problem?
>
> I believe Alexander was dealing with problems where drivers were not
> properly handling IP extension headers. We regularly get reports about

There are many ways to cause IP extension headers to be inserted. How
is this specific to BPF or this series?

> driver checksum failures on edge conditions. Making a fully functional

Not sure what an "edge condition" is? Untrusted networks? How is
drivers crashing on receive related to this series?

> and efficient transmit data path is nontrivial, there are many
> assumptions being made some documented some not. When drivers crash we
> either fix the driver or the protocol stack that is doing something
> bad.

Tom, we have a dozen ways to modify packet content already and we have
multiple ways to take raw packet data from userspace and inject them
at dev_queue_xmit() level and some even above. What is different for
lwtunnel_xmit()?

This is entirely optional and nobody is forced to use any of this. If
you don't trust the BPF program then you better not run in. It's about
the same as trusting a random tc filter or iptables ruleset. The
important point is that integrity is maintained at all times.I would
love to address any specific concern.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 18:51   ` Thomas Graf
@ 2016-11-01 20:08     ` Hannes Frederic Sowa
  2016-11-01 20:59       ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-01 20:08 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
> On 1 November 2016 at 03:54, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > I do fear the complexity and debugability introduced by this patch set
> > quite a bit.
> 
> What is the complexity concern? This is pretty straight forward. I
> agree on debugability. This is being worked on separately as Alexei
> mentioned, to address this for all BPF integrations.

We have a multi-layered policy engine which is actually hard to inspect
from user space already.

We first resolve the rules, with forwards us to the table_id, where we
do the fib lookup, which in the end returns the eBPF program to use.

> > I wonder if architecturally it would be more feasible to add a generic
> > (bfp-)hook into into dst_output(_sk) and allow arbitrary metadata to be
> > added into the dsts.
> >
> > BPF could then be able to access parts of the metadata in the attached
> > metadata dst entries and performing the matching this way?
> 
> If I understand you correctly then a single BPF program would be
> loaded which then applies to all dst_output() calls? This has a huge
> drawback, instead of multiple small BPF programs which do exactly what
> is required per dst, a large BPF program is needed which matches on
> metadata. That's way slower and renders one of the biggest advantages
> of BPF invalid, the ability to generate a a small program tailored to
> a particular use. See Cilium.

I thought more of hooks in the actual output/input functions specific to
the protocol type (unfortunately again) protected by jump labels? Those
hook get part of the dst_entry mapped so they can act on them.

Another idea would be to put the eBPF hooks into the fib rules
infrastructure. But I fear this wouldn't get you the hooks you were
looking for? There they would only end up in the runtime path if
actually activated.

> > The reason why I would prefer an approach like this: irregardless of the
> > routing lookup we would process the skb with bpf or not. This gives a
> > single point to debug, where instead in your approach we first must
> > figure out the corresponding bpf program and then check for it specifically.
> 
> Not sure I see what kind of advantage this actually provides. You can
> dump the routes and see which programs get invoked and which section.

Dumping and verifying which routes get used might actually already be
quite complex on its own. Thus my fear.

> If it's based on metadata then you need to know the program logic and
> associate it with the metadata in the dst. It actually doesn't get
> much easier than to debug one of the samples, they are completely
> static once compiled and it's very simple to verify if they do what
> they are supposed to do.

At the same time you can have lots of those programs and you e.g. would
also need to verify if they are acting on the same data structures or
have the identical code.

It all reminds me a bit on grepping in source code which makes heavy use
of function pointers with very generic and short names.

> If you like the single program approach, feel free to load the same
> program for every dst. Perfectly acceptable but I don't see why we
> should force everybody to use that model.

I am concerned having 100ths of BPF programs, all specialized on a
particular route, to debug. Looking at one code file and its associated
tables seems still easier to me.

E.g. imaging we have input routes and output routes with different BPF
programs. We somehow must make sure all nodes kind of behave accordingly
to "sane" network semantics. If you end up with an input route doing bpf
processing and the according output node, which e.g. might be needed to
reflect ICMP packets, doesn't behave accordingly you at least have two
programs to debug already instead of a switch- or if-condition in one
single code location. I would like to "force" this kind of symmetry to
developers using eBPF, thus I thought meta-data manipulation and
verification inside the kernel would be a better attack at this problem,
no?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-11-01  1:10   ` kbuild test robot
  2016-11-01  2:39   ` kbuild test robot
@ 2016-11-01 20:11   ` David Ahern
  2016-11-01 21:03     ` Thomas Graf
  2016-11-02 13:39   ` Roopa Prabhu
  3 siblings, 1 reply; 36+ messages in thread
From: David Ahern @ 2016-11-01 20:11 UTC (permalink / raw)
  To: Thomas Graf, davem; +Cc: alexei.starovoitov, daniel, tom, roopa, netdev

On 10/31/16 6:37 PM, Thomas Graf wrote:
> Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
> BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
> LWT redirection of type LWTUNNEL_ENCAP_BPF.
> 
> The separate program types are required because manipulation of
> packet data is only allowed on the output and transmit path as
> the subsequent dst_input() call path assumes an IP header
> validated by ip_rcv(). The BPF programs will be handed an skb
> with the L3 header attached and may return one of the following
> return codes:
> 
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid on lwtunnel_xmit() hook)
> 
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
> 
> A new helper bpf_skb_push() is added which allows to preprend an
> L2 header in front of the skb, extend the existing L3 header, or
> both. This allows to address a wide range of issues:
>  - Optimize L2 header construction when L2 information is always
>    static to avoid ARP/NDisc lookup.
>  - Extend IP header to add additional IP options.
>  - Perform simple encapsulation where offload is of no concern.
>    (The existing funtionality to attach a tunnel key to the skb
>     and redirect to a tunnel net_device to allow for offload
>     continues to work obviously).

have you tested the adding of headers with large packets hitting gso and fragmentation? Wondering if mtu is used properly when headers are added and the lwt->headroom is not accounted. See 14972cbd34ff668c390cbd2e6497323484c9e812

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 19:59           ` Thomas Graf
@ 2016-11-01 20:33             ` Tom Herbert
  2016-11-01 21:05               ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Tom Herbert @ 2016-11-01 20:33 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On Tue, Nov 1, 2016 at 12:59 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 1 November 2016 at 12:27, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Nov 1, 2016 at 12:11 PM, Thomas Graf <tgraf@suug.ch> wrote:
>>> You can do the same with act_pedit or cls_bpf at dev_queue_xmit()
>>> before or after you go into the encapsulation device. This is a tool
>>> for root, if you install a drop all ssh rule then you won't be able to
>>> log into the box anymore. If you modify the packet and render it
>>> invalid, the packet will be dropped.
>>> If you attach a VLAN header while VLAN offload is already set up, then
>>> the hardware will add another VLAN header, this is what I would
>>> expect. I truly hope that we don't have code which crashes if we
>>> dev_queue_xmit() garbage into any device. That would be horrible.
>>>
>>> Do you have a specific example that could be a problem?
>>
>> I believe Alexander was dealing with problems where drivers were not
>> properly handling IP extension headers. We regularly get reports about
>
> There are many ways to cause IP extension headers to be inserted. How
> is this specific to BPF or this series?
>
>> driver checksum failures on edge conditions. Making a fully functional
>
> Not sure what an "edge condition" is? Untrusted networks? How is
> drivers crashing on receive related to this series?
>
>> and efficient transmit data path is nontrivial, there are many
>> assumptions being made some documented some not. When drivers crash we
>> either fix the driver or the protocol stack that is doing something
>> bad.
>
> Tom, we have a dozen ways to modify packet content already and we have
> multiple ways to take raw packet data from userspace and inject them
> at dev_queue_xmit() level and some even above. What is different for
> lwtunnel_xmit()?
>
> This is entirely optional and nobody is forced to use any of this. If
> you don't trust the BPF program then you better not run in. It's about
> the same as trusting a random tc filter or iptables ruleset. The
> important point is that integrity is maintained at all times.I would
> love to address any specific concern.

You need to show that is integrity is maintained with these patches.
Part of this can be done, but part of this needs to be established in
testing.

I've already given specifics for at least one potential source of
issues in routing issues. I would like to know what happens if someone
rewrites an IPv4 packet into IPv6 packet or vice versa. AFAICT we
would be send an IPv6 using an IPv4 route, or an IPv4 using an IPv6
route. What is supposed to happen in these circumstances? What
actually happens?

Similarly, someone overwrites the whole packet with 0xff. What happens
when we try to send that?

Tom

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 20:08     ` Hannes Frederic Sowa
@ 2016-11-01 20:59       ` Thomas Graf
  2016-11-01 22:12         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 20:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 1 November 2016 at 13:08, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>> If I understand you correctly then a single BPF program would be
>> loaded which then applies to all dst_output() calls? This has a huge
>> drawback, instead of multiple small BPF programs which do exactly what
>> is required per dst, a large BPF program is needed which matches on
>> metadata. That's way slower and renders one of the biggest advantages
>> of BPF invalid, the ability to generate a a small program tailored to
>> a particular use. See Cilium.
>
> I thought more of hooks in the actual output/input functions specific to
> the protocol type (unfortunately again) protected by jump labels? Those
> hook get part of the dst_entry mapped so they can act on them.

This has no advantage over installing a BPF program at tc egress and
enabling to store/access metadata per dst. The whole point is to
execute bpf for a specific route.

> Another idea would be to put the eBPF hooks into the fib rules
> infrastructure. But I fear this wouldn't get you the hooks you were
> looking for? There they would only end up in the runtime path if
> actually activated.

Use of fib rules kills performance so it's not an option. I'm not even
sure that would be any simpler.

> Dumping and verifying which routes get used might actually already be
> quite complex on its own. Thus my fear.

We even have an API to query which route is used for a tuple. What
else would you like to see?

>> If it's based on metadata then you need to know the program logic and
>> associate it with the metadata in the dst. It actually doesn't get
>> much easier than to debug one of the samples, they are completely
>> static once compiled and it's very simple to verify if they do what
>> they are supposed to do.
>
> At the same time you can have lots of those programs and you e.g. would
> also need to verify if they are acting on the same data structures or
> have the identical code.

This will be addressed with signing AFAIK.

> It all reminds me a bit on grepping in source code which makes heavy use
> of function pointers with very generic and short names.

Is this statement related to routing? I don't get the reference to
function pointers and generic short names.

>> If you like the single program approach, feel free to load the same
>> program for every dst. Perfectly acceptable but I don't see why we
>> should force everybody to use that model.
>
> I am concerned having 100ths of BPF programs, all specialized on a
> particular route, to debug. Looking at one code file and its associated
> tables seems still easier to me.

100 programs != 100 source files. A lot more realistic is a single or
a handful of programs which get compiled for a particular route with
certain pieces enabled/disabled.

> E.g. imaging we have input routes and output routes with different BPF
> programs. We somehow must make sure all nodes kind of behave accordingly
> to "sane" network semantics. If you end up with an input route doing bpf

As soon as we have signing, you can verify your programs in testing,
sign the programs and then quickly verify on all your nodes whether
you are running the correct programs.

Would it help if we allow to store the original source used for
bytecode generation. What would make it clear which program was used.

> processing and the according output node, which e.g. might be needed to
> reflect ICMP packets, doesn't behave accordingly you at least have two
> programs to debug already instead of a switch- or if-condition in one
> single code location. I would like to "force" this kind of symmetry to
> developers using eBPF, thus I thought meta-data manipulation and
> verification inside the kernel would be a better attack at this problem,
> no?

Are you saying you want a single gigantic program for both input and output?

That's not possible. The BPF program has different limitations
depending on where it runs. On input, any write action on the packet
is not allowed, extending the header is only allowed on xmit, and so
on.

I also don't see how this could possibly scale if all packets must go
through a single BPF program. The overhead will be tremendous if you
only want to filter a couple of prefixes.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 20:11   ` David Ahern
@ 2016-11-01 21:03     ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 21:03 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 1 November 2016 at 13:11, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 10/31/16 6:37 PM, Thomas Graf wrote:
>>  - Perform simple encapsulation where offload is of no concern.
>>    (The existing funtionality to attach a tunnel key to the skb
>>     and redirect to a tunnel net_device to allow for offload
>>     continues to work obviously).
>
> have you tested the adding of headers with large packets hitting gso and fragmentation? Wondering if mtu is used properly when headers are added and the lwt->headroom is not accounted. See 14972cbd34ff668c390cbd2e6497323484c9e812

Thanks for the pointer David, I will look into this.

The packet size is currently limited to this when adding l2 headers:
skb->dev->mtu + skb->dev->hard_header_len

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 20:33             ` Tom Herbert
@ 2016-11-01 21:05               ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-01 21:05 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann, roopa,
	Linux Kernel Network Developers

On 1 November 2016 at 13:33, Tom Herbert <tom@herbertland.com> wrote:
> You need to show that is integrity is maintained with these patches.
> Part of this can be done, but part of this needs to be established in
> testing.
>
> I've already given specifics for at least one potential source of
> issues in routing issues. I would like to know what happens if someone
> rewrites an IPv4 packet into IPv6 packet or vice versa. AFAICT we
> would be send an IPv6 using an IPv4 route, or an IPv4 using an IPv6
> route. What is supposed to happen in these circumstances? What
> actually happens?
>
> Similarly, someone overwrites the whole packet with 0xff. What happens
> when we try to send that?

OK, I will add these tests to the selftest in the next iteration.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 20:59       ` Thomas Graf
@ 2016-11-01 22:12         ` Hannes Frederic Sowa
  2016-11-01 23:07           ` Tom Herbert
  2016-11-02 22:54           ` Thomas Graf
  0 siblings, 2 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-01 22:12 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 01.11.2016 21:59, Thomas Graf wrote:
> On 1 November 2016 at 13:08, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>>> If I understand you correctly then a single BPF program would be
>>> loaded which then applies to all dst_output() calls? This has a huge
>>> drawback, instead of multiple small BPF programs which do exactly what
>>> is required per dst, a large BPF program is needed which matches on
>>> metadata. That's way slower and renders one of the biggest advantages
>>> of BPF invalid, the ability to generate a a small program tailored to
>>> a particular use. See Cilium.
>>
>> I thought more of hooks in the actual output/input functions specific to
>> the protocol type (unfortunately again) protected by jump labels? Those
>> hook get part of the dst_entry mapped so they can act on them.
> 
> This has no advantage over installing a BPF program at tc egress and
> enabling to store/access metadata per dst. The whole point is to
> execute bpf for a specific route.

The advantage I saw here was that in your proposal the tc egress path
would have to be chosen by a route. Otherwise I would already have
proposed it. :)

>> Another idea would be to put the eBPF hooks into the fib rules
>> infrastructure. But I fear this wouldn't get you the hooks you were
>> looking for? There they would only end up in the runtime path if
>> actually activated.
> 
> Use of fib rules kills performance so it's not an option. I'm not even
> sure that would be any simpler.

It very much depends on the number of rules installed. If there are just
several very few rules, it shouldn't hurt performance that much (but
haven't verified).

>> Dumping and verifying which routes get used might actually already be
>> quite complex on its own. Thus my fear.
> 
> We even have an API to query which route is used for a tuple. What
> else would you like to see?

I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
sockets sniff at interfaces and also gather and dump the metadata to
user space (this would depend on bpf programs only doing the
modifications in metadata and not in the actual packet).

Or maybe just tracing support (without depending on the eBPF program
developer to have added debugging in the BPF program).

>>> If it's based on metadata then you need to know the program logic and
>>> associate it with the metadata in the dst. It actually doesn't get
>>> much easier than to debug one of the samples, they are completely
>>> static once compiled and it's very simple to verify if they do what
>>> they are supposed to do.
>>
>> At the same time you can have lots of those programs and you e.g. would
>> also need to verify if they are acting on the same data structures or
>> have the identical code.
> 
> This will be addressed with signing AFAIK.

This sounds a bit unrealistic. Signing lots of small programs can be a
huge burden to the entity doing the signing (if it is not on the same
computer). And as far as I understood the programs should be generated
dynamically?

>> It all reminds me a bit on grepping in source code which makes heavy use
>> of function pointers with very generic and short names.
> 
> Is this statement related to routing? I don't get the reference to
> function pointers and generic short names.

No, just an anecdotal side note how I felt when I saw the patchset. ;)

>>> If you like the single program approach, feel free to load the same
>>> program for every dst. Perfectly acceptable but I don't see why we
>>> should force everybody to use that model.
>>
>> I am concerned having 100ths of BPF programs, all specialized on a
>> particular route, to debug. Looking at one code file and its associated
>> tables seems still easier to me.
> 
> 100 programs != 100 source files. A lot more realistic is a single or
> a handful of programs which get compiled for a particular route with
> certain pieces enabled/disabled.
> 
>> E.g. imaging we have input routes and output routes with different BPF
>> programs. We somehow must make sure all nodes kind of behave accordingly
>> to "sane" network semantics. If you end up with an input route doing bpf
> 
> As soon as we have signing, you can verify your programs in testing,
> sign the programs and then quickly verify on all your nodes whether
> you are running the correct programs.
> 
> Would it help if we allow to store the original source used for
> bytecode generation. What would make it clear which program was used.

I would also be fine with just a strong hash of the bytecode, so the
program can be identified accurately. Maybe helps with deduplication
later on, too. ;)

>> processing and the according output node, which e.g. might be needed to
>> reflect ICMP packets, doesn't behave accordingly you at least have two
>> programs to debug already instead of a switch- or if-condition in one
>> single code location. I would like to "force" this kind of symmetry to
>> developers using eBPF, thus I thought meta-data manipulation and
>> verification inside the kernel would be a better attack at this problem,
>> no?
> 
> Are you saying you want a single gigantic program for both input and output?

Even though I read through the patchset I am not absolutely sure which
problem it really solves. Especially because lots of things can be done
already at the ingress vs. egress interface (I looked at patch 4 but I
am not sure how realistic they are).

> That's not possible. The BPF program has different limitations
> depending on where it runs. On input, any write action on the packet
> is not allowed, extending the header is only allowed on xmit, and so
> on.
> 
> I also don't see how this could possibly scale if all packets must go
> through a single BPF program. The overhead will be tremendous if you
> only want to filter a couple of prefixes.

In case of hash table lookup it should be fast. llvm will probably also
generate jump table for a few 100 ip addresses, no? Additionally the
routing table lookup could be not done at all.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 22:12         ` Hannes Frederic Sowa
@ 2016-11-01 23:07           ` Tom Herbert
  2016-11-02 10:48             ` Hannes Frederic Sowa
  2016-11-02 22:57             ` Thomas Graf
  2016-11-02 22:54           ` Thomas Graf
  1 sibling, 2 replies; 36+ messages in thread
From: Tom Herbert @ 2016-11-01 23:07 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 01.11.2016 21:59, Thomas Graf wrote:
>> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>>>> If I understand you correctly then a single BPF program would be
>>>> loaded which then applies to all dst_output() calls? This has a huge
>>>> drawback, instead of multiple small BPF programs which do exactly what
>>>> is required per dst, a large BPF program is needed which matches on
>>>> metadata. That's way slower and renders one of the biggest advantages
>>>> of BPF invalid, the ability to generate a a small program tailored to
>>>> a particular use. See Cilium.
>>>
>>> I thought more of hooks in the actual output/input functions specific to
>>> the protocol type (unfortunately again) protected by jump labels? Those
>>> hook get part of the dst_entry mapped so they can act on them.
>>
>> This has no advantage over installing a BPF program at tc egress and
>> enabling to store/access metadata per dst. The whole point is to
>> execute bpf for a specific route.
>
> The advantage I saw here was that in your proposal the tc egress path
> would have to be chosen by a route. Otherwise I would already have
> proposed it. :)
>
>>> Another idea would be to put the eBPF hooks into the fib rules
>>> infrastructure. But I fear this wouldn't get you the hooks you were
>>> looking for? There they would only end up in the runtime path if
>>> actually activated.
>>
>> Use of fib rules kills performance so it's not an option. I'm not even
>> sure that would be any simpler.
>
> It very much depends on the number of rules installed. If there are just
> several very few rules, it shouldn't hurt performance that much (but
> haven't verified).
>
Hannes,

I can say that the primary value we get out of using ILA+LWT is that
we can essentially cache a policy decision in connected sockets. That
is we are able to create a host route for each destination (thousands
of them) that describes how to do the translation for each one. There
is no route lookup per packet, and actually no extra lookup otherwise.
The translation code doesn't do much at all, basically just copies in
new destination to the packet. We need a route lookup for the
rewritten destination, but that is easily cached in the LWT structure.
The net result is that the transmit path for ILA is _really_ fast. I'm
not sure how we can match this same performance tc egress, it seems
like we would want to cache the matching rules in the socket to avoid
rule lookups.

On the other hand, I'm not really sure how to implement for this level
of performance this in LWT+BPF either. It seems like one way to do
that would be to create a program each destination and set it each
host. As you point out would create a million different programs which
doesn't seem manageable. I don't think the BPF map works either since
that implies we need a lookup (?). It seems like what we need is one
program but allow it to be parameterized with per destination
information saved in the route (LWT structure).

Tom

>>> Dumping and verifying which routes get used might actually already be
>>> quite complex on its own. Thus my fear.
>>
>> We even have an API to query which route is used for a tuple. What
>> else would you like to see?
>
> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
> sockets sniff at interfaces and also gather and dump the metadata to
> user space (this would depend on bpf programs only doing the
> modifications in metadata and not in the actual packet).
>
> Or maybe just tracing support (without depending on the eBPF program
> developer to have added debugging in the BPF program).
>
>>>> If it's based on metadata then you need to know the program logic and
>>>> associate it with the metadata in the dst. It actually doesn't get
>>>> much easier than to debug one of the samples, they are completely
>>>> static once compiled and it's very simple to verify if they do what
>>>> they are supposed to do.
>>>
>>> At the same time you can have lots of those programs and you e.g. would
>>> also need to verify if they are acting on the same data structures or
>>> have the identical code.
>>
>> This will be addressed with signing AFAIK.
>
> This sounds a bit unrealistic. Signing lots of small programs can be a
> huge burden to the entity doing the signing (if it is not on the same
> computer). And as far as I understood the programs should be generated
> dynamically?
>
>>> It all reminds me a bit on grepping in source code which makes heavy use
>>> of function pointers with very generic and short names.
>>
>> Is this statement related to routing? I don't get the reference to
>> function pointers and generic short names.
>
> No, just an anecdotal side note how I felt when I saw the patchset. ;)
>
>>>> If you like the single program approach, feel free to load the same
>>>> program for every dst. Perfectly acceptable but I don't see why we
>>>> should force everybody to use that model.
>>>
>>> I am concerned having 100ths of BPF programs, all specialized on a
>>> particular route, to debug. Looking at one code file and its associated
>>> tables seems still easier to me.
>>
>> 100 programs != 100 source files. A lot more realistic is a single or
>> a handful of programs which get compiled for a particular route with
>> certain pieces enabled/disabled.
>>
>>> E.g. imaging we have input routes and output routes with different BPF
>>> programs. We somehow must make sure all nodes kind of behave accordingly
>>> to "sane" network semantics. If you end up with an input route doing bpf
>>
>> As soon as we have signing, you can verify your programs in testing,
>> sign the programs and then quickly verify on all your nodes whether
>> you are running the correct programs.
>>
>> Would it help if we allow to store the original source used for
>> bytecode generation. What would make it clear which program was used.
>
> I would also be fine with just a strong hash of the bytecode, so the
> program can be identified accurately. Maybe helps with deduplication
> later on, too. ;)
>
>>> processing and the according output node, which e.g. might be needed to
>>> reflect ICMP packets, doesn't behave accordingly you at least have two
>>> programs to debug already instead of a switch- or if-condition in one
>>> single code location. I would like to "force" this kind of symmetry to
>>> developers using eBPF, thus I thought meta-data manipulation and
>>> verification inside the kernel would be a better attack at this problem,
>>> no?
>>
>> Are you saying you want a single gigantic program for both input and output?
>
> Even though I read through the patchset I am not absolutely sure which
> problem it really solves. Especially because lots of things can be done
> already at the ingress vs. egress interface (I looked at patch 4 but I
> am not sure how realistic they are).
>
>> That's not possible. The BPF program has different limitations
>> depending on where it runs. On input, any write action on the packet
>> is not allowed, extending the header is only allowed on xmit, and so
>> on.
>>
>> I also don't see how this could possibly scale if all packets must go
>> through a single BPF program. The overhead will be tremendous if you
>> only want to filter a couple of prefixes.
>
> In case of hash table lookup it should be fast. llvm will probably also
> generate jump table for a few 100 ip addresses, no? Additionally the
> routing table lookup could be not done at all.
>
> Thanks,
> Hannes
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 23:07           ` Tom Herbert
@ 2016-11-02 10:48             ` Hannes Frederic Sowa
  2016-11-02 16:59               ` Tom Herbert
  2016-11-03 14:16               ` Thomas Graf
  2016-11-02 22:57             ` Thomas Graf
  1 sibling, 2 replies; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-02 10:48 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Thomas Graf, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

Hi Tom,

On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > On 01.11.2016 21:59, Thomas Graf wrote:
> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
> >> <hannes@stressinduktion.org> wrote:
> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
> >>>> If I understand you correctly then a single BPF program would be
> >>>> loaded which then applies to all dst_output() calls? This has a huge
> >>>> drawback, instead of multiple small BPF programs which do exactly what
> >>>> is required per dst, a large BPF program is needed which matches on
> >>>> metadata. That's way slower and renders one of the biggest advantages
> >>>> of BPF invalid, the ability to generate a a small program tailored to
> >>>> a particular use. See Cilium.
> >>>
> >>> I thought more of hooks in the actual output/input functions specific to
> >>> the protocol type (unfortunately again) protected by jump labels? Those
> >>> hook get part of the dst_entry mapped so they can act on them.
> >>
> >> This has no advantage over installing a BPF program at tc egress and
> >> enabling to store/access metadata per dst. The whole point is to
> >> execute bpf for a specific route.
> >
> > The advantage I saw here was that in your proposal the tc egress path
> > would have to be chosen by a route. Otherwise I would already have
> > proposed it. :)
> >
> >>> Another idea would be to put the eBPF hooks into the fib rules
> >>> infrastructure. But I fear this wouldn't get you the hooks you were
> >>> looking for? There they would only end up in the runtime path if
> >>> actually activated.
> >>
> >> Use of fib rules kills performance so it's not an option. I'm not even
> >> sure that would be any simpler.
> >
> > It very much depends on the number of rules installed. If there are just
> > several very few rules, it shouldn't hurt performance that much (but
> > haven't verified).
> >
> Hannes,
> 
> I can say that the primary value we get out of using ILA+LWT is that
> we can essentially cache a policy decision in connected sockets. That
> is we are able to create a host route for each destination (thousands
> of them) that describes how to do the translation for each one. There
> is no route lookup per packet, and actually no extra lookup otherwise.

Exactly, that is why I do like LWT and the dst_entry socket caching
shows its benefits here. Also the dst_entries communicate enough vital
information up the stack so that allocation of sk_buffs is done
accordingly to the headers that might need to be inserted later on.

(On the other hand, the looked up BPF program can also be cached. This
becomes more difficult if we can't share the socket structs between
namespaces though.)

> The translation code doesn't do much at all, basically just copies in
> new destination to the packet. We need a route lookup for the
> rewritten destination, but that is easily cached in the LWT structure.
> The net result is that the transmit path for ILA is _really_ fast. I'm
> not sure how we can match this same performance tc egress, it seems
> like we would want to cache the matching rules in the socket to avoid
> rule lookups.

In case of namespaces, do you allocate the host routes in the parent or
child (net-)namespaces? Or don't we talk about namespaces right now at all?

Why do we want to do the packet manipulation in tc egress and not using
LWT + interfaces? The dst_entries should be able to express all possible
allocation strategies etc. so that we don't need to shift/reallocate
packets around when inserting an additional header. We can't express
those semantics with tc egress.

> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Yes, that is my proposal. Just using the dst entry as meta-data (which
can actually also be an ID for the network namespace the packet is
coming from).

My concern with using BPF is that the rest of the kernel doesn't really
see the semantics and can't optimize or cache at specific points,
because the kernel cannot introspect what the BPF program does (for
metadata manipulation, one can e.g. specifiy that the program is "pure",
and always provides the same output for some specified given input, thus
things can be cached and memorized, but that framework seems very hard
to build).

That's why I am in favor of splitting this patchset down and allow the
policies that should be expressed by BPF programs being applied to the
specific subsystems (I am not totally against a generic BPF hook in
input or output of the protocol engines). E.g. can we deal with static
rewriting of L2 addresses in the neighbor cache? We already provide a
fast header cache for L2 data which might be used here?

I also fear this becomes a kernel by-pass:

It might be very hard e.g. to apply NFT/netfilter to such packets, if
e.g. a redirect happens suddenly and packet flow is diverted from the
one the user sees currently based on the interfaces and routing tables.

Those are just some thoughts so far, I still have to think more about
this. Thanks for the discussion, it is very interesting.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
                     ` (2 preceding siblings ...)
  2016-11-01 20:11   ` David Ahern
@ 2016-11-02 13:39   ` Roopa Prabhu
  2016-11-02 22:58     ` Thomas Graf
  3 siblings, 1 reply; 36+ messages in thread
From: Roopa Prabhu @ 2016-11-02 13:39 UTC (permalink / raw)
  To: Thomas Graf; +Cc: davem, alexei.starovoitov, daniel, tom, netdev

On 10/31/16, 5:37 PM, Thomas Graf wrote:
> Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
> BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
> LWT redirection of type LWTUNNEL_ENCAP_BPF.
>
> The separate program types are required because manipulation of
> packet data is only allowed on the output and transmit path as
> the subsequent dst_input() call path assumes an IP header
> validated by ip_rcv(). The BPF programs will be handed an skb
> with the L3 header attached and may return one of the following
> return codes:
>
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid on lwtunnel_xmit() hook)
>
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
>
> A new helper bpf_skb_push() is added which allows to preprend an
> L2 header in front of the skb, extend the existing L3 header, or
> both. This allows to address a wide range of issues:
>  - Optimize L2 header construction when L2 information is always
>    static to avoid ARP/NDisc lookup.
>  - Extend IP header to add additional IP options.
>  - Perform simple encapsulation where offload is of no concern.
>    (The existing funtionality to attach a tunnel key to the skb
>     and redirect to a tunnel net_device to allow for offload
>     continues to work obviously).
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  
[snip]
> diff --git a/net/Kconfig b/net/Kconfig
> index 7b6cd34..7554f12 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -396,6 +396,7 @@ source "net/nfc/Kconfig"
>  
>  config LWTUNNEL
>  	bool "Network light weight tunnels"
> +	depends on IPV6 || IPV6=n
>  	---help---
>  	  This feature provides an infrastructure to support light weight
>  	  tunnels like mpls. There is no netdevice associated with a light
> diff --git a/net/core/Makefile b/net/core/Makefile
> index d6508c2..a675fd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
> -obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
> +obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o

Any reason you want to keep lwt bpf under the main CONFIG_LWTUNNEL infra config ?.
since it is defined as yet another plug-gable encap function, seems like it will be better under a separate
CONFIG_LWTUNNEL_BPF or CONFIG_LWT_BPF that depends on CONFIG_LWTUNNEL

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-02 10:48             ` Hannes Frederic Sowa
@ 2016-11-02 16:59               ` Tom Herbert
  2016-11-03 14:16               ` Thomas Graf
  1 sibling, 0 replies; 36+ messages in thread
From: Tom Herbert @ 2016-11-02 16:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Thomas Graf, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

On Wed, Nov 2, 2016 at 3:48 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Tom,
>
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On Tue, Nov 1, 2016 at 3:12 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > On 01.11.2016 21:59, Thomas Graf wrote:
>> >> On 1 November 2016 at 13:08, Hannes Frederic Sowa
>> >> <hannes@stressinduktion.org> wrote:
>> >>> On Tue, Nov 1, 2016, at 19:51, Thomas Graf wrote:
>> >>>> If I understand you correctly then a single BPF program would be
>> >>>> loaded which then applies to all dst_output() calls? This has a huge
>> >>>> drawback, instead of multiple small BPF programs which do exactly what
>> >>>> is required per dst, a large BPF program is needed which matches on
>> >>>> metadata. That's way slower and renders one of the biggest advantages
>> >>>> of BPF invalid, the ability to generate a a small program tailored to
>> >>>> a particular use. See Cilium.
>> >>>
>> >>> I thought more of hooks in the actual output/input functions specific to
>> >>> the protocol type (unfortunately again) protected by jump labels? Those
>> >>> hook get part of the dst_entry mapped so they can act on them.
>> >>
>> >> This has no advantage over installing a BPF program at tc egress and
>> >> enabling to store/access metadata per dst. The whole point is to
>> >> execute bpf for a specific route.
>> >
>> > The advantage I saw here was that in your proposal the tc egress path
>> > would have to be chosen by a route. Otherwise I would already have
>> > proposed it. :)
>> >
>> >>> Another idea would be to put the eBPF hooks into the fib rules
>> >>> infrastructure. But I fear this wouldn't get you the hooks you were
>> >>> looking for? There they would only end up in the runtime path if
>> >>> actually activated.
>> >>
>> >> Use of fib rules kills performance so it's not an option. I'm not even
>> >> sure that would be any simpler.
>> >
>> > It very much depends on the number of rules installed. If there are just
>> > several very few rules, it shouldn't hurt performance that much (but
>> > haven't verified).
>> >
>> Hannes,
>>
>> I can say that the primary value we get out of using ILA+LWT is that
>> we can essentially cache a policy decision in connected sockets. That
>> is we are able to create a host route for each destination (thousands
>> of them) that describes how to do the translation for each one. There
>> is no route lookup per packet, and actually no extra lookup otherwise.
>
> Exactly, that is why I do like LWT and the dst_entry socket caching
> shows its benefits here. Also the dst_entries communicate enough vital
> information up the stack so that allocation of sk_buffs is done
> accordingly to the headers that might need to be inserted later on.
>
> (On the other hand, the looked up BPF program can also be cached. This
> becomes more difficult if we can't share the socket structs between
> namespaces though.)
>
>> The translation code doesn't do much at all, basically just copies in
>> new destination to the packet. We need a route lookup for the
>> rewritten destination, but that is easily cached in the LWT structure.
>> The net result is that the transmit path for ILA is _really_ fast. I'm
>> not sure how we can match this same performance tc egress, it seems
>> like we would want to cache the matching rules in the socket to avoid
>> rule lookups.
>
> In case of namespaces, do you allocate the host routes in the parent or
> child (net-)namespaces? Or don't we talk about namespaces right now at all?
>
ILA is namespace aware, everything is set per namespace. I don't see
any issue with set routes per namespace either, nor any namespace
related issues with this patch, except maybe that I wouldn't know what
the interaction between BPF maps and namespaces are. Do maps belong to
namespaces?

> Why do we want to do the packet manipulation in tc egress and not using
> LWT + interfaces? The dst_entries should be able to express all possible
> allocation strategies etc. so that we don't need to shift/reallocate
> packets around when inserting an additional header. We can't express
> those semantics with tc egress.
>
I don't think we do want to do this sort of packet manipulation (ILA
in particular) in tc egress, that was my point. It's also not
appropriate for netfilter either I think.

>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).
>
> My concern with using BPF is that the rest of the kernel doesn't really
> see the semantics and can't optimize or cache at specific points,
> because the kernel cannot introspect what the BPF program does (for
> metadata manipulation, one can e.g. specifiy that the program is "pure",
> and always provides the same output for some specified given input, thus
> things can be cached and memorized, but that framework seems very hard
> to build).
>
I share your concerns. This is along the lines of an issue that I
bring up when discussing kernel bypass. Most non-kernel people don't
grasp how much service the kernel actually provides nor how a
protocol, say TCP, is tightly integrated with the rest of the stack
and kernel. Trying to reliably replicate that functionality outside of
the kernel, create all the necessary interfaces between the bypass
code kernel code to do effective integration, or otherwise make it
somehow work in concert with the kernel is a really difficult
challenge.

> That's why I am in favor of splitting this patchset down and allow the
> policies that should be expressed by BPF programs being applied to the
> specific subsystems (I am not totally against a generic BPF hook in
> input or output of the protocol engines). E.g. can we deal with static
> rewriting of L2 addresses in the neighbor cache? We already provide a
> fast header cache for L2 data which might be used here?
>
> I also fear this becomes a kernel by-pass:
>
By the definitions I proposed at netdev, I believe it is already a
form of kernel bypass. In the taxonomy I would call this "partial
bypass"-- the kernel is aware that bypass is happening, but has no
insight into what the 3rd party code is doing or even its intent. This
is not to say it's inherently a bad idea, just need to be clear about
what it is.

Tom

> It might be very hard e.g. to apply NFT/netfilter to such packets, if
> e.g. a redirect happens suddenly and packet flow is diverted from the
> one the user sees currently based on the interfaces and routing tables.
>
> Those are just some thoughts so far, I still have to think more about
> this. Thanks for the discussion, it is very interesting.
>
> Bye,
> Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 22:12         ` Hannes Frederic Sowa
  2016-11-01 23:07           ` Tom Herbert
@ 2016-11-02 22:54           ` Thomas Graf
  2016-11-03 14:52             ` Hannes Frederic Sowa
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-02 22:54 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 1 November 2016 at 16:12, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 01.11.2016 21:59, Thomas Graf wrote:
>>> Dumping and verifying which routes get used might actually already be
>>> quite complex on its own. Thus my fear.
>>
>> We even have an API to query which route is used for a tuple. What
>> else would you like to see?
>
> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
> sockets sniff at interfaces and also gather and dump the metadata to
> user space (this would depend on bpf programs only doing the
> modifications in metadata and not in the actual packet).

Not sure I understand. Why does this depend on BPF?

> Or maybe just tracing support (without depending on the eBPF program
> developer to have added debugging in the BPF program).

Absolutely in favour of that.

>> This will be addressed with signing AFAIK.
>
> This sounds a bit unrealistic. Signing lots of small programs can be a
> huge burden to the entity doing the signing (if it is not on the same
> computer). And as far as I understood the programs should be generated
> dynamically?

Right, for generated programs, a hash is a better fit and still sufficient.

>> Would it help if we allow to store the original source used for
>> bytecode generation. What would make it clear which program was used.
>
> I would also be fine with just a strong hash of the bytecode, so the
> program can be identified accurately. Maybe helps with deduplication
> later on, too. ;)

OK, I think we all already agreed on doing this.

> Even though I read through the patchset I am not absolutely sure which
> problem it really solves. Especially because lots of things can be done
> already at the ingress vs. egress interface (I looked at patch 4 but I
> am not sure how realistic they are).

Filtering at egress requires to attach the BPF program to all
potential outgoing interface and then pass every single packet through
the program whereas with LWT BPF, I'm only taking the cost where
actually needed.

>> I also don't see how this could possibly scale if all packets must go
>> through a single BPF program. The overhead will be tremendous if you
>> only want to filter a couple of prefixes.
>
> In case of hash table lookup it should be fast. llvm will probably also
> generate jump table for a few 100 ip addresses, no? Additionally the
> routing table lookup could be not done at all.

Why would I want to accept the overhead if I simply avoid it? Just
parsing the header and doing the hash lookup will add cost, cost for
each packet.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-01 23:07           ` Tom Herbert
  2016-11-02 10:48             ` Hannes Frederic Sowa
@ 2016-11-02 22:57             ` Thomas Graf
  2016-11-02 23:27               ` Tom Herbert
  1 sibling, 1 reply; 36+ messages in thread
From: Thomas Graf @ 2016-11-02 22:57 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Hannes Frederic Sowa, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

On 1 November 2016 at 17:07, Tom Herbert <tom@herbertland.com> wrote:
> On the other hand, I'm not really sure how to implement for this level
> of performance this in LWT+BPF either. It seems like one way to do
> that would be to create a program each destination and set it each
> host. As you point out would create a million different programs which
> doesn't seem manageable. I don't think the BPF map works either since
> that implies we need a lookup (?). It seems like what we need is one
> program but allow it to be parameterized with per destination
> information saved in the route (LWT structure).

Attaching different BPF programs to millions of unique dsts doesn't
make any sense. That will obivously will never scale and it's not
supposed to scale. This is meant to be used for prefixes which
represent a series of endpoints, f.e. all local containers, all
non-internal traffic, all vpn traffic, etc. I'm also not sure why we
are talking about ILA here, you have written a native implementation,
why would you want to solve it with BPF again?

If you want to run a single program for all dsts, feel free to run the
same BPF program for each dst. Nobody is forcing you to attach
individual programs.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-02 13:39   ` Roopa Prabhu
@ 2016-11-02 22:58     ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-02 22:58 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, netdev

On 2 November 2016 at 07:39, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index d6508c2..a675fd3 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
>> -obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
>> +obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o
>
> Any reason you want to keep lwt bpf under the main CONFIG_LWTUNNEL infra config ?.
> since it is defined as yet another plug-gable encap function, seems like it will be better under a separate
> CONFIG_LWTUNNEL_BPF or CONFIG_LWT_BPF that depends on CONFIG_LWTUNNEL

The code was so minimal with no additional dependencies that I didn't
see a need for a separate Kconfig. I'm fine adding that in the next
iteration though. No objections.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-02 22:57             ` Thomas Graf
@ 2016-11-02 23:27               ` Tom Herbert
  0 siblings, 0 replies; 36+ messages in thread
From: Tom Herbert @ 2016-11-02 23:27 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Hannes Frederic Sowa, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

On Wed, Nov 2, 2016 at 3:57 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 1 November 2016 at 17:07, Tom Herbert <tom@herbertland.com> wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Attaching different BPF programs to millions of unique dsts doesn't
> make any sense. That will obivously will never scale and it's not
> supposed to scale. This is meant to be used for prefixes which
> represent a series of endpoints, f.e. all local containers, all
> non-internal traffic, all vpn traffic, etc. I'm also not sure why we
> are talking about ILA here, you have written a native implementation,
> why would you want to solve it with BPF again?
>
We are talking about ILA because you specifically mentioned that in
overview log as a use case: "ILA like uses cases where L3 addresses
are resolved and then routed".

Tom

> If you want to run a single program for all dsts, feel free to run the
> same BPF program for each dst. Nobody is forcing you to attach
> individual programs.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-02 10:48             ` Hannes Frederic Sowa
  2016-11-02 16:59               ` Tom Herbert
@ 2016-11-03 14:16               ` Thomas Graf
  1 sibling, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-03 14:16 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Tom Herbert, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, roopa, netdev

On 2 November 2016 at 04:48, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On Wed, Nov 2, 2016, at 00:07, Tom Herbert wrote:
>> On the other hand, I'm not really sure how to implement for this level
>> of performance this in LWT+BPF either. It seems like one way to do
>> that would be to create a program each destination and set it each
>> host. As you point out would create a million different programs which
>> doesn't seem manageable. I don't think the BPF map works either since
>> that implies we need a lookup (?). It seems like what we need is one
>> program but allow it to be parameterized with per destination
>> information saved in the route (LWT structure).
>
> Yes, that is my proposal. Just using the dst entry as meta-data (which
> can actually also be an ID for the network namespace the packet is
> coming from).

I have no objection to doing this on top of this series.

> My concern with using BPF is that the rest of the kernel doesn't really
> see the semantics and can't optimize or cache at specific points,
> because the kernel cannot introspect what the BPF program does (for
> metadata manipulation, one can e.g. specifiy that the program is "pure",
> and always provides the same output for some specified given input, thus
> things can be cached and memorized, but that framework seems very hard
> to build).

So you want to reintroduce a routing cache? Each packet needs to pass
through the BPF program anyway for accounting purposes. This is not
just about getting the packets out the right nexthop in the fastest
possible way.

> I also fear this becomes a kernel by-pass:
>
> It might be very hard e.g. to apply NFT/netfilter to such packets, if
> e.g. a redirect happens suddenly and packet flow is diverted from the
> one the user sees currently based on the interfaces and routing tables.

The LWT xmit hook is after the POST_ROUTING hook. The input and output
hook cannot redirect and output will become read-only just like input
already is. We are not bypassing anything. Please stop throwing the
word bypass around. This is just a false claim.

> That's why I am in favor of splitting this patchset down and allow the
> policies that should be expressed by BPF programs being applied to the
> specific subsystems (I am not totally against a generic BPF hook in
> input or output of the protocol engines). E.g. can we deal with static
> rewriting of L2 addresses in the neighbor cache? We already provide a
> fast header cache for L2 data which might be used here?

Split what? What policies?

I have two primary use cases for this:
1) Traffic into local containers: Containers are only supposed to do
L3, all L2 traffic is dropped for security reasons. The L2 header for
any packets in and out of the container is fixed and does not require
any sort of resolving. I in order to feed packets from the local host
into the containers, a route with the container prefix is set up. It
points to a nexthop address which appears behind a veth pair. A BPF
program is listening at tc ingress on the veth pair and will enforce
policies and do accounting. It requires very ugly hacks because Linux
does not like to do forwarding to an address which is considered
local. It works but it is a hack.

What I want to do instead is to run the BPF program for the route
directly, apply the policies, do accounting, push the fixed dummy L2
header and redirect it to the container. If someone has netfilter
rules installed, they will still apply. Nothing is hidden.

2) For external traffic that is coming in. We have a BPF program
listening on tc ingress which matches on the destination address on
all incoming traffic. If the packet is a for a container, we perform
the same actions as above. In this case we are bypassing the routing
table. This is ugly. What I want to do instead is to have the
container prefix invoke the BPF program so all packets have a route
lookup performed and netfilter filtering performed, only after that,
the BPF program is invoked exclusively for the packets destined for
local containers. Yes, it would be possible to redirect into a
temporary veth again and listen on that but it again requires to fake
a L2 segment which is just unnecessary and slow.

This is not hiding anything and it is not bypassing anything.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-02 22:54           ` Thomas Graf
@ 2016-11-03 14:52             ` Hannes Frederic Sowa
  2016-11-03 15:20               ` Thomas Graf
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2016-11-03 14:52 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 02.11.2016 23:54, Thomas Graf wrote:
> On 1 November 2016 at 16:12, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> On 01.11.2016 21:59, Thomas Graf wrote:
>>>> Dumping and verifying which routes get used might actually already be
>>>> quite complex on its own. Thus my fear.
>>>
>>> We even have an API to query which route is used for a tuple. What
>>> else would you like to see?
>>
>> I am not sure here. Some ideas I had were to allow tcpdump (pf_packet)
>> sockets sniff at interfaces and also gather and dump the metadata to
>> user space (this would depend on bpf programs only doing the
>> modifications in metadata and not in the actual packet).
> 
> Not sure I understand. Why does this depend on BPF?

It doesn't. My hope was, if BPF merely tries to modify meta-data, we can
provide better debugging tools as if we mangle the packet directly.

>> Or maybe just tracing support (without depending on the eBPF program
>> developer to have added debugging in the BPF program).
> 
> Absolutely in favour of that.
> 
>>> This will be addressed with signing AFAIK.
>>
>> This sounds a bit unrealistic. Signing lots of small programs can be a
>> huge burden to the entity doing the signing (if it is not on the same
>> computer). And as far as I understood the programs should be generated
>> dynamically?
> 
> Right, for generated programs, a hash is a better fit and still sufficient.
> 
>>> Would it help if we allow to store the original source used for
>>> bytecode generation. What would make it clear which program was used.
>>
>> I would also be fine with just a strong hash of the bytecode, so the
>> program can be identified accurately. Maybe helps with deduplication
>> later on, too. ;)
> 
> OK, I think we all already agreed on doing this.
> 
>> Even though I read through the patchset I am not absolutely sure which
>> problem it really solves. Especially because lots of things can be done
>> already at the ingress vs. egress interface (I looked at patch 4 but I
>> am not sure how realistic they are).
> 
> Filtering at egress requires to attach the BPF program to all
> potential outgoing interface and then pass every single packet through
> the program whereas with LWT BPF, I'm only taking the cost where
> actually needed.

I do certainly see this point as a big plus. I definitely also thought
about this a lot when thinking about how flower can/should be used with
multiple interfaces and how to keep its flow tables synchronized.

>>> I also don't see how this could possibly scale if all packets must go
>>> through a single BPF program. The overhead will be tremendous if you
>>> only want to filter a couple of prefixes.
>>
>> In case of hash table lookup it should be fast. llvm will probably also
>> generate jump table for a few 100 ip addresses, no? Additionally the
>> routing table lookup could be not done at all.
> 
> Why would I want to accept the overhead if I simply avoid it? Just
> parsing the header and doing the hash lookup will add cost, cost for
> each packet.

That is true, but in case you are outside of the namespace, you still
have to calculate the cost of doing the FIB lookup for the BPF program
each time, too.

E.g. given the lookup cost in a hash for a netnwork namespace pointer
vs. the cost of doing a FIB lookup to get a program that does a specific
transformation sounds at least in the big O-notiation to be in a better
place. ;)

If you have to do both anyway, probably your patchset will perform
better, I agree.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation
  2016-11-03 14:52             ` Hannes Frederic Sowa
@ 2016-11-03 15:20               ` Thomas Graf
  0 siblings, 0 replies; 36+ messages in thread
From: Thomas Graf @ 2016-11-03 15:20 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Tom Herbert, roopa, netdev

On 3 November 2016 at 08:52, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> On 02.11.2016 23:54, Thomas Graf wrote:
>> Why would I want to accept the overhead if I simply avoid it? Just
>> parsing the header and doing the hash lookup will add cost, cost for
>> each packet.
>
> That is true, but in case you are outside of the namespace, you still
> have to calculate the cost of doing the FIB lookup for the BPF program
> each time, too.
>
> E.g. given the lookup cost in a hash for a netnwork namespace pointer
> vs. the cost of doing a FIB lookup to get a program that does a specific
> transformation sounds at least in the big O-notiation to be in a better
> place. ;)
>
> If you have to do both anyway, probably your patchset will perform
> better, I agree.

Most containers are unprivileged, the route inside the container's
namespace is owned by the host and we can attach the BPF program
directly to the default route inside the container and all packets
egressing from the container will pass through it. That fib lookup is
needed anyway so we can leverage the cost of that lookup. We can drop
hostile packets early without ever going on L2 level.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2016-11-03 15:20 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-01  0:37 [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
2016-11-01  0:37 ` [PATCH net-next v2 1/5] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
2016-11-01  0:37 ` [PATCH net-next v2 2/5] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
2016-11-01  0:37 ` [PATCH net-next v2 3/5] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
2016-11-01  1:10   ` kbuild test robot
2016-11-01  2:39   ` kbuild test robot
2016-11-01 20:11   ` David Ahern
2016-11-01 21:03     ` Thomas Graf
2016-11-02 13:39   ` Roopa Prabhu
2016-11-02 22:58     ` Thomas Graf
2016-11-01  0:37 ` [PATCH net-next v2 4/5] bpf: Add samples for LWT-BPF Thomas Graf
2016-11-01  0:37 ` [PATCH net-next v2 5/5] lwtunnel: Limit number of recursions on output to 5 Thomas Graf
2016-11-01  4:52   ` kbuild test robot
2016-11-01  8:09     ` Thomas Graf
2016-11-01 10:54 ` [PATCH net-next v2 0/5] bpf: BPF for lightweight tunnel encapsulation Hannes Frederic Sowa
2016-11-01 18:51   ` Thomas Graf
2016-11-01 20:08     ` Hannes Frederic Sowa
2016-11-01 20:59       ` Thomas Graf
2016-11-01 22:12         ` Hannes Frederic Sowa
2016-11-01 23:07           ` Tom Herbert
2016-11-02 10:48             ` Hannes Frederic Sowa
2016-11-02 16:59               ` Tom Herbert
2016-11-03 14:16               ` Thomas Graf
2016-11-02 22:57             ` Thomas Graf
2016-11-02 23:27               ` Tom Herbert
2016-11-02 22:54           ` Thomas Graf
2016-11-03 14:52             ` Hannes Frederic Sowa
2016-11-03 15:20               ` Thomas Graf
2016-11-01 16:17 ` Tom Herbert
2016-11-01 18:20   ` Thomas Graf
2016-11-01 18:51     ` Tom Herbert
2016-11-01 19:11       ` Thomas Graf
2016-11-01 19:27         ` Tom Herbert
2016-11-01 19:59           ` Thomas Graf
2016-11-01 20:33             ` Tom Herbert
2016-11-01 21:05               ` Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.