[PATCH net-next 0/4] BPF for lightweight tunnel encapsulation

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation
@ 2016-10-30 11:58 Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 11:58 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, netdev, roopa

This series implements BPF program invocation from dst entries via the
lightweight tunnels infrastructure. The BPF program can be attached to
lwtunnel_input(), lwtunnel_output() or lwtunnel_xmit() and sees an L3
skb as context. input is read-only, output can write, xmit can write,
push headers, and redirect.

Motiviation for this work:
 - Restricting outgoing routes beyond what the route tuple supports
 - Per route accounting byond realms
 - Fast attachment of L2 headers where header does not require resolving
   L2 addresses
 - ILA like uses cases where L3 addresses are resolved and then routed
   in an async manner
 - Fast encapsulation + redirect. For now limited to use cases where not 
   setting inner and outer offset/protocol is OK.

A couple of samples on how to use it can be found in patch 04.

Thomas Graf (4):
  route: Set orig_output when redirecting to lwt on locally generated
    traffic
  route: Set lwtstate for local traffic and cached input dsts
  bpf: BPF for lightweight tunnel encapsulation
  bpf: Add samples for LWT-BPF

 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  31 +++-
 include/uapi/linux/lwtunnel.h |  21 +++
 kernel/bpf/verifier.c         |  16 +-
 net/core/Makefile             |   2 +-
 net/core/filter.c             | 148 ++++++++++++++++-
 net/core/lwt_bpf.c            | 365 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |   1 +
 net/ipv4/route.c              |  37 +++--
 samples/bpf/bpf_helpers.h     |   4 +
 samples/bpf/lwt_bpf.c         | 210 ++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh   | 337 ++++++++++++++++++++++++++++++++++++++
 12 files changed, 1156 insertions(+), 18 deletions(-)
 create mode 100644 net/core/lwt_bpf.c
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

-- 
2.7.4

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic
  2016-10-30 11:58 [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-10-30 11:58 ` Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 11:58 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, netdev, roopa

orig_output for IPv4 was only set for dsts which hit an input route.
Set it consistently for locally generated traffic as well to allow
lwt to continue the dst_output() path as configured by the nexthop.

Fixes: 2536862311d ("lwt: Add support to redirect dst.input")
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 62d4d90..7da886e 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2138,8 +2138,10 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate))
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
 		rth->dst.output = lwtunnel_output;
+	}
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 2/4] route: Set lwtstate for local traffic and cached input dsts
  2016-10-30 11:58 [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
@ 2016-10-30 11:58 ` Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 4/4] bpf: Add samples for LWT-BPF Thomas Graf
  3 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 11:58 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, netdev, roopa

A route on the output path hitting a RTN_LOCAL route will keep the dst
associated on its way through the loopback device. On the receive path,
the dst_input() call will thus invoke the input handler of the route
created in the output path. Thus, lwt redirection for input must be done
for dsts allocated in the otuput path as well.

Also, if a route is cached in the input path, the allocated dst should
respect lwtunnel configuration on the nexthop as well.

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 net/ipv4/route.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7da886e..44f5403 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1596,6 +1596,19 @@ static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
 	spin_unlock_bh(&fnhe_lock);
 }
 
+static void set_lwt_redirect(struct rtable *rth)
+{
+	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_output = rth->dst.output;
+		rth->dst.output = lwtunnel_output;
+	}
+
+	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+		rth->dst.lwtstate->orig_input = rth->dst.input;
+		rth->dst.input = lwtunnel_input;
+	}
+}
+
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
@@ -1685,14 +1698,7 @@ static int __mkroute_input(struct sk_buff *skb,
 	rth->dst.input = ip_forward;
 
 	rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
-	if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_input = rth->dst.input;
-		rth->dst.input = lwtunnel_input;
-	}
+	set_lwt_redirect(rth);
 	skb_dst_set(skb, &rth->dst);
 out:
 	err = 0;
@@ -1919,8 +1925,18 @@ out:	return err;
 		rth->dst.error= -err;
 		rth->rt_flags 	&= ~RTCF_LOCAL;
 	}
+
 	if (do_cache) {
-		if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
+		struct fib_nh *nh = &FIB_RES_NH(res);
+
+		rth->dst.lwtstate = lwtstate_get(nh->nh_lwtstate);
+		if (lwtunnel_input_redirect(rth->dst.lwtstate)) {
+			WARN_ON(rth->dst.input == lwtunnel_input);
+			rth->dst.lwtstate->orig_input = rth->dst.input;
+			rth->dst.input = lwtunnel_input;
+		}
+
+		if (unlikely(!rt_cache_route(nh, rth))) {
 			rth->dst.flags |= DST_NOCACHE;
 			rt_add_uncached_list(rth);
 		}
@@ -2138,10 +2154,7 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 	}
 
 	rt_set_nexthop(rth, fl4->daddr, res, fnhe, fi, type, 0);
-	if (lwtunnel_output_redirect(rth->dst.lwtstate)) {
-		rth->dst.lwtstate->orig_output = rth->dst.output;
-		rth->dst.output = lwtunnel_output;
-	}
+	set_lwt_redirect(rth);
 
 	return rth;
 }
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-30 11:58 [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
  2016-10-30 11:58 ` [PATCH net-next 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
@ 2016-10-30 11:58 ` Thomas Graf
  2016-10-30 20:34   ` Tom Herbert
  2016-10-30 11:58 ` [PATCH net-next 4/4] bpf: Add samples for LWT-BPF Thomas Graf
  3 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 11:58 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, netdev, roopa

Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
LWT redirection of type LWTUNNEL_ENCAP_BPF.

The separate program types are required because manipulation of
packet data is only allowed on the output and transmit path as
the subsequent dst_input() call path assumes an IP header
validated by ip_rcv(). The BPF programs will be handed an skb
with the L3 header attached and may return one of the following
return codes:

 BPF_OK - Continue routing as per nexthop
 BPF_DROP - Drop skb and return EPERM
 BPF_REDIRECT - Redirect skb to device as per redirect() helper.
                (Only valid on lwtunnel_xmit() hook)

The return codes are binary compatible with their TC_ACT_
relatives to ease compatibility.

A new helper bpf_skb_push() is added which allows to preprend an
L2 header in front of the skb, extend the existing L3 header, or
both. This allows to address a wide range of issues:
 - Optimize L2 header construction when L2 information is always
   static to avoid ARP/NDisc lookup.
 - Extend IP header to add additional IP options.
 - Perform simple encapsulation where offload is of no concern.
   (The existing funtionality to attach a tunnel key to the skb
    and redirect to a tunnel net_device to allow for offload
    continues to work obviously).

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 include/linux/filter.h        |   2 +-
 include/uapi/linux/bpf.h      |  31 +++-
 include/uapi/linux/lwtunnel.h |  21 +++
 kernel/bpf/verifier.c         |  16 +-
 net/core/Makefile             |   2 +-
 net/core/filter.c             | 148 ++++++++++++++++-
 net/core/lwt_bpf.c            | 365 ++++++++++++++++++++++++++++++++++++++++++
 net/core/lwtunnel.c           |   1 +
 8 files changed, 579 insertions(+), 7 deletions(-)
 create mode 100644 net/core/lwt_bpf.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..aad7f81 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -438,7 +438,7 @@ struct xdp_buff {
 };
 
 /* compute the linear packet data range [data, data_end) which
- * will be accessed by cls_bpf and act_bpf programs
+ * will be accessed by cls_bpf, act_bpf and lwt programs
  */
 static inline void bpf_compute_data_end(struct sk_buff *skb)
 {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e2f38e0..2ebaa3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,6 +96,9 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_TRACEPOINT,
 	BPF_PROG_TYPE_XDP,
 	BPF_PROG_TYPE_PERF_EVENT,
+	BPF_PROG_TYPE_LWT_IN,
+	BPF_PROG_TYPE_LWT_OUT,
+	BPF_PROG_TYPE_LWT_XMIT,
 };
 
 #define BPF_PSEUDO_MAP_FD	1
@@ -383,6 +386,16 @@ union bpf_attr {
  *
  * int bpf_get_numa_node_id()
  *     Return: Id of current NUMA node.
+ *
+ * int bpf_skb_push()
+ *     Add room to beginning of skb and adjusts MAC header offset accordingly.
+ *     Extends/reallocaes for needed skb headeroom automatically.
+ *     May change skb data pointer and will thus invalidate any check done
+ *     for direct packet access.
+ *     @skb: pointer to skb
+ *     @len: length of header to be pushed in front
+ *     @flags: Flags (unused for now)
+ *     Return: 0 on success or negative error
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -427,7 +440,8 @@ union bpf_attr {
 	FN(skb_pull_data),		\
 	FN(csum_update),		\
 	FN(set_hash_invalid),		\
-	FN(get_numa_node_id),
+	FN(get_numa_node_id),		\
+	FN(skb_push),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -511,6 +525,21 @@ struct bpf_tunnel_key {
 	__u32 tunnel_label;
 };
 
+/* Generic BPF return codes which all BPF program types may support.
+ * The values are binary compatible with their TC_ACT_* counter-part to
+ * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
+ * programs.
+ *
+ * XDP is handled seprately, see XDP_*.
+ */
+enum bpf_ret_code {
+	BPF_OK = 0,
+	/* 1 reserved */
+	BPF_DROP = 2,
+	/* 3-6 reserved */
+	BPF_REDIRECT = 7,
+};
+
 /* User return codes for XDP prog type.
  * A valid XDP program must return one of these defined values. All other
  * return codes are reserved for future use. Unknown return codes will result
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index a478fe8..9354d997 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
 	LWTUNNEL_ENCAP_IP,
 	LWTUNNEL_ENCAP_ILA,
 	LWTUNNEL_ENCAP_IP6,
+	LWTUNNEL_ENCAP_BPF,
 	__LWTUNNEL_ENCAP_MAX,
 };
 
@@ -42,4 +43,24 @@ enum lwtunnel_ip6_t {
 
 #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
 
+enum {
+	LWT_BPF_PROG_UNSPEC,
+	LWT_BPF_PROG_FD,
+	LWT_BPF_PROG_NAME,
+	__LWT_BPF_PROG_MAX,
+};
+
+#define LWT_BPF_PROG_MAX (__LWT_BPF_PROG_MAX - 1)
+
+enum {
+	LWT_BPF_UNSPEC,
+	LWT_BPF_IN,
+	LWT_BPF_OUT,
+	LWT_BPF_XMIT,
+	__LWT_BPF_MAX,
+};
+
+#define LWT_BPF_MAX (__LWT_BPF_MAX - 1)
+
+
 #endif /* _UAPI_LWTUNNEL_H_ */
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 9002575..519b58e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -633,12 +633,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
 #define MAX_PACKET_OFF 0xffff
 
 static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
-				       const struct bpf_call_arg_meta *meta)
+				       const struct bpf_call_arg_meta *meta,
+				       enum bpf_access_type t)
 {
 	switch (env->prog->type) {
+	case BPF_PROG_TYPE_LWT_IN:
+		/* dst_input() can't write for now, orig_input may depend on
+		 * IP header parsed by ip_rcv().
+		 */
+		if (t == BPF_WRITE)
+			return false;
 	case BPF_PROG_TYPE_SCHED_CLS:
 	case BPF_PROG_TYPE_SCHED_ACT:
 	case BPF_PROG_TYPE_XDP:
+	case BPF_PROG_TYPE_LWT_OUT:
+	case BPF_PROG_TYPE_LWT_XMIT:
 		if (meta)
 			return meta->pkt_access;
 
@@ -837,7 +846,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
 			err = check_stack_read(state, off, size, value_regno);
 		}
 	} else if (state->regs[regno].type == PTR_TO_PACKET) {
-		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL)) {
+		if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
 			verbose("cannot write into packet\n");
 			return -EACCES;
 		}
@@ -970,7 +979,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
 		return 0;
 	}
 
-	if (type == PTR_TO_PACKET && !may_access_direct_pkt_data(env, meta)) {
+	if (type == PTR_TO_PACKET &&
+	    !may_access_direct_pkt_data(env, meta, BPF_READ)) {
 		verbose("helper access to the packet is not allowed\n");
 		return -EACCES;
 	}
diff --git a/net/core/Makefile b/net/core/Makefile
index d6508c2..a675fd3 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
 obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
-obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
+obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o
 obj-$(CONFIG_DST_CACHE) += dst_cache.o
 obj-$(CONFIG_HWBM) += hwbm.o
 obj-$(CONFIG_NET_DEVLINK) += devlink.o
diff --git a/net/core/filter.c b/net/core/filter.c
index cd9e2ba..325a9d8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2138,6 +2138,43 @@ static const struct bpf_func_proto bpf_skb_change_tail_proto = {
 	.arg3_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_skb_push, struct sk_buff *, skb, __u32, len, u64, flags)
+{
+	u32 new_len = skb->len + len;
+
+	/* restrict max skb size and check for overflow */
+	if (new_len > __bpf_skb_max_len(skb) || new_len < skb->len)
+		return -ERANGE;
+
+	if (flags)
+		return -EINVAL;
+
+	if (len > 0) {
+		int ret;
+
+		ret = skb_cow(skb, len);
+		if (unlikely(ret < 0))
+			return ret;
+
+		__skb_push(skb, len);
+		memset(skb->data, 0, len);
+	}
+
+	skb_reset_mac_header(skb);
+
+	bpf_compute_data_end(skb);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_skb_push_proto = {
+	.func		= bpf_skb_push,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+	.arg3_type	= ARG_ANYTHING,
+};
+
 bool bpf_helper_changes_skb_data(void *func)
 {
 	if (func == bpf_skb_vlan_push ||
@@ -2147,7 +2184,8 @@ bool bpf_helper_changes_skb_data(void *func)
 	    func == bpf_skb_change_tail ||
 	    func == bpf_skb_pull_data ||
 	    func == bpf_l3_csum_replace ||
-	    func == bpf_l4_csum_replace)
+	    func == bpf_l4_csum_replace ||
+	    func == bpf_skb_push)
 		return true;
 
 	return false;
@@ -2578,6 +2616,75 @@ xdp_func_proto(enum bpf_func_id func_id)
 	}
 }
 
+static const struct bpf_func_proto *
+lwt_in_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_load_bytes:
+		return &bpf_skb_load_bytes_proto;
+	case BPF_FUNC_skb_pull_data:
+		return &bpf_skb_pull_data_proto;
+	case BPF_FUNC_csum_diff:
+		return &bpf_csum_diff_proto;
+	case BPF_FUNC_get_cgroup_classid:
+		return &bpf_get_cgroup_classid_proto;
+	case BPF_FUNC_get_route_realm:
+		return &bpf_get_route_realm_proto;
+	case BPF_FUNC_get_hash_recalc:
+		return &bpf_get_hash_recalc_proto;
+	case BPF_FUNC_perf_event_output:
+		return &bpf_skb_event_output_proto;
+	case BPF_FUNC_get_smp_processor_id:
+		return &bpf_get_smp_processor_id_proto;
+	case BPF_FUNC_skb_under_cgroup:
+		return &bpf_skb_under_cgroup_proto;
+	default:
+		return sk_filter_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_out_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_store_bytes:
+		return &bpf_skb_store_bytes_proto;
+	case BPF_FUNC_csum_update:
+		return &bpf_csum_update_proto;
+	case BPF_FUNC_l3_csum_replace:
+		return &bpf_l3_csum_replace_proto;
+	case BPF_FUNC_l4_csum_replace:
+		return &bpf_l4_csum_replace_proto;
+	case BPF_FUNC_set_hash_invalid:
+		return &bpf_set_hash_invalid_proto;
+	default:
+		return lwt_in_func_proto(func_id);
+	}
+}
+
+static const struct bpf_func_proto *
+lwt_xmit_func_proto(enum bpf_func_id func_id)
+{
+	switch (func_id) {
+	case BPF_FUNC_skb_get_tunnel_key:
+		return &bpf_skb_get_tunnel_key_proto;
+	case BPF_FUNC_skb_set_tunnel_key:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_skb_get_tunnel_opt:
+		return &bpf_skb_get_tunnel_opt_proto;
+	case BPF_FUNC_skb_set_tunnel_opt:
+		return bpf_get_skb_set_tunnel_proto(func_id);
+	case BPF_FUNC_redirect:
+		return &bpf_redirect_proto;
+	case BPF_FUNC_skb_change_tail:
+		return &bpf_skb_change_tail_proto;
+	case BPF_FUNC_skb_push:
+		return &bpf_skb_push_proto;
+	default:
+		return lwt_out_func_proto(func_id);
+	}
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
 	if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2940,6 +3047,27 @@ static const struct bpf_verifier_ops xdp_ops = {
 	.convert_ctx_access	= xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops lwt_in_ops = {
+	.get_func_proto		= lwt_in_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_out_ops = {
+	.get_func_proto		= lwt_out_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
+static const struct bpf_verifier_ops lwt_xmit_ops = {
+	.get_func_proto		= lwt_xmit_func_proto,
+	.is_valid_access	= tc_cls_act_is_valid_access,
+	.convert_ctx_access	= sk_filter_convert_ctx_access,
+	.gen_prologue		= tc_cls_act_prologue,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
 	.ops	= &sk_filter_ops,
 	.type	= BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2960,12 +3088,30 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
 	.type	= BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list lwt_in_type __read_mostly = {
+	.ops	= &lwt_in_ops,
+	.type	= BPF_PROG_TYPE_LWT_IN,
+};
+
+static struct bpf_prog_type_list lwt_out_type __read_mostly = {
+	.ops	= &lwt_out_ops,
+	.type	= BPF_PROG_TYPE_LWT_OUT,
+};
+
+static struct bpf_prog_type_list lwt_xmit_type __read_mostly = {
+	.ops	= &lwt_xmit_ops,
+	.type	= BPF_PROG_TYPE_LWT_XMIT,
+};
+
 static int __init register_sk_filter_ops(void)
 {
 	bpf_register_prog_type(&sk_filter_type);
 	bpf_register_prog_type(&sched_cls_type);
 	bpf_register_prog_type(&sched_act_type);
 	bpf_register_prog_type(&xdp_type);
+	bpf_register_prog_type(&lwt_in_type);
+	bpf_register_prog_type(&lwt_out_type);
+	bpf_register_prog_type(&lwt_xmit_type);
 
 	return 0;
 }
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
new file mode 100644
index 0000000..8404ac6
--- /dev/null
+++ b/net/core/lwt_bpf.c
@@ -0,0 +1,365 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/types.h>
+#include <linux/bpf.h>
+#include <net/lwtunnel.h>
+
+struct bpf_lwt_prog {
+	struct bpf_prog *prog;
+	char *name;
+};
+
+struct bpf_lwt {
+	struct bpf_lwt_prog in;
+	struct bpf_lwt_prog out;
+	struct bpf_lwt_prog xmit;
+};
+
+#define MAX_PROG_NAME 256
+
+static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
+{
+	return (struct bpf_lwt *)lwt->data;
+}
+
+#define NO_REDIRECT false
+#define CAN_REDIRECT true
+
+static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
+		       struct dst_entry *dst, bool can_redirect)
+{
+	int ret;
+
+	/* Preempt disable is needed to protect per-cpu redirect_info between
+	 * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
+	 * access to maps strictly require a rcu_read_lock() for protection,
+	 * mixing with BH RCU lock doesn't work.
+	 */
+	preempt_disable();
+	rcu_read_lock();
+	bpf_compute_data_end(skb);
+	ret = BPF_PROG_RUN(lwt->prog, skb);
+	rcu_read_unlock();
+
+	switch (ret) {
+	case BPF_OK:
+		break;
+
+	case BPF_REDIRECT:
+		if (!can_redirect) {
+			WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
+				  lwt->name ? : "<unknown>");
+			ret = BPF_OK;
+		} else {
+			ret = skb_do_redirect(skb);
+			if (ret == 0)
+				ret = BPF_REDIRECT;
+		}
+		break;
+
+	case BPF_DROP:
+		kfree_skb(skb);
+		ret = -EPERM;
+		break;
+
+	default:
+		WARN_ONCE(1, "Illegal LWT BPF return value %u, expect packet loss\n",
+			  ret);
+		kfree_skb(skb);
+		ret = -EINVAL;
+		break;
+	}
+
+	preempt_enable();
+
+	return ret;
+}
+
+static int bpf_input(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->in.prog) {
+		ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_input)) {
+		WARN_ONCE(1, "orig_input not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_input(skb);
+}
+
+static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+	int ret;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->out.prog) {
+		ret = run_lwt_bpf(skb, &bpf->out, dst, NO_REDIRECT);
+		if (ret < 0)
+			return ret;
+	}
+
+	if (unlikely(!dst->lwtstate->orig_output)) {
+		WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
+			  bpf->out.name);
+		kfree_skb(skb);
+		return -EINVAL;
+	}
+
+	return dst->lwtstate->orig_output(net, sk, skb);
+}
+
+static int bpf_xmit(struct sk_buff *skb)
+{
+	struct dst_entry *dst = skb_dst(skb);
+	struct bpf_lwt *bpf;
+
+	bpf = bpf_lwt_lwtunnel(dst->lwtstate);
+	if (bpf->xmit.prog) {
+		int ret;
+
+		ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
+		switch (ret) {
+		case BPF_OK:
+			return LWTUNNEL_XMIT_CONTINUE;
+		case BPF_REDIRECT:
+			return LWTUNNEL_XMIT_DONE;
+		default:
+			return ret;
+		}
+	}
+
+	return LWTUNNEL_XMIT_CONTINUE;
+}
+
+static void bpf_lwt_prog_destroy(struct bpf_lwt_prog *prog)
+{
+	if (prog->prog)
+		bpf_prog_put(prog->prog);
+
+	kfree(prog->name);
+}
+
+static void bpf_destroy_state(struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	bpf_lwt_prog_destroy(&bpf->in);
+	bpf_lwt_prog_destroy(&bpf->out);
+	bpf_lwt_prog_destroy(&bpf->xmit);
+}
+
+static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
+	[LWT_BPF_PROG_FD] = { .type = NLA_U32, },
+	[LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
+				.len = MAX_PROG_NAME },
+};
+
+static int bpf_parse_prog(struct nlattr *attr, struct bpf_lwt_prog *prog,
+			  enum bpf_prog_type type)
+{
+	struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
+	struct bpf_prog *p;
+	int ret;
+	u32 fd;
+
+	ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attr, bpf_prog_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
+		return -EINVAL;
+
+	prog->name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
+	if (!prog->name)
+		return -ENOMEM;
+
+	fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
+	p = bpf_prog_get_type(fd, type);
+	if (IS_ERR(p))
+		return PTR_ERR(p);
+
+	prog->prog = p;
+
+	return 0;
+}
+
+static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = {
+	[LWT_BPF_IN]   = { .type = NLA_NESTED, },
+	[LWT_BPF_OUT]  = { .type = NLA_NESTED, },
+	[LWT_BPF_XMIT] = { .type = NLA_NESTED, },
+};
+
+static int bpf_build_state(struct net_device *dev, struct nlattr *nla,
+			   unsigned int family, const void *cfg,
+			   struct lwtunnel_state **ts)
+{
+	struct nlattr *tb[LWT_BPF_MAX + 1];
+	struct lwtunnel_state *newts;
+	struct bpf_lwt *bpf;
+	int ret;
+
+	ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy);
+	if (ret < 0)
+		return ret;
+
+	if (!tb[LWT_BPF_IN] && !tb[LWT_BPF_OUT] && !tb[LWT_BPF_XMIT])
+		return -EINVAL;
+
+	newts = lwtunnel_state_alloc(sizeof(*bpf));
+	if (!newts)
+		return -ENOMEM;
+
+	newts->type = LWTUNNEL_ENCAP_BPF;
+	bpf = bpf_lwt_lwtunnel(newts);
+
+	if (tb[LWT_BPF_IN]) {
+		ret = bpf_parse_prog(tb[LWT_BPF_IN], &bpf->in,
+				     BPF_PROG_TYPE_LWT_IN);
+		if (ret  < 0) {
+			kfree(newts);
+			return ret;
+		}
+
+		newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
+	}
+
+	if (tb[LWT_BPF_OUT]) {
+		ret = bpf_parse_prog(tb[LWT_BPF_OUT], &bpf->out,
+				     BPF_PROG_TYPE_LWT_OUT);
+		if (ret < 0) {
+			bpf_destroy_state(newts);
+			kfree(newts);
+			return ret;
+		}
+
+		newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
+	}
+
+	if (tb[LWT_BPF_XMIT]) {
+		ret = bpf_parse_prog(tb[LWT_BPF_XMIT], &bpf->xmit,
+				     BPF_PROG_TYPE_LWT_XMIT);
+		if (ret < 0) {
+			bpf_destroy_state(newts);
+			kfree(newts);
+			return ret;
+		}
+
+		newts->flags |= LWTUNNEL_STATE_XMIT_REDIRECT;
+	}
+
+	*ts = newts;
+
+	return 0;
+}
+
+static int bpf_fill_lwt_prog(struct sk_buff *skb, int attr,
+			     struct bpf_lwt_prog *prog)
+{
+	struct nlattr *nest;
+
+	if (!prog->prog)
+		return 0;
+
+	nest = nla_nest_start(skb, attr);
+	if (!nest)
+		return -EMSGSIZE;
+
+	if (prog->name &&
+	    nla_put_string(skb, LWT_BPF_PROG_NAME, prog->name))
+		return -EMSGSIZE;
+
+	return nla_nest_end(skb, nest);
+}
+
+static int bpf_fill_encap_info(struct sk_buff *skb, struct lwtunnel_state *lwt)
+{
+	struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
+
+	if (bpf_fill_lwt_prog(skb, LWT_BPF_IN, &bpf->in) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_OUT, &bpf->out) < 0 ||
+	    bpf_fill_lwt_prog(skb, LWT_BPF_XMIT, &bpf->xmit) < 0)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
+static int bpf_encap_nlsize(struct lwtunnel_state *lwtstate)
+{
+	int nest_len = nla_total_size(sizeof(struct nlattr)) +
+		       nla_total_size(MAX_PROG_NAME) + /* LWT_BPF_PROG_NAME */
+		       0;
+
+	return nest_len + /* LWT_BPF_IN */
+	       nest_len + /* LWT_BPF_OUT */
+	       nest_len + /* LWT_BPF_XMIT */
+	       0;
+}
+
+int bpf_lwt_prog_cmp(struct bpf_lwt_prog *a, struct bpf_lwt_prog *b)
+{
+	/* FIXME:
+	 * The LWT state is currently rebuilt for delete requests which
+	 * results in a new bpf_prog instance. Comparing names for now.
+	 */
+	if (!a->name && !b->name)
+		return 0;
+
+	if (!a->name || !b->name)
+		return 1;
+
+	return strcmp(a->name, b->name);
+}
+
+static int bpf_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
+{
+	struct bpf_lwt *a_bpf = bpf_lwt_lwtunnel(a);
+	struct bpf_lwt *b_bpf = bpf_lwt_lwtunnel(b);
+
+	return bpf_lwt_prog_cmp(&a_bpf->in, &b_bpf->in) ||
+	       bpf_lwt_prog_cmp(&a_bpf->out, &b_bpf->out) ||
+	       bpf_lwt_prog_cmp(&a_bpf->xmit, &b_bpf->xmit);
+}
+
+static const struct lwtunnel_encap_ops bpf_encap_ops = {
+	.build_state	= bpf_build_state,
+	.destroy_state	= bpf_destroy_state,
+	.input		= bpf_input,
+	.output		= bpf_output,
+	.xmit		= bpf_xmit,
+	.fill_encap	= bpf_fill_encap_info,
+	.get_encap_size = bpf_encap_nlsize,
+	.cmp_encap	= bpf_encap_cmp,
+};
+
+static int __init bpf_lwt_init(void)
+{
+	return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
+}
+
+subsys_initcall(bpf_lwt_init)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 88fd642..554d901 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -39,6 +39,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
 		return "MPLS";
 	case LWTUNNEL_ENCAP_ILA:
 		return "ILA";
+	case LWTUNNEL_ENCAP_BPF:
 	case LWTUNNEL_ENCAP_IP6:
 	case LWTUNNEL_ENCAP_IP:
 	case LWTUNNEL_ENCAP_NONE:
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH net-next 4/4] bpf: Add samples for LWT-BPF
  2016-10-30 11:58 [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation Thomas Graf
                   ` (2 preceding siblings ...)
  2016-10-30 11:58 ` [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-10-30 11:58 ` Thomas Graf
  3 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 11:58 UTC (permalink / raw)
  To: davem; +Cc: alexei.starovoitov, daniel, tom, netdev, roopa

This adds a set of samples demonstrating the use of lwt-bpf combined
with a shell script which allows running the samples in the form of
a basic selftest.

The samples include:
 - Allowing all packets
 - Dropping all packets
 - Printing context information
 - Access packet data
 - IPv4 daddr rewrite in dst_output()
 - L2 MAC header push + redirect in lwt xmit

Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
 samples/bpf/bpf_helpers.h   |   4 +
 samples/bpf/lwt_bpf.c       | 210 +++++++++++++++++++++++++++
 samples/bpf/test_lwt_bpf.sh | 337 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 551 insertions(+)
 create mode 100644 samples/bpf/lwt_bpf.c
 create mode 100755 samples/bpf/test_lwt_bpf.sh

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..f34e417 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -80,6 +80,8 @@ struct bpf_map_def {
 	unsigned int map_flags;
 };
 
+static int (*bpf_skb_load_bytes)(void *ctx, int off, void *to, int len) =
+	(void *) BPF_FUNC_skb_load_bytes;
 static int (*bpf_skb_store_bytes)(void *ctx, int off, void *from, int len, int flags) =
 	(void *) BPF_FUNC_skb_store_bytes;
 static int (*bpf_l3_csum_replace)(void *ctx, int off, int from, int to, int flags) =
@@ -88,6 +90,8 @@ static int (*bpf_l4_csum_replace)(void *ctx, int off, int from, int to, int flag
 	(void *) BPF_FUNC_l4_csum_replace;
 static int (*bpf_skb_under_cgroup)(void *ctx, void *map, int index) =
 	(void *) BPF_FUNC_skb_under_cgroup;
+static int (*bpf_skb_push)(void *, int len, int flags) =
+	(void *) BPF_FUNC_skb_push;
 
 #if defined(__x86_64__)
 
diff --git a/samples/bpf/lwt_bpf.c b/samples/bpf/lwt_bpf.c
new file mode 100644
index 0000000..05be6ac
--- /dev/null
+++ b/samples/bpf/lwt_bpf.c
@@ -0,0 +1,210 @@
+/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+
+#include <stdint.h>
+#include <stddef.h>
+#include <linux/bpf.h>
+#include <linux/ip.h>
+#include <linux/in.h>
+#include <linux/in6.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/icmpv6.h>
+#include <linux/if_ether.h>
+#include "bpf_helpers.h"
+#include <string.h>
+
+# define printk(fmt, ...)						\
+		({							\
+			char ____fmt[] = fmt;				\
+			bpf_trace_printk(____fmt, sizeof(____fmt),	\
+				     ##__VA_ARGS__);			\
+		})
+
+#define CB_MAGIC 1234
+
+/* Let all packets pass */
+SEC("nop")
+int do_nop(struct __sk_buff *skb)
+{
+	return BPF_OK;
+}
+
+/* Print some context information per packet to tracing buffer.
+ */
+SEC("ctx_test")
+int do_ctx_test(struct __sk_buff *skb)
+{
+	skb->cb[0] = CB_MAGIC;
+	printk("len %d hash %d protocol %d\n", skb->len, skb->hash,
+	       skb->protocol);
+	printk("cb %d ingress_ifindex %d ifindex %d\n", skb->cb[0],
+	       skb->ingress_ifindex, skb->ifindex);
+
+	return BPF_OK;
+}
+
+/* Print content of skb->cb[] to tracing buffer */
+SEC("print_cb")
+int do_print_cb(struct __sk_buff *skb)
+{
+	printk("cb0: %x cb1: %x cb2: %x\n", skb->cb[0], skb->cb[1],
+	       skb->cb[2]);
+	printk("cb3: %x cb4: %x\n", skb->cb[3], skb->cb[4]);
+
+	return BPF_OK;
+}
+
+/* Print source and destination IPv4 address to tracing buffer */
+SEC("data_test")
+int do_data_test(struct __sk_buff *skb)
+{
+	void *data = (void *)(long)skb->data;
+	void *data_end = (void *)(long)skb->data_end;
+	struct iphdr *iph = data;
+
+	if (data + sizeof(*iph) > data_end) {
+		printk("packet truncated\n");
+		return BPF_DROP;
+	}
+
+	printk("src: %x dst: %x\n", iph->saddr, iph->daddr);
+
+	return BPF_OK;
+}
+
+#define IP_CSUM_OFF offsetof(struct iphdr, check)
+#define IP_DST_OFF offsetof(struct iphdr, daddr)
+#define IP_SRC_OFF offsetof(struct iphdr, saddr)
+#define IP_PROTO_OFF offsetof(struct iphdr, protocol)
+#define TCP_CSUM_OFF offsetof(struct tcphdr, check)
+#define UDP_CSUM_OFF offsetof(struct udphdr, check)
+#define IS_PSEUDO 0x10
+
+static inline int rewrite(struct __sk_buff *skb, uint32_t old_ip,
+			  uint32_t new_ip, int rw_daddr)
+{
+	int ret, off = 0, flags = IS_PSEUDO;
+	uint8_t proto;
+
+	ret = bpf_skb_load_bytes(skb, IP_PROTO_OFF, &proto, 1);
+	if (ret < 0) {
+		printk("bpf_l4_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	switch (proto) {
+	case IPPROTO_TCP:
+		off = TCP_CSUM_OFF;
+		break;
+
+	case IPPROTO_UDP:
+		off = UDP_CSUM_OFF;
+		flags |= BPF_F_MARK_MANGLED_0;
+		break;
+
+	case IPPROTO_ICMPV6:
+		off = offsetof(struct icmp6hdr, icmp6_cksum);
+		break;
+	}
+
+	if (off) {
+		ret = bpf_l4_csum_replace(skb, off, old_ip, new_ip,
+					  flags | sizeof(new_ip));
+		if (ret < 0) {
+			printk("bpf_l4_csum_replace failed: %d\n");
+			return BPF_DROP;
+		}
+	}
+
+	ret = bpf_l3_csum_replace(skb, IP_CSUM_OFF, old_ip, new_ip, sizeof(new_ip));
+	if (ret < 0) {
+		printk("bpf_l3_csum_replace failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (rw_daddr)
+		ret = bpf_skb_store_bytes(skb, IP_DST_OFF, &new_ip, sizeof(new_ip), 0);
+	else
+		ret = bpf_skb_store_bytes(skb, IP_SRC_OFF, &new_ip, sizeof(new_ip), 0);
+
+	if (ret < 0) {
+		printk("bpf_skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	return BPF_OK;
+}
+
+/* Rewrite IPv4 destination address from 192.168.254.2 to 192.168.254.3 */
+SEC("rw_out")
+int do_rw_out(struct __sk_buff *skb)
+{
+	uint32_t old_ip, new_ip = 0x3fea8c0;
+	int ret;
+
+	ret = bpf_skb_load_bytes(skb, IP_DST_OFF, &old_ip, 4);
+	if (ret < 0) {
+		printk("bpf_skb_load_bytes failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	if (old_ip == 0x2fea8c0) {
+		printk("out: rewriting from %x to %x\n", old_ip, new_ip);
+		return rewrite(skb, old_ip, new_ip, 1);
+	}
+
+	return BPF_OK;
+}
+
+SEC("redirect")
+int do_redirect(struct __sk_buff *skb)
+{
+	uint64_t smac = SRC_MAC, dmac = DST_MAC;
+	int ret, ifindex = DST_IFINDEX;
+	struct ethhdr ehdr;
+
+	ret = bpf_skb_push(skb, 14, 0);
+	if (ret < 0) {
+		printk("skb_push() failed: %d\n", ret);
+	}
+
+	ehdr.h_proto = __constant_htons(ETH_P_IP);
+	memcpy(&ehdr.h_source, &smac, 6);
+	memcpy(&ehdr.h_dest, &dmac, 6);
+
+	ret = bpf_skb_store_bytes(skb, 0, &ehdr, sizeof(ehdr), 0);
+	if (ret < 0) {
+		printk("skb_store_bytes() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	ret = bpf_redirect(ifindex, 0);
+	if (ret < 0) {
+		printk("bpf_redirect() failed: %d\n", ret);
+		return BPF_DROP;
+	}
+
+	printk("redirected to %d\n", ifindex);
+
+	return BPF_REDIRECT;
+}
+
+/* Drop all packets */
+SEC("drop_all")
+int do_drop_all(struct __sk_buff *skb)
+{
+	printk("dropping with: %d\n", BPF_DROP);
+	return BPF_DROP;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/test_lwt_bpf.sh b/samples/bpf/test_lwt_bpf.sh
new file mode 100755
index 0000000..ea4921a
--- /dev/null
+++ b/samples/bpf/test_lwt_bpf.sh
@@ -0,0 +1,337 @@
+#!/bin/bash
+
+# Uncomment to see generated bytecode
+#VERBOSE=verbose
+
+NS=lwt_ns
+VETH0=tst_lwt0
+VETH1=tst_lwt1
+IP4_1="192.168.254.1"
+IP4_2="192.168.254.2"
+IP4_3="192.168.254.3"
+IP4_4="192.168.254.4"
+
+TRACE_ROOT=/sys/kernel/debug/tracing
+
+function hton_mac()
+{
+	MAC="${1//:/}"
+	echo "0x${MAC:10:2}${MAC:8:2}${MAC:6:2}${MAC:4:2}${MAC:2:2}${MAC:0:2}"
+}
+
+function lookup_mac()
+{
+	set +x
+	if [ ! -z "$2" ]; then
+		MAC=$(ip netns exec $2 ip link show $1 | grep ether | awk '{print $2}')
+	else
+		MAC=$(ip link show $1 | grep ether | awk '{print $2}')
+	fi
+	echo $(hton_mac $MAC)
+	set -x
+}
+
+function cleanup {
+        set +ex
+        rm lwt_bpf.o 2> /dev/null
+        ip link del $VETH0 2> /dev/null
+        ip netns delete $NS 2> /dev/null
+        set -ex
+}
+
+function setup_veth {
+        ip netns add $NS
+
+        ip link add $VETH0 type veth peer name $VETH1
+
+        ip link set dev $VETH0 up
+        ip addr add ${IP4_1}/24 dev $VETH0
+
+        ip link set $VETH1 netns $NS
+        ip netns exec $NS ip link set dev $VETH1 up
+        ip netns exec $NS ip addr add ${IP4_2}/24 dev $VETH1
+        ip netns exec $NS ip addr add ${IP4_3}/32 dev $VETH1
+
+        echo 1 > ${TRACE_ROOT}/tracing_on
+}
+
+function get_trace {
+	set +x
+        cat ${TRACE_ROOT}/trace | grep -v '^#'
+	set -x
+}
+
+function install_prog {
+	ip route del ${IP4_2}/32 dev $VETH0 2> /dev/null || true
+	ip route del table local local ${IP4_4}/32 dev lo 2> /dev/null || true
+	cp /dev/null ${TRACE_ROOT}/trace
+
+	OPTS="encap bpf $1 obj lwt_bpf.o section $2 $VERBOSE"
+
+	if [ "$1" == "in" ];  then
+		ip route add table local local ${IP4_4}/32 $OPTS dev lo
+	else
+		ip route add ${IP4_2}/32 $OPTS dev $VETH0
+	fi
+}
+
+function remove_prog {
+	if [ "$1" == "in" ];  then
+		ip route del table local local ${IP4_4}/32 dev lo
+	else
+		ip route del ${IP4_2}/32 dev $VETH0
+	fi
+}
+
+function filter_trace {
+	# Add newline to allow starting EXPECT= variables on newline
+	NL=$'\n'
+	echo "${NL}$*" | sed -e 's/^.*: : //g'
+}
+
+function expect_fail {
+	set +x
+	echo "FAIL:"
+	echo "Expected: $1"
+	echo "Got: $2"
+	set -x
+	exit 1
+}
+
+function match_trace {
+	set +x
+	RET=0
+	TRACE=$1
+	EXPECT=$2
+	GOT="$(filter_trace "$TRACE")"
+
+	[ "$GOT" != "$EXPECT" ] && {
+		expect_fail "$EXPECT" "$GOT"
+		RET=1
+	}
+	set -x
+	return $RET
+}
+
+function test_start {
+	set +x
+	echo "----------------------------------------------------------------"
+	echo "Starting test: $*"
+	echo "----------------------------------------------------------------"
+	set -x
+}
+
+function failure {
+	get_trace
+	echo "FAIL: $*"
+	exit 1
+}
+
+function test_ctx_xmit {
+	test_start "test_ctx on lwt xmit"
+	install_prog xmit ctx_test
+	ping -c 3 $IP4_2 || {
+		failure "test_ctx xmit: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 0 ifindex $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_ctx_out {
+	test_start "test_ctx on lwt out"
+	install_prog out ctx_test
+	ping -c 3 $IP4_2 || {
+		failure "test_ctx out: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0
+len 84 hash 0 protocol 0
+cb 1234 ingress_ifindex 0 ifindex 0" || exit 1
+	remove_prog out
+}
+
+function test_ctx_in {
+	test_start "test_ctx on lwt in"
+	install_prog in ctx_test
+	ping -c 3 $IP4_4 || {
+		failure "test_ctx out: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP4_4 => $IP4_4
+	match_trace "$(get_trace)" "
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1
+len 84 hash 0 protocol 8
+cb 1234 ingress_ifindex 1 ifindex 1" || exit 1
+	remove_prog in
+}
+
+function test_data {
+	test_start "test_data on lwt $1"
+	install_prog $1 data_test
+	ping -c 3 $IP4_2 || {
+		failure "test_data ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0
+src: 1fea8c0 dst: 2fea8c0" || exit 1
+	remove_prog $1
+}
+
+function test_data_in {
+	test_start "test_data on lwt in"
+	install_prog in data_test
+	ping -c 3 $IP4_4 || {
+		failure "test_data in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP4_4 => $IP4_4
+	match_trace "$(get_trace)" "
+src: 4fea8c0 dst: 4fea8c0
+src: 4fea8c0 dst: 4fea8c0
+src: 4fea8c0 dst: 4fea8c0
+src: 4fea8c0 dst: 4fea8c0
+src: 4fea8c0 dst: 4fea8c0
+src: 4fea8c0 dst: 4fea8c0" || exit 1
+	remove_prog in
+}
+
+function test_cb {
+	test_start "test_cb on lwt $1"
+	install_prog $1 print_cb
+	ping -c 3 $IP4_2 || {
+		failure "test_cb ${1}: packets are dropped"
+	}
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog $1
+}
+
+function test_cb_in {
+	test_start "test_cb on lwt in"
+	install_prog in print_cb
+	ping -c 3 $IP4_4 || {
+		failure "test_cb in: packets are dropped"
+	}
+	# We will both request & reply packets as the packets will
+	# be from $IP4_4 => $IP4_4
+	match_trace "$(get_trace)" "
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0
+cb0: 0 cb1: 0 cb2: 0
+cb3: 0 cb4: 0" || exit 1
+	remove_prog in
+}
+
+function test_drop_all {
+	test_start "test_drop_all on lwt $1"
+	install_prog $1 drop_all
+	ping -c 3 $IP4_2 && {
+		failure "test_drop_all ${1}: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog $1
+}
+
+function test_drop_all_in {
+	test_start "test_drop_all on lwt in"
+	install_prog in drop_all
+	ping -c 3 $IP4_4 && {
+		failure "test_drop_all in: Unexpected success of ping"
+	}
+	match_trace "$(get_trace)" "
+dropping with: 2
+dropping with: 2
+dropping with: 2" || exit 1
+	remove_prog in
+}
+
+function test_redirect_xmit {
+	test_start "test_redirect on lwt xmit"
+	install_prog xmit redirect
+	ping -c 3 $IP4_2 || {
+		failure "Redirected packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog xmit
+}
+
+function test_rw_out {
+	test_start "test_rw on lwt out"
+	install_prog out rw_out
+	ping -c 3 $IP4_2 || {
+		failure "FAIL: Redirected packets appear to be dropped"
+	}
+	match_trace "$(get_trace)" "
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX
+redirected to $DST_IFINDEX" || exit 1
+	remove_prog out
+}
+
+cleanup
+setup_veth
+
+DST_MAC=$(lookup_mac $VETH1 $NS)
+SRC_MAC=$(lookup_mac $VETH0)
+DST_IFINDEX=$(cat /sys/class/net/$VETH0/ifindex)
+
+CLANG_OPTS="-O2 -target bpf -I ../include/"
+CLANG_OPTS+=" -DSRC_MAC=$SRC_MAC -DDST_MAC=$DST_MAC -DDST_IFINDEX=$DST_IFINDEX"
+clang $CLANG_OPTS -c lwt_bpf.c -o lwt_bpf.o
+
+test_ctx_xmit
+test_ctx_out
+test_ctx_in
+test_data "xmit"
+test_data "out"
+test_data_in
+test_cb "xmit"
+test_cb "out"
+test_cb_in
+test_drop_all "xmit"
+test_drop_all "out"
+test_drop_all_in
+test_redirect_xmit
+test_rw_out
+
+cleanup
+echo 0 > ${TRACE_ROOT}/tracing_on
+exit 0
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-30 11:58 ` [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
@ 2016-10-30 20:34   ` Tom Herbert
  2016-10-30 21:47     ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-10-30 20:34 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf <tgraf@suug.ch> wrote:
> Register two new BPF prog types BPF_PROG_TYPE_LWT_IN and
> BPF_PROG_TYPE_LWT_OUT which are invoked if a route contains a
> LWT redirection of type LWTUNNEL_ENCAP_BPF.
>
> The separate program types are required because manipulation of
> packet data is only allowed on the output and transmit path as
> the subsequent dst_input() call path assumes an IP header
> validated by ip_rcv(). The BPF programs will be handed an skb
> with the L3 header attached and may return one of the following
> return codes:
>
>  BPF_OK - Continue routing as per nexthop
>  BPF_DROP - Drop skb and return EPERM
>  BPF_REDIRECT - Redirect skb to device as per redirect() helper.
>                 (Only valid on lwtunnel_xmit() hook)
>
> The return codes are binary compatible with their TC_ACT_
> relatives to ease compatibility.
>
> A new helper bpf_skb_push() is added which allows to preprend an
> L2 header in front of the skb, extend the existing L3 header, or
> both. This allows to address a wide range of issues:
>  - Optimize L2 header construction when L2 information is always
>    static to avoid ARP/NDisc lookup.
>  - Extend IP header to add additional IP options.
>  - Perform simple encapsulation where offload is of no concern.
>    (The existing funtionality to attach a tunnel key to the skb
>     and redirect to a tunnel net_device to allow for offload
>     continues to work obviously).
>
> Signed-off-by: Thomas Graf <tgraf@suug.ch>
> ---
>  include/linux/filter.h        |   2 +-
>  include/uapi/linux/bpf.h      |  31 +++-
>  include/uapi/linux/lwtunnel.h |  21 +++
>  kernel/bpf/verifier.c         |  16 +-
>  net/core/Makefile             |   2 +-
>  net/core/filter.c             | 148 ++++++++++++++++-
>  net/core/lwt_bpf.c            | 365 ++++++++++++++++++++++++++++++++++++++++++
>  net/core/lwtunnel.c           |   1 +
>  8 files changed, 579 insertions(+), 7 deletions(-)
>  create mode 100644 net/core/lwt_bpf.c
>
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index 1f09c52..aad7f81 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -438,7 +438,7 @@ struct xdp_buff {
>  };
>
>  /* compute the linear packet data range [data, data_end) which
> - * will be accessed by cls_bpf and act_bpf programs
> + * will be accessed by cls_bpf, act_bpf and lwt programs
>   */
>  static inline void bpf_compute_data_end(struct sk_buff *skb)
>  {
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e2f38e0..2ebaa3c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -96,6 +96,9 @@ enum bpf_prog_type {
>         BPF_PROG_TYPE_TRACEPOINT,
>         BPF_PROG_TYPE_XDP,
>         BPF_PROG_TYPE_PERF_EVENT,
> +       BPF_PROG_TYPE_LWT_IN,
> +       BPF_PROG_TYPE_LWT_OUT,
> +       BPF_PROG_TYPE_LWT_XMIT,
>  };
>
>  #define BPF_PSEUDO_MAP_FD      1
> @@ -383,6 +386,16 @@ union bpf_attr {
>   *
>   * int bpf_get_numa_node_id()
>   *     Return: Id of current NUMA node.
> + *
> + * int bpf_skb_push()
> + *     Add room to beginning of skb and adjusts MAC header offset accordingly.
> + *     Extends/reallocaes for needed skb headeroom automatically.
> + *     May change skb data pointer and will thus invalidate any check done
> + *     for direct packet access.
> + *     @skb: pointer to skb
> + *     @len: length of header to be pushed in front
> + *     @flags: Flags (unused for now)
> + *     Return: 0 on success or negative error
>   */
>  #define __BPF_FUNC_MAPPER(FN)          \
>         FN(unspec),                     \
> @@ -427,7 +440,8 @@ union bpf_attr {
>         FN(skb_pull_data),              \
>         FN(csum_update),                \
>         FN(set_hash_invalid),           \
> -       FN(get_numa_node_id),
> +       FN(get_numa_node_id),           \
> +       FN(skb_push),
>
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> @@ -511,6 +525,21 @@ struct bpf_tunnel_key {
>         __u32 tunnel_label;
>  };
>
> +/* Generic BPF return codes which all BPF program types may support.
> + * The values are binary compatible with their TC_ACT_* counter-part to
> + * provide backwards compatibility with existing SCHED_CLS and SCHED_ACT
> + * programs.
> + *
> + * XDP is handled seprately, see XDP_*.
> + */
> +enum bpf_ret_code {
> +       BPF_OK = 0,
> +       /* 1 reserved */
> +       BPF_DROP = 2,
> +       /* 3-6 reserved */
> +       BPF_REDIRECT = 7,
> +};
> +
>  /* User return codes for XDP prog type.
>   * A valid XDP program must return one of these defined values. All other
>   * return codes are reserved for future use. Unknown return codes will result
> diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
> index a478fe8..9354d997 100644
> --- a/include/uapi/linux/lwtunnel.h
> +++ b/include/uapi/linux/lwtunnel.h
> @@ -9,6 +9,7 @@ enum lwtunnel_encap_types {
>         LWTUNNEL_ENCAP_IP,
>         LWTUNNEL_ENCAP_ILA,
>         LWTUNNEL_ENCAP_IP6,
> +       LWTUNNEL_ENCAP_BPF,
>         __LWTUNNEL_ENCAP_MAX,
>  };
>
> @@ -42,4 +43,24 @@ enum lwtunnel_ip6_t {
>
>  #define LWTUNNEL_IP6_MAX (__LWTUNNEL_IP6_MAX - 1)
>
> +enum {
> +       LWT_BPF_PROG_UNSPEC,
> +       LWT_BPF_PROG_FD,
> +       LWT_BPF_PROG_NAME,
> +       __LWT_BPF_PROG_MAX,
> +};
> +
> +#define LWT_BPF_PROG_MAX (__LWT_BPF_PROG_MAX - 1)
> +
> +enum {
> +       LWT_BPF_UNSPEC,
> +       LWT_BPF_IN,
> +       LWT_BPF_OUT,
> +       LWT_BPF_XMIT,
> +       __LWT_BPF_MAX,
> +};
> +
> +#define LWT_BPF_MAX (__LWT_BPF_MAX - 1)
> +
> +
>  #endif /* _UAPI_LWTUNNEL_H_ */
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 9002575..519b58e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -633,12 +633,21 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, int off,
>  #define MAX_PACKET_OFF 0xffff
>
>  static bool may_access_direct_pkt_data(struct bpf_verifier_env *env,
> -                                      const struct bpf_call_arg_meta *meta)
> +                                      const struct bpf_call_arg_meta *meta,
> +                                      enum bpf_access_type t)
>  {
>         switch (env->prog->type) {
> +       case BPF_PROG_TYPE_LWT_IN:
> +               /* dst_input() can't write for now, orig_input may depend on
> +                * IP header parsed by ip_rcv().
> +                */
> +               if (t == BPF_WRITE)
> +                       return false;
>         case BPF_PROG_TYPE_SCHED_CLS:
>         case BPF_PROG_TYPE_SCHED_ACT:
>         case BPF_PROG_TYPE_XDP:
> +       case BPF_PROG_TYPE_LWT_OUT:
> +       case BPF_PROG_TYPE_LWT_XMIT:
>                 if (meta)
>                         return meta->pkt_access;
>
> @@ -837,7 +846,7 @@ static int check_mem_access(struct bpf_verifier_env *env, u32 regno, int off,
>                         err = check_stack_read(state, off, size, value_regno);
>                 }
>         } else if (state->regs[regno].type == PTR_TO_PACKET) {
> -               if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL)) {
> +               if (t == BPF_WRITE && !may_access_direct_pkt_data(env, NULL, t)) {
>                         verbose("cannot write into packet\n");
>                         return -EACCES;
>                 }
> @@ -970,7 +979,8 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 regno,
>                 return 0;
>         }
>
> -       if (type == PTR_TO_PACKET && !may_access_direct_pkt_data(env, meta)) {
> +       if (type == PTR_TO_PACKET &&
> +           !may_access_direct_pkt_data(env, meta, BPF_READ)) {
>                 verbose("helper access to the packet is not allowed\n");
>                 return -EACCES;
>         }
> diff --git a/net/core/Makefile b/net/core/Makefile
> index d6508c2..a675fd3 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -23,7 +23,7 @@ obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>  obj-$(CONFIG_NET_PTP_CLASSIFY) += ptp_classifier.o
>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
> -obj-$(CONFIG_LWTUNNEL) += lwtunnel.o
> +obj-$(CONFIG_LWTUNNEL) += lwtunnel.o lwt_bpf.o
>  obj-$(CONFIG_DST_CACHE) += dst_cache.o
>  obj-$(CONFIG_HWBM) += hwbm.o
>  obj-$(CONFIG_NET_DEVLINK) += devlink.o
> diff --git a/net/core/filter.c b/net/core/filter.c
> index cd9e2ba..325a9d8 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2138,6 +2138,43 @@ static const struct bpf_func_proto bpf_skb_change_tail_proto = {
>         .arg3_type      = ARG_ANYTHING,
>  };
>
> +BPF_CALL_3(bpf_skb_push, struct sk_buff *, skb, __u32, len, u64, flags)
> +{
> +       u32 new_len = skb->len + len;
> +
> +       /* restrict max skb size and check for overflow */
> +       if (new_len > __bpf_skb_max_len(skb) || new_len < skb->len)
> +               return -ERANGE;
> +
> +       if (flags)
> +               return -EINVAL;
> +
> +       if (len > 0) {
> +               int ret;
> +
> +               ret = skb_cow(skb, len);
> +               if (unlikely(ret < 0))
> +                       return ret;
> +
> +               __skb_push(skb, len);
> +               memset(skb->data, 0, len);
> +       }
> +
> +       skb_reset_mac_header(skb);
> +
> +       bpf_compute_data_end(skb);
> +       return 0;
> +}
> +
> +static const struct bpf_func_proto bpf_skb_push_proto = {
> +       .func           = bpf_skb_push,
> +       .gpl_only       = false,
> +       .ret_type       = RET_INTEGER,
> +       .arg1_type      = ARG_PTR_TO_CTX,
> +       .arg2_type      = ARG_ANYTHING,
> +       .arg3_type      = ARG_ANYTHING,
> +};
> +
>  bool bpf_helper_changes_skb_data(void *func)
>  {
>         if (func == bpf_skb_vlan_push ||
> @@ -2147,7 +2184,8 @@ bool bpf_helper_changes_skb_data(void *func)
>             func == bpf_skb_change_tail ||
>             func == bpf_skb_pull_data ||
>             func == bpf_l3_csum_replace ||
> -           func == bpf_l4_csum_replace)
> +           func == bpf_l4_csum_replace ||
> +           func == bpf_skb_push)
>                 return true;
>
>         return false;
> @@ -2578,6 +2616,75 @@ xdp_func_proto(enum bpf_func_id func_id)
>         }
>  }
>
> +static const struct bpf_func_proto *
> +lwt_in_func_proto(enum bpf_func_id func_id)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_skb_load_bytes:
> +               return &bpf_skb_load_bytes_proto;
> +       case BPF_FUNC_skb_pull_data:
> +               return &bpf_skb_pull_data_proto;
> +       case BPF_FUNC_csum_diff:
> +               return &bpf_csum_diff_proto;
> +       case BPF_FUNC_get_cgroup_classid:
> +               return &bpf_get_cgroup_classid_proto;
> +       case BPF_FUNC_get_route_realm:
> +               return &bpf_get_route_realm_proto;
> +       case BPF_FUNC_get_hash_recalc:
> +               return &bpf_get_hash_recalc_proto;
> +       case BPF_FUNC_perf_event_output:
> +               return &bpf_skb_event_output_proto;
> +       case BPF_FUNC_get_smp_processor_id:
> +               return &bpf_get_smp_processor_id_proto;
> +       case BPF_FUNC_skb_under_cgroup:
> +               return &bpf_skb_under_cgroup_proto;
> +       default:
> +               return sk_filter_func_proto(func_id);
> +       }
> +}
> +
> +static const struct bpf_func_proto *
> +lwt_out_func_proto(enum bpf_func_id func_id)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_skb_store_bytes:
> +               return &bpf_skb_store_bytes_proto;
> +       case BPF_FUNC_csum_update:
> +               return &bpf_csum_update_proto;
> +       case BPF_FUNC_l3_csum_replace:
> +               return &bpf_l3_csum_replace_proto;
> +       case BPF_FUNC_l4_csum_replace:
> +               return &bpf_l4_csum_replace_proto;
> +       case BPF_FUNC_set_hash_invalid:
> +               return &bpf_set_hash_invalid_proto;
> +       default:
> +               return lwt_in_func_proto(func_id);
> +       }
> +}
> +
> +static const struct bpf_func_proto *
> +lwt_xmit_func_proto(enum bpf_func_id func_id)
> +{
> +       switch (func_id) {
> +       case BPF_FUNC_skb_get_tunnel_key:
> +               return &bpf_skb_get_tunnel_key_proto;
> +       case BPF_FUNC_skb_set_tunnel_key:
> +               return bpf_get_skb_set_tunnel_proto(func_id);
> +       case BPF_FUNC_skb_get_tunnel_opt:
> +               return &bpf_skb_get_tunnel_opt_proto;
> +       case BPF_FUNC_skb_set_tunnel_opt:
> +               return bpf_get_skb_set_tunnel_proto(func_id);
> +       case BPF_FUNC_redirect:
> +               return &bpf_redirect_proto;
> +       case BPF_FUNC_skb_change_tail:
> +               return &bpf_skb_change_tail_proto;
> +       case BPF_FUNC_skb_push:
> +               return &bpf_skb_push_proto;
> +       default:
> +               return lwt_out_func_proto(func_id);
> +       }
> +}
> +
>  static bool __is_valid_access(int off, int size, enum bpf_access_type type)
>  {
>         if (off < 0 || off >= sizeof(struct __sk_buff))
> @@ -2940,6 +3047,27 @@ static const struct bpf_verifier_ops xdp_ops = {
>         .convert_ctx_access     = xdp_convert_ctx_access,
>  };
>
> +static const struct bpf_verifier_ops lwt_in_ops = {
> +       .get_func_proto         = lwt_in_func_proto,
> +       .is_valid_access        = tc_cls_act_is_valid_access,
> +       .convert_ctx_access     = sk_filter_convert_ctx_access,
> +       .gen_prologue           = tc_cls_act_prologue,
> +};
> +
> +static const struct bpf_verifier_ops lwt_out_ops = {
> +       .get_func_proto         = lwt_out_func_proto,
> +       .is_valid_access        = tc_cls_act_is_valid_access,
> +       .convert_ctx_access     = sk_filter_convert_ctx_access,
> +       .gen_prologue           = tc_cls_act_prologue,
> +};
> +
> +static const struct bpf_verifier_ops lwt_xmit_ops = {
> +       .get_func_proto         = lwt_xmit_func_proto,
> +       .is_valid_access        = tc_cls_act_is_valid_access,
> +       .convert_ctx_access     = sk_filter_convert_ctx_access,
> +       .gen_prologue           = tc_cls_act_prologue,
> +};
> +
>  static struct bpf_prog_type_list sk_filter_type __read_mostly = {
>         .ops    = &sk_filter_ops,
>         .type   = BPF_PROG_TYPE_SOCKET_FILTER,
> @@ -2960,12 +3088,30 @@ static struct bpf_prog_type_list xdp_type __read_mostly = {
>         .type   = BPF_PROG_TYPE_XDP,
>  };
>
> +static struct bpf_prog_type_list lwt_in_type __read_mostly = {
> +       .ops    = &lwt_in_ops,
> +       .type   = BPF_PROG_TYPE_LWT_IN,
> +};
> +
> +static struct bpf_prog_type_list lwt_out_type __read_mostly = {
> +       .ops    = &lwt_out_ops,
> +       .type   = BPF_PROG_TYPE_LWT_OUT,
> +};
> +
> +static struct bpf_prog_type_list lwt_xmit_type __read_mostly = {
> +       .ops    = &lwt_xmit_ops,
> +       .type   = BPF_PROG_TYPE_LWT_XMIT,
> +};
> +
>  static int __init register_sk_filter_ops(void)
>  {
>         bpf_register_prog_type(&sk_filter_type);
>         bpf_register_prog_type(&sched_cls_type);
>         bpf_register_prog_type(&sched_act_type);
>         bpf_register_prog_type(&xdp_type);
> +       bpf_register_prog_type(&lwt_in_type);
> +       bpf_register_prog_type(&lwt_out_type);
> +       bpf_register_prog_type(&lwt_xmit_type);
>
>         return 0;
>  }
> diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
> new file mode 100644
> index 0000000..8404ac6
> --- /dev/null
> +++ b/net/core/lwt_bpf.c
> @@ -0,0 +1,365 @@
> +/* Copyright (c) 2016 Thomas Graf <tgraf@tgraf.ch>
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/types.h>
> +#include <linux/bpf.h>
> +#include <net/lwtunnel.h>
> +
> +struct bpf_lwt_prog {
> +       struct bpf_prog *prog;
> +       char *name;
> +};
> +
> +struct bpf_lwt {
> +       struct bpf_lwt_prog in;
> +       struct bpf_lwt_prog out;
> +       struct bpf_lwt_prog xmit;
> +};
> +
> +#define MAX_PROG_NAME 256
> +
> +static inline struct bpf_lwt *bpf_lwt_lwtunnel(struct lwtunnel_state *lwt)
> +{
> +       return (struct bpf_lwt *)lwt->data;
> +}
> +
> +#define NO_REDIRECT false
> +#define CAN_REDIRECT true
> +
> +static int run_lwt_bpf(struct sk_buff *skb, struct bpf_lwt_prog *lwt,
> +                      struct dst_entry *dst, bool can_redirect)
> +{
> +       int ret;
> +
> +       /* Preempt disable is needed to protect per-cpu redirect_info between
> +        * BPF prog and skb_do_redirect(). The call_rcu in bpf_prog_put() and
> +        * access to maps strictly require a rcu_read_lock() for protection,
> +        * mixing with BH RCU lock doesn't work.
> +        */
> +       preempt_disable();
> +       rcu_read_lock();
> +       bpf_compute_data_end(skb);
> +       ret = BPF_PROG_RUN(lwt->prog, skb);
> +       rcu_read_unlock();
> +
> +       switch (ret) {
> +       case BPF_OK:
> +               break;
> +
> +       case BPF_REDIRECT:
> +               if (!can_redirect) {
> +                       WARN_ONCE(1, "Illegal redirect return code in prog %s\n",
> +                                 lwt->name ? : "<unknown>");
> +                       ret = BPF_OK;
> +               } else {
> +                       ret = skb_do_redirect(skb);
> +                       if (ret == 0)
> +                               ret = BPF_REDIRECT;
> +               }
> +               break;
> +
> +       case BPF_DROP:
> +               kfree_skb(skb);
> +               ret = -EPERM;
> +               break;
> +
> +       default:
> +               WARN_ONCE(1, "Illegal LWT BPF return value %u, expect packet loss\n",
> +                         ret);
> +               kfree_skb(skb);
> +               ret = -EINVAL;
> +               break;
> +       }
> +
> +       preempt_enable();
> +
> +       return ret;
> +}
> +
> +static int bpf_input(struct sk_buff *skb)
> +{
> +       struct dst_entry *dst = skb_dst(skb);
> +       struct bpf_lwt *bpf;
> +       int ret;
> +
> +       bpf = bpf_lwt_lwtunnel(dst->lwtstate);
> +       if (bpf->in.prog) {
> +               ret = run_lwt_bpf(skb, &bpf->in, dst, NO_REDIRECT);
> +               if (ret < 0)
> +                       return ret;
> +       }
> +
> +       if (unlikely(!dst->lwtstate->orig_input)) {
> +               WARN_ONCE(1, "orig_input not set on dst for prog %s\n",
> +                         bpf->out.name);
> +               kfree_skb(skb);
> +               return -EINVAL;
> +       }
> +
> +       return dst->lwtstate->orig_input(skb);
> +}
> +
> +static int bpf_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> +{
> +       struct dst_entry *dst = skb_dst(skb);
> +       struct bpf_lwt *bpf;
> +       int ret;
> +
> +       bpf = bpf_lwt_lwtunnel(dst->lwtstate);
> +       if (bpf->out.prog) {
> +               ret = run_lwt_bpf(skb, &bpf->out, dst, NO_REDIRECT);
> +               if (ret < 0)
> +                       return ret;
> +       }
> +
> +       if (unlikely(!dst->lwtstate->orig_output)) {
> +               WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
> +                         bpf->out.name);
> +               kfree_skb(skb);
> +               return -EINVAL;
> +       }
> +
> +       return dst->lwtstate->orig_output(net, sk, skb);

Thomas,

The BPF program may have changed the destination address so continuing
with original route in skb may not be appropriate here. This was fixed
in ila_lwt by calling ip6_route_output and we were able to dst cache
facility to cache the route to avoid cost of looking it up on every
packet. Since the kernel  has no insight into what the BPF program
does to the packet I'd suggest 1) checking if destination address
changed by BPF and if it did then call route_output to get new route
2) If the LWT destination is a host route then try to keep a dst
cache. This would entail checking destination address on return that
it is the same one as kept in the dst cache.

Tom

> +}
> +
> +static int bpf_xmit(struct sk_buff *skb)
> +{
> +       struct dst_entry *dst = skb_dst(skb);
> +       struct bpf_lwt *bpf;
> +
> +       bpf = bpf_lwt_lwtunnel(dst->lwtstate);
> +       if (bpf->xmit.prog) {
> +               int ret;
> +
> +               ret = run_lwt_bpf(skb, &bpf->xmit, dst, CAN_REDIRECT);
> +               switch (ret) {
> +               case BPF_OK:
> +                       return LWTUNNEL_XMIT_CONTINUE;
> +               case BPF_REDIRECT:
> +                       return LWTUNNEL_XMIT_DONE;
> +               default:
> +                       return ret;
> +               }
> +       }
> +
> +       return LWTUNNEL_XMIT_CONTINUE;
> +}
> +
> +static void bpf_lwt_prog_destroy(struct bpf_lwt_prog *prog)
> +{
> +       if (prog->prog)
> +               bpf_prog_put(prog->prog);
> +
> +       kfree(prog->name);
> +}
> +
> +static void bpf_destroy_state(struct lwtunnel_state *lwt)
> +{
> +       struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
> +
> +       bpf_lwt_prog_destroy(&bpf->in);
> +       bpf_lwt_prog_destroy(&bpf->out);
> +       bpf_lwt_prog_destroy(&bpf->xmit);
> +}
> +
> +static const struct nla_policy bpf_prog_policy[LWT_BPF_PROG_MAX + 1] = {
> +       [LWT_BPF_PROG_FD] = { .type = NLA_U32, },
> +       [LWT_BPF_PROG_NAME] = { .type = NLA_NUL_STRING,
> +                               .len = MAX_PROG_NAME },
> +};
> +
> +static int bpf_parse_prog(struct nlattr *attr, struct bpf_lwt_prog *prog,
> +                         enum bpf_prog_type type)
> +{
> +       struct nlattr *tb[LWT_BPF_PROG_MAX + 1];
> +       struct bpf_prog *p;
> +       int ret;
> +       u32 fd;
> +
> +       ret = nla_parse_nested(tb, LWT_BPF_PROG_MAX, attr, bpf_prog_policy);
> +       if (ret < 0)
> +               return ret;
> +
> +       if (!tb[LWT_BPF_PROG_FD] || !tb[LWT_BPF_PROG_NAME])
> +               return -EINVAL;
> +
> +       prog->name = nla_memdup(tb[LWT_BPF_PROG_NAME], GFP_KERNEL);
> +       if (!prog->name)
> +               return -ENOMEM;
> +
> +       fd = nla_get_u32(tb[LWT_BPF_PROG_FD]);
> +       p = bpf_prog_get_type(fd, type);
> +       if (IS_ERR(p))
> +               return PTR_ERR(p);
> +
> +       prog->prog = p;
> +
> +       return 0;
> +}
> +
> +static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = {
> +       [LWT_BPF_IN]   = { .type = NLA_NESTED, },
> +       [LWT_BPF_OUT]  = { .type = NLA_NESTED, },
> +       [LWT_BPF_XMIT] = { .type = NLA_NESTED, },
> +};
> +
> +static int bpf_build_state(struct net_device *dev, struct nlattr *nla,
> +                          unsigned int family, const void *cfg,
> +                          struct lwtunnel_state **ts)
> +{
> +       struct nlattr *tb[LWT_BPF_MAX + 1];
> +       struct lwtunnel_state *newts;
> +       struct bpf_lwt *bpf;
> +       int ret;
> +
> +       ret = nla_parse_nested(tb, LWT_BPF_MAX, nla, bpf_nl_policy);
> +       if (ret < 0)
> +               return ret;
> +
> +       if (!tb[LWT_BPF_IN] && !tb[LWT_BPF_OUT] && !tb[LWT_BPF_XMIT])
> +               return -EINVAL;
> +
> +       newts = lwtunnel_state_alloc(sizeof(*bpf));
> +       if (!newts)
> +               return -ENOMEM;
> +
> +       newts->type = LWTUNNEL_ENCAP_BPF;
> +       bpf = bpf_lwt_lwtunnel(newts);
> +
> +       if (tb[LWT_BPF_IN]) {
> +               ret = bpf_parse_prog(tb[LWT_BPF_IN], &bpf->in,
> +                                    BPF_PROG_TYPE_LWT_IN);
> +               if (ret  < 0) {
> +                       kfree(newts);
> +                       return ret;
> +               }
> +
> +               newts->flags |= LWTUNNEL_STATE_INPUT_REDIRECT;
> +       }
> +
> +       if (tb[LWT_BPF_OUT]) {
> +               ret = bpf_parse_prog(tb[LWT_BPF_OUT], &bpf->out,
> +                                    BPF_PROG_TYPE_LWT_OUT);
> +               if (ret < 0) {
> +                       bpf_destroy_state(newts);
> +                       kfree(newts);
> +                       return ret;
> +               }
> +
> +               newts->flags |= LWTUNNEL_STATE_OUTPUT_REDIRECT;
> +       }
> +
> +       if (tb[LWT_BPF_XMIT]) {
> +               ret = bpf_parse_prog(tb[LWT_BPF_XMIT], &bpf->xmit,
> +                                    BPF_PROG_TYPE_LWT_XMIT);
> +               if (ret < 0) {
> +                       bpf_destroy_state(newts);
> +                       kfree(newts);
> +                       return ret;
> +               }
> +
> +               newts->flags |= LWTUNNEL_STATE_XMIT_REDIRECT;
> +       }
> +
> +       *ts = newts;
> +
> +       return 0;
> +}
> +
> +static int bpf_fill_lwt_prog(struct sk_buff *skb, int attr,
> +                            struct bpf_lwt_prog *prog)
> +{
> +       struct nlattr *nest;
> +
> +       if (!prog->prog)
> +               return 0;
> +
> +       nest = nla_nest_start(skb, attr);
> +       if (!nest)
> +               return -EMSGSIZE;
> +
> +       if (prog->name &&
> +           nla_put_string(skb, LWT_BPF_PROG_NAME, prog->name))
> +               return -EMSGSIZE;
> +
> +       return nla_nest_end(skb, nest);
> +}
> +
> +static int bpf_fill_encap_info(struct sk_buff *skb, struct lwtunnel_state *lwt)
> +{
> +       struct bpf_lwt *bpf = bpf_lwt_lwtunnel(lwt);
> +
> +       if (bpf_fill_lwt_prog(skb, LWT_BPF_IN, &bpf->in) < 0 ||
> +           bpf_fill_lwt_prog(skb, LWT_BPF_OUT, &bpf->out) < 0 ||
> +           bpf_fill_lwt_prog(skb, LWT_BPF_XMIT, &bpf->xmit) < 0)
> +               return -EMSGSIZE;
> +
> +       return 0;
> +}
> +
> +static int bpf_encap_nlsize(struct lwtunnel_state *lwtstate)
> +{
> +       int nest_len = nla_total_size(sizeof(struct nlattr)) +
> +                      nla_total_size(MAX_PROG_NAME) + /* LWT_BPF_PROG_NAME */
> +                      0;
> +
> +       return nest_len + /* LWT_BPF_IN */
> +              nest_len + /* LWT_BPF_OUT */
> +              nest_len + /* LWT_BPF_XMIT */
> +              0;
> +}
> +
> +int bpf_lwt_prog_cmp(struct bpf_lwt_prog *a, struct bpf_lwt_prog *b)
> +{
> +       /* FIXME:
> +        * The LWT state is currently rebuilt for delete requests which
> +        * results in a new bpf_prog instance. Comparing names for now.
> +        */
> +       if (!a->name && !b->name)
> +               return 0;
> +
> +       if (!a->name || !b->name)
> +               return 1;
> +
> +       return strcmp(a->name, b->name);
> +}
> +
> +static int bpf_encap_cmp(struct lwtunnel_state *a, struct lwtunnel_state *b)
> +{
> +       struct bpf_lwt *a_bpf = bpf_lwt_lwtunnel(a);
> +       struct bpf_lwt *b_bpf = bpf_lwt_lwtunnel(b);
> +
> +       return bpf_lwt_prog_cmp(&a_bpf->in, &b_bpf->in) ||
> +              bpf_lwt_prog_cmp(&a_bpf->out, &b_bpf->out) ||
> +              bpf_lwt_prog_cmp(&a_bpf->xmit, &b_bpf->xmit);
> +}
> +
> +static const struct lwtunnel_encap_ops bpf_encap_ops = {
> +       .build_state    = bpf_build_state,
> +       .destroy_state  = bpf_destroy_state,
> +       .input          = bpf_input,
> +       .output         = bpf_output,
> +       .xmit           = bpf_xmit,
> +       .fill_encap     = bpf_fill_encap_info,
> +       .get_encap_size = bpf_encap_nlsize,
> +       .cmp_encap      = bpf_encap_cmp,
> +};
> +
> +static int __init bpf_lwt_init(void)
> +{
> +       return lwtunnel_encap_add_ops(&bpf_encap_ops, LWTUNNEL_ENCAP_BPF);
> +}
> +
> +subsys_initcall(bpf_lwt_init)
> diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
> index 88fd642..554d901 100644
> --- a/net/core/lwtunnel.c
> +++ b/net/core/lwtunnel.c
> @@ -39,6 +39,7 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
>                 return "MPLS";
>         case LWTUNNEL_ENCAP_ILA:
>                 return "ILA";
> +       case LWTUNNEL_ENCAP_BPF:
>         case LWTUNNEL_ENCAP_IP6:
>         case LWTUNNEL_ENCAP_IP:
>         case LWTUNNEL_ENCAP_NONE:
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-30 20:34   ` Tom Herbert
@ 2016-10-30 21:47     ` Thomas Graf
  2016-10-31  1:28       ` Tom Herbert
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2016-10-30 21:47 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On 10/30/16 at 01:34pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > +       if (unlikely(!dst->lwtstate->orig_output)) {
> > +               WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
> > +                         bpf->out.name);
> > +               kfree_skb(skb);
> > +               return -EINVAL;
> > +       }
> > +
> > +       return dst->lwtstate->orig_output(net, sk, skb);
> 
> The BPF program may have changed the destination address so continuing
> with original route in skb may not be appropriate here. This was fixed
> in ila_lwt by calling ip6_route_output and we were able to dst cache
> facility to cache the route to avoid cost of looking it up on every
> packet. Since the kernel  has no insight into what the BPF program
> does to the packet I'd suggest 1) checking if destination address
> changed by BPF and if it did then call route_output to get new route
> 2) If the LWT destination is a host route then try to keep a dst
> cache. This would entail checking destination address on return that
> it is the same one as kept in the dst cache.

Instead of building complex logic, we can allow the program to return
a code to indicate when to perform another route lookup just as we do
for the redirect case. Just because the destination address has
changed may not require another lookup in all cases. A typical example
would be a program rewriting addresses for the default route to other
address which are always handled by the default route as well. An
unconditional lookup would hurt performance in many cases.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-30 21:47     ` Thomas Graf
@ 2016-10-31  1:28       ` Tom Herbert
  2016-10-31  8:19         ` Thomas Graf
  2016-10-31 12:59         ` Thomas Graf
  0 siblings, 2 replies; 14+ messages in thread
From: Tom Herbert @ 2016-10-31  1:28 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On Sun, Oct 30, 2016 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 10/30/16 at 01:34pm, Tom Herbert wrote:
>> On Sun, Oct 30, 2016 at 4:58 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > +       if (unlikely(!dst->lwtstate->orig_output)) {
>> > +               WARN_ONCE(1, "orig_output not set on dst for prog %s\n",
>> > +                         bpf->out.name);
>> > +               kfree_skb(skb);
>> > +               return -EINVAL;
>> > +       }
>> > +
>> > +       return dst->lwtstate->orig_output(net, sk, skb);
>>
>> The BPF program may have changed the destination address so continuing
>> with original route in skb may not be appropriate here. This was fixed
>> in ila_lwt by calling ip6_route_output and we were able to dst cache
>> facility to cache the route to avoid cost of looking it up on every
>> packet. Since the kernel  has no insight into what the BPF program
>> does to the packet I'd suggest 1) checking if destination address
>> changed by BPF and if it did then call route_output to get new route
>> 2) If the LWT destination is a host route then try to keep a dst
>> cache. This would entail checking destination address on return that
>> it is the same one as kept in the dst cache.
>
> Instead of building complex logic, we can allow the program to return
> a code to indicate when to perform another route lookup just as we do
> for the redirect case. Just because the destination address has
> changed may not require another lookup in all cases. A typical example
> would be a program rewriting addresses for the default route to other
> address which are always handled by the default route as well. An
> unconditional lookup would hurt performance in many cases.

Right, that's why we rely on a dst cache. Any use of LWT that
encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
etc.) would want to use the dst cache optimization to avoid the second
lookup. The ILA LWT code used to call orig output and that worked as
long as we could set the default router as the gateway "via". It was
something we were able to deploy, but not a general solution.
Integrating properly with routing gives a much better solution IMO.
Note that David Lebrun's latest LWT Segment Routing patch does the
second lookup with the dst cache to try to avoid it.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31  1:28       ` Tom Herbert
@ 2016-10-31  8:19         ` Thomas Graf
  2016-10-31 12:59         ` Thomas Graf
  1 sibling, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-31  8:19 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On 10/30/16 at 06:28pm, Tom Herbert wrote:
> On Sun, Oct 30, 2016 at 2:47 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > Instead of building complex logic, we can allow the program to return
> > a code to indicate when to perform another route lookup just as we do
> > for the redirect case. Just because the destination address has
> > changed may not require another lookup in all cases. A typical example
> > would be a program rewriting addresses for the default route to other
> > address which are always handled by the default route as well. An
> > unconditional lookup would hurt performance in many cases.
> 
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Yes, I saw both ILA and SR dst_cache. I was planning on addressing
the conditional reroute in a second step but it looks fairly simple
actually so I'm fine adding this in a v2 based on a return code. I
will limit lwt-bpf to AF_INET && AF_INET6 though.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31  1:28       ` Tom Herbert
  2016-10-31  8:19         ` Thomas Graf
@ 2016-10-31 12:59         ` Thomas Graf
  2016-10-31 14:17           ` Tom Herbert
  1 sibling, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2016-10-31 12:59 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On 10/30/16 at 06:28pm, Tom Herbert wrote:
> Right, that's why we rely on a dst cache. Any use of LWT that
> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
> etc.) would want to use the dst cache optimization to avoid the second
> lookup. The ILA LWT code used to call orig output and that worked as
> long as we could set the default router as the gateway "via". It was
> something we were able to deploy, but not a general solution.
> Integrating properly with routing gives a much better solution IMO.
> Note that David Lebrun's latest LWT Segment Routing patch does the
> second lookup with the dst cache to try to avoid it.

Noticed while implementing this: How does ILA ensure that dst_output()
is not invoked in a circular manner?

dstA->output() -> dstB->otuput() -> dstA->output() -> ...

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31 12:59         ` Thomas Graf
@ 2016-10-31 14:17           ` Tom Herbert
  2016-10-31 15:06             ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-10-31 14:17 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 10/30/16 at 06:28pm, Tom Herbert wrote:
>> Right, that's why we rely on a dst cache. Any use of LWT that
>> encapsulates or tunnels to a fixed destination (ILA, VXLAN, IPIP,
>> etc.) would want to use the dst cache optimization to avoid the second
>> lookup. The ILA LWT code used to call orig output and that worked as
>> long as we could set the default router as the gateway "via". It was
>> something we were able to deploy, but not a general solution.
>> Integrating properly with routing gives a much better solution IMO.
>> Note that David Lebrun's latest LWT Segment Routing patch does the
>> second lookup with the dst cache to try to avoid it.
>
> Noticed while implementing this: How does ILA ensure that dst_output()
> is not invoked in a circular manner?
>
> dstA->output() -> dstB->otuput() -> dstA->output() -> ...

It doesn't. We'll need to add a check for that. Maybe the rule should
be that an skbuff is only allowed to hit one LWT route?

Another scenario to consider: Suppose someone is doing protocol
translation like in RFC7915. This is one of operations we'd need with
ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
be allowed/supported in LWT BPF?

Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31 14:17           ` Tom Herbert
@ 2016-10-31 15:06             ` Thomas Graf
  2016-10-31 16:07               ` Tom Herbert
  0 siblings, 1 reply; 14+ messages in thread
From: Thomas Graf @ 2016-10-31 15:06 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On 10/31/16 at 07:17am, Tom Herbert wrote:
> On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > Noticed while implementing this: How does ILA ensure that dst_output()
> > is not invoked in a circular manner?
> >
> > dstA->output() -> dstB->otuput() -> dstA->output() -> ...
> 
> It doesn't. We'll need to add a check for that. Maybe the rule should
> be that an skbuff is only allowed to hit one LWT route?

I'll add a per cpu variable to do a recursion limit for dst_output()
which callers to possibly recurse can use. ILA can use that as well.

> Another scenario to consider: Suppose someone is doing protocol
> translation like in RFC7915. This is one of operations we'd need with
> ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
> be allowed/supported in LWT BPF?

In lwtunnel_xmit() yes, input and output would not support this right
now. It will need some logic as the orig_input and orig_output would
obviously be expecting the same protocol. This is the reason why the
xmit prog type currently has a wider set of allowed helpers.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31 15:06             ` Thomas Graf
@ 2016-10-31 16:07               ` Tom Herbert
  2016-10-31 17:35                 ` Thomas Graf
  0 siblings, 1 reply; 14+ messages in thread
From: Tom Herbert @ 2016-10-31 16:07 UTC (permalink / raw)
  To: Thomas Graf
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On Mon, Oct 31, 2016 at 8:06 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 10/31/16 at 07:17am, Tom Herbert wrote:
>> On Mon, Oct 31, 2016 at 5:59 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > Noticed while implementing this: How does ILA ensure that dst_output()
>> > is not invoked in a circular manner?
>> >
>> > dstA->output() -> dstB->otuput() -> dstA->output() -> ...
>>
>> It doesn't. We'll need to add a check for that. Maybe the rule should
>> be that an skbuff is only allowed to hit one LWT route?
>
> I'll add a per cpu variable to do a recursion limit for dst_output()
> which callers to possibly recurse can use. ILA can use that as well.
>
>> Another scenario to consider: Suppose someone is doing protocol
>> translation like in RFC7915. This is one of operations we'd need with
>> ILA or GRE to implement an IPv4 overlay network over IPv6. Would this
>> be allowed/supported in LWT BPF?
>
> In lwtunnel_xmit() yes, input and output would not support this right
> now. It will need some logic as the orig_input and orig_output would
> obviously be expecting the same protocol. This is the reason why the
> xmit prog type currently has a wider set of allowed helpers.

I guess this leads to a more general question I have about the effects
of allowing userspace to insert code in the kernel that modifies
packets. If we allow BPF programs to arbitrarily modify packets in
LWT, how do we ensure that there are no insidious effects later in the
path? For instance,  what someone uses BPF to convert an IPv6 packet
to IPv4, or maybe convert packet to something that isn't even IP, or
what if someone just decides to overwrite every byte in a packet with
0xff? Are these thing allowed, and if so what is the effect? I would
assume a policy that these can't cause any insidious effects to
unrelated traffic or the rest of the system, in particular such things
should not cause the  kernel to crash (based on the principle that
user space code should never cause kernel to crash). I think XDP might
be okay since the path is straightforward and only deals with raw
packets, but LWT is much higher in that stack.

Thanks,
Tom

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation
  2016-10-31 16:07               ` Tom Herbert
@ 2016-10-31 17:35                 ` Thomas Graf
  0 siblings, 0 replies; 14+ messages in thread
From: Thomas Graf @ 2016-10-31 17:35 UTC (permalink / raw)
  To: Tom Herbert
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, roopa

On 10/31/16 at 09:07am, Tom Herbert wrote:
> I guess this leads to a more general question I have about the effects
> of allowing userspace to insert code in the kernel that modifies
> packets. If we allow BPF programs to arbitrarily modify packets in
> LWT, how do we ensure that there are no insidious effects later in the
> path? For instance,  what someone uses BPF to convert an IPv6 packet
> to IPv4, or maybe convert packet to something that isn't even IP, or
> what if someone just decides to overwrite every byte in a packet with
> 0xff?

This is why modifying packets is not allowed on input at all as it
would invalidate the IP parsing that has already been done.

Writing is allowed for dst_output() on the basis that it is the
equivalent of a raw socket with header inclusion. If you look at
rawv6_send_hdrinc(), it does not perform any validation and calls into
dst_output() directly. I agree though that this must be made water
proof.

Pushing additional headers is only allowed at xmit, this is the
equivalent LWT MPLS.

> Are these thing allowed, and if so what is the effect? I would
> assume a policy that these can't cause any insidious effects to
> unrelated traffic or the rest of the system, in particular such things
> should not cause the  kernel to crash (based on the principle that
> user space code should never cause kernel to crash). I think XDP might

Agreed. Although it's already possible to hook a kernel module at LWT
or Netfilter to do arbitrary packet modifications, BPF must be held
at a higher standard even in privileged mode.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-10-31 17:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-30 11:58 [PATCH net-next 0/4] BPF for lightweight tunnel encapsulation Thomas Graf
2016-10-30 11:58 ` [PATCH net-next 1/4] route: Set orig_output when redirecting to lwt on locally generated traffic Thomas Graf
2016-10-30 11:58 ` [PATCH net-next 2/4] route: Set lwtstate for local traffic and cached input dsts Thomas Graf
2016-10-30 11:58 ` [PATCH net-next 3/4] bpf: BPF for lightweight tunnel encapsulation Thomas Graf
2016-10-30 20:34   ` Tom Herbert
2016-10-30 21:47     ` Thomas Graf
2016-10-31  1:28       ` Tom Herbert
2016-10-31  8:19         ` Thomas Graf
2016-10-31 12:59         ` Thomas Graf
2016-10-31 14:17           ` Tom Herbert
2016-10-31 15:06             ` Thomas Graf
2016-10-31 16:07               ` Tom Herbert
2016-10-31 17:35                 ` Thomas Graf
2016-10-30 11:58 ` [PATCH net-next 4/4] bpf: Add samples for LWT-BPF Thomas Graf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.