netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC bpf-next 0/7] Programming socket lookup with BPF
@ 2019-06-18 13:00 Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type Jakub Sitnicki
                   ` (7 more replies)
  0 siblings, 8 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team

We have been exploring an idea of making listening socket
lookup (inet_lookup) programmable with BPF.

Why? At last Netdev Marek talked [1] about two limitations of bind() API
we're hitting when running services on our edge servers:

1) sharing a port between two services

   Services are accepting connections on different (disjoint) IP ranges but
   use the same port. Say, packets to 192.0.2.0/24 tcp/80 go to NGINX,
   while 198.51.100.0/24 tcp/80 is handled by Apache. Servers are running
   as different users, in a flat single-netns setup.

2) receiving traffic on all ports

   Proxy accepts connections a specific IP range but on any port [2].

In both cases we've found that bind() and a combination of INADDR_ANY,
SO_REUSEADDR, or SO_REUSEPORT doesn't allow for the setup we need, short of
binding each service to every IP:port pair combination :-)

We've resorted at first to custom patches [3], and more recently to traffic
steering with TPROXY. Not without pain points:

 - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
   find the listening socket to check for SYN cookies with TPROXY redirect.

 - TPROXY takes a reference to the listening socket on dispatch, which
   raises lock contention concerns.

 - Traffic steering configuration is split over several iptables rules, at
   least one per service, which makes configuration changes error prone.

Now back to the patch set, it introduces a new BPF program type, dubbed
inet_lookup, that runs before listening socket lookup, and can override the
destination IP:port pair used as lookup key. Program attaches to netns in
scope of which the lookup happens.

What an inet_lookup program might look like? For the mentioned scenario
with two HTTP servers sharing port 80:

#define NET1 (IP4(192,  0,   2, 0) >> 8)
#define NET2 (IP4(198, 51, 100, 0) >> 8)

SEC("inet_lookup/demo_two_http_servers")
int demo_two_http_servers(struct bpf_inet_lookup *ctx)
{
        if (ctx->family != AF_INET)
                return BPF_OK;
        if (ctx->local_port != 80)
                return BPF_OK;

        switch (bpf_ntohl(ctx->local_ip4) >> 8) {
        case NET1:
                ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
                ctx->local_port = 81;
                return BPF_REDIRECT;
        case NET2:
                ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
                ctx->local_port = 82;
                return BPF_REDIRECT;
        }

        return BPF_OK;
}

What are the downsides?

 - BPF program, if attached, runs on the receive hot path,
 - introspection is worse than for TPROXY iptables rules.

Also UDP packet steering has to be reworked. In current form we run the
inet_lookup program before checking for any connected UDP sockets, which is
unexpected.

The patches, while still in their early stages, show what we're trying to
solve. We're reaching out early for feedback to see what are the technical
concerns and if we can address them.

Just in time for the coming Netconf conference.

Thanks,
Jakub

[1] https://netdevconf.org/0x13/session.html?panel-industry-perspectives
[2] https://blog.cloudflare.com/how-we-built-spectrum/
[3] https://www.spinics.net/lists/netdev/msg370789.html


Jakub Sitnicki (7):
  bpf: Introduce inet_lookup program type
  ipv4: Run inet_lookup bpf program on socket lookup
  ipv6: Run inet_lookup bpf program on socket lookup
  bpf: Sync linux/bpf.h to tools/
  libbpf: Add support for inet_lookup program type
  bpf: Test destination address remapping with inet_lookup
  bpf: Add verifier tests for inet_lookup context access

 include/linux/bpf_types.h                     |   1 +
 include/linux/filter.h                        |  17 +
 include/net/inet6_hashtables.h                |  39 ++
 include/net/inet_hashtables.h                 |  39 ++
 include/net/net_namespace.h                   |   3 +
 include/uapi/linux/bpf.h                      |  27 +
 kernel/bpf/syscall.c                          |  10 +
 net/core/filter.c                             | 216 ++++++++
 net/ipv4/inet_hashtables.c                    |  11 +-
 net/ipv4/udp.c                                |   1 +
 net/ipv6/inet6_hashtables.c                   |  11 +-
 net/ipv6/udp.c                                |   6 +-
 tools/include/uapi/linux/bpf.h                |  27 +
 tools/lib/bpf/libbpf.c                        |   4 +
 tools/lib/bpf/libbpf.h                        |   2 +
 tools/lib/bpf/libbpf.map                      |   2 +
 tools/lib/bpf/libbpf_probes.c                 |   1 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../selftests/bpf/progs/inet_lookup_prog.c    |  68 +++
 .../testing/selftests/bpf/test_inet_lookup.c  | 392 ++++++++++++++
 .../testing/selftests/bpf/test_inet_lookup.sh |  35 ++
 .../selftests/bpf/verifier/ctx_inet_lookup.c  | 511 ++++++++++++++++++
 23 files changed, 1418 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/inet_lookup_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_inet_lookup.c
 create mode 100755 tools/testing/selftests/bpf/test_inet_lookup.sh
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 2/7] ipv4: Run inet_lookup bpf program on socket lookup Jakub Sitnicki
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team, Marek Majkowski

Add a new BPF program type that attaches to the network namespace.  Its
purpose is to run before the listening socket lookup so that we can
override the destination IP and/or port from BPF.

The inet_lookup program receives the lookup 4-tuple via context as input
for making a decision. It is allowed to overwrite the local IP & port that
are used for looking up a listening socket.

The program is not called anywhere yet. We hook it up to ipv4 and ipv6
stacks in subsequent patches.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/linux/bpf_types.h   |   1 +
 include/linux/filter.h      |  17 +++
 include/net/net_namespace.h |   3 +
 include/uapi/linux/bpf.h    |  27 +++++
 kernel/bpf/syscall.c        |  10 ++
 net/core/filter.c           | 216 ++++++++++++++++++++++++++++++++++++
 6 files changed, 274 insertions(+)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5a9975678d6f..9f1424146de8 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -36,6 +36,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
 #endif
 #ifdef CONFIG_INET
 BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
+BPF_PROG_TYPE(BPF_PROG_TYPE_INET_LOOKUP, inet_lookup)
 #endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 43b45d6db36d..f826fca6cc1c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1199,4 +1199,21 @@ struct bpf_sysctl_kern {
 	u64 tmp_reg;
 };
 
+#ifdef CONFIG_INET
+struct bpf_inet_lookup_kern {
+	unsigned short	family;
+	__be32		saddr;
+	struct in6_addr	saddr6;
+	__be16		sport;
+	__be32		daddr;
+	struct in6_addr	daddr6;
+	unsigned short	hnum;
+};
+
+int inet_lookup_attach_bpf(const union bpf_attr *attr, struct bpf_prog *prog);
+int inet_lookup_detach_bpf(const union bpf_attr *attr);
+int inet_lookup_query_bpf(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr);
+#endif
+
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index abb4f92456e1..6f2e5ecc8b08 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -167,6 +167,9 @@ struct net {
 #endif
 #ifdef CONFIG_XDP_SOCKETS
 	struct netns_xdp	xdp;
+#endif
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_prog __rcu	*inet_lookup_prog;
 #endif
 	struct sock		*diag_nlsk;
 	atomic_t		fnhe_genid;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d0a23476f887..7776f36a43d1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_INET_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -192,6 +193,7 @@ enum bpf_attach_type {
 	BPF_LIRC_MODE2,
 	BPF_FLOW_DISSECTOR,
 	BPF_CGROUP_SYSCTL,
+	BPF_INET_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3066,6 +3068,31 @@ struct bpf_tcp_sock {
 				 */
 };
 
+/* User accessible data for inet_lookup programs.
+ * New fields must be added at the end.
+ */
+struct bpf_inet_lookup {
+	__u32 family;
+	__u32 remote_ip4;	/* Allows 1,2,4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_ip4;	/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 remote_ip6[4];	/* Allows 1,2,4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_ip6[4];	/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 remote_port;	/* Allows 4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_port;	/* Allows 4-byte read and write.
+				 * Stored in host byte order.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4c53cbd3329d..57813c539a41 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1916,6 +1916,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_CGROUP_SYSCTL:
 		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
 		break;
+	case BPF_INET_LOOKUP:
+		ptype = BPF_PROG_TYPE_INET_LOOKUP;
+		break;
 	default:
 		return -EINVAL;
 	}
@@ -1940,6 +1943,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 		ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
 		break;
+	case BPF_PROG_TYPE_INET_LOOKUP:
+		ret = inet_lookup_attach_bpf(attr, prog);
+		break;
 	default:
 		ret = cgroup_bpf_prog_attach(attr, ptype, prog);
 	}
@@ -1997,6 +2003,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
 	case BPF_CGROUP_SYSCTL:
 		ptype = BPF_PROG_TYPE_CGROUP_SYSCTL;
 		break;
+	case BPF_INET_LOOKUP:
+		return inet_lookup_detach_bpf(attr);
 	default:
 		return -EINVAL;
 	}
@@ -2036,6 +2044,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
 		return lirc_prog_query(attr, uattr);
 	case BPF_FLOW_DISSECTOR:
 		return skb_flow_dissector_prog_query(attr, uattr);
+	case BPF_INET_LOOKUP:
+		return inet_lookup_query_bpf(attr, uattr);
 	default:
 		return -EINVAL;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 8c18f2781afa..439e8eccb018 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8716,4 +8716,220 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
 
 const struct bpf_prog_ops sk_reuseport_prog_ops = {
 };
+
+static DEFINE_MUTEX(inet_lookup_prog_mutex);
+
+int inet_lookup_attach_bpf(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_prog *attached;
+
+	mutex_lock(&inet_lookup_prog_mutex);
+	attached = rcu_dereference_protected(net->inet_lookup_prog,
+					     lockdep_is_held(&inet_lookup_prog_mutex));
+	if (attached) {
+		/* Only one BPF program can be attached at a time */
+		mutex_unlock(&inet_lookup_prog_mutex);
+		return -EEXIST;
+	}
+	rcu_assign_pointer(net->inet_lookup_prog, prog);
+	mutex_unlock(&inet_lookup_prog_mutex);
+	return 0;
+}
+
+int inet_lookup_detach_bpf(const union bpf_attr *attr)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct bpf_prog *attached;
+
+	mutex_lock(&inet_lookup_prog_mutex);
+	attached = rcu_dereference_protected(net->inet_lookup_prog,
+					     lockdep_is_held(&inet_lookup_prog_mutex));
+	if (!attached) {
+		mutex_unlock(&inet_lookup_prog_mutex);
+		return -ENOENT;
+	}
+	bpf_prog_put(attached);
+	RCU_INIT_POINTER(net->inet_lookup_prog, NULL);
+	mutex_unlock(&inet_lookup_prog_mutex);
+
+	return 0;
+}
+
+int inet_lookup_query_bpf(const union bpf_attr *attr,
+			  union bpf_attr __user *uattr)
+{
+	return -EOPNOTSUPP;	/* TODO: Not implemented. */
+}
+
+static const struct bpf_func_proto *
+inet_lookup_func_proto(enum bpf_func_id func_id,
+		       const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
+static bool inet_lookup_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	const int size_default = sizeof(__u32);
+
+	if (off < 0 || off >= sizeof(struct bpf_inet_lookup))
+		return false;
+	if (off % size != 0)
+		return false;
+
+	switch (off) {
+	case bpf_ctx_range(struct bpf_inet_lookup, remote_ip4):
+	case bpf_ctx_range_till(struct bpf_inet_lookup,
+				remote_ip6[0], remote_ip6[3]):
+		if (type != BPF_READ)
+			return false;
+		if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+			return false;
+		bpf_ctx_record_field_size(info, size_default);
+		break;
+
+	case bpf_ctx_range(struct bpf_inet_lookup, local_ip4):
+	case bpf_ctx_range_till(struct bpf_inet_lookup,
+				local_ip6[0], local_ip6[3]):
+		if (type == BPF_READ) {
+			if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+				return false;
+			bpf_ctx_record_field_size(info, size_default);
+		} else {
+			if (size != size_default)
+				return false;
+		}
+		break;
+
+	case bpf_ctx_range(struct bpf_inet_lookup, family):
+	case bpf_ctx_range(struct bpf_inet_lookup, remote_port):
+		if (type != BPF_READ)
+			return false;
+		if (size != size_default)
+			return false;
+		break;
+
+	case bpf_ctx_range(struct bpf_inet_lookup, local_port):
+		if (size != size_default)
+			return false;
+		break;
+
+	default:
+		return false;
+	}
+
+	return true;
+}
+
+#define LOAD_FIELD_OFF(STRUCT, FIELD, OFF) ({				\
+	*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(STRUCT, FIELD),		\
+			      si->dst_reg, si->src_reg,			\
+			      bpf_target_off(STRUCT, FIELD,		\
+					     FIELD_SIZEOF(STRUCT,	\
+							  FIELD),	\
+					     target_size) + (OFF));	\
+})
+
+#define LOAD_FIELD(STRUCT, FIELD) LOAD_FIELD_OFF(STRUCT, FIELD, 0)
+
+#define STORE_FIELD_OFF(STRUCT, FIELD, OFF) ({				\
+	*insn++ = BPF_STX_MEM(BPF_FIELD_SIZEOF(STRUCT, FIELD),		\
+			      si->dst_reg, si->src_reg,			\
+			      bpf_target_off(STRUCT, FIELD,		\
+					     FIELD_SIZEOF(STRUCT,	\
+							  FIELD),	\
+					     target_size) + (OFF));	\
+})
+
+#define STORE_FIELD(STRUCT, FIELD) STORE_FIELD_OFF(STRUCT, FIELD, 0)
+
+/* TODO: Handle 1,2-byte reads from {local,remote}_ip[46]. */
+static u32 inet_lookup_convert_ctx_access(enum bpf_access_type type,
+					  const struct bpf_insn *si,
+					  struct bpf_insn *insn_buf,
+					  struct bpf_prog *prog,
+					  u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int off;
+
+	switch (si->off) {
+	case offsetof(struct bpf_inet_lookup, family):
+		LOAD_FIELD(struct bpf_inet_lookup_kern, family);
+		break;
+
+	case offsetof(struct bpf_inet_lookup, remote_ip4):
+		LOAD_FIELD(struct bpf_inet_lookup_kern, saddr);
+		break;
+
+	case offsetof(struct bpf_inet_lookup, local_ip4):
+		if (type == BPF_READ)
+			LOAD_FIELD(struct bpf_inet_lookup_kern, daddr);
+		else
+			STORE_FIELD(struct bpf_inet_lookup_kern, daddr);
+		break;
+
+	case bpf_ctx_range_till(struct bpf_inet_lookup,
+				remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		off = si->off;
+		off -= offsetof(struct bpf_inet_lookup, remote_ip6[0]);
+
+		LOAD_FIELD_OFF(struct bpf_inet_lookup_kern,
+			       saddr6.s6_addr32[0], off);
+#else
+		(void)off;
+
+		*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case bpf_ctx_range_till(struct bpf_inet_lookup,
+				local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+		off = si->off;
+		off -= offsetof(struct bpf_inet_lookup, local_ip6[0]);
+
+		if (type == BPF_READ)
+			LOAD_FIELD_OFF(struct bpf_inet_lookup_kern,
+				       daddr6.s6_addr32[0], off);
+		else
+			STORE_FIELD_OFF(struct bpf_inet_lookup_kern,
+					daddr6.s6_addr32[0], off);
+#else
+		(void)off;
+
+		if (type == BPF_READ)
+			*insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+		break;
+
+	case offsetof(struct bpf_inet_lookup, remote_port):
+		LOAD_FIELD(struct bpf_inet_lookup_kern, sport);
+		break;
+
+	case offsetof(struct bpf_inet_lookup, local_port):
+		if (type == BPF_READ)
+			LOAD_FIELD(struct bpf_inet_lookup_kern, hnum);
+		else
+			STORE_FIELD(struct bpf_inet_lookup_kern, hnum);
+		break;
+	}
+
+	return insn - insn_buf;
+}
+
+const struct bpf_prog_ops inet_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops inet_lookup_verifier_ops = {
+	.get_func_proto		= inet_lookup_func_proto,
+	.is_valid_access	= inet_lookup_is_valid_access,
+	.convert_ctx_access	= inet_lookup_convert_ctx_access,
+};
+
 #endif /* CONFIG_INET */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 2/7] ipv4: Run inet_lookup bpf program on socket lookup
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 3/7] ipv6: " Jakub Sitnicki
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team, Marek Majkowski

Run a BPF program before looking up the listening socket, or in case of udp
before looking up any socket, connected or not. The program is allowed to
change the destination address & port we use as keys for the lookup,
providing its return code is BPF_REDIRECT.

This allows us to redirect traffic destined to a set of addresses and ports
without bindings sockets to all the addresses and ports we want to receive
on.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet_hashtables.h | 39 +++++++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c    | 11 ++++++----
 net/ipv4/udp.c                |  1 +
 3 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index babb14136705..7d8b58b2ded0 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -418,4 +418,43 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 
 int inet_hash_connect(struct inet_timewait_death_row *death_row,
 		      struct sock *sk);
+
+#ifdef CONFIG_BPF_SYSCALL
+static inline void inet_lookup_run_bpf(struct net *net,
+				       const __be32 saddr,
+				       const __be16 sport,
+				       __be32 *daddr,
+				       unsigned short *hnum)
+{
+	struct bpf_inet_lookup_kern ctx = {
+		.family	= AF_INET,
+		.saddr	= saddr,
+		.sport	= sport,
+		.daddr	= *daddr,
+		.hnum	= *hnum,
+	};
+	struct bpf_prog *prog;
+	int ret = BPF_OK;
+
+	rcu_read_lock();
+	prog = rcu_dereference(net->inet_lookup_prog);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, &ctx);
+	rcu_read_unlock();
+
+	if (ret == BPF_REDIRECT) {
+		*daddr = ctx.daddr;
+		*hnum = ctx.hnum;
+	}
+}
+#else
+static inline void inet_lookup_run_bpf(struct sk_buff *skb,
+				       const __be32 saddr,
+				       const __be16 sport,
+				       __be32 *daddr,
+				       unsigned short *hnum)
+{
+}
+#endif	/* CONFIG_BPF_SYSCALL */
+
 #endif /* _INET_HASHTABLES_H */
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 942265d65eb3..ae3f1da1b4f6 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -300,24 +300,27 @@ struct sock *__inet_lookup_listener(struct net *net,
 				    const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
+	unsigned short hnum2 = hnum;
 	struct sock *result = NULL;
+	__be32 daddr2 = daddr;
 	unsigned int hash2;
 
-	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
+	inet_lookup_run_bpf(net, saddr, sport, &daddr2, &hnum2);
+	hash2 = ipv4_portaddr_hash(net, daddr2, hnum2);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
 	result = inet_lhash2_lookup(net, ilb2, skb, doff,
-				    saddr, sport, daddr, hnum,
+				    saddr, sport, daddr2, hnum2,
 				    dif, sdif);
 	if (result)
 		goto done;
 
 	/* Lookup lhash2 with INADDR_ANY */
-	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
+	hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum2);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
 	result = inet_lhash2_lookup(net, ilb2, skb, doff,
-				    saddr, sport, htonl(INADDR_ANY), hnum,
+				    saddr, sport, htonl(INADDR_ANY), hnum2,
 				    dif, sdif);
 done:
 	if (unlikely(IS_ERR(result)))
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 8fb250ed53d4..c4f4c94525ec 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -467,6 +467,7 @@ struct sock *__udp4_lib_lookup(struct net *net, __be32 saddr,
 	struct udp_hslot *hslot2;
 	bool exact_dif = udp_lib_exact_dif_match(net, skb);
 
+	inet_lookup_run_bpf(net, saddr, sport, &daddr, &hnum);
 	hash2 = ipv4_portaddr_hash(net, daddr, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 3/7] ipv6: Run inet_lookup bpf program on socket lookup
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 2/7] ipv4: Run inet_lookup bpf program on socket lookup Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 4/7] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team, Marek Majkowski

Following the ipv4 changes, run a BPF program attached to netns in context
of which we're doing the socket lookup so that it can rewrite the
destination IP and port we use as keys for the lookup.

The program is called before the listening socket lookup for TCP, and
before connected or not-connected socket lookup for UDP.

Suggested-by: Marek Majkowski <marek@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 include/net/inet6_hashtables.h | 39 ++++++++++++++++++++++++++++++++++
 net/ipv6/inet6_hashtables.c    | 11 ++++++----
 net/ipv6/udp.c                 |  6 ++++--
 3 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 9db98af46985..ab06961d33a9 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -108,6 +108,45 @@ struct sock *inet6_lookup(struct net *net, struct inet_hashinfo *hashinfo,
 			  const int dif);
 
 int inet6_hash(struct sock *sk);
+
+#ifdef CONFIG_BPF_SYSCALL
+static inline void inet6_lookup_run_bpf(struct net *net,
+					const struct in6_addr *saddr,
+					const __be16 sport,
+					struct in6_addr *daddr,
+					unsigned short *hnum)
+{
+	struct bpf_inet_lookup_kern ctx = {
+		.family	= AF_INET6,
+		.saddr6	= *saddr,
+		.sport	= sport,
+		.daddr6	= *daddr,
+		.hnum	= *hnum,
+	};
+	struct bpf_prog *prog;
+	int ret = BPF_OK;
+
+	rcu_read_lock();
+	prog = rcu_dereference(net->inet_lookup_prog);
+	if (prog)
+		ret = BPF_PROG_RUN(prog, &ctx);
+	rcu_read_unlock();
+
+	if (ret == BPF_REDIRECT) {
+		*daddr = ctx.daddr6;
+		*hnum = ctx.hnum;
+	}
+}
+#else
+static inline void inet6_lookup_run_bpf(struct net *net,
+					const struct in6_addr *saddr,
+					const __be16 sport,
+					struct in6_addr *daddr,
+					unsigned short *hnum)
+{
+}
+#endif /* CONFIG_BPF_SYSCALL */
+
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
 #define INET6_MATCH(__sk, __net, __saddr, __daddr, __ports, __dif, __sdif) \
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index f3515ebe9b3a..280a9b8bf914 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -158,24 +158,27 @@ struct sock *inet6_lookup_listener(struct net *net,
 		const unsigned short hnum, const int dif, const int sdif)
 {
 	struct inet_listen_hashbucket *ilb2;
+	struct in6_addr daddr2 = *daddr;
+	unsigned short hnum2 = hnum;
 	struct sock *result = NULL;
 	unsigned int hash2;
 
-	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
+	inet6_lookup_run_bpf(net, saddr, sport, &daddr2, &hnum2);
+	hash2 = ipv6_portaddr_hash(net, &daddr2, hnum2);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
 	result = inet6_lhash2_lookup(net, ilb2, skb, doff,
-				     saddr, sport, daddr, hnum,
+				     saddr, sport, &daddr2, hnum2,
 				     dif, sdif);
 	if (result)
 		goto done;
 
 	/* Lookup lhash2 with in6addr_any */
-	hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
+	hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum2);
 	ilb2 = inet_lhash2_bucket(hashinfo, hash2);
 
 	result = inet6_lhash2_lookup(net, ilb2, skb, doff,
-				     saddr, sport, &in6addr_any, hnum,
+				     saddr, sport, &in6addr_any, hnum2,
 				     dif, sdif);
 done:
 	if (unlikely(IS_ERR(result)))
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 07fa579dfb96..6c0030ba83c6 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -196,17 +196,19 @@ struct sock *__udp6_lib_lookup(struct net *net,
 			       struct sk_buff *skb)
 {
 	unsigned short hnum = ntohs(dport);
+	struct in6_addr daddr2 = *daddr;
 	unsigned int hash2, slot2;
 	struct udp_hslot *hslot2;
 	struct sock *result;
 	bool exact_dif = udp6_lib_exact_dif_match(net, skb);
 
-	hash2 = ipv6_portaddr_hash(net, daddr, hnum);
+	inet6_lookup_run_bpf(net, saddr, sport, &daddr2, &hnum);
+	hash2 = ipv6_portaddr_hash(net, &daddr2, hnum);
 	slot2 = hash2 & udptable->mask;
 	hslot2 = &udptable->hash2[slot2];
 
 	result = udp6_lib_lookup2(net, saddr, sport,
-				  daddr, hnum, dif, sdif, exact_dif,
+				  &daddr2, hnum, dif, sdif, exact_dif,
 				  hslot2, skb);
 	if (!result) {
 		hash2 = ipv6_portaddr_hash(net, &in6addr_any, hnum);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 4/7] bpf: Sync linux/bpf.h to tools/
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
                   ` (2 preceding siblings ...)
  2019-06-18 13:00 ` [RFC bpf-next 3/7] ipv6: " Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 5/7] libbpf: Add support for inet_lookup program type Jakub Sitnicki
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team

Newly added program and context type is needed for tests in subsequent
patches.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/include/uapi/linux/bpf.h | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index d0a23476f887..7776f36a43d1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -170,6 +170,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_FLOW_DISSECTOR,
 	BPF_PROG_TYPE_CGROUP_SYSCTL,
 	BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+	BPF_PROG_TYPE_INET_LOOKUP,
 };
 
 enum bpf_attach_type {
@@ -192,6 +193,7 @@ enum bpf_attach_type {
 	BPF_LIRC_MODE2,
 	BPF_FLOW_DISSECTOR,
 	BPF_CGROUP_SYSCTL,
+	BPF_INET_LOOKUP,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -3066,6 +3068,31 @@ struct bpf_tcp_sock {
 				 */
 };
 
+/* User accessible data for inet_lookup programs.
+ * New fields must be added at the end.
+ */
+struct bpf_inet_lookup {
+	__u32 family;
+	__u32 remote_ip4;	/* Allows 1,2,4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_ip4;	/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 remote_ip6[4];	/* Allows 1,2,4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_ip6[4];	/* Allows 1,2,4-byte read and 4-byte write.
+				 * Stored in network byte order.
+				 */
+	__u32 remote_port;	/* Allows 4-byte read but no write.
+				 * Stored in network byte order.
+				 */
+	__u32 local_port;	/* Allows 4-byte read and write.
+				 * Stored in host byte order.
+				 */
+};
+
 struct bpf_sock_tuple {
 	union {
 		struct {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 5/7] libbpf: Add support for inet_lookup program type
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
                   ` (3 preceding siblings ...)
  2019-06-18 13:00 ` [RFC bpf-next 4/7] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 6/7] bpf: Test destination address remapping with inet_lookup Jakub Sitnicki
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team

Make libbpf aware of the newly added program type. Reserve a section name
for it.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/lib/bpf/libbpf.c        | 4 ++++
 tools/lib/bpf/libbpf.h        | 2 ++
 tools/lib/bpf/libbpf.map      | 2 ++
 tools/lib/bpf/libbpf_probes.c | 1 +
 4 files changed, 9 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index e725fa86b189..84dfdfc0a971 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2244,6 +2244,7 @@ static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
 	case BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE:
 	case BPF_PROG_TYPE_PERF_EVENT:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_INET_LOOKUP:
 		return false;
 	case BPF_PROG_TYPE_KPROBE:
 	default:
@@ -3110,6 +3111,7 @@ BPF_PROG_TYPE_FNS(tracepoint, BPF_PROG_TYPE_TRACEPOINT);
 BPF_PROG_TYPE_FNS(raw_tracepoint, BPF_PROG_TYPE_RAW_TRACEPOINT);
 BPF_PROG_TYPE_FNS(xdp, BPF_PROG_TYPE_XDP);
 BPF_PROG_TYPE_FNS(perf_event, BPF_PROG_TYPE_PERF_EVENT);
+BPF_PROG_TYPE_FNS(inet_lookup, BPF_PROG_TYPE_INET_LOOKUP);
 
 void bpf_program__set_expected_attach_type(struct bpf_program *prog,
 					   enum bpf_attach_type type)
@@ -3197,6 +3199,8 @@ static const struct {
 						BPF_CGROUP_UDP6_SENDMSG),
 	BPF_EAPROG_SEC("cgroup/sysctl",		BPF_PROG_TYPE_CGROUP_SYSCTL,
 						BPF_CGROUP_SYSCTL),
+	BPF_EAPROG_SEC("inet_lookup",		BPF_PROG_TYPE_INET_LOOKUP,
+						BPF_INET_LOOKUP),
 };
 
 #undef BPF_PROG_SEC_IMPL
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 2e594a0fa961..283dac0f6d13 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -240,6 +240,7 @@ LIBBPF_API int bpf_program__set_sched_cls(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_sched_act(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_xdp(struct bpf_program *prog);
 LIBBPF_API int bpf_program__set_perf_event(struct bpf_program *prog);
+LIBBPF_API int bpf_program__set_inet_lookup(struct bpf_program *prog);
 LIBBPF_API void bpf_program__set_type(struct bpf_program *prog,
 				      enum bpf_prog_type type);
 LIBBPF_API void
@@ -254,6 +255,7 @@ LIBBPF_API bool bpf_program__is_sched_cls(struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_sched_act(struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_xdp(struct bpf_program *prog);
 LIBBPF_API bool bpf_program__is_perf_event(struct bpf_program *prog);
+LIBBPF_API bool bpf_program__is_inet_lookup(struct bpf_program *prog);
 
 /*
  * No need for __attribute__((packed)), all members of 'bpf_map_def'
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 2c6d835620d2..e55d8e5d6fd4 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -67,6 +67,7 @@ LIBBPF_0.0.1 {
 		bpf_prog_test_run;
 		bpf_prog_test_run_xattr;
 		bpf_program__fd;
+		bpf_program__is_inet_lookup;
 		bpf_program__is_kprobe;
 		bpf_program__is_perf_event;
 		bpf_program__is_raw_tracepoint;
@@ -84,6 +85,7 @@ LIBBPF_0.0.1 {
 		bpf_program__priv;
 		bpf_program__set_expected_attach_type;
 		bpf_program__set_ifindex;
+		bpf_program__set_inet_lookup;
 		bpf_program__set_kprobe;
 		bpf_program__set_perf_event;
 		bpf_program__set_prep;
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 5e2aa83f637a..5094f32d33a7 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -101,6 +101,7 @@ probe_load(enum bpf_prog_type prog_type, const struct bpf_insn *insns,
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_INET_LOOKUP:
 	default:
 		break;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 6/7] bpf: Test destination address remapping with inet_lookup
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
                   ` (4 preceding siblings ...)
  2019-06-18 13:00 ` [RFC bpf-next 5/7] libbpf: Add support for inet_lookup program type Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:00 ` [RFC bpf-next 7/7] bpf: Add verifier tests for inet_lookup context access Jakub Sitnicki
  2019-06-18 13:52 ` [RFC bpf-next 0/7] Programming socket lookup with BPF Florian Westphal
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team

Check if destination IP address and/or port remapping happens when an
inet_lookup program is attached to the net namespace.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   6 +-
 .../selftests/bpf/progs/inet_lookup_prog.c    |  68 +++
 .../testing/selftests/bpf/test_inet_lookup.c  | 392 ++++++++++++++++++
 .../testing/selftests/bpf/test_inet_lookup.sh |  35 ++
 5 files changed, 500 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/inet_lookup_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_inet_lookup.c
 create mode 100755 tools/testing/selftests/bpf/test_inet_lookup.sh

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 7470327edcfe..c1492cb188e5 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -39,3 +39,4 @@ libbpf.so.*
 test_hashmap
 test_btf_dump
 xdping
+test_inet_lookup
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index ae5c3e576c2f..aa68dc091888 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -61,7 +61,8 @@ TEST_PROGS := test_kmod.sh \
 	test_tcp_check_syncookie.sh \
 	test_tc_tunnel.sh \
 	test_tc_edt.sh \
-	test_xdping.sh
+	test_xdping.sh \
+	test_inet_lookup.sh
 
 TEST_PROGS_EXTENDED := with_addr.sh \
 	with_tunnels.sh \
@@ -70,7 +71,8 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \
-	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user
+	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
+	test_inet_lookup
 
 include ../lib.mk
 
diff --git a/tools/testing/selftests/bpf/progs/inet_lookup_prog.c b/tools/testing/selftests/bpf/progs/inet_lookup_prog.c
new file mode 100644
index 000000000000..5f3d2b1f94eb
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/inet_lookup_prog.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <sys/socket.h>
+
+#include "bpf_endian.h"
+#include "bpf_helpers.h"
+
+#define IP4(a, b, c, d)	((__u32)(		\
+	((__u32)((a) & (__u32)0xffUL) << 24) |	\
+	((__u32)((b) & (__u32)0xffUL) << 16) |	\
+	((__u32)((c) & (__u32)0xffUL) <<  8) |	\
+	((__u32)((d) & (__u32)0xffUL) <<  0)))
+
+static const __u32 CONNECT_PORT = 7007;
+static const __u32 LISTEN_PORT  = 8008;
+
+static const __u32 CONNECT_IP4 = IP4(127, 0, 0, 1);
+static const __u32 LISTEN_IP4  = IP4(127, 0, 0, 2);
+
+/* Remap destination port CONNECT_PORT -> LISTEN_PORT. */
+SEC("inet_lookup/remap_port")
+int inet4_remap_port(struct bpf_inet_lookup *ctx)
+{
+	if (ctx->local_port != CONNECT_PORT)
+		return BPF_OK;
+
+	ctx->local_port = LISTEN_PORT;
+	return BPF_REDIRECT;
+}
+
+/* Remap destination IP CONNECT_IP4 -> LISTEN_IP4. */
+SEC("inet_lookup/remap_ip4")
+int inet4_remap_ip(struct bpf_inet_lookup *ctx)
+{
+	if (ctx->family != AF_INET)
+		return BPF_OK;
+	if (ctx->local_port != CONNECT_PORT)
+		return BPF_OK;
+	if (ctx->local_ip4 != bpf_htonl(CONNECT_IP4))
+		return BPF_OK;
+
+	ctx->local_ip4 = bpf_htonl(LISTEN_IP4);
+	return BPF_REDIRECT;
+}
+
+/* Remap destination IP CONNECT_IP6 -> LISTEN_IP6. */
+SEC("inet_lookup/remap_ip6")
+int inet6_remap_ip(struct bpf_inet_lookup *ctx)
+{
+	if (ctx->family != AF_INET6)
+		return BPF_OK;
+	if (ctx->local_port != CONNECT_PORT)
+		return BPF_OK;
+	if (ctx->local_ip6[0] != bpf_htonl(0xfd000000) ||
+	    ctx->local_ip6[1] != bpf_htonl(0x00000000) ||
+	    ctx->local_ip6[2] != bpf_htonl(0x00000000) ||
+	    ctx->local_ip6[3] != bpf_htonl(0x00000001))
+		return BPF_OK;
+
+	ctx->local_ip6[0] = bpf_htonl(0xfd000000);
+	ctx->local_ip6[1] = bpf_htonl(0x00000000);
+	ctx->local_ip6[2] = bpf_htonl(0x00000000);
+	ctx->local_ip6[3] = bpf_htonl(0x00000002);
+	return BPF_REDIRECT;
+}
+
+char _license[] SEC("license") = "GPL";
+__u32 _version SEC("version") = 1;
diff --git a/tools/testing/selftests/bpf/test_inet_lookup.c b/tools/testing/selftests/bpf/test_inet_lookup.c
new file mode 100644
index 000000000000..d4e440655252
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_inet_lookup.c
@@ -0,0 +1,392 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Echo test with the server not receiving at the same IP:port as the
+ * client sends the request to. Use BPF inet_lookup program to remap
+ * IP/port on socket lookup and direct the packets to the server.
+ */
+
+#include <arpa/inet.h>
+#include <errno.h>
+#include <error.h>
+#include <stdio.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+
+#include "bpf_rlimit.h"
+#include "bpf_util.h"
+
+#define BPF_FILE	"./inet_lookup_prog.o"
+#define MAX_ERROR_LEN	256
+
+#define EXT_IP4		"127.0.0.1"
+#define INT_IP4		"127.0.0.2"
+#define EXT_IP6		"fd00::1"
+#define INT_IP6		"fd00::2"
+#define EXT_PORT	7007
+#define INT_PORT	8008
+
+struct inet_addr {
+	const char *ip;
+	unsigned short port;
+};
+
+struct test {
+	const char *desc;
+	const char *bpf_prog;
+
+	struct {
+		int family;
+		int type;
+	} socket;
+
+	struct inet_addr recv_at;
+	struct inet_addr send_to;
+};
+
+static const struct test tests[] = {
+	{
+		.desc		= "TCP IPv4 remap port",
+		.bpf_prog	= "inet_lookup/remap_port",
+		.socket		= { AF_INET, SOCK_STREAM },
+		.recv_at	= { EXT_IP4, INT_PORT },
+		.send_to	= { EXT_IP4, EXT_PORT },
+	},
+	{
+		.desc		= "TCP IPv4 remap IP",
+		.bpf_prog	= "inet_lookup/remap_ip4",
+		.socket		= { AF_INET, SOCK_STREAM },
+		.recv_at	= { INT_IP4, EXT_PORT },
+		.send_to	= { EXT_IP4, EXT_PORT },
+	},
+	{
+		.desc		= "TCP IPv6 remap port",
+		.bpf_prog	= "inet_lookup/remap_port",
+		.socket		= { AF_INET6, SOCK_STREAM },
+		.recv_at	= { EXT_IP6, INT_PORT },
+		.send_to	= { EXT_IP6, EXT_PORT },
+	},
+	{
+		.desc		= "TCP IPv6 remap IP",
+		.bpf_prog	= "inet_lookup/remap_ip6",
+		.socket		= { AF_INET6, SOCK_STREAM },
+		.recv_at	= { INT_IP6, EXT_PORT },
+		.send_to	= { EXT_IP6, EXT_PORT },
+	},
+	{
+		.desc		= "UDP IPv4 remap port",
+		.bpf_prog	= "inet_lookup/remap_port",
+		.socket		= { AF_INET, SOCK_DGRAM },
+		.recv_at	= { EXT_IP4, INT_PORT },
+		.send_to	= { EXT_IP4, EXT_PORT },
+	},
+	{
+		.desc		= "UDP IPv4 remap IP",
+		.bpf_prog	= "inet_lookup/remap_ip4",
+		.socket		= { AF_INET, SOCK_DGRAM },
+		.recv_at	= { INT_IP4, EXT_PORT },
+		.send_to	= { EXT_IP4, EXT_PORT },
+	},
+	{
+		.desc		= "UDP IPv6 remap port",
+		.bpf_prog	= "inet_lookup/remap_port",
+		.socket		= { AF_INET6, SOCK_DGRAM },
+		.recv_at	= { EXT_IP6, INT_PORT },
+		.send_to	= { EXT_IP6, EXT_PORT },
+	},
+	{
+		.desc		= "UDP IPv6 remap IP",
+		.bpf_prog	= "inet_lookup/remap_ip6",
+		.socket		= { AF_INET6, SOCK_DGRAM },
+		.recv_at	= { INT_IP6, EXT_PORT },
+		.send_to	= { EXT_IP6, EXT_PORT },
+	},
+};
+
+static void make_addr(int family, const char *ip, int port,
+		      struct sockaddr_storage *ss, int *sz)
+{
+	struct sockaddr_in *addr4;
+	struct sockaddr_in6 *addr6;
+
+	switch (family) {
+	case AF_INET:
+		addr4 = (struct sockaddr_in *)ss;
+		addr4->sin_family = AF_INET;
+		addr4->sin_port = htons(port);
+		if (!inet_pton(AF_INET, ip, &addr4->sin_addr))
+			error(1, errno, "inet_pton failed: %s", ip);
+		*sz = sizeof(*addr4);
+		break;
+	case AF_INET6:
+		addr6 = (struct sockaddr_in6 *)ss;
+		addr6->sin6_family = AF_INET6;
+		addr6->sin6_port = htons(port);
+		if (!inet_pton(AF_INET6, ip, &addr6->sin6_addr))
+			error(1, errno, "inet_pton failed: %s", ip);
+		*sz = sizeof(*addr6);
+		break;
+	default:
+		error(1, 0, "unsupported family %d", family);
+	}
+}
+
+static int make_server(int family, int type, const char *ip, int port)
+{
+	struct sockaddr_storage ss = {0};
+	int fd, opt, sz;
+
+	make_addr(family, ip, port, &ss, &sz);
+
+	fd = socket(family, type, 0);
+	if (fd < 0)
+		error(1, errno, "failed to create listen socket");
+
+	opt = 1;
+	if (setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &opt, sizeof(opt)))
+		error(1, errno, "failed to set SO_REUSEPORT");
+	if (family == AF_INET && type == SOCK_DGRAM) {
+		if (setsockopt(fd, SOL_IP, IP_RECVORIGDSTADDR,
+			       &opt, sizeof(opt)))
+			error(1, errno, "failed to set IP_RECVORIGDSTADDR");
+	}
+	if (family == AF_INET6 && type == SOCK_DGRAM) {
+		if (setsockopt(fd, SOL_IPV6, IPV6_RECVORIGDSTADDR,
+			       &opt, sizeof(opt)))
+			error(1, errno, "failed to set IPV6_RECVORIGDSTADDR");
+	}
+
+	if (bind(fd, (struct sockaddr *)&ss, sz))
+		error(1, errno, "failed to bind listen socket");
+
+	if (type == SOCK_STREAM && listen(fd, 1))
+		error(1, errno, "failed to listen on port %d", port);
+
+	return fd;
+}
+
+static int make_client(int family, int type, const char *ip, int port)
+{
+	struct sockaddr_storage ss = {0};
+	struct sockaddr *sa;
+	int fd, sz;
+
+	make_addr(family, ip, port, &ss, &sz);
+	sa = (struct sockaddr *)&ss;
+
+	fd = socket(family, type, 0);
+	if (fd < 0)
+		error(1, errno, "failed to create socket");
+
+	if (connect(fd, sa, sz))
+		error(1, errno, "failed to connect socket");
+
+	return fd;
+}
+
+static void send_byte(int fd)
+{
+	if (send(fd, "a", 1, 0) < 1)
+		error(1, errno, "failed to send message");
+}
+
+static void recv_byte(int fd)
+{
+	char buf[1];
+
+	if (recv(fd, buf, sizeof(buf), 0) < 1)
+		error(1, errno, "failed to receive message");
+}
+
+static void tcp_recv_send(int server_fd)
+{
+	char buf[1];
+	size_t len;
+	ssize_t n;
+	int fd;
+
+	fd = accept(server_fd, NULL, NULL);
+	if (fd < 0)
+		error(1, errno, "failed to accept");
+
+	len = sizeof(buf);
+	n = recv(fd, buf, len, 0);
+	if (n < 0)
+		error(1, errno, "failed to receive");
+	if (n < len)
+		error(1, 0, "partial receive");
+
+	n = send(fd, buf, len, 0);
+	if (n < 0)
+		error(1, errno, "failed to send");
+	if (n < len)
+		error(1, 0, "partial send");
+
+	close(fd);
+}
+
+static void udp_recv_send(int server_fd)
+{
+	char cmsg_buf[CMSG_SPACE(sizeof(struct sockaddr_storage))];
+	struct sockaddr_storage *dst_addr = NULL;
+	struct sockaddr_storage src_addr;
+	struct msghdr msg = { 0 };
+	struct iovec iov = { 0 };
+	struct cmsghdr *cm;
+	char buf[1];
+	ssize_t n;
+	int fd;
+
+	iov.iov_base = buf;
+	iov.iov_len = sizeof(buf);
+
+	msg.msg_name = &src_addr;
+	msg.msg_namelen = sizeof(src_addr);
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsg_buf;
+	msg.msg_controllen = sizeof(cmsg_buf);
+
+	n = recvmsg(server_fd, &msg, 0);
+	if (n < 0)
+		error(1, errno, "failed to receive");
+	if (n < sizeof(buf))
+		error(1, 0, "partial receive");
+	if (msg.msg_flags & MSG_CTRUNC)
+		error(1, errno, "truncated cmsg");
+
+	for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
+		if ((cm->cmsg_level == SOL_IP &&
+		     cm->cmsg_type == IP_ORIGDSTADDR) ||
+		    (cm->cmsg_level == SOL_IPV6 &&
+		     cm->cmsg_type == IPV6_ORIGDSTADDR)) {
+			dst_addr = (struct sockaddr_storage *)CMSG_DATA(cm);
+			break;
+		}
+		error(0, 0, "ignored cmsg at level %d type %d",
+		      cm->cmsg_level, cm->cmsg_type);
+	}
+	if (!dst_addr)
+		error(1, 0, "failed to get destination address");
+
+	/* Reply from original destination address. */
+	fd = socket(dst_addr->ss_family, SOCK_DGRAM, 0);
+	if (fd < 0)
+		error(1, errno, "failed to create socket");
+
+	if (bind(fd, (struct sockaddr *)dst_addr, sizeof(*dst_addr)))
+		error(1, errno, "failed to bind socket");
+
+	msg.msg_control = NULL;
+	msg.msg_controllen = 0;
+	n = sendmsg(fd, &msg, 0);
+	if (n < 0)
+		error(1, errno, "failed to send");
+	if (n < sizeof(buf))
+		error(1, 0, "partial send");
+
+	close(fd);
+}
+
+static void tcp_echo(int client_fd, int server_fd)
+{
+	send_byte(client_fd);
+	tcp_recv_send(server_fd);
+	recv_byte(client_fd);
+}
+
+static void udp_echo(int client_fd, int server_fd)
+{
+	send_byte(client_fd);
+	udp_recv_send(server_fd);
+	recv_byte(client_fd);
+}
+
+static struct bpf_object *load_progs(void)
+{
+	char buf[MAX_ERROR_LEN];
+	struct bpf_object *obj;
+	int prog_fd;
+	int err;
+
+	err = bpf_prog_load(BPF_FILE, BPF_PROG_TYPE_UNSPEC, &obj, &prog_fd);
+	if (err) {
+		libbpf_strerror(err, buf, ARRAY_SIZE(buf));
+		error(1, 0, "failed to open bpf file: %s", buf);
+	}
+
+	return obj;
+}
+
+static void attach_prog(struct bpf_object *obj, const char *sec)
+{
+	enum bpf_attach_type attach_type;
+	struct bpf_program *prog;
+	char buf[MAX_ERROR_LEN];
+	int target_fd = -1;
+	int prog_fd;
+	int err;
+
+	prog = bpf_object__find_program_by_title(obj, sec);
+	err = libbpf_get_error(prog);
+	if (err) {
+		libbpf_strerror(err, buf, ARRAY_SIZE(buf));
+		error(1, 0, "failed to find section \"%s\": %s", sec, buf);
+	}
+
+	err = libbpf_attach_type_by_name(sec, &attach_type);
+	if (err) {
+		libbpf_strerror(err, buf, ARRAY_SIZE(buf));
+		error(1, 0, "failed to identify attach type: %s", buf);
+	}
+
+	prog_fd = bpf_program__fd(prog);
+	if (prog_fd < 0)
+		error(1, errno, "failed to get prog fd");
+
+	err = bpf_prog_detach(target_fd, attach_type);
+	if (err && err != -EPERM)
+		error(1, -err, "failed to detach prog");
+
+	err = bpf_prog_attach(prog_fd, target_fd, attach_type, 0);
+	if (err)
+		error(1, -err, "failed to attach prog");
+}
+
+static void run_test(const struct test *t, struct bpf_object *obj)
+{
+	int client_fd, server_fd;
+
+	fprintf(stderr, "%s\n", t->desc);
+	attach_prog(obj, t->bpf_prog);
+
+	server_fd = make_server(t->socket.family, t->socket.type,
+				t->recv_at.ip, t->recv_at.port);
+	client_fd = make_client(t->socket.family, t->socket.type,
+				t->send_to.ip, t->send_to.port);
+
+	if (t->socket.type == SOCK_STREAM)
+		tcp_echo(client_fd, server_fd);
+	else
+		udp_echo(client_fd, server_fd);
+
+	close(client_fd);
+	close(server_fd);
+}
+
+int main(void)
+{
+	struct bpf_object *obj;
+	const struct test *t;
+
+	obj = load_progs();
+
+	for (t = tests; t < tests + ARRAY_SIZE(tests); t++)
+		run_test(t, obj);
+
+	bpf_object__unload(obj);
+
+	fprintf(stderr, "PASS\n");
+	return 0;
+}
diff --git a/tools/testing/selftests/bpf/test_inet_lookup.sh b/tools/testing/selftests/bpf/test_inet_lookup.sh
new file mode 100755
index 000000000000..5efb42fbdf59
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_inet_lookup.sh
@@ -0,0 +1,35 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+if [[ $EUID -ne 0 ]]; then
+        echo "This script must be run as root"
+        echo "FAIL"
+        exit 1
+fi
+
+# Run the script in a dedicated network namespace.
+if [[ -z $(ip netns identify $$) ]]; then
+        ../net/in_netns.sh "$0" "$@"
+        exit $?
+fi
+
+readonly IP6_1="fd00::1"
+readonly IP6_2="fd00::2"
+
+setup()
+{
+        ip -6 addr add ${IP6_1}/128 dev lo
+        ip -6 addr add ${IP6_2}/128 dev lo
+}
+
+cleanup()
+{
+        ip -6 addr del ${IP6_1}/128 dev lo
+        ip -6 addr del ${IP6_2}/128 dev lo
+}
+
+trap cleanup EXIT
+setup
+
+./test_inet_lookup
+exit $?
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [RFC bpf-next 7/7] bpf: Add verifier tests for inet_lookup context access
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
                   ` (5 preceding siblings ...)
  2019-06-18 13:00 ` [RFC bpf-next 6/7] bpf: Test destination address remapping with inet_lookup Jakub Sitnicki
@ 2019-06-18 13:00 ` Jakub Sitnicki
  2019-06-18 13:52 ` [RFC bpf-next 0/7] Programming socket lookup with BPF Florian Westphal
  7 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-18 13:00 UTC (permalink / raw)
  To: netdev, bpf; +Cc: kernel-team

Exercise verifier access checks for bpf_inet_lookup context object fields.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
 .../selftests/bpf/verifier/ctx_inet_lookup.c  | 511 ++++++++++++++++++
 1 file changed, 511 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c

diff --git a/tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c b/tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c
new file mode 100644
index 000000000000..b4555fb03e17
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c
@@ -0,0 +1,511 @@
+{
+	"valid 1,2,4-byte read bpf_inet_lookup remote_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup remote_ip4",
+	.insns = {
+		/* 8-byte read */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_inet_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 4-byte write */
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 2-byte write */
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup remote_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		/* 1-byte write */
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_inet_lookup local_ip4",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4) + 3),
+		BPF_MOV64_IMM(BPF_REG_0, 0),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup local_ip4",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 4-byte write bpf_inet_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup local_ip4",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x7f000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip4)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_inet_lookup remote_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup,
+				     remote_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup,
+				     remote_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup remote_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_inet_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup remote_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 1,2,4-byte read bpf_inet_lookup local_ip6",
+	.insns = {
+		/* 4-byte read */
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[3])),
+		/* 2-byte read */
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[3]) + 2),
+		/* 1-byte read */
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[3]) + 3),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup local_ip6",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 4-byte write bpf_inet_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup local_ip6",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 0x00000001U),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_ip6[0])),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 4-byte write bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup remote_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, remote_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 4-byte read bpf_inet_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte read bpf_inet_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte read bpf_inet_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_H, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte read bpf_inet_lookup local_port",
+	.insns = {
+		BPF_LDX_MEM(BPF_B, BPF_REG_0, BPF_REG_1,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"valid 4-byte write bpf_inet_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_W, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.result = ACCEPT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 8-byte write bpf_inet_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 2-byte write bpf_inet_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_H, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
+{
+	"invalid 1-byte write bpf_inet_lookup local_port",
+	.insns = {
+		BPF_MOV64_IMM(BPF_REG_0, 1234),
+		BPF_STX_MEM(BPF_B, BPF_REG_1, BPF_REG_0,
+			    offsetof(struct bpf_inet_lookup, local_port)),
+		BPF_EXIT_INSN(),
+	},
+	.errstr = "invalid bpf_context access",
+	.result = REJECT,
+	.prog_type = BPF_PROG_TYPE_INET_LOOKUP,
+},
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
                   ` (6 preceding siblings ...)
  2019-06-18 13:00 ` [RFC bpf-next 7/7] bpf: Add verifier tests for inet_lookup context access Jakub Sitnicki
@ 2019-06-18 13:52 ` Florian Westphal
  2019-06-19  9:13   ` Jakub Sitnicki
  7 siblings, 1 reply; 19+ messages in thread
From: Florian Westphal @ 2019-06-18 13:52 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: netdev, bpf, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> wrote:
>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>    find the listening socket to check for SYN cookies with TPROXY redirect.

Sorry for the question, but where is the problem?
(i.e., is it with TPROXY or bpf side)?

>  - TPROXY takes a reference to the listening socket on dispatch, which
>    raises lock contention concerns.

FWIW this could be avoided in similar way as to how we handle noref dsts.

The only reason we need to take the reference at the moment is because
once skb leaves the TPROXY target hook, the skb could leave rcu
protection as well at some point (nfqueue for example).

Maybe its even enough to move reference taking to nfqueue and add
'noref' destructor, that would allow skb_steal_sock to propagate
refcounted value in __inet_lookup_skb.

So, at least for this part I don't see a technical reason why this
has to grab a reference for listener socket.

>  - Traffic steering configuration is split over several iptables rules, at
>    least one per service, which makes configuration changes error prone.

Could you perhaps sketch an example ruleset (doesn't have to be complete
nor parse-able by itpables-restore), I would just like to understand if
there is any room for improvement on netfilter/iptables/nft side.

Thanks,
Florian

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-18 13:52 ` [RFC bpf-next 0/7] Programming socket lookup with BPF Florian Westphal
@ 2019-06-19  9:13   ` Jakub Sitnicki
  2019-06-20 11:56     ` Florian Westphal
                       ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-19  9:13 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, bpf, kernel-team

Hey Florian,

Thanks for taking a look at it.

On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>>    find the listening socket to check for SYN cookies with TPROXY redirect.
>
> Sorry for the question, but where is the problem?
> (i.e., is it with TPROXY or bpf side)?

The way I see it is that the problem is that we have mappings for
steering traffic into sockets split between two places: (1) the socket
lookup tables, and (2) the TPROXY rules.

BPF programs that need to check if there is a socket the packet is
destined for have access to the socket lookup tables, via the mentioned
bpf_sk_lookup helper, but are unaware of TPROXY redirects.

For TCP we're able to look up from BPF if there are any established,
request, and "normal" listening sockets. The listening sockets that
receive connections via TPROXY are invisible to BPF progs.

Why are we interested in finding all listening sockets? To check if any
of them had SYN queue overflow recently and if we should honor SYN
cookies.

>>  - TPROXY takes a reference to the listening socket on dispatch, which
>>    raises lock contention concerns.
>
> FWIW this could be avoided in similar way as to how we handle noref dsts.
>
> The only reason we need to take the reference at the moment is because
> once skb leaves the TPROXY target hook, the skb could leave rcu
> protection as well at some point (nfqueue for example).
>
> Maybe its even enough to move reference taking to nfqueue and add
> 'noref' destructor, that would allow skb_steal_sock to propagate
> refcounted value in __inet_lookup_skb.
>
> So, at least for this part I don't see a technical reason why this
> has to grab a reference for listener socket.

That's helpful, thanks! We rely on TPROXY, so I would like to help with
that. Let me see if I can get time to work on it.

>
>>  - Traffic steering configuration is split over several iptables rules, at
>>    least one per service, which makes configuration changes error prone.
>
> Could you perhaps sketch an example ruleset (doesn't have to be complete
> nor parse-able by itpables-restore), I would just like to understand if
> there is any room for improvement on netfilter/iptables/nft side.

Happy to. Scenarios that are of interest to us:

1) Port sharing, while accepting on a set of subnets
   (same are the demo BPF prog from cover letter)

  ip route add local 192.0.2.0/24 dev lo
  ip route add local 198.51.100.0/24 dev lo
  ip route add local 203.0.113.0/24 dev lo

  ipset create net1 hash:net
  ipset create net2 hash:net
  ipset create net3 hash:net

  ipset add net1 192.0.2.0/24
  ipset add net2 198.51.100.0/24
  ipset add net3 203.0.113.0/24

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net1 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net2 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=82

2) Receving on all ports, except some

  iptables -t mangle -A PREROUTING -p tcp --dport 80 \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=81

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

3) Steering part of the traffic to a different socket (A/B testing)

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -m statistic --mode random --probability 0.01 \
           -j TPROXY --on-ip=127.0.0.1 --on-port=2

  iptables -t mangle -A PREROUTING -p tcp \
           -m set --match-set net3 dst \
           -j TPROXY --on-ip=127.0.0.1 --on-port=1

One thing I haven't touched on in the cover letter is that to use TPROXY
you need to set IP_TRANSPARENT on the listening socket. This requires
that your process runs with CAP_NET_RAW or CAP_NET_ADMIN, or that you
get the socket from systemd.

I haven't been able to explain why the process needs to be privileged to
receive traffic steered with TPROXY, but it turns out to be a pain point
too. We end up having to lock down the service to ensure it doesn't use
the elevated privileges for anything else than setting IP_TRANSPARENT.

Thanks,
Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-19  9:13   ` Jakub Sitnicki
@ 2019-06-20 11:56     ` Florian Westphal
  2019-06-20 22:20     ` Joe Stringer
  2019-06-21 12:51     ` Florian Westphal
  2 siblings, 0 replies; 19+ messages in thread
From: Florian Westphal @ 2019-06-20 11:56 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: Florian Westphal, netdev, bpf, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> wrote:
> > Sorry for the question, but where is the problem?
> > (i.e., is it with TPROXY or bpf side)?
> 
> The way I see it is that the problem is that we have mappings for
> steering traffic into sockets split between two places: (1) the socket
> lookup tables, and (2) the TPROXY rules.
> 
> BPF programs that need to check if there is a socket the packet is
> destined for have access to the socket lookup tables, via the mentioned
> bpf_sk_lookup helper, but are unaware of TPROXY redirects.

Oh, right.

[ TPROXY setup ]

Thanks for sharing, it will take me some time to digest this.
It would be good to have a simpler way to express this.

> One thing I haven't touched on in the cover letter is that to use TPROXY
> you need to set IP_TRANSPARENT on the listening socket. This requires
> that your process runs with CAP_NET_RAW or CAP_NET_ADMIN, or that you
> get the socket from systemd.
> 
> I haven't been able to explain why the process needs to be privileged to
> receive traffic steered with TPROXY, but it turns out to be a pain point
> too. We end up having to lock down the service to ensure it doesn't use
> the elevated privileges for anything else than setting IP_TRANSPARENT.

Marek thinks its security measure:
1. TPROXY rule to redirect 80 to 8080 is added
2. UNPRIV binds 8080 -> Unpriv can then intercept packets for privileged
    port (it can't, as TPROXY rule refuses to redirect to sk that did not
    have IP_TRANSPARENT set).

AFAICS purely from stack pov, it sets IP_REPLY_ARG_NOSRCCHECK which in
turn sets FLOWI_FLAG_ANYSRC which bypasses a "fl->saddr is configured on
this machine" check in ip_route_output_key_hash_rcu.

I did not yet find similar entanglement for ipv6, will check.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-19  9:13   ` Jakub Sitnicki
  2019-06-20 11:56     ` Florian Westphal
@ 2019-06-20 22:20     ` Joe Stringer
       [not found]       ` <CAGn+7TUmgsA8oKw-mM6S5iR4rmNt6sWxjUgw8=qSCHb=m0ROyg@mail.gmail.com>
  2019-06-25  7:28       ` Jakub Sitnicki
  2019-06-21 12:51     ` Florian Westphal
  2 siblings, 2 replies; 19+ messages in thread
From: Joe Stringer @ 2019-06-20 22:20 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: Florian Westphal, netdev, bpf, kernel-team

On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Hey Florian,
>
> Thanks for taking a look at it.
>
> On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
> > Jakub Sitnicki <jakub@cloudflare.com> wrote:
> >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
> >>    find the listening socket to check for SYN cookies with TPROXY redirect.
> >
> > Sorry for the question, but where is the problem?
> > (i.e., is it with TPROXY or bpf side)?
>
> The way I see it is that the problem is that we have mappings for
> steering traffic into sockets split between two places: (1) the socket
> lookup tables, and (2) the TPROXY rules.
>
> BPF programs that need to check if there is a socket the packet is
> destined for have access to the socket lookup tables, via the mentioned
> bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>
> For TCP we're able to look up from BPF if there are any established,
> request, and "normal" listening sockets. The listening sockets that
> receive connections via TPROXY are invisible to BPF progs.
>
> Why are we interested in finding all listening sockets? To check if any
> of them had SYN queue overflow recently and if we should honor SYN
> cookies.

Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-19  9:13   ` Jakub Sitnicki
  2019-06-20 11:56     ` Florian Westphal
  2019-06-20 22:20     ` Joe Stringer
@ 2019-06-21 12:51     ` Florian Westphal
  2019-06-21 14:33       ` Eric Dumazet
  2 siblings, 1 reply; 19+ messages in thread
From: Florian Westphal @ 2019-06-21 12:51 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: Florian Westphal, netdev, bpf, kernel-team

Jakub Sitnicki <jakub@cloudflare.com> wrote:
> > So, at least for this part I don't see a technical reason why this
> > has to grab a reference for listener socket.
> 
> That's helpful, thanks! We rely on TPROXY, so I would like to help with
> that. Let me see if I can get time to work on it.

AFAICS so far this would be enough:

1. remove the BUG_ON() in skb_orphan, letting it clear skb->sk instead
2. in nf_queue_entry_get_refs(), if skb->sk and no destructor:
   call nf_tproxy_assign_sock() so a reference gets taken.
3. change skb_steal_sock:
   static inline struct sock *skb_steal_sock(struct sk_buff *skb, bool *refcounted)
    [..]
    *refcounted = skb->destructor != NULL;
4. make tproxy sk assign elide the destructor assigment in case of
   a listening sk.

This should work because TPROXY target is restricted to PRE_ROUTING, and
__netif_receive_skb_core runs with rcu readlock already held.

On a side note, it would also be interesting to see what breaks if the
nf_tproxy_sk_is_transparent() check in the tprox eval function is
removed -- if we need the transparent:1 marker only for output, i think
it would be ok to raise the bit transparently in the kernel in case
we assign skb->sk = found_sk; i.e.
 if (unlikely(!sk_is_transparent(sk))
	 make_sk_transparent(sk);

I don't see a reason why we need the explicit setsockopt(IP_TRANSPARENT)
from userspace.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-21 12:51     ` Florian Westphal
@ 2019-06-21 14:33       ` Eric Dumazet
  2019-06-21 16:41         ` Florian Westphal
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2019-06-21 14:33 UTC (permalink / raw)
  To: Florian Westphal, Jakub Sitnicki; +Cc: netdev, bpf, kernel-team



On 6/21/19 5:51 AM, Florian Westphal wrote:
> Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>> So, at least for this part I don't see a technical reason why this
>>> has to grab a reference for listener socket.
>>
>> That's helpful, thanks! We rely on TPROXY, so I would like to help with
>> that. Let me see if I can get time to work on it.
> 
> AFAICS so far this would be enough:
> 
> 1. remove the BUG_ON() in skb_orphan, letting it clear skb->sk instead
> 2. in nf_queue_entry_get_refs(), if skb->sk and no destructor:
>    call nf_tproxy_assign_sock() so a reference gets taken.
> 3. change skb_steal_sock:
>    static inline struct sock *skb_steal_sock(struct sk_buff *skb, bool *refcounted)
>     [..]
>     *refcounted = skb->destructor != NULL;
> 4. make tproxy sk assign elide the destructor assigment in case of
>    a listening sk.
> 

Okay, but how do we make sure the skb->sk association does not leak from rcu section ?

Note we have the noref/refcounted magic for skb_dst(), we might try to use something similar
for skb->sk

> This should work because TPROXY target is restricted to PRE_ROUTING, and
> __netif_receive_skb_core runs with rcu readlock already held.
> 
> On a side note, it would also be interesting to see what breaks if the
> nf_tproxy_sk_is_transparent() check in the tprox eval function is
> removed -- if we need the transparent:1 marker only for output, i think
> it would be ok to raise the bit transparently in the kernel in case
> we assign skb->sk = found_sk; i.e.
>  if (unlikely(!sk_is_transparent(sk))
> 	 make_sk_transparent(sk);
> 
> I don't see a reason why we need the explicit setsockopt(IP_TRANSPARENT)
> from userspace.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-21 14:33       ` Eric Dumazet
@ 2019-06-21 16:41         ` Florian Westphal
  2019-06-21 16:54           ` Paolo Abeni
  0 siblings, 1 reply; 19+ messages in thread
From: Florian Westphal @ 2019-06-21 16:41 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, Jakub Sitnicki, netdev, bpf, kernel-team

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > AFAICS so far this would be enough:
> > 
> > 1. remove the BUG_ON() in skb_orphan, letting it clear skb->sk instead
> > 2. in nf_queue_entry_get_refs(), if skb->sk and no destructor:
> >    call nf_tproxy_assign_sock() so a reference gets taken.
> > 3. change skb_steal_sock:
> >    static inline struct sock *skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> >     [..]
> >     *refcounted = skb->destructor != NULL;
> > 4. make tproxy sk assign elide the destructor assigment in case of
> >    a listening sk.
> > 
> 
> Okay, but how do we make sure the skb->sk association does not leak from rcu section ?

From netfilter pov the only escape point is nfqueue (and kfree_skb),
so for tcp/udp it will end up in their respective rx path eventually.
But you are right in that we need to also audit all NF_STOLEN users that
can be invoked from PRE_ROUTING and INPUT hooks.

OUTPUT/FORWARD/POSTROUTING are not relevant, in case skb enters IP forwarding,
it will be dropped there (we have a check to toss skb with socket
attached in forward).

In recent hallway discussion Eric suggested to add a empty destructor
stub, it would allow to do the needed annotation, i.e.
no need to change skb_orphan(), *refcounted would be set via
skb->destructor != noref_listen_skb_destructor check.

> Note we have the noref/refcounted magic for skb_dst(), we might try to use something similar
> for skb->sk

Yes, would be more code churn because we have to replace skb->sk access
by a helper to mask off NOREF bit (or we need to add a "noref" bit in
sk_buff itself).

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
       [not found]       ` <CAGn+7TUmgsA8oKw-mM6S5iR4rmNt6sWxjUgw8=qSCHb=m0ROyg@mail.gmail.com>
@ 2019-06-21 16:50         ` Joe Stringer
  2019-06-25  8:11           ` Jakub Sitnicki
  0 siblings, 1 reply; 19+ messages in thread
From: Joe Stringer @ 2019-06-21 16:50 UTC (permalink / raw)
  To: Jakub Sitnicki; +Cc: Joe Stringer, Florian Westphal, netdev, bpf, kernel-team

On Fri, Jun 21, 2019 at 1:44 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> On Fri, Jun 21, 2019, 00:20 Joe Stringer <joe@wand.net.nz> wrote:
>>
>> On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >
>> > Hey Florian,
>> >
>> > Thanks for taking a look at it.
>> >
>> > On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
>> > > Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> > >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>> > >>    find the listening socket to check for SYN cookies with TPROXY redirect.
>> > >
>> > > Sorry for the question, but where is the problem?
>> > > (i.e., is it with TPROXY or bpf side)?
>> >
>> > The way I see it is that the problem is that we have mappings for
>> > steering traffic into sockets split between two places: (1) the socket
>> > lookup tables, and (2) the TPROXY rules.
>> >
>> > BPF programs that need to check if there is a socket the packet is
>> > destined for have access to the socket lookup tables, via the mentioned
>> > bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>> >
>> > For TCP we're able to look up from BPF if there are any established,
>> > request, and "normal" listening sockets. The listening sockets that
>> > receive connections via TPROXY are invisible to BPF progs.
>> >
>> > Why are we interested in finding all listening sockets? To check if any
>> > of them had SYN queue overflow recently and if we should honor SYN
>> > cookies.
>>
>> Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?
>
>
> They are invisible in that sense that you can't look them up using the packet 4-tuple. You have to somehow make the XDP/TC progs aware of the TPROXY redirects to find the target sockets.

Isn't that what you're doing in the example from the cover letter
(reincluded below for reference), except with the new program type
rather than XDP/TC progs?

       switch (bpf_ntohl(ctx->local_ip4) >> 8) {
        case NET1:
                ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
                ctx->local_port = 81;
                return BPF_REDIRECT;
        case NET2:
                ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
                ctx->local_port = 82;
                return BPF_REDIRECT;
        }

That said, I appreciate that even if you find the sockets from XDP,
you'd presumably need some way to retain the socket reference beyond
XDP execution to convince the stack to guide the traffic into that
socket, which would be a whole other effort. For your use case it may
or may not make the most sense.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-21 16:41         ` Florian Westphal
@ 2019-06-21 16:54           ` Paolo Abeni
  0 siblings, 0 replies; 19+ messages in thread
From: Paolo Abeni @ 2019-06-21 16:54 UTC (permalink / raw)
  To: Florian Westphal, Eric Dumazet; +Cc: Jakub Sitnicki, netdev, bpf, kernel-team

Hi,

On Fri, 2019-06-21 at 18:41 +0200, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > AFAICS so far this would be enough:
> > > 
> > > 1. remove the BUG_ON() in skb_orphan, letting it clear skb->sk instead
> > > 2. in nf_queue_entry_get_refs(), if skb->sk and no destructor:
> > >    call nf_tproxy_assign_sock() so a reference gets taken.
> > > 3. change skb_steal_sock:
> > >    static inline struct sock *skb_steal_sock(struct sk_buff *skb, bool *refcounted)
> > >     [..]
> > >     *refcounted = skb->destructor != NULL;
> > > 4. make tproxy sk assign elide the destructor assigment in case of
> > >    a listening sk.
> > > 
> > 
> > Okay, but how do we make sure the skb->sk association does not leak from rcu section ?
> 
> From netfilter pov the only escape point is nfqueue (and kfree_skb),
> so for tcp/udp it will end up in their respective rx path eventually.
> But you are right in that we need to also audit all NF_STOLEN users that
> can be invoked from PRE_ROUTING and INPUT hooks.
> 
> OUTPUT/FORWARD/POSTROUTING are not relevant, in case skb enters IP forwarding,
> it will be dropped there (we have a check to toss skb with socket
> attached in forward).
> 
> In recent hallway discussion Eric suggested to add a empty destructor
> stub, it would allow to do the needed annotation, i.e.
> no need to change skb_orphan(), *refcounted would be set via
> skb->destructor != noref_listen_skb_destructor check.

Perhaps I'm misreading the above, but it looks like this has some
overlapping with a past attempt:

https://marc.info/?l=linux-netdev&m=150611442802964&w=2

Cheers,

Paolo


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-20 22:20     ` Joe Stringer
       [not found]       ` <CAGn+7TUmgsA8oKw-mM6S5iR4rmNt6sWxjUgw8=qSCHb=m0ROyg@mail.gmail.com>
@ 2019-06-25  7:28       ` Jakub Sitnicki
  1 sibling, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-25  7:28 UTC (permalink / raw)
  To: Joe Stringer; +Cc: Florian Westphal, netdev, bpf, kernel-team

[Reposting with correct format this time. Sorry.]

On Fri, Jun 21, 2019 at 12:20 AM CEST, Joe Stringer wrote:
> On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> Hey Florian,
>>
>> Thanks for taking a look at it.
>>
>> On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
>> > Jakub Sitnicki <jakub@cloudflare.com> wrote:
>> >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>> >>    find the listening socket to check for SYN cookies with TPROXY redirect.
>> >
>> > Sorry for the question, but where is the problem?
>> > (i.e., is it with TPROXY or bpf side)?
>>
>> The way I see it is that the problem is that we have mappings for
>> steering traffic into sockets split between two places: (1) the socket
>> lookup tables, and (2) the TPROXY rules.
>>
>> BPF programs that need to check if there is a socket the packet is
>> destined for have access to the socket lookup tables, via the mentioned
>> bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>>
>> For TCP we're able to look up from BPF if there are any established,
>> request, and "normal" listening sockets. The listening sockets that
>> receive connections via TPROXY are invisible to BPF progs.
>>
>> Why are we interested in finding all listening sockets? To check if any
>> of them had SYN queue overflow recently and if we should honor SYN
>> cookies.
>
> Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?

They are invisible in that sense that you can't look them up using the
packet 4-tuple. You have to somehow make the XDP/TC progs aware of the
TPROXY redirects to find the target sockets.

-Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC bpf-next 0/7] Programming socket lookup with BPF
  2019-06-21 16:50         ` Joe Stringer
@ 2019-06-25  8:11           ` Jakub Sitnicki
  0 siblings, 0 replies; 19+ messages in thread
From: Jakub Sitnicki @ 2019-06-25  8:11 UTC (permalink / raw)
  To: Joe Stringer; +Cc: Florian Westphal, netdev, bpf, kernel-team

On Fri, Jun 21, 2019 at 06:50 PM CEST, Joe Stringer wrote:
> On Fri, Jun 21, 2019 at 1:44 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>
>> On Fri, Jun 21, 2019, 00:20 Joe Stringer <joe@wand.net.nz> wrote:
>>>
>>> On Wed, Jun 19, 2019 at 2:14 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>> >
>>> > Hey Florian,
>>> >
>>> > Thanks for taking a look at it.
>>> >
>>> > On Tue, Jun 18, 2019 at 03:52 PM CEST, Florian Westphal wrote:
>>> > > Jakub Sitnicki <jakub@cloudflare.com> wrote:
>>> > >>  - XDP programs using bpf_sk_lookup helpers, like load balancers, can't
>>> > >>    find the listening socket to check for SYN cookies with TPROXY redirect.
>>> > >
>>> > > Sorry for the question, but where is the problem?
>>> > > (i.e., is it with TPROXY or bpf side)?
>>> >
>>> > The way I see it is that the problem is that we have mappings for
>>> > steering traffic into sockets split between two places: (1) the socket
>>> > lookup tables, and (2) the TPROXY rules.
>>> >
>>> > BPF programs that need to check if there is a socket the packet is
>>> > destined for have access to the socket lookup tables, via the mentioned
>>> > bpf_sk_lookup helper, but are unaware of TPROXY redirects.
>>> >
>>> > For TCP we're able to look up from BPF if there are any established,
>>> > request, and "normal" listening sockets. The listening sockets that
>>> > receive connections via TPROXY are invisible to BPF progs.
>>> >
>>> > Why are we interested in finding all listening sockets? To check if any
>>> > of them had SYN queue overflow recently and if we should honor SYN
>>> > cookies.
>>>
>>> Why are they invisible? Can't you look them up with bpf_skc_lookup_tcp()?
>>
>>
>> They are invisible in that sense that you can't look them up using the packet 4-tuple. You have to somehow make the XDP/TC progs aware of the TPROXY redirects to find the target sockets.
>
> Isn't that what you're doing in the example from the cover letter
> (reincluded below for reference), except with the new program type
> rather than XDP/TC progs?
>
>        switch (bpf_ntohl(ctx->local_ip4) >> 8) {
>         case NET1:
>                 ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
>                 ctx->local_port = 81;
>                 return BPF_REDIRECT;
>         case NET2:
>                 ctx->local_ip4 = bpf_htonl(IP4(127, 0, 0, 1));
>                 ctx->local_port = 82;
>                 return BPF_REDIRECT;
>         }
>
> That said, I appreciate that even if you find the sockets from XDP,
> you'd presumably need some way to retain the socket reference beyond
> XDP execution to convince the stack to guide the traffic into that
> socket, which would be a whole other effort. For your use case it may
> or may not make the most sense.

Granted we're just moving steering logic from one place to another, that
is from TPROXY rules to a BPF program.

The key here is that the BPF prog runs during inet_lookup.  This let's
"lower level" BPF progs like XDP or TC check if there is a destination
socket, without having to know about steering rules.

If there is a local socket, we don't need to do socket dispatch from
BPF. Just pass the packet up the stack.

-Jakub

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-06-25  8:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-18 13:00 [RFC bpf-next 0/7] Programming socket lookup with BPF Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 1/7] bpf: Introduce inet_lookup program type Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 2/7] ipv4: Run inet_lookup bpf program on socket lookup Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 3/7] ipv6: " Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 4/7] bpf: Sync linux/bpf.h to tools/ Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 5/7] libbpf: Add support for inet_lookup program type Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 6/7] bpf: Test destination address remapping with inet_lookup Jakub Sitnicki
2019-06-18 13:00 ` [RFC bpf-next 7/7] bpf: Add verifier tests for inet_lookup context access Jakub Sitnicki
2019-06-18 13:52 ` [RFC bpf-next 0/7] Programming socket lookup with BPF Florian Westphal
2019-06-19  9:13   ` Jakub Sitnicki
2019-06-20 11:56     ` Florian Westphal
2019-06-20 22:20     ` Joe Stringer
     [not found]       ` <CAGn+7TUmgsA8oKw-mM6S5iR4rmNt6sWxjUgw8=qSCHb=m0ROyg@mail.gmail.com>
2019-06-21 16:50         ` Joe Stringer
2019-06-25  8:11           ` Jakub Sitnicki
2019-06-25  7:28       ` Jakub Sitnicki
2019-06-21 12:51     ` Florian Westphal
2019-06-21 14:33       ` Eric Dumazet
2019-06-21 16:41         ` Florian Westphal
2019-06-21 16:54           ` Paolo Abeni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).